Last updated: 2021-05-27

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a6fb2e3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    data/VR_Snapshot_20081104.txt.xz
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Unstaged changes:
    Modified:   analysis/index.Rmd
    Modified:   analysis/m_00_status.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/m_01_9_clean_vars.Rmd) and HTML (docs/m_01_9_clean_vars.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 5c06e88 Ross Gayler 2021-05-26 WIP
html 5c06e88 Ross Gayler 2021-05-26 WIP
Rmd 3ca94f0 Ross Gayler 2021-05-24 WIP
Rmd 6421aeb Ross Gayler 2021-04-04 wip
Rmd 411de1e Ross Gayler 2021-04-04 WIP
html 411de1e Ross Gayler 2021-04-04 WIP
Rmd 0bd4a5f Ross Gayler 2021-04-03 WIP

# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal
library(fs) # file system operations
library(dplyr) # data wrangling

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(gt) # table formatting
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching
library(skimr) # compact summary of each variable
library(lubridate) # date parsing

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(forcats) # manipulation of factors
library(ggplot2) # graphics
library(tidyr) # data tidying
#

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the cleaning of all the variables based on the issues discovered in the previous analyses.

It also adds a record ID variable.

1.1 Variable roles

The name variables, last_name, first_name, and midl_name, will definitely be used in compatibility modelling.

We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.

The demographic variables sex, age, birth_place, and the administrative variable county_desc may be used as predictors and/or blocking variables.

The remainder of the variables (residence and administrative variables) will be kept in case they are useful for manually assessing claimed matches.

1.2 Cleanup for predictors

1.2.1 Name standardisation

Standardisation will be applied to the name variables last_name, first_name, and midl_name. This attempts to remove variation that is probably irrelevant to identity (e.g. case, punctuation, and spacing).

1.2.2 Missing values

In the previous notebooks I have converted empty strings to missing values (NA_character_ in R). This was convenient because table() and skim() count missing values as a separate category. However, modelling is a different kettle of fish.

In modelling, we want to get an estimated probability of identity match for every query, regardless of how many attributes have missing values. Typical modelling functions do not tolerate any missing (NA) values in predictors. If any of the predictors is missing then the estimate is also missing.

We avoid that problem by transforming the missing values into some nonmissing value and creating an extra variable to indicate the missingness. This will be done for the name variables last_name, first_name, and midl_name, and the demographic variable age. (This is not necessary for birth_place because “missing” is just another valid level of the variable.

1.2.3 Cleanup summary

The cleanup actions to be applied are (in order):

  • age

  • convert from string to integer

  • add missing value indicator and set to true if age < 17 or age > 104

  • if age missing indicator is true, set age to 0

  • preprocess all character variables

  • map missing to empty string

  • map lower case letters to upper case

  • all names

  • map each non-alphanumeric character to a space (Remove variability of punctuation while preserving word boundaries.)

  • map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)

  • if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)

  • map each remaining digit to an empty string (Remove random digit insertions)

  • last name

  • map words DR, II, III, IIII, IV, JR, MD, SR to empty string

  • if number of letters in last name = 1, map name to empty string

  • middle name

  • map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string

  • first name

  • map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string

  • if number of letters in first name = 0, move first word of middle name to first name

  • postprocess all name variables

  • map all spaces to empty strings (Remove variability of spacing.)

  • add missing value indicator variables for all name variables

2 Read entity data

Read the raw entity data file using the previously defined functions raw_entity_data_read(), raw_entity_data_excl_status(), raw_entity_data_excl_test(), raw_entity_data_drop_novar(), raw_entity_data_parse_dates(), and raw_entity_data_drop_cancel_dt().

# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_read(f_entity_raw_tsv) %>% 
  raw_entity_data_excl_status() %>% 
  raw_entity_data_excl_test() %>% 
  raw_entity_data_drop_novar() %>% 
  raw_entity_data_parse_dates() %>% 
  raw_entity_data_drop_admin() %>% 
  raw_entity_data_drop_demog()

dim(d)
[1] 4099699      21

3 Define and apply cleanup functions

3.0.1 Age

  • convert from string to integer
  • add missing indicator and set to true if age < 17 or age > 104
  • if age missing indicator is true, set age to 0
# Function to clean age
raw_entity_data_clean_age <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      age_cln = as.integer(age),
      age_cln_miss = ! dplyr::between(age_cln, 17, 104), # valid age range
      age_cln = dplyr::if_else(age_cln_miss, 0L, age_cln)
    )
}

Apply and check function.

# Apply

dim(d)
[1] 4099699      21
d <- d %>% raw_entity_data_clean_age()

dim(d)
[1] 4099699      23
# After

summary(d$age_cln)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    33.0    45.0    46.3    58.0   104.0 
summary(d$age_cln_miss)
   Mode   FALSE    TRUE 
logical 4068644   31055 
d %>% 
  dplyr::select(starts_with("age_cln")) %>% 
  dplyr::group_by(age_cln_miss) %>% 
  dplyr::summarise(qtl = quantile(age_cln) %>% list()) %>%
  tidyr::unnest_wider(col = qtl) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels())
age_cln_miss 0% 25% 50% 75% 100%
FALSE 17 33 45 58 104
TRUE 0 0 0 0 0

3.0.2 Preprocess all character variables

  • map missing to empty string
  • map lower case letters to upper case

Note that this is an in-place transformation of the character variables rather than adding the transformed values as new variables. This is because I am viewing this a light tidying rather than as creating distinctly new cleaned values.

# Helper function to preprocess one char variable
raw_entity_data_clean_preprocess_char_1 <- function(x) {
  x %>%
    tidyr::replace_na("") %>% # map NA to ""
    stringr::str_to_upper() # map lower case to upper case
}

# Function to preprocess all char variables
raw_entity_data_clean_preprocess_char <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      across(where(is.character), raw_entity_data_clean_preprocess_char_1) # apply to all char vars
    )
}

Apply and check function.

# This is only a partial check - look at a very small subset of variables

# Before

d %>% 
  dplyr::select(ends_with("_name")) %>% 
  skimr::skim()
Table 3.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 4
_______________________
Column type frequency:
character 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name 0 1.00 1 21 0 191996 0
first_name 23 1.00 1 19 0 126589 0
midl_name 252695 0.94 1 20 0 175742 0
street_name 7 1.00 1 30 0 83244 0
d %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::filter(stringr::str_detect(first_name, "[a-z]")) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name first_name midl_name street_name
CARLAND JoANN CROOK TRAVIS
HUGHES LaVERNE MARY FALCON
BRAGG JoANNE A CAMP CREEK
NATIONS JoANN NMN BRADLEY BRANCH
SIMMONS SiROBERT JOHNSON DEVONSHIRE
BRITT McCKINES LAMONT GRANTHAM SCHOOL
STAFFORD DeNEAL ROBERT COTTONWOOD
CARROLL McHILDIA THOMPSON GOVERNOR MOORE
WATT JoANN <NA> CARLYLE
THOMPSON LaSONYA D DERRYDOWNE
FLEMING JeROME S BEATTIES FORD
# Apply

d <- d %>% raw_entity_data_clean_preprocess_char()

# After

d %>% 
  dplyr::select(ends_with("_name")) %>% 
  skimr::skim()
Table 3.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 4
_______________________
Column type frequency:
character 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name 0 1 1 21 0 191995 0
first_name 0 1 0 19 23 126584 0
midl_name 0 1 0 20 252695 175695 0
street_name 0 1 0 30 7 83241 0
d %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::filter(stringr::str_detect(first_name, "[a-z]")) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name first_name midl_name street_name

3.0.3 All names

  • map each non-alphanumeric character to a space (Remove variability of punctuation. This preserves word boundaries.)
  • map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
  • if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
  • map each digit to an empty string (Remove random digit insertions)
# Helper function to fix zeroes in all names
# Map zero to O if there are no other digits in the string
raw_entity_data_clean_all_names_fix_zero <- function(x) { # x: vector of strings
  dplyr::if_else(
    stringr::str_detect( x, "0") & # if string contains zero AND
      stringr::str_detect( x, "[1-9]", negate = TRUE), # string contains no other digits
    stringr::str_replace_all( x, "0", "O"), # then map zero to O
    x # else return x
  )
}

# Helper function to apply all-name cleaning to one name
# Apply all-name cleaning
raw_entity_data_clean_all_names_1 <- function(x) { # x: vector of strings
  x %>%
    stringr::str_replace_all("[^ A-Z0-9]", " ") %>% # map non-alphanumeric to " "
    stringr::str_replace_all( # fix generation suffixes
      c("\\b11\\b"   = "II",
        "\\b111\\b"  = "III",
        "\\b1111\\b" = "IIII")
    ) %>%
    raw_entity_data_clean_all_names_fix_zero() %>%
    stringr::str_remove_all("[0-9]") %>% # map remaining digits to ""
    stringr::str_squish() # remove excess whitespace
}

# Function to apply all-name cleaning to all names
raw_entity_data_clean_all_names <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      across(
        .cols = c(last_name, first_name, midl_name), # apply to all name vars
        .fns = raw_entity_data_clean_all_names_1,
        .names = "{.col}_cln")
    )
}

Apply and check function.

# This is only a partial check

# Apply

dim(d)
[1] 4099699      23
d <- d %>% raw_entity_data_clean_all_names()

dim(d)
[1] 4099699      26
# After

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(first_name, "[^ a-zA-Z0-9]")) %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
SMITH SMITH ROSALIN-ANN ROSALIN ANN WASHINGTON WASHINGTON
OWENS OWENS TA-JUANA TA JUANA MICHELLE MICHELLE
NELSON NELSON YONG-CHA YONG CHA CHA CHA
HARPER HARPER CHE-CHE CHE CHE
BAUDREAU BAUDREAU M-ROBERTA M ROBERTA D D
KUHL KUHL HANNE-LORE HANNE LORE HESS HESS
SMITH SMITH D'SHONE D SHONE LAKIA LAKIA
CONNER CONNER MARY-JANE MARY JANE BOHOLANO BOHOLANO
GRIMES GRIMES LA'NAE LA NAE TANISHA TANISHA
CURRY CURRY JULIANNA-LEE JULIANNA LEE BETTS BETTS
FLYNN FLYNN KERRI-LYNN KERRI LYNN
FREEMAN FREEMAN MONTE' MONTE TERRAN TERRAN
FORBES FORBES PATSY-DAWN PATSY DAWN WILLIAMS WILLIAMS
LEE LEE HEE-KWAN HEE KWAN
MULLEN MULLEN O'DEYNE O DEYNE B B
ADAMS ADAMS AL-DEAN AL DEAN
BELLAMY BELLAMY JA'VON JA VON BOBO BOBO
CLINTON CLINTON SHANTE' SHANTE LYVETTE LYVETTE
PHILLIPS PHILLIPS MEN-NESHIANNE MEN NESHIANNE QUNA QUNA
VELASCO VELASCO JOSE` JOSE L L
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(midl_name, "111*")) %>% 
  dplyr::distinct(midl_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
EDWARDS EDWARDS FRANK FRANK D 11 D II
FEATHERSTONE FEATHERSTONE GEORGE GEORGE 111 III
BENNETT BENNETT HAROLD HAROLD 11 II
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(first_name, "0")) %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
JOHNSON JOHNSON T0NYA TONYA BETH BETH
HODGINS HODGINS J0HNNY JOHNNY BURNETTE BURNETTE
SPRINKLE SPRINKLE L0RI LORI WOODWARD WOODWARD
BAREFOOT BAREFOOT B0BBY BOBBY R R
THOMAS THOMAS P0LLY POLLY BROWN BROWN
TANNAHILL TANNAHILL J0SEPH JOSEPH ERIC ERIC
KNOX KNOX ANTONI0 ANTONIO F F
TIDDY TIDDY J0HN JOHN F F
EDMONDS EDMONDS MARI0N MARION CAVINESS CAVINESS
MITCHELL MITCHELL SHANN0N SHANNON ARLINE ARLINE
GRAU GRAU TIM0THY TIMOTHY
COOPER COOPER RODRIG0 RODRIGO SALOMON SALOMON
BROWN BROWN ALONZ0 ALONZO
BLEDSOE BLEDSOE R0Y ROY JACK JACK
BUIE BUIE S0NTE SONTE Y Y
BLUE BLUE KAR0N KARON LEE LEE
COOPER COOPER C0RDELIA CORDELIA P P
WILLIAMS WILLIAMS M0NIKA MONIKA UDANA UDANA
AYALA AYALA ANDREA-0 ANDREA O
RAMIREZ RAMIREZ REYNALD0 REYNALDO G G
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(midl_name, "[2-9]")) %>% 
  dplyr::distinct(midl_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
GUIDO GUIDO DEANA DEANA LYNN2513 LYNN
BEACHAM BEACHAM HEATHER HEATHER ANDERSON9104576 ANDERSON
FAICLOTH FAICLOTH TIMOTHY TIMOTHY LOUIS7100 LOUIS
SMITH SMITH TRACY TRACY WAYNE030986 WAYNE
TOOMES TOOMES BRIAN BRIAN SCOTT3450 SCOTT
BECHTEL BECHTEL TERESA TERESA MARIE103062 MARIE
JONES JONES DONNA DONNA 10052004
YOUNG YOUNG WANWYNE WANWYNE 4625
BREEN BREEN TERRANCE TERRANCE MICHAEL146 MICHAEL
PHAIR PHAIR IDELL IDELL 8017
PERKINS PERKINS TERESA TERESA ROSENBAUM3305 ROSENBAUM
WILLIS WILLIS DEJUAN DEJUAN 328
PYRTLE PYRTLE PHILLIP PHILLIP W5RAY WRAY
PLESS PLESS JOAN JOAN WRIGHT2106 WRIGHT
GLOVER GLOVER DIONNE DIONNE LYNN1820 LYNN
FRENCH FRENCH SHNETTA SHNETTA ALEXANDER080572 ALEXANDER
SHUMAKER SHUMAKER RUTH RUTH ANN BURTON47 ANN BURTON
HOWERIN HOWERIN MICHAEL MICHAEL DALE401 DALE
BLAIR BLAIR ESSIE ESSIE MIZELLE25248249 MIZELLE
HILL HILL ZEB ZEB MITCHELL368 MITCHELL

3.0.4 Last name

  • map words DR, II, III, IIII, IV, JR, MD, SR to empty string
  • if number of letters in last name = 1, map name to empty string
# Helper function to remove words (w) from vector of strings (x)
raw_entity_data_clean_all_names_remove_words <- function(x, w) { # x, w: vectors of char (w = words to remove)
  x %>%
    stringr::str_remove_all(
      pattern =  paste0("\\b", w, "\\b", collapse = "|") #convert word list to regexp
    ) %>%
    stringr::str_squish() # remove excess whitespace
}

# Function to apply last-name cleaning
raw_entity_data_clean_last_name <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      last_name_cln = last_name_cln %>% # remove special words
        raw_entity_data_clean_all_names_remove_words(
          c("DR", "II", "III", "IIII", "IV", "JR", "MD", "SR")
        ),

      last_name_cln = dplyr::if_else( # remove very short names
        stringr::str_length(last_name_cln) > 1,
        last_name_cln,
        ""
      )
    )
}

Apply and check function.

# This is only a partial check

# Before

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    last_name_cln, 
    "\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(last_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
FILLINGHAM, II FILLINGHAM II ROBERT ROBERT E E
MCGUIRT III MCGUIRT III JAMES JAMES WILLIAM WILLIAM
METTS II METTS II CAREY CAREY MONTGOMERY MONTGOMERY
REDFEARN III REDFEARN III WILBERT WILBERT
BLACKWELDER III BLACKWELDER III DWIGHT DWIGHT MCNAIRY MCNAIRY
WORD II WORD II JOE JOE NAHAN NAHAN
STIMSON SR STIMSON SR RICHARD RICHARD BARRETT BARRETT
GOODWIN III GOODWIN III WARD WARD ALEXANDER ALEXANDER
DOSS SR DOSS SR MICHAEL MICHAEL RAY RAY
GROUSE III GROUSE III CHARLES CHARLES J J
WILLIAMSON DR WILLIAMSON DR IRVIN IRVIN D D
HICKS SR HICKS SR WILFORD WILFORD LYTLE LYTLE
SMITH III SMITH III GUY GUY R R
BEATTY III BEATTY III CURTIS CURTIS M M
CHAPPELL III CHAPPELL III TRAVIS TRAVIS
CRITTENDON II CRITTENDON II WILLIAM WILLIAM BURRELL BURRELL
VAUGHN SR VAUGHN SR WALTER WALTER S S
GREEN II GREEN II BILLY BILLY HOWARD HOWARD
SATTERFIELD 111 SATTERFIELD III CHARLES CHARLES MASON MASON
COCKERHAM III COCKERHAM III BOBBY BOBBY LEE LEE
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_length(last_name_cln) <= 1) %>% 
  dplyr::distinct(last_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
M M COY COY FAY FAY
Y Y PRUM PRUM
J J J J
X X MARCUS MARCUS
U U RAYMOND RAYMOND
H H MOIH MOIH
N N RENEE RENEE VIVIAN VIVIAN
A A CHUH CHUH
S S PETER PETER THOMAS THOMAS
K K NIUH NIUH
R R MARY MARY
# Apply

dim(d)
[1] 4099699      26
d <- d %>% raw_entity_data_clean_last_name()

dim(d)
[1] 4099699      26
# After

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    last_name_cln, 
    "\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(last_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_length(last_name_cln) <= 1) %>% 
  dplyr::distinct(last_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
M COY COY FAY FAY

3.0.5 Middle name

  • map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string
# Function to apply middle-name cleaning
raw_entity_data_clean_middle_name <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      midl_name_cln = midl_name_cln %>% # remove special words
        raw_entity_data_clean_all_names_remove_words(
          c("AKA", "DR", "II", "III", "IV", "JR", "MD", "MISS",
            "MR", "MRS", "MS", "NMN", "NN", "REV", "SR")
        )
    )
}

Apply and check function.

# This is only a partial check

# Before

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    midl_name_cln, 
    "\\bAKA\\b|\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bMS\\b|\\bNMN\\b|\\bNN\\b|\\bREV\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(midl_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
BALLOU BALLOU JAMES JAMES STAFFORD MRS STAFFORD MRS
SMITH SMITH S S LEROY MRS LEROY MRS
FOWLER FOWLER H H JACK MRS JACK MRS
GREENE GREENE LESTER LESTER D(NN) D NN
ARMSTON ARMSTON MILTON MILTON SR SR
WILLIAMS WILLIAMS ERVIN ERVIN W., SR., W SR
LAMBERT LAMBERT W W T MRS T MRS
BOWLES BOWLES J J VAN MRS VAN MRS
ARDIZZOLA ARDIZZOLA RUDY RUDY NMN NMN
PRICE PRICE THOMAS THOMAS EUGENE MRS EUGENE MRS
VETTER VETTER JOHN JOHN S DR S DR
JOHNSON JOHNSON FRED FRED ALAN SR ALAN SR
BENNETT BENNETT CHARLES CHARLES J MRS J MRS
HILDEBRAND HILDEBRAND JAMES JAMES OVERLAND MRS OVERLAND MRS
WARD WARD J J HERMAN MRS HERMAN MRS
LUPTON LUPTON DENNIS DENNIS WAYNE SR WAYNE SR
MOSS MOSS HERBERT HERBERT H MRS H MRS
VAN GORDER VAN GORDER CHARLES CHARLES OSCAR II OSCAR II
WHITNEY WHITNEY WILLIAM WILLIAM PRESTON SR PRESTON SR
KELLY KELLY DAVID DAVID LEE II LEE II
# Apply

dim(d)
[1] 4099699      26
d <- d %>% raw_entity_data_clean_middle_name()

dim(d)
[1] 4099699      26
# After

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    midl_name_cln, 
    "\\bAKA\\b|\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bMS\\b|\\bNMN\\b|\\bNN\\b|\\bREV\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(midl_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln

3.0.6 First name

  • map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
  • if number of letters in first name = 0, move first word of middle name to first name
# Helper function
# If no first name, move first word of middle name to first name
raw_entity_data_clean_first_name_move_name <- function(d) { # d: data frame of entity data
  has_first_name <- d$first_name_cln != ""

  re_fword <- "^[A-Z]+\\b" # regular expression for first word

  midl <- d$midl_name_cln

  midl_head <- midl %>% # get first word
    stringr::str_extract(re_fword) %>%
    tidyr::replace_na("")

  midl_tail <- midl %>% # get remainder of words
    stringr::str_remove(re_fword) %>%
    stringr::str_squish()

  d %>%
    dplyr::mutate(
      first_name_cln = dplyr::if_else(has_first_name,
                                      first_name_cln,
                                      midl_head
      ),
      midl_name_cln = dplyr::if_else(has_first_name,
                                     midl_name_cln,
                                     midl_tail
      )
    )
}

# Function to apply first-name cleaning
raw_entity_data_clean_first_name <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      first_name_cln = first_name_cln %>% # remove special words
        raw_entity_data_clean_all_names_remove_words(
          c("DR", "FATHER", "III", "IV", "JR", "MD", "MISS",
            "MR", "MRS", "NMN", "REV", "SISTER", "SR")
        )
    ) %>%
    raw_entity_data_clean_first_name_move_name()
}

Apply and check function.

# This is only a partial check

# Before

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    first_name_cln, 
    "\\bDR\\b|\\bFATHER\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bNMN\\b|\\bREV\\b|\\bSISTER\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
COVINGTON COVINGTON EDNA(MRS EDNA MRS PERRY, JR) PERRY
BORDERS BORDERS EUGENE(NMN) EUGENE NMN
MCPHERSON MCPHERSON VAN III VAN III
BINGMAN BINGMAN GRAY MRS GRAY MRS
GRAHAM GRAHAM STEPHEN SR STEPHEN SR LEGREE LEGREE
HARRIS HARRIS MRS FRED MRS FRED W W
TANCRAITOR TANCRAITOR SISTER MAXINE SISTER MAXINE ELIZABETH ELIZABETH
COPELAND COPELAND IV IV EDWARD JAMES EDWARD JAMES
GOOLSBY GOOLSBY EUGENE MRS EUGENE MRS
GURGANIOUS GURGANIOUS JOHN MRS JOHN MRS HALLIE HALLIE
DAVENPORT DAVENPORT MRS H MRS H T T
HARRIS HARRIS MRS WILLIAM MRS WILLIAM W W
STOCKELL STOCKELL MD MD COOPER COOPER
HAMRICK HAMRICK JOHN R MRS JOHN R MRS MARGARET MARGARET
SMITH SMITH MRS WILLIAM JOE MRS WILLIAM JOE DAVIS DAVIS
POPE POPE MRS O MRS O N N
RIVES RIVES MRS WILBUR MRS WILBUR A A
NICHOLS NICHOLS DORIS ( MRS W DORIS MRS W
MOODY MOODY MRS WILLARD MRS WILLARD W W
BURKE BURKE MRS GEORGE MRS GEORGE W W
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_length(first_name_cln) == 0) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
FUQUA FUQUA MARY MARY
# Apply

dim(d)
[1] 4099699      26
d <- d %>% raw_entity_data_clean_first_name()

dim(d)
[1] 4099699      26
# After

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(
    first_name_cln, 
    "\\bDR\\b|\\bFATHER\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bNMN\\b|\\bREV\\b|\\bSISTER\\b|\\bSR\\b")
  ) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_length(first_name_cln) == 0) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
WARREN WARREN

3.0.7 Postprocess all name variables

  • map all spaces to empty strings (Remove variability of spacing.)
  • add missing value indicator variables for all name variables
# Function to postprocess all cleaned name variables
raw_entity_data_clean_postprocess_names <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(
      # remove all spaces
      last_name_cln  = last_name_cln  %>% stringr::str_remove_all(" "),
      first_name_cln = first_name_cln %>% stringr::str_remove_all(" "),
      midl_name_cln  = midl_name_cln  %>% stringr::str_remove_all(" "),

      # add missing value indicators
      last_name_cln_miss  = last_name_cln  == "",
      first_name_cln_miss = first_name_cln == "",
      midl_name_cln_miss  = midl_name_cln  == ""
    )
}

Apply and check function.

# This is only a partial check

# Before

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(first_name_cln, " ")) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln
ZAVADA ZAVADA CLELLA-ANN CLELLA ANN
MCGIRT MCGIRT QUA-ANDREA QUA ANDREA SHAQUITA SHAQUITA
AUTRY AUTRY LILLIAN D LILLIAN D FUSSELL FUSSELL
MEGINNIS MEGINNIS F MARION F MARION
PRATT PRATT ANDREA GAIL ANDREA GAIL PEOPLES PEOPLES
MITCHENER MITCHENER A STEVENSON A STEVENSON
WESSON WESSON VERNESSA ANN VERNESSA ANN MURPHY MURPHY
JENKINS JENKINS EL CISCO EL CISCO
MCCALL MCCALL EDITH LORETTA EDITH LORETTA HARMON HARMON
THOMAS THOMAS JANE NICOLE JANE NICOLE MILTON MILTON
CHENG CHENG MEI-JUE MEI JUE LEE LEE
MATHIS MATHIS L STEVE L STEVE
SEMANS SEMANS MARY B MARY B T T
BURCH BURCH FAITH R FAITH R BAKER BAKER
BENJAMIN BENJAMIN J ERIC J ERIC
PARKS PARKS M DELORES M DELORES
EVERHART EVERHART SHEILA NANETTE SHEILA NANETTE COMBS COMBS
LEGGETT LEGGETT ANNA ALICIA ANNA ALICIA DANNIELLE DANNIELLE
SCOTT SCOTT OLIA MAE OLIA MAE LOCKLEAR LOCKLEAR
PARRIS PARRIS PORTIA DENISE PORTIA DENISE MCDANIEL MCDANIEL
# Apply

dim(d)
[1] 4099699      26
d <- d %>% raw_entity_data_clean_postprocess_names()

dim(d)
[1] 4099699      29
# After

d %>% 
  dplyr::select(
    last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
  ) %>% 
  dplyr::filter(stringr::str_detect(first_name_cln, " ")) %>% 
  dplyr::distinct(first_name_cln, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln midl_name midl_name_cln

4 Examples

Define a function to apply all the name cleaning actions.

# Function to apply all the cleaning actions
raw_entity_data_clean_all <- function(
  d # data frame - raw entity data
) {
  d %>%
    raw_entity_data_clean_age() %>%
    raw_entity_data_clean_preprocess_char() %>%
    raw_entity_data_clean_all_names() %>%
    raw_entity_data_clean_last_name() %>%
    raw_entity_data_clean_middle_name() %>%
    raw_entity_data_clean_first_name() %>%
    raw_entity_data_clean_postprocess_names()
}

This function is equivalent to the incremental processing that occurred in the earlier chunks.

Show some examples of the cleaned data.

Quick distributions

d %>% 
  dplyr::select(ends_with("cln"), ends_with("_miss")) %>% 
  skimr::skim()
Table 4.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 8
_______________________
Column type frequency:
character 3
logical 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name_cln 0 1 0 20 18 189999 0
first_name_cln 0 1 0 18 15 124073 0
midl_name_cln 0 1 0 18 254774 172071 0

Variable type: logical

skim_variable n_missing complete_rate mean count
age_cln_miss 0 1 0.01 FAL: 4068644, TRU: 31055
last_name_cln_miss 0 1 0.00 FAL: 4099681, TRU: 18
first_name_cln_miss 0 1 0.00 FAL: 4099684, TRU: 15
midl_name_cln_miss 0 1 0.06 FAL: 3844925, TRU: 254774

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_cln 0 1 46.3 17.71 0 33 45 58 104 ▁▇▇▃▁

4.1 Age

d %>% 
  dplyr::select(age, age_cln, age_cln_miss) %>% 
  dplyr::group_by(age_cln_miss) %>% 
  dplyr::distinct(age, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 10) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
age age_cln
FALSE
70 70
64 64
84 84
19 19
99 99
38 38
88 88
37 37
92 92
54 54
TRUE
116 0
142 0
163 0
111 0
12 0
109 0
150 0
117 0
7 0
106 0

4.2 Last name

d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::group_by(last_name_cln_miss) %>% 
  dplyr::distinct(last_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 10) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
FALSE
CHITTILLA CHITTILLA RAMA RAMA FALSE KRISHNA KRISHNA FALSE
SWERTFEGER SWERTFEGER ANN ANN FALSE H H FALSE
CORKUM CORKUM DONALD DONALD FALSE TRUE
GWVARA GWVARA JENNIFER JENNIFER FALSE ARLENE ARLENE FALSE
MEIDL MEIDL BARBARA BARBARA FALSE MAE MAE FALSE
CLAY-PUGH CLAYPUGH ELIZABETH ELIZABETH FALSE R R FALSE
BAEZ CRUZ BAEZCRUZ ALEXANDER ALEXANDER FALSE TRUE
ABOUTA ABOUTA ANDRE ANDRE FALSE TRUE
DANIELS-WRIGHT DANIELSWRIGHT BETTY BETTY FALSE TRUE
CRONISER CRONISER TARA TARA FALSE CHAMBERS CHAMBERS FALSE
TRUE
K NIUH NIUH FALSE TRUE
S PETER PETER FALSE THOMAS THOMAS FALSE
H MOIH MOIH FALSE TRUE
M COY COY FALSE FAY FAY FALSE
J J J FALSE TRUE
U RAYMOND RAYMOND FALSE TRUE
Y PRUM PRUM FALSE TRUE
X MARCUS MARCUS FALSE TRUE
A CHUH CHUH FALSE TRUE
N RENEE RENEE FALSE VIVIAN VIVIAN FALSE
d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::filter(last_name != last_name_cln) %>% 
  dplyr::distinct(last_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
D'ARIENZO DARIENZO FALSE PETER PETER FALSE F F FALSE
VAN DAM VANDAM FALSE VICKY VICKY FALSE MAYLE MAYLE FALSE
NEWMAN-SLADE NEWMANSLADE FALSE JUSTINE JUSTINE FALSE TRUE
CHASSE' CHASSE FALSE JOSEPH JOSEPH FALSE TRUE
TAMES-WALTON TAMESWALTON FALSE KATHRYN KATHRYN FALSE ANN ANN FALSE
SHADE DUYS SHADEDUYS FALSE JANE JANE FALSE SPERRY SPERRY FALSE
MCVICKER-DAVIS MCVICKERDAVIS FALSE NANCY NANCY FALSE M M FALSE
LUCAS-YE LUCASYE FALSE KELLY KELLY FALSE MARIE MARIE FALSE
ROBINSON-STOVALL ROBINSONSTOVALL FALSE CHEESET CHEESET FALSE TRUE
SHIELDS-BABER SHIELDSBABER FALSE CAROLYN CAROLYN FALSE JOAN JOAN FALSE
TOLAND-WOOD TOLANDWOOD FALSE BARBARA BARBARA FALSE TRUE
WASHINGTON-COLEY WASHINGTONCOLEY FALSE KARA KARA FALSE G G FALSE
SANCHEZ-COLLINS SANCHEZCOLLINS FALSE MICHAEL MICHAEL FALSE JAMES JAMES FALSE
BUICE-LUCIANI BUICELUCIANI FALSE JILL JILL FALSE TRUE
PEELE-BUTCHER PEELEBUTCHER FALSE TIANDRA TIANDRA FALSE LASHAWN LASHAWN FALSE
BURCH-FAULKNER BURCHFAULKNER FALSE DIANNE DIANNE FALSE MORGAN MORGAN FALSE
POMPEY-ROSS POMPEYROSS FALSE TAMMY TAMMY FALSE M M FALSE
VAN GILDER VANGILDER FALSE JOE JOE FALSE J J FALSE
ZIEGENFUSS-MENDOZA ZIEGENFUSSMENDOZA FALSE JAIME JAIME FALSE ANN ANN FALSE
FIELDS-MINKINS FIELDSMINKINS FALSE ALICIA ALICIA FALSE M M FALSE

4.3 First name

d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::group_by(first_name_cln_miss) %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 10) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln midl_name midl_name_cln midl_name_cln_miss
FALSE
HOBSON HOBSON FALSE SAHNIA SAHNIA DENISE DENISE FALSE
SPIVEY SPIVEY FALSE DAUGHTREY DAUGHTREY W W FALSE
LIVERIS LIVERIS FALSE JUDITH MICHEL JUDITHMICHEL SAFFORD SAFFORD FALSE
EVENSON EVENSON FALSE REGINALE REGINALE GARY GARY FALSE
TERRY TERRY FALSE JONNELL JONNELL WAYNE WAYNE FALSE
LANCASTER LANCASTER FALSE DO'RANNE DORANNE M M FALSE
LEE LEE FALSE AABIDAH AABIDAH AAZIZAH AAZIZAH FALSE
SHAHRIER SHAHRIER FALSE JARIN JARIN SULTANA SULTANA FALSE
BRODGEN BRODGEN FALSE ANDIE ANDIE MICHELE MICHELE FALSE
ALLEN ALLEN FALSE TINA-DIANNE TINADIANNE PATE PATE FALSE
TRUE
KANE KANE FALSE MR TRUE
LUU LUU FALSE MRS TRUE
TOOLE TOOLE FALSE JR TRUE
WARREN WARREN FALSE TRUE
d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::filter(first_name != first_name_cln) %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
WRIGHT WRIGHT FALSE LA-CRETIA LACRETIA FALSE R R FALSE
WILSON WILSON FALSE LESSIE CATHERIN LESSIECATHERIN FALSE M M FALSE
OWEN OWEN FALSE ROBYN-LEIGH ROBYNLEIGH FALSE FISHER FISHER FALSE
COOMBS COOMBS FALSE ELIZABETH NICOL ELIZABETHNICOL FALSE SMITH SMITH FALSE
HARRIS HARRIS FALSE MARY LETITIA MARYLETITIA FALSE S S FALSE
MOORE MOORE FALSE MIRANDA JO MIRANDAJO FALSE HOLDER HOLDER FALSE
GUSLER GUSLER FALSE SUSAN LEIGH SUSANLEIGH FALSE SNEAD SNEAD FALSE
OHLHAUT OHLHAUT FALSE A MARTIN AMARTIN FALSE TRUE
TIPTON TIPTON FALSE JULIE LYNNE JULIELYNNE FALSE SATTER SATTER FALSE
TANKSLEY TANKSLEY FALSE KATHARINA-LUISE KATHARINALUISE FALSE CLEAVER CLEAVER FALSE
WHITE WHITE FALSE KERI ANN KERIANN FALSE TRUE
CALHOUN CALHOUN FALSE PAULA L PAULAL FALSE BESS BESS FALSE
JEN JEN FALSE YUAN PEI YUANPEI FALSE P P FALSE
HOLMES HOLMES FALSE IDA LUCILLE IDALUCILLE FALSE AVERY AVERY FALSE
HINSON HINSON FALSE JANE-L JANEL FALSE O'TUEL OTUEL FALSE
SANDERS SANDERS FALSE DONETTE LEE DONETTELEE FALSE WALLACE WALLACE FALSE
MEACHAM MEACHAM FALSE MARY IRENE MARYIRENE FALSE MABE MABE FALSE
CHANG CHANG FALSE JEN-YEA JENYEA FALSE TRUE
KING KING FALSE LA-VAITER LAVAITER FALSE MARIE MARIE FALSE
MELVIN MELVIN FALSE ALLIE JANE ALLIEJANE FALSE ELNORA ELNORA FALSE
d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::filter(stringr::str_detect(first_name, "SISTER")) %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
ROSS ROSS FALSE SISTER S FALSE S TRUE
TANCRAITOR TANCRAITOR FALSE SISTER MAXINE MAXINE FALSE ELIZABETH ELIZABETH FALSE
d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::filter(first_name == "") %>% 
  dplyr::distinct(first_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
FUQUA FUQUA FALSE MARY FALSE MARY TRUE

4.4 Middle name

d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::group_by(midl_name_cln_miss) %>% 
  dplyr::distinct(midl_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 10) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln
FALSE
SIMS SIMS FALSE KRISTEN KRISTEN FALSE DENISE MORRIS DENISEMORRIS
LEE LEE FALSE KAILI KAILI FALSE HOUA HOUA
DANDO DANDO FALSE JOAN JOAN FALSE NUFFER NUFFER
WRIGHT WRIGHT FALSE BOBBIE BOBBIE FALSE TAWNAT TAWNAT
SCHOOLEY SCHOOLEY FALSE JEAN JEAN FALSE NINNETTE NINNETTE
JONES JONES FALSE KIA KIA FALSE SHAUMON SHAUMON
HEMINGWAY HEMINGWAY FALSE RAYSIL RAYSIL FALSE RAMON FITZGERAL RAMONFITZGERAL
DAWSON DAWSON FALSE DEREK DEREK FALSE DESGALIER DESGALIER
WILLIAMS WILLIAMS FALSE JENNIFER JENNIFER FALSE MICHELLE LARKIN MICHELLELARKIN
COX COX FALSE CYNTHIA CYNTHIA FALSE DENISE HOPPER DENISEHOPPER
TRUE
FEATHERSTONE FEATHERSTONE FALSE GEORGE GEORGE FALSE 111
CROSS CROSS FALSE EUGENE EUGENE FALSE IV
ROSE ROSE FALSE FRANK FRANK FALSE JR.
FUQUA FUQUA FALSE WILLIAM FALSE WILLIAM
BOLES BOLES FALSE FAUSTINE FAUSTINE FALSE MISS
HOAGLAND HOAGLAND FALSE JR SANDY FALSE SANDY
BENNETT BENNETT FALSE HAROLD HAROLD FALSE 11
KELLY KELLY FALSE SISTER ANN FALSE ANN
ZIMMER ZIMMER FALSE CLIFFORD FALSE CLIFFORD
GIOVANNETTI GIOVANNETTI FALSE JACQUELINE JACQUELINE FALSE (NMN
d %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  dplyr::filter(midl_name != midl_name_cln) %>% 
  dplyr::distinct(midl_name, .keep_all = TRUE) %>% 
  dplyr::slice_sample(n = 20) %>% 
  gt::gt() %>% 
  gt::opt_row_striping() %>% 
  gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>% 
  gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
ALLISON ALLISON FALSE TRANSHER TRANSHER FALSE NAKIA UVOICE NAKIAUVOICE FALSE
DUBIS DUBIS FALSE PATRICIA PATRICIA FALSE GAIL RUSSELL GAILRUSSELL FALSE
UPCHURCH UPCHURCH FALSE GLORIA GLORIA FALSE MCINTOSH PERRY MCINTOSHPERRY FALSE
RICHARDSON RICHARDSON FALSE TAMMY TAMMY FALSE LYNN STAFFOR LYNNSTAFFOR FALSE
REAGAN REAGAN FALSE ANNA ANNA FALSE SCHELL-PATTON SCHELLPATTON FALSE
SMITH SMITH FALSE ANNA ANNA FALSE MARIE KIRKLAND MARIEKIRKLAND FALSE
STROUD STROUD FALSE JONATHAN JONATHAN FALSE ROSCOE FULLER ROSCOEFULLER FALSE
MILLER MILLER FALSE PATRICIA PATRICIA FALSE LYNN-WOODRING LYNNWOODRING FALSE
MCLEAN MCLEAN FALSE SHEILA SHEILA FALSE LYNN HOOPER LYNNHOOPER FALSE
BULLARD BULLARD FALSE JOSEPH JOSEPH FALSE MATTHEW KELLER MATTHEWKELLER FALSE
PORTER PORTER FALSE SHIRLEY SHIRLEY FALSE M PUETT MPUETT FALSE
HALL HALL FALSE JENNIFER JENNIFER FALSE ANNE STEVENS ANNESTEVENS FALSE
VALONE VALONE FALSE KATHRYN KATHRYN FALSE JO PRUITT JOPRUITT FALSE
TRAVIS TRAVIS FALSE CHRISTOPHER CHRISTOPHER FALSE WILLIAM CALVIN WILLIAMCALVIN FALSE
WHITE WHITE FALSE KATHLEEN KATHLEEN FALSE KEM PENG TAM KEMPENGTAM FALSE
THOMPSON THOMPSON FALSE ANITA ANITA FALSE SUE A SUEA FALSE
BURGESS BURGESS FALSE ANNIE ANNIE FALSE GENEVA DAVIS GENEVADAVIS FALSE
NICHOLSON NICHOLSON FALSE PATSY PATSY FALSE GAIL HOXIT GAILHOXIT FALSE
LEWALLEN LEWALLEN FALSE NATHANIEL NATHANIEL FALSE ANTHONY-WRAY ANTHONYWRAY FALSE
PIZZO PIZZO FALSE CHARLOTTE CHARLOTTE FALSE HELENE TRYMB HELENETRYMB FALSE

5 Putting it all together

Define a function to add a record ID variable.

# Function to add ID
# This is just transitional.
# Remove it after switching to 2008 snapshot with NCID
raw_entity_data_add_id <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::mutate(id = 1:nrow(.))
}

Define a function to apply all the steps from reading the raw entity data through to creating the cleaned entity data. This will be used in the core pipeline.

# Function to make the cleaned raw entity data
# Reads the data, applies exclusions
# and neatens it to where it's ready to calculate modelling features
raw_entity_data_make_clean <- function(
  file_path # character - file path usable by vroom
  # value - data frame
) {
  raw_entity_data_read(file_path) %>%
    raw_entity_data_excl_status() %>%
    raw_entity_data_excl_test() %>%
    raw_entity_data_drop_novar() %>%
    raw_entity_data_parse_dates() %>%
    raw_entity_data_drop_admin() %>%
    raw_entity_data_drop_demog() %>%
    raw_entity_data_clean_all() %>%
    raw_entity_data_add_id()
}

Timing

Computation time (excl. render): 422.009 sec elapsed

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] tidyr_1.1.3      ggplot2_3.3.3    forcats_0.5.1    lubridate_1.7.10
 [5] skimr_2.1.3      stringr_1.4.0    tibble_3.1.2     vroom_1.4.0     
 [9] gt_0.3.0         dplyr_1.0.6      fs_1.5.0         here_1.0.1      
[13] tictoc_1.0.1     targets_0.4.2   

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1  xfun_0.23         bslib_0.2.5       repr_1.1.3       
 [5] purrr_0.3.4       colorspace_2.0-1  vctrs_0.3.8       generics_0.1.0   
 [9] htmltools_0.5.1.1 base64enc_0.1-3   yaml_2.2.1        utf8_1.2.1       
[13] rlang_0.4.11      jquerylib_0.1.4   later_1.2.0       pillar_1.6.1     
[17] glue_1.4.2        withr_2.4.2       bit64_4.0.5       lifecycle_1.0.0  
[21] munsell_0.5.0     gtable_0.3.0      workflowr_1.6.2   codetools_0.2-18 
[25] evaluate_0.14     knitr_1.33        callr_3.7.0       httpuv_1.6.1     
[29] ps_1.6.0          parallel_4.1.0    fansi_0.4.2       highr_0.9        
[33] Rcpp_1.0.6        backports_1.2.1   checkmate_2.0.0   renv_0.13.2      
[37] promises_1.2.0.1  scales_1.1.1      jsonlite_1.7.2    bit_4.0.4        
[41] digest_0.6.27     stringi_1.6.2     bookdown_0.22     processx_3.5.2   
[45] rprojroot_2.0.2   grid_4.1.0        cli_2.5.0         tools_4.1.0      
[49] magrittr_2.0.1    sass_0.4.0        crayon_1.4.1      whisker_0.4      
[53] pkgconfig_2.0.3   ellipsis_0.3.2    data.table_1.14.0 rmarkdown_2.8    
[57] R6_2.5.0          igraph_1.2.6      git2r_0.28.0      compiler_4.1.0