Last updated: 2021-05-27
Checks: 7 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a6fb2e3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: _targets/
Ignored: data/VR_20051125.txt.xz
Ignored: data/VR_Snapshot_20081104.txt.xz
Ignored: renv/library/
Ignored: renv/local/
Ignored: renv/staging/
Unstaged changes:
Modified: analysis/index.Rmd
Modified: analysis/m_00_status.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/m_01_9_clean_vars.Rmd
) and HTML (docs/m_01_9_clean_vars.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 5c06e88 | Ross Gayler | 2021-05-26 | WIP |
html | 5c06e88 | Ross Gayler | 2021-05-26 | WIP |
Rmd | 3ca94f0 | Ross Gayler | 2021-05-24 | WIP |
Rmd | 6421aeb | Ross Gayler | 2021-04-04 | wip |
Rmd | 411de1e | Ross Gayler | 2021-04-04 | WIP |
html | 411de1e | Ross Gayler | 2021-04-04 | WIP |
Rmd | 0bd4a5f | Ross Gayler | 2021-04-03 | WIP |
# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.
# Set up the project environment, because {workflowr} knits each Rmd file
# in a new R session, and doesn't execute the project .Rprofile
library(targets) # access data from the targets cache
library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal
library(fs) # file system operations
library(dplyr) # data wrangling
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(gt) # table formatting
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching
library(skimr) # compact summary of each variable
library(lubridate) # date parsing
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(forcats) # manipulation of factors
library(ggplot2) # graphics
library(tidyr) # data tidying
#
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)
The aim of this set of meta notebooks is to work out how to read the raw
entity data. and get it sufficiently neatened so that we can construct
standardised names and modelling features without needing any further
neatening. To be clear, the target (c_raw_entity_data
) corresponding
to the objective of this set of notebooks is the neatened raw data,
before constructing any modelling features.
This notebook documents the cleaning of all the variables based on the issues discovered in the previous analyses.
It also adds a record ID variable.
The name variables, last_name
, first_name
, and midl_name
, will
definitely be used in compatibility modelling.
We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.
The demographic variables sex
, age
, birth_place
, and the
administrative variable county_desc
may be used as predictors and/or
blocking variables.
The remainder of the variables (residence and administrative variables) will be kept in case they are useful for manually assessing claimed matches.
Standardisation will be applied to the name variables last_name
,
first_name
, and midl_name
. This attempts to remove variation that is
probably irrelevant to identity (e.g. case, punctuation, and spacing).
In the previous notebooks I have converted empty strings to missing
values (NA_character_
in R). This was convenient because table()
and
skim()
count missing values as a separate category. However, modelling
is a different kettle of fish.
In modelling, we want to get an estimated probability of identity match
for every query, regardless of how many attributes have missing values.
Typical modelling functions do not tolerate any missing (NA
) values in
predictors. If any of the predictors is missing then the estimate is
also missing.
We avoid that problem by transforming the missing values into some
nonmissing value and creating an extra variable to indicate the
missingness. This will be done for the name variables last_name
,
first_name
, and midl_name
, and the demographic variable age
. (This
is not necessary for birth_place
because “missing” is just another
valid level of the variable.
The cleanup actions to be applied are (in order):
age
convert from string to integer
add missing value indicator and set to true if age < 17 or age > 104
if age missing indicator is true, set age to 0
preprocess all character variables
map missing to empty string
map lower case letters to upper case
all names
map each non-alphanumeric character to a space (Remove variability of punctuation while preserving word boundaries.)
map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
map each remaining digit to an empty string (Remove random digit insertions)
last name
map words DR, II, III, IIII, IV, JR, MD, SR to empty string
if number of letters in last name = 1, map name to empty string
middle name
map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string
first name
map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
if number of letters in first name = 0, move first word of middle name to first name
postprocess all name variables
map all spaces to empty strings (Remove variability of spacing.)
add missing value indicator variables for all name variables
Read the raw entity data file using the previously defined functions
raw_entity_data_read()
, raw_entity_data_excl_status()
,
raw_entity_data_excl_test()
, raw_entity_data_drop_novar()
,
raw_entity_data_parse_dates()
, and raw_entity_data_drop_cancel_dt()
.
# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_read(f_entity_raw_tsv) %>%
raw_entity_data_excl_status() %>%
raw_entity_data_excl_test() %>%
raw_entity_data_drop_novar() %>%
raw_entity_data_parse_dates() %>%
raw_entity_data_drop_admin() %>%
raw_entity_data_drop_demog()
dim(d)
[1] 4099699 21
# Function to clean age
raw_entity_data_clean_age <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
age_cln = as.integer(age),
age_cln_miss = ! dplyr::between(age_cln, 17, 104), # valid age range
age_cln = dplyr::if_else(age_cln_miss, 0L, age_cln)
)
}
Apply and check function.
# Apply
dim(d)
[1] 4099699 21
d <- d %>% raw_entity_data_clean_age()
dim(d)
[1] 4099699 23
# After
summary(d$age_cln)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 33.0 45.0 46.3 58.0 104.0
summary(d$age_cln_miss)
Mode FALSE TRUE
logical 4068644 31055
d %>%
dplyr::select(starts_with("age_cln")) %>%
dplyr::group_by(age_cln_miss) %>%
dplyr::summarise(qtl = quantile(age_cln) %>% list()) %>%
tidyr::unnest_wider(col = qtl) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels())
age_cln_miss | 0% | 25% | 50% | 75% | 100% |
---|---|---|---|---|---|
FALSE | 17 | 33 | 45 | 58 | 104 |
TRUE | 0 | 0 | 0 | 0 | 0 |
Note that this is an in-place transformation of the character variables rather than adding the transformed values as new variables. This is because I am viewing this a light tidying rather than as creating distinctly new cleaned values.
# Helper function to preprocess one char variable
raw_entity_data_clean_preprocess_char_1 <- function(x) {
x %>%
tidyr::replace_na("") %>% # map NA to ""
stringr::str_to_upper() # map lower case to upper case
}
# Function to preprocess all char variables
raw_entity_data_clean_preprocess_char <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
across(where(is.character), raw_entity_data_clean_preprocess_char_1) # apply to all char vars
)
}
Apply and check function.
# This is only a partial check - look at a very small subset of variables
# Before
d %>%
dplyr::select(ends_with("_name")) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4099699 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
last_name | 0 | 1.00 | 1 | 21 | 0 | 191996 | 0 |
first_name | 23 | 1.00 | 1 | 19 | 0 | 126589 | 0 |
midl_name | 252695 | 0.94 | 1 | 20 | 0 | 175742 | 0 |
street_name | 7 | 1.00 | 1 | 30 | 0 | 83244 | 0 |
d %>%
dplyr::select(ends_with("_name")) %>%
dplyr::filter(stringr::str_detect(first_name, "[a-z]")) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | first_name | midl_name | street_name |
---|---|---|---|
CARLAND | JoANN | CROOK | TRAVIS |
HUGHES | LaVERNE | MARY | FALCON |
BRAGG | JoANNE | A | CAMP CREEK |
NATIONS | JoANN | NMN | BRADLEY BRANCH |
SIMMONS | SiROBERT | JOHNSON | DEVONSHIRE |
BRITT | McCKINES | LAMONT | GRANTHAM SCHOOL |
STAFFORD | DeNEAL | ROBERT | COTTONWOOD |
CARROLL | McHILDIA | THOMPSON | GOVERNOR MOORE |
WATT | JoANN | <NA> | CARLYLE |
THOMPSON | LaSONYA | D | DERRYDOWNE |
FLEMING | JeROME | S | BEATTIES FORD |
# Apply
d <- d %>% raw_entity_data_clean_preprocess_char()
# After
d %>%
dplyr::select(ends_with("_name")) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4099699 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
last_name | 0 | 1 | 1 | 21 | 0 | 191995 | 0 |
first_name | 0 | 1 | 0 | 19 | 23 | 126584 | 0 |
midl_name | 0 | 1 | 0 | 20 | 252695 | 175695 | 0 |
street_name | 0 | 1 | 0 | 30 | 7 | 83241 | 0 |
d %>%
dplyr::select(ends_with("_name")) %>%
dplyr::filter(stringr::str_detect(first_name, "[a-z]")) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | first_name | midl_name | street_name |
---|
# Helper function to fix zeroes in all names
# Map zero to O if there are no other digits in the string
raw_entity_data_clean_all_names_fix_zero <- function(x) { # x: vector of strings
dplyr::if_else(
stringr::str_detect( x, "0") & # if string contains zero AND
stringr::str_detect( x, "[1-9]", negate = TRUE), # string contains no other digits
stringr::str_replace_all( x, "0", "O"), # then map zero to O
x # else return x
)
}
# Helper function to apply all-name cleaning to one name
# Apply all-name cleaning
raw_entity_data_clean_all_names_1 <- function(x) { # x: vector of strings
x %>%
stringr::str_replace_all("[^ A-Z0-9]", " ") %>% # map non-alphanumeric to " "
stringr::str_replace_all( # fix generation suffixes
c("\\b11\\b" = "II",
"\\b111\\b" = "III",
"\\b1111\\b" = "IIII")
) %>%
raw_entity_data_clean_all_names_fix_zero() %>%
stringr::str_remove_all("[0-9]") %>% # map remaining digits to ""
stringr::str_squish() # remove excess whitespace
}
# Function to apply all-name cleaning to all names
raw_entity_data_clean_all_names <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
across(
.cols = c(last_name, first_name, midl_name), # apply to all name vars
.fns = raw_entity_data_clean_all_names_1,
.names = "{.col}_cln")
)
}
Apply and check function.
# This is only a partial check
# Apply
dim(d)
[1] 4099699 23
d <- d %>% raw_entity_data_clean_all_names()
dim(d)
[1] 4099699 26
# After
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(first_name, "[^ a-zA-Z0-9]")) %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
SMITH | SMITH | ROSALIN-ANN | ROSALIN ANN | WASHINGTON | WASHINGTON |
OWENS | OWENS | TA-JUANA | TA JUANA | MICHELLE | MICHELLE |
NELSON | NELSON | YONG-CHA | YONG CHA | CHA | CHA |
HARPER | HARPER | CHE-CHE | CHE CHE | ||
BAUDREAU | BAUDREAU | M-ROBERTA | M ROBERTA | D | D |
KUHL | KUHL | HANNE-LORE | HANNE LORE | HESS | HESS |
SMITH | SMITH | D'SHONE | D SHONE | LAKIA | LAKIA |
CONNER | CONNER | MARY-JANE | MARY JANE | BOHOLANO | BOHOLANO |
GRIMES | GRIMES | LA'NAE | LA NAE | TANISHA | TANISHA |
CURRY | CURRY | JULIANNA-LEE | JULIANNA LEE | BETTS | BETTS |
FLYNN | FLYNN | KERRI-LYNN | KERRI LYNN | ||
FREEMAN | FREEMAN | MONTE' | MONTE | TERRAN | TERRAN |
FORBES | FORBES | PATSY-DAWN | PATSY DAWN | WILLIAMS | WILLIAMS |
LEE | LEE | HEE-KWAN | HEE KWAN | ||
MULLEN | MULLEN | O'DEYNE | O DEYNE | B | B |
ADAMS | ADAMS | AL-DEAN | AL DEAN | ||
BELLAMY | BELLAMY | JA'VON | JA VON | BOBO | BOBO |
CLINTON | CLINTON | SHANTE' | SHANTE | LYVETTE | LYVETTE |
PHILLIPS | PHILLIPS | MEN-NESHIANNE | MEN NESHIANNE | QUNA | QUNA |
VELASCO | VELASCO | JOSE` | JOSE | L | L |
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(midl_name, "111*")) %>%
dplyr::distinct(midl_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
EDWARDS | EDWARDS | FRANK | FRANK | D 11 | D II |
FEATHERSTONE | FEATHERSTONE | GEORGE | GEORGE | 111 | III |
BENNETT | BENNETT | HAROLD | HAROLD | 11 | II |
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(first_name, "0")) %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
JOHNSON | JOHNSON | T0NYA | TONYA | BETH | BETH |
HODGINS | HODGINS | J0HNNY | JOHNNY | BURNETTE | BURNETTE |
SPRINKLE | SPRINKLE | L0RI | LORI | WOODWARD | WOODWARD |
BAREFOOT | BAREFOOT | B0BBY | BOBBY | R | R |
THOMAS | THOMAS | P0LLY | POLLY | BROWN | BROWN |
TANNAHILL | TANNAHILL | J0SEPH | JOSEPH | ERIC | ERIC |
KNOX | KNOX | ANTONI0 | ANTONIO | F | F |
TIDDY | TIDDY | J0HN | JOHN | F | F |
EDMONDS | EDMONDS | MARI0N | MARION | CAVINESS | CAVINESS |
MITCHELL | MITCHELL | SHANN0N | SHANNON | ARLINE | ARLINE |
GRAU | GRAU | TIM0THY | TIMOTHY | ||
COOPER | COOPER | RODRIG0 | RODRIGO | SALOMON | SALOMON |
BROWN | BROWN | ALONZ0 | ALONZO | ||
BLEDSOE | BLEDSOE | R0Y | ROY | JACK | JACK |
BUIE | BUIE | S0NTE | SONTE | Y | Y |
BLUE | BLUE | KAR0N | KARON | LEE | LEE |
COOPER | COOPER | C0RDELIA | CORDELIA | P | P |
WILLIAMS | WILLIAMS | M0NIKA | MONIKA | UDANA | UDANA |
AYALA | AYALA | ANDREA-0 | ANDREA O | ||
RAMIREZ | RAMIREZ | REYNALD0 | REYNALDO | G | G |
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(midl_name, "[2-9]")) %>%
dplyr::distinct(midl_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
GUIDO | GUIDO | DEANA | DEANA | LYNN2513 | LYNN |
BEACHAM | BEACHAM | HEATHER | HEATHER | ANDERSON9104576 | ANDERSON |
FAICLOTH | FAICLOTH | TIMOTHY | TIMOTHY | LOUIS7100 | LOUIS |
SMITH | SMITH | TRACY | TRACY | WAYNE030986 | WAYNE |
TOOMES | TOOMES | BRIAN | BRIAN | SCOTT3450 | SCOTT |
BECHTEL | BECHTEL | TERESA | TERESA | MARIE103062 | MARIE |
JONES | JONES | DONNA | DONNA | 10052004 | |
YOUNG | YOUNG | WANWYNE | WANWYNE | 4625 | |
BREEN | BREEN | TERRANCE | TERRANCE | MICHAEL146 | MICHAEL |
PHAIR | PHAIR | IDELL | IDELL | 8017 | |
PERKINS | PERKINS | TERESA | TERESA | ROSENBAUM3305 | ROSENBAUM |
WILLIS | WILLIS | DEJUAN | DEJUAN | 328 | |
PYRTLE | PYRTLE | PHILLIP | PHILLIP | W5RAY | WRAY |
PLESS | PLESS | JOAN | JOAN | WRIGHT2106 | WRIGHT |
GLOVER | GLOVER | DIONNE | DIONNE | LYNN1820 | LYNN |
FRENCH | FRENCH | SHNETTA | SHNETTA | ALEXANDER080572 | ALEXANDER |
SHUMAKER | SHUMAKER | RUTH | RUTH | ANN BURTON47 | ANN BURTON |
HOWERIN | HOWERIN | MICHAEL | MICHAEL | DALE401 | DALE |
BLAIR | BLAIR | ESSIE | ESSIE | MIZELLE25248249 | MIZELLE |
HILL | HILL | ZEB | ZEB | MITCHELL368 | MITCHELL |
# Helper function to remove words (w) from vector of strings (x)
raw_entity_data_clean_all_names_remove_words <- function(x, w) { # x, w: vectors of char (w = words to remove)
x %>%
stringr::str_remove_all(
pattern = paste0("\\b", w, "\\b", collapse = "|") #convert word list to regexp
) %>%
stringr::str_squish() # remove excess whitespace
}
# Function to apply last-name cleaning
raw_entity_data_clean_last_name <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
last_name_cln = last_name_cln %>% # remove special words
raw_entity_data_clean_all_names_remove_words(
c("DR", "II", "III", "IIII", "IV", "JR", "MD", "SR")
),
last_name_cln = dplyr::if_else( # remove very short names
stringr::str_length(last_name_cln) > 1,
last_name_cln,
""
)
)
}
Apply and check function.
# This is only a partial check
# Before
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
last_name_cln,
"\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bSR\\b")
) %>%
dplyr::distinct(last_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
FILLINGHAM, II | FILLINGHAM II | ROBERT | ROBERT | E | E |
MCGUIRT III | MCGUIRT III | JAMES | JAMES | WILLIAM | WILLIAM |
METTS II | METTS II | CAREY | CAREY | MONTGOMERY | MONTGOMERY |
REDFEARN III | REDFEARN III | WILBERT | WILBERT | ||
BLACKWELDER III | BLACKWELDER III | DWIGHT | DWIGHT | MCNAIRY | MCNAIRY |
WORD II | WORD II | JOE | JOE | NAHAN | NAHAN |
STIMSON SR | STIMSON SR | RICHARD | RICHARD | BARRETT | BARRETT |
GOODWIN III | GOODWIN III | WARD | WARD | ALEXANDER | ALEXANDER |
DOSS SR | DOSS SR | MICHAEL | MICHAEL | RAY | RAY |
GROUSE III | GROUSE III | CHARLES | CHARLES | J | J |
WILLIAMSON DR | WILLIAMSON DR | IRVIN | IRVIN | D | D |
HICKS SR | HICKS SR | WILFORD | WILFORD | LYTLE | LYTLE |
SMITH III | SMITH III | GUY | GUY | R | R |
BEATTY III | BEATTY III | CURTIS | CURTIS | M | M |
CHAPPELL III | CHAPPELL III | TRAVIS | TRAVIS | ||
CRITTENDON II | CRITTENDON II | WILLIAM | WILLIAM | BURRELL | BURRELL |
VAUGHN SR | VAUGHN SR | WALTER | WALTER | S | S |
GREEN II | GREEN II | BILLY | BILLY | HOWARD | HOWARD |
SATTERFIELD 111 | SATTERFIELD III | CHARLES | CHARLES | MASON | MASON |
COCKERHAM III | COCKERHAM III | BOBBY | BOBBY | LEE | LEE |
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_length(last_name_cln) <= 1) %>%
dplyr::distinct(last_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
M | M | COY | COY | FAY | FAY |
Y | Y | PRUM | PRUM | ||
J | J | J | J | ||
X | X | MARCUS | MARCUS | ||
U | U | RAYMOND | RAYMOND | ||
H | H | MOIH | MOIH | ||
N | N | RENEE | RENEE | VIVIAN | VIVIAN |
A | A | CHUH | CHUH | ||
S | S | PETER | PETER | THOMAS | THOMAS |
K | K | NIUH | NIUH | ||
R | R | MARY | MARY |
# Apply
dim(d)
[1] 4099699 26
d <- d %>% raw_entity_data_clean_last_name()
dim(d)
[1] 4099699 26
# After
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
last_name_cln,
"\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bSR\\b")
) %>%
dplyr::distinct(last_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_length(last_name_cln) <= 1) %>%
dplyr::distinct(last_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
M | COY | COY | FAY | FAY |
# Function to apply middle-name cleaning
raw_entity_data_clean_middle_name <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
midl_name_cln = midl_name_cln %>% # remove special words
raw_entity_data_clean_all_names_remove_words(
c("AKA", "DR", "II", "III", "IV", "JR", "MD", "MISS",
"MR", "MRS", "MS", "NMN", "NN", "REV", "SR")
)
)
}
Apply and check function.
# This is only a partial check
# Before
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
midl_name_cln,
"\\bAKA\\b|\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bMS\\b|\\bNMN\\b|\\bNN\\b|\\bREV\\b|\\bSR\\b")
) %>%
dplyr::distinct(midl_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
BALLOU | BALLOU | JAMES | JAMES | STAFFORD MRS | STAFFORD MRS |
SMITH | SMITH | S | S | LEROY MRS | LEROY MRS |
FOWLER | FOWLER | H | H | JACK MRS | JACK MRS |
GREENE | GREENE | LESTER | LESTER | D(NN) | D NN |
ARMSTON | ARMSTON | MILTON | MILTON | SR | SR |
WILLIAMS | WILLIAMS | ERVIN | ERVIN | W., SR., | W SR |
LAMBERT | LAMBERT | W | W | T MRS | T MRS |
BOWLES | BOWLES | J | J | VAN MRS | VAN MRS |
ARDIZZOLA | ARDIZZOLA | RUDY | RUDY | NMN | NMN |
PRICE | PRICE | THOMAS | THOMAS | EUGENE MRS | EUGENE MRS |
VETTER | VETTER | JOHN | JOHN | S DR | S DR |
JOHNSON | JOHNSON | FRED | FRED | ALAN SR | ALAN SR |
BENNETT | BENNETT | CHARLES | CHARLES | J MRS | J MRS |
HILDEBRAND | HILDEBRAND | JAMES | JAMES | OVERLAND MRS | OVERLAND MRS |
WARD | WARD | J | J | HERMAN MRS | HERMAN MRS |
LUPTON | LUPTON | DENNIS | DENNIS | WAYNE SR | WAYNE SR |
MOSS | MOSS | HERBERT | HERBERT | H MRS | H MRS |
VAN GORDER | VAN GORDER | CHARLES | CHARLES | OSCAR II | OSCAR II |
WHITNEY | WHITNEY | WILLIAM | WILLIAM | PRESTON SR | PRESTON SR |
KELLY | KELLY | DAVID | DAVID | LEE II | LEE II |
# Apply
dim(d)
[1] 4099699 26
d <- d %>% raw_entity_data_clean_middle_name()
dim(d)
[1] 4099699 26
# After
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
midl_name_cln,
"\\bAKA\\b|\\bDR\\b|\\bII\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bMS\\b|\\bNMN\\b|\\bNN\\b|\\bREV\\b|\\bSR\\b")
) %>%
dplyr::distinct(midl_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|
# Helper function
# If no first name, move first word of middle name to first name
raw_entity_data_clean_first_name_move_name <- function(d) { # d: data frame of entity data
has_first_name <- d$first_name_cln != ""
re_fword <- "^[A-Z]+\\b" # regular expression for first word
midl <- d$midl_name_cln
midl_head <- midl %>% # get first word
stringr::str_extract(re_fword) %>%
tidyr::replace_na("")
midl_tail <- midl %>% # get remainder of words
stringr::str_remove(re_fword) %>%
stringr::str_squish()
d %>%
dplyr::mutate(
first_name_cln = dplyr::if_else(has_first_name,
first_name_cln,
midl_head
),
midl_name_cln = dplyr::if_else(has_first_name,
midl_name_cln,
midl_tail
)
)
}
# Function to apply first-name cleaning
raw_entity_data_clean_first_name <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
first_name_cln = first_name_cln %>% # remove special words
raw_entity_data_clean_all_names_remove_words(
c("DR", "FATHER", "III", "IV", "JR", "MD", "MISS",
"MR", "MRS", "NMN", "REV", "SISTER", "SR")
)
) %>%
raw_entity_data_clean_first_name_move_name()
}
Apply and check function.
# This is only a partial check
# Before
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
first_name_cln,
"\\bDR\\b|\\bFATHER\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bNMN\\b|\\bREV\\b|\\bSISTER\\b|\\bSR\\b")
) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
COVINGTON | COVINGTON | EDNA(MRS | EDNA MRS | PERRY, JR) | PERRY |
BORDERS | BORDERS | EUGENE(NMN) | EUGENE NMN | ||
MCPHERSON | MCPHERSON | VAN III | VAN III | ||
BINGMAN | BINGMAN | GRAY MRS | GRAY MRS | ||
GRAHAM | GRAHAM | STEPHEN SR | STEPHEN SR | LEGREE | LEGREE |
HARRIS | HARRIS | MRS FRED | MRS FRED | W | W |
TANCRAITOR | TANCRAITOR | SISTER MAXINE | SISTER MAXINE | ELIZABETH | ELIZABETH |
COPELAND | COPELAND | IV | IV | EDWARD JAMES | EDWARD JAMES |
GOOLSBY | GOOLSBY | EUGENE MRS | EUGENE MRS | ||
GURGANIOUS | GURGANIOUS | JOHN MRS | JOHN MRS | HALLIE | HALLIE |
DAVENPORT | DAVENPORT | MRS H | MRS H | T | T |
HARRIS | HARRIS | MRS WILLIAM | MRS WILLIAM | W | W |
STOCKELL | STOCKELL | MD | MD | COOPER | COOPER |
HAMRICK | HAMRICK | JOHN R MRS | JOHN R MRS | MARGARET | MARGARET |
SMITH | SMITH | MRS WILLIAM JOE | MRS WILLIAM JOE | DAVIS | DAVIS |
POPE | POPE | MRS O | MRS O | N | N |
RIVES | RIVES | MRS WILBUR | MRS WILBUR | A | A |
NICHOLS | NICHOLS | DORIS ( MRS W | DORIS MRS W | ||
MOODY | MOODY | MRS WILLARD | MRS WILLARD | W | W |
BURKE | BURKE | MRS GEORGE | MRS GEORGE | W | W |
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_length(first_name_cln) == 0) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
FUQUA | FUQUA | MARY | MARY |
# Apply
dim(d)
[1] 4099699 26
d <- d %>% raw_entity_data_clean_first_name()
dim(d)
[1] 4099699 26
# After
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(
first_name_cln,
"\\bDR\\b|\\bFATHER\\b|\\bIII\\b|\\bIV\\b|\\bJR\b|\\bMD\\b|\\bMISS\\b|\\bMR\\b|\\MRS\\b|\\bNMN\\b|\\bREV\\b|\\bSISTER\\b|\\bSR\\b")
) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_length(first_name_cln) == 0) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
WARREN | WARREN |
# Function to postprocess all cleaned name variables
raw_entity_data_clean_postprocess_names <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(
# remove all spaces
last_name_cln = last_name_cln %>% stringr::str_remove_all(" "),
first_name_cln = first_name_cln %>% stringr::str_remove_all(" "),
midl_name_cln = midl_name_cln %>% stringr::str_remove_all(" "),
# add missing value indicators
last_name_cln_miss = last_name_cln == "",
first_name_cln_miss = first_name_cln == "",
midl_name_cln_miss = midl_name_cln == ""
)
}
Apply and check function.
# This is only a partial check
# Before
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(first_name_cln, " ")) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|---|---|---|---|---|
ZAVADA | ZAVADA | CLELLA-ANN | CLELLA ANN | ||
MCGIRT | MCGIRT | QUA-ANDREA | QUA ANDREA | SHAQUITA | SHAQUITA |
AUTRY | AUTRY | LILLIAN D | LILLIAN D | FUSSELL | FUSSELL |
MEGINNIS | MEGINNIS | F MARION | F MARION | ||
PRATT | PRATT | ANDREA GAIL | ANDREA GAIL | PEOPLES | PEOPLES |
MITCHENER | MITCHENER | A STEVENSON | A STEVENSON | ||
WESSON | WESSON | VERNESSA ANN | VERNESSA ANN | MURPHY | MURPHY |
JENKINS | JENKINS | EL CISCO | EL CISCO | ||
MCCALL | MCCALL | EDITH LORETTA | EDITH LORETTA | HARMON | HARMON |
THOMAS | THOMAS | JANE NICOLE | JANE NICOLE | MILTON | MILTON |
CHENG | CHENG | MEI-JUE | MEI JUE | LEE | LEE |
MATHIS | MATHIS | L STEVE | L STEVE | ||
SEMANS | SEMANS | MARY B | MARY B | T | T |
BURCH | BURCH | FAITH R | FAITH R | BAKER | BAKER |
BENJAMIN | BENJAMIN | J ERIC | J ERIC | ||
PARKS | PARKS | M DELORES | M DELORES | ||
EVERHART | EVERHART | SHEILA NANETTE | SHEILA NANETTE | COMBS | COMBS |
LEGGETT | LEGGETT | ANNA ALICIA | ANNA ALICIA | DANNIELLE | DANNIELLE |
SCOTT | SCOTT | OLIA MAE | OLIA MAE | LOCKLEAR | LOCKLEAR |
PARRIS | PARRIS | PORTIA DENISE | PORTIA DENISE | MCDANIEL | MCDANIEL |
# Apply
dim(d)
[1] 4099699 26
d <- d %>% raw_entity_data_clean_postprocess_names()
dim(d)
[1] 4099699 29
# After
d %>%
dplyr::select(
last_name, last_name_cln, first_name, first_name_cln, midl_name, midl_name_cln
) %>%
dplyr::filter(stringr::str_detect(first_name_cln, " ")) %>%
dplyr::distinct(first_name_cln, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | midl_name | midl_name_cln |
---|
Define a function to apply all the name cleaning actions.
# Function to apply all the cleaning actions
raw_entity_data_clean_all <- function(
d # data frame - raw entity data
) {
d %>%
raw_entity_data_clean_age() %>%
raw_entity_data_clean_preprocess_char() %>%
raw_entity_data_clean_all_names() %>%
raw_entity_data_clean_last_name() %>%
raw_entity_data_clean_middle_name() %>%
raw_entity_data_clean_first_name() %>%
raw_entity_data_clean_postprocess_names()
}
This function is equivalent to the incremental processing that occurred in the earlier chunks.
Show some examples of the cleaned data.
Quick distributions
d %>%
dplyr::select(ends_with("cln"), ends_with("_miss")) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4099699 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
logical | 4 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
last_name_cln | 0 | 1 | 0 | 20 | 18 | 189999 | 0 |
first_name_cln | 0 | 1 | 0 | 18 | 15 | 124073 | 0 |
midl_name_cln | 0 | 1 | 0 | 18 | 254774 | 172071 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
age_cln_miss | 0 | 1 | 0.01 | FAL: 4068644, TRU: 31055 |
last_name_cln_miss | 0 | 1 | 0.00 | FAL: 4099681, TRU: 18 |
first_name_cln_miss | 0 | 1 | 0.00 | FAL: 4099684, TRU: 15 |
midl_name_cln_miss | 0 | 1 | 0.06 | FAL: 3844925, TRU: 254774 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age_cln | 0 | 1 | 46.3 | 17.71 | 0 | 33 | 45 | 58 | 104 | ▁▇▇▃▁ |
d %>%
dplyr::select(age, age_cln, age_cln_miss) %>%
dplyr::group_by(age_cln_miss) %>%
dplyr::distinct(age, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 10) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
age | age_cln |
---|---|
FALSE | |
70 | 70 |
64 | 64 |
84 | 84 |
19 | 19 |
99 | 99 |
38 | 38 |
88 | 88 |
37 | 37 |
92 | 92 |
54 | 54 |
TRUE | |
116 | 0 |
142 | 0 |
163 | 0 |
111 | 0 |
12 | 0 |
109 | 0 |
150 | 0 |
117 | 0 |
7 | 0 |
106 | 0 |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::group_by(last_name_cln_miss) %>%
dplyr::distinct(last_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 10) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|
FALSE | |||||||
CHITTILLA | CHITTILLA | RAMA | RAMA | FALSE | KRISHNA | KRISHNA | FALSE |
SWERTFEGER | SWERTFEGER | ANN | ANN | FALSE | H | H | FALSE |
CORKUM | CORKUM | DONALD | DONALD | FALSE | TRUE | ||
GWVARA | GWVARA | JENNIFER | JENNIFER | FALSE | ARLENE | ARLENE | FALSE |
MEIDL | MEIDL | BARBARA | BARBARA | FALSE | MAE | MAE | FALSE |
CLAY-PUGH | CLAYPUGH | ELIZABETH | ELIZABETH | FALSE | R | R | FALSE |
BAEZ CRUZ | BAEZCRUZ | ALEXANDER | ALEXANDER | FALSE | TRUE | ||
ABOUTA | ABOUTA | ANDRE | ANDRE | FALSE | TRUE | ||
DANIELS-WRIGHT | DANIELSWRIGHT | BETTY | BETTY | FALSE | TRUE | ||
CRONISER | CRONISER | TARA | TARA | FALSE | CHAMBERS | CHAMBERS | FALSE |
TRUE | |||||||
K | NIUH | NIUH | FALSE | TRUE | |||
S | PETER | PETER | FALSE | THOMAS | THOMAS | FALSE | |
H | MOIH | MOIH | FALSE | TRUE | |||
M | COY | COY | FALSE | FAY | FAY | FALSE | |
J | J | J | FALSE | TRUE | |||
U | RAYMOND | RAYMOND | FALSE | TRUE | |||
Y | PRUM | PRUM | FALSE | TRUE | |||
X | MARCUS | MARCUS | FALSE | TRUE | |||
A | CHUH | CHUH | FALSE | TRUE | |||
N | RENEE | RENEE | FALSE | VIVIAN | VIVIAN | FALSE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::filter(last_name != last_name_cln) %>%
dplyr::distinct(last_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
D'ARIENZO | DARIENZO | FALSE | PETER | PETER | FALSE | F | F | FALSE |
VAN DAM | VANDAM | FALSE | VICKY | VICKY | FALSE | MAYLE | MAYLE | FALSE |
NEWMAN-SLADE | NEWMANSLADE | FALSE | JUSTINE | JUSTINE | FALSE | TRUE | ||
CHASSE' | CHASSE | FALSE | JOSEPH | JOSEPH | FALSE | TRUE | ||
TAMES-WALTON | TAMESWALTON | FALSE | KATHRYN | KATHRYN | FALSE | ANN | ANN | FALSE |
SHADE DUYS | SHADEDUYS | FALSE | JANE | JANE | FALSE | SPERRY | SPERRY | FALSE |
MCVICKER-DAVIS | MCVICKERDAVIS | FALSE | NANCY | NANCY | FALSE | M | M | FALSE |
LUCAS-YE | LUCASYE | FALSE | KELLY | KELLY | FALSE | MARIE | MARIE | FALSE |
ROBINSON-STOVALL | ROBINSONSTOVALL | FALSE | CHEESET | CHEESET | FALSE | TRUE | ||
SHIELDS-BABER | SHIELDSBABER | FALSE | CAROLYN | CAROLYN | FALSE | JOAN | JOAN | FALSE |
TOLAND-WOOD | TOLANDWOOD | FALSE | BARBARA | BARBARA | FALSE | TRUE | ||
WASHINGTON-COLEY | WASHINGTONCOLEY | FALSE | KARA | KARA | FALSE | G | G | FALSE |
SANCHEZ-COLLINS | SANCHEZCOLLINS | FALSE | MICHAEL | MICHAEL | FALSE | JAMES | JAMES | FALSE |
BUICE-LUCIANI | BUICELUCIANI | FALSE | JILL | JILL | FALSE | TRUE | ||
PEELE-BUTCHER | PEELEBUTCHER | FALSE | TIANDRA | TIANDRA | FALSE | LASHAWN | LASHAWN | FALSE |
BURCH-FAULKNER | BURCHFAULKNER | FALSE | DIANNE | DIANNE | FALSE | MORGAN | MORGAN | FALSE |
POMPEY-ROSS | POMPEYROSS | FALSE | TAMMY | TAMMY | FALSE | M | M | FALSE |
VAN GILDER | VANGILDER | FALSE | JOE | JOE | FALSE | J | J | FALSE |
ZIEGENFUSS-MENDOZA | ZIEGENFUSSMENDOZA | FALSE | JAIME | JAIME | FALSE | ANN | ANN | FALSE |
FIELDS-MINKINS | FIELDSMINKINS | FALSE | ALICIA | ALICIA | FALSE | M | M | FALSE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::group_by(first_name_cln_miss) %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 10) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|
FALSE | |||||||
HOBSON | HOBSON | FALSE | SAHNIA | SAHNIA | DENISE | DENISE | FALSE |
SPIVEY | SPIVEY | FALSE | DAUGHTREY | DAUGHTREY | W | W | FALSE |
LIVERIS | LIVERIS | FALSE | JUDITH MICHEL | JUDITHMICHEL | SAFFORD | SAFFORD | FALSE |
EVENSON | EVENSON | FALSE | REGINALE | REGINALE | GARY | GARY | FALSE |
TERRY | TERRY | FALSE | JONNELL | JONNELL | WAYNE | WAYNE | FALSE |
LANCASTER | LANCASTER | FALSE | DO'RANNE | DORANNE | M | M | FALSE |
LEE | LEE | FALSE | AABIDAH | AABIDAH | AAZIZAH | AAZIZAH | FALSE |
SHAHRIER | SHAHRIER | FALSE | JARIN | JARIN | SULTANA | SULTANA | FALSE |
BRODGEN | BRODGEN | FALSE | ANDIE | ANDIE | MICHELE | MICHELE | FALSE |
ALLEN | ALLEN | FALSE | TINA-DIANNE | TINADIANNE | PATE | PATE | FALSE |
TRUE | |||||||
KANE | KANE | FALSE | MR | TRUE | |||
LUU | LUU | FALSE | MRS | TRUE | |||
TOOLE | TOOLE | FALSE | JR | TRUE | |||
WARREN | WARREN | FALSE | TRUE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::filter(first_name != first_name_cln) %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
WRIGHT | WRIGHT | FALSE | LA-CRETIA | LACRETIA | FALSE | R | R | FALSE |
WILSON | WILSON | FALSE | LESSIE CATHERIN | LESSIECATHERIN | FALSE | M | M | FALSE |
OWEN | OWEN | FALSE | ROBYN-LEIGH | ROBYNLEIGH | FALSE | FISHER | FISHER | FALSE |
COOMBS | COOMBS | FALSE | ELIZABETH NICOL | ELIZABETHNICOL | FALSE | SMITH | SMITH | FALSE |
HARRIS | HARRIS | FALSE | MARY LETITIA | MARYLETITIA | FALSE | S | S | FALSE |
MOORE | MOORE | FALSE | MIRANDA JO | MIRANDAJO | FALSE | HOLDER | HOLDER | FALSE |
GUSLER | GUSLER | FALSE | SUSAN LEIGH | SUSANLEIGH | FALSE | SNEAD | SNEAD | FALSE |
OHLHAUT | OHLHAUT | FALSE | A MARTIN | AMARTIN | FALSE | TRUE | ||
TIPTON | TIPTON | FALSE | JULIE LYNNE | JULIELYNNE | FALSE | SATTER | SATTER | FALSE |
TANKSLEY | TANKSLEY | FALSE | KATHARINA-LUISE | KATHARINALUISE | FALSE | CLEAVER | CLEAVER | FALSE |
WHITE | WHITE | FALSE | KERI ANN | KERIANN | FALSE | TRUE | ||
CALHOUN | CALHOUN | FALSE | PAULA L | PAULAL | FALSE | BESS | BESS | FALSE |
JEN | JEN | FALSE | YUAN PEI | YUANPEI | FALSE | P | P | FALSE |
HOLMES | HOLMES | FALSE | IDA LUCILLE | IDALUCILLE | FALSE | AVERY | AVERY | FALSE |
HINSON | HINSON | FALSE | JANE-L | JANEL | FALSE | O'TUEL | OTUEL | FALSE |
SANDERS | SANDERS | FALSE | DONETTE LEE | DONETTELEE | FALSE | WALLACE | WALLACE | FALSE |
MEACHAM | MEACHAM | FALSE | MARY IRENE | MARYIRENE | FALSE | MABE | MABE | FALSE |
CHANG | CHANG | FALSE | JEN-YEA | JENYEA | FALSE | TRUE | ||
KING | KING | FALSE | LA-VAITER | LAVAITER | FALSE | MARIE | MARIE | FALSE |
MELVIN | MELVIN | FALSE | ALLIE JANE | ALLIEJANE | FALSE | ELNORA | ELNORA | FALSE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::filter(stringr::str_detect(first_name, "SISTER")) %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
ROSS | ROSS | FALSE | SISTER | S | FALSE | S | TRUE | |
TANCRAITOR | TANCRAITOR | FALSE | SISTER MAXINE | MAXINE | FALSE | ELIZABETH | ELIZABETH | FALSE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::filter(first_name == "") %>%
dplyr::distinct(first_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
FUQUA | FUQUA | FALSE | MARY | FALSE | MARY | TRUE |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::group_by(midl_name_cln_miss) %>%
dplyr::distinct(midl_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 10) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln |
---|---|---|---|---|---|---|---|
FALSE | |||||||
SIMS | SIMS | FALSE | KRISTEN | KRISTEN | FALSE | DENISE MORRIS | DENISEMORRIS |
LEE | LEE | FALSE | KAILI | KAILI | FALSE | HOUA | HOUA |
DANDO | DANDO | FALSE | JOAN | JOAN | FALSE | NUFFER | NUFFER |
WRIGHT | WRIGHT | FALSE | BOBBIE | BOBBIE | FALSE | TAWNAT | TAWNAT |
SCHOOLEY | SCHOOLEY | FALSE | JEAN | JEAN | FALSE | NINNETTE | NINNETTE |
JONES | JONES | FALSE | KIA | KIA | FALSE | SHAUMON | SHAUMON |
HEMINGWAY | HEMINGWAY | FALSE | RAYSIL | RAYSIL | FALSE | RAMON FITZGERAL | RAMONFITZGERAL |
DAWSON | DAWSON | FALSE | DEREK | DEREK | FALSE | DESGALIER | DESGALIER |
WILLIAMS | WILLIAMS | FALSE | JENNIFER | JENNIFER | FALSE | MICHELLE LARKIN | MICHELLELARKIN |
COX | COX | FALSE | CYNTHIA | CYNTHIA | FALSE | DENISE HOPPER | DENISEHOPPER |
TRUE | |||||||
FEATHERSTONE | FEATHERSTONE | FALSE | GEORGE | GEORGE | FALSE | 111 | |
CROSS | CROSS | FALSE | EUGENE | EUGENE | FALSE | IV | |
ROSE | ROSE | FALSE | FRANK | FRANK | FALSE | JR. | |
FUQUA | FUQUA | FALSE | WILLIAM | FALSE | WILLIAM | ||
BOLES | BOLES | FALSE | FAUSTINE | FAUSTINE | FALSE | MISS | |
HOAGLAND | HOAGLAND | FALSE | JR | SANDY | FALSE | SANDY | |
BENNETT | BENNETT | FALSE | HAROLD | HAROLD | FALSE | 11 | |
KELLY | KELLY | FALSE | SISTER | ANN | FALSE | ANN | |
ZIMMER | ZIMMER | FALSE | CLIFFORD | FALSE | CLIFFORD | ||
GIOVANNETTI | GIOVANNETTI | FALSE | JACQUELINE | JACQUELINE | FALSE | (NMN |
d %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
dplyr::filter(midl_name != midl_name_cln) %>%
dplyr::distinct(midl_name, .keep_all = TRUE) %>%
dplyr::slice_sample(n = 20) %>%
gt::gt() %>%
gt::opt_row_striping() %>%
gt::tab_style(style = cell_text(weight = "bold"), locations = cells_column_labels()) %>%
gt::fmt_missing(columns = everything(), missing_text = "<NA>")
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
ALLISON | ALLISON | FALSE | TRANSHER | TRANSHER | FALSE | NAKIA UVOICE | NAKIAUVOICE | FALSE |
DUBIS | DUBIS | FALSE | PATRICIA | PATRICIA | FALSE | GAIL RUSSELL | GAILRUSSELL | FALSE |
UPCHURCH | UPCHURCH | FALSE | GLORIA | GLORIA | FALSE | MCINTOSH PERRY | MCINTOSHPERRY | FALSE |
RICHARDSON | RICHARDSON | FALSE | TAMMY | TAMMY | FALSE | LYNN STAFFOR | LYNNSTAFFOR | FALSE |
REAGAN | REAGAN | FALSE | ANNA | ANNA | FALSE | SCHELL-PATTON | SCHELLPATTON | FALSE |
SMITH | SMITH | FALSE | ANNA | ANNA | FALSE | MARIE KIRKLAND | MARIEKIRKLAND | FALSE |
STROUD | STROUD | FALSE | JONATHAN | JONATHAN | FALSE | ROSCOE FULLER | ROSCOEFULLER | FALSE |
MILLER | MILLER | FALSE | PATRICIA | PATRICIA | FALSE | LYNN-WOODRING | LYNNWOODRING | FALSE |
MCLEAN | MCLEAN | FALSE | SHEILA | SHEILA | FALSE | LYNN HOOPER | LYNNHOOPER | FALSE |
BULLARD | BULLARD | FALSE | JOSEPH | JOSEPH | FALSE | MATTHEW KELLER | MATTHEWKELLER | FALSE |
PORTER | PORTER | FALSE | SHIRLEY | SHIRLEY | FALSE | M PUETT | MPUETT | FALSE |
HALL | HALL | FALSE | JENNIFER | JENNIFER | FALSE | ANNE STEVENS | ANNESTEVENS | FALSE |
VALONE | VALONE | FALSE | KATHRYN | KATHRYN | FALSE | JO PRUITT | JOPRUITT | FALSE |
TRAVIS | TRAVIS | FALSE | CHRISTOPHER | CHRISTOPHER | FALSE | WILLIAM CALVIN | WILLIAMCALVIN | FALSE |
WHITE | WHITE | FALSE | KATHLEEN | KATHLEEN | FALSE | KEM PENG TAM | KEMPENGTAM | FALSE |
THOMPSON | THOMPSON | FALSE | ANITA | ANITA | FALSE | SUE A | SUEA | FALSE |
BURGESS | BURGESS | FALSE | ANNIE | ANNIE | FALSE | GENEVA DAVIS | GENEVADAVIS | FALSE |
NICHOLSON | NICHOLSON | FALSE | PATSY | PATSY | FALSE | GAIL HOXIT | GAILHOXIT | FALSE |
LEWALLEN | LEWALLEN | FALSE | NATHANIEL | NATHANIEL | FALSE | ANTHONY-WRAY | ANTHONYWRAY | FALSE |
PIZZO | PIZZO | FALSE | CHARLOTTE | CHARLOTTE | FALSE | HELENE TRYMB | HELENETRYMB | FALSE |
Define a function to add a record ID variable.
# Function to add ID
# This is just transitional.
# Remove it after switching to 2008 snapshot with NCID
raw_entity_data_add_id <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::mutate(id = 1:nrow(.))
}
Define a function to apply all the steps from reading the raw entity data through to creating the cleaned entity data. This will be used in the core pipeline.
# Function to make the cleaned raw entity data
# Reads the data, applies exclusions
# and neatens it to where it's ready to calculate modelling features
raw_entity_data_make_clean <- function(
file_path # character - file path usable by vroom
# value - data frame
) {
raw_entity_data_read(file_path) %>%
raw_entity_data_excl_status() %>%
raw_entity_data_excl_test() %>%
raw_entity_data_drop_novar() %>%
raw_entity_data_parse_dates() %>%
raw_entity_data_drop_admin() %>%
raw_entity_data_drop_demog() %>%
raw_entity_data_clean_all() %>%
raw_entity_data_add_id()
}
Computation time (excl. render): 422.009 sec elapsed
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] tidyr_1.1.3 ggplot2_3.3.3 forcats_0.5.1 lubridate_1.7.10
[5] skimr_2.1.3 stringr_1.4.0 tibble_3.1.2 vroom_1.4.0
[9] gt_0.3.0 dplyr_1.0.6 fs_1.5.0 here_1.0.1
[13] tictoc_1.0.1 targets_0.4.2
loaded via a namespace (and not attached):
[1] tidyselect_1.1.1 xfun_0.23 bslib_0.2.5 repr_1.1.3
[5] purrr_0.3.4 colorspace_2.0-1 vctrs_0.3.8 generics_0.1.0
[9] htmltools_0.5.1.1 base64enc_0.1-3 yaml_2.2.1 utf8_1.2.1
[13] rlang_0.4.11 jquerylib_0.1.4 later_1.2.0 pillar_1.6.1
[17] glue_1.4.2 withr_2.4.2 bit64_4.0.5 lifecycle_1.0.0
[21] munsell_0.5.0 gtable_0.3.0 workflowr_1.6.2 codetools_0.2-18
[25] evaluate_0.14 knitr_1.33 callr_3.7.0 httpuv_1.6.1
[29] ps_1.6.0 parallel_4.1.0 fansi_0.4.2 highr_0.9
[33] Rcpp_1.0.6 backports_1.2.1 checkmate_2.0.0 renv_0.13.2
[37] promises_1.2.0.1 scales_1.1.1 jsonlite_1.7.2 bit_4.0.4
[41] digest_0.6.27 stringi_1.6.2 bookdown_0.22 processx_3.5.2
[45] rprojroot_2.0.2 grid_4.1.0 cli_2.5.0 tools_4.1.0
[49] magrittr_2.0.1 sass_0.4.0 crayon_1.4.1 whisker_0.4
[53] pkgconfig_2.0.3 ellipsis_0.3.2 data.table_1.14.0 rmarkdown_2.8
[57] R6_2.5.0 igraph_1.2.6 git2r_0.28.0 compiler_4.1.0