Last updated: 2021-05-27
Checks: 7 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a6fb2e3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: _targets/
Ignored: data/VR_20051125.txt.xz
Ignored: data/VR_Snapshot_20081104.txt.xz
Ignored: renv/library/
Ignored: renv/local/
Ignored: renv/staging/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/m_01_1_read_raw_entity_data.Rmd
) and HTML (docs/m_01_1_read_raw_entity_data.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ab90fe6 | Ross Gayler | 2021-05-18 | WIP |
html | ab90fe6 | Ross Gayler | 2021-05-18 | WIP |
html | 0bb37f0 | Ross Gayler | 2021-05-15 | Build site. |
Rmd | 54a8052 | Ross Gayler | 2021-05-15 | wflow_publish("analysis/m_01_1*.Rmd") |
Rmd | d7b5c39 | Ross Gayler | 2021-05-15 | WIP |
Rmd | e1b609b | Ross Gayler | 2021-05-14 | WIP |
Rmd | 4109078 | Ross Gayler | 2021-05-13 | WIP |
html | 4109078 | Ross Gayler | 2021-05-13 | WIP |
Rmd | ebd787e | Ross Gayler | 2021-03-28 | WIP |
html | ebd787e | Ross Gayler | 2021-03-28 | WIP |
# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.
# Set up the project environment, because {workflowr} knits each Rmd file
# in a new R session, and doesn't execute the project .Rprofile
library(targets) # access data from the targets cache
library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal
library(fs) # file system operations
library(dplyr) # data wrangling
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(vroom) # fast reading of delimited text files
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)
These meta notebooks document the development of functions that will be applied in the core pipeline.
The aim of the m_01
set of meta notebooks is to work out how to read
the raw entity data, drop excluded cases, discard irrelevant variables,
apply any cleaning, and construct standardised names. This does not
include construction of any modelling features. To be clear, the target
(c_raw_entity_data
) corresponding to the objective of this set of
notebooks is the cleaned and standardised raw data, before constructing
any modelling features.
This notebook documents the process of working out how to read the raw entity data. This is necessary because the documentation of data is often omits some essential detail.
The subsequent notebooks in this set will develop the other functions needed to generate the cleaned and standardised data.
This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to an online folder of Voter Registration snapshots, which contains the snapshot data files and a data dictionary file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.
The snapshots contain many columns that are irrelevant to this project (e.g. school district name) and/or prohibited under Australian privacy law (e.g. political affiliation, race). We do not read these unneeded columns from the snapshot file.
We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.
The data dictionary is stored in the data/
directory.
f_entity_raw_dd <- here::here("data", "layout_VR_Snapshot.txt") # data dictionary file
readLines(f_entity_raw_dd) %>% writeLines()
/* *******************************************************************************
* name: layout_VR_Snapshot.txt
* purpose: Layout for the VR_SNAPSHOT_YYYYMMDD file. This file contains a denormalized
* point-in-time snapshot of information for active and inactive voters
* as-well-as removed voters going back for a period of ten years.
* format: tab delimited column names in first row
* updated: 06/28/2020
******************************************************************************* */
-- --------------------------------------------------------------------------------
name data type description
-- --------------------------------------------------------------------------------
snapshot_dt char 10 Date of snapshot
county_id char 3 County identification number
county_desc char 15 County description
voter_reg_num char 12 Voter registration number (unique by county)
ncid char 12 North Carolina identification number (NCID) of voter
status_cd char 1 Status code for voter registration
voter_status_desc char 10 Satus code descriptions.
reason_cd char 2 Reason code for voter registration status
voter_status_reason_desc char 60 Reason code description
absent_ind char 1 <not used>
name_prefx_cd char 4 <not used>
last_name char 25 Voter last name
first_name char 20 Voter first name
midl_name char 20 Voter middle name
name_sufx_cd char 4 Voter name suffix
house_num char 10 Residential address street number
half_code char 1 Residential address street number half code
street_dir char 2 Residential address street direction (N,S,E,W,NE,SW, etc.)
street_name char 30 Residential address street name
street_type_cd char 4 Residential address street type (RD, ST, DR, BLVD, etc.)
street_sufx_cd char 4 Residential address street suffix (BUS, EXT, and directional)
unit_designator char 4 <not used>
unit_num char 7 Residential address unit number
res_city_desc char 20 Residential address city name
state_cd char 2 Residential address state code
zip_code char 9 Residential address zip code
mail_addr1 char 40 Mailing street address
mail_addr2 char 40 Mailing address line two
mail_addr3 char 40 Mailing address line three
mail_addr4 char 40 Mailing address line four
mail_city char 30 Mailing address city name
mail_state char 2 Mailing address state code
mail_zipcode char 9 Mailing address zip code
area_cd char 3 Area code for phone number
phone_num char 7 Telephone number
race_code char 3 Race code
race_desc char 35 Race description
ethnic_code char 2 Ethnicity code
ethnic_desc char 30 Ethnicity description
party_cd char 3 Party affiliation code
party_desc char 12 Party affiliation description
sex_code char 1 Gender code
sex char 6 Gender description
age char 3 Age
birth_place char 2 Birth place
registr_dt char 10 Voter registration date
precinct_abbrv char 6 Precinct abbreviation
precinct_desc char 30 Precinct name
municipality_abbrv char 4 Municipality abbreviation
municipality_desc char 30 Municipality name
ward_abbrv char 4 Ward abbreviation
ward_desc char 30 Ward name
cong_dist_abbrv char 4 Congressional district abbreviation
cong_dist_desc char 30 Congressional district name
super_court_abbrv char 4 Supreme Court abbreviation
super_court_desc char 30 Supreme Court name
judic_dist_abbrv char 4 Judicial district abbreviation
judic_dist_desc char 30 Judicial district name
NC_senate_abbrv char 4 NC Senate district abbreviation
NC_senate_desc char 30 NC Senate district name
NC_house_abbrv char 4 NC House district abbreviation
NC_house_desc char 30 NC House district name
county_commiss_abbrv char 4 County Commissioner district abbreviation
county_commiss_desc char 30 County Commissioner district name
township_abbrv char 6 Township district abbreviation
township_desc char 30 Township district name
school_dist_abbrv char 6 School district abbreviation
school_dist_desc char 30 School district name
fire_dist_abbrv char 4 Fire district abbreviation
fire_dist_desc char 30 Fire district name
water_dist_abbrv char 4 Water district abbreviation
water_dist_desc char 30 Water district name
sewer_dist_abbrv char 4 Sewer district abbreviation
sewer_dist_desc char 30 Sewer district name
sanit_dist_abbrv char 4 Sanitation district abbreviation
sanit_dist_desc char 30 Sanitation district name
rescue_dist_abbrv char 4 Rescue district abbreviation
rescue_dist_desc char 30 Rescue district name
munic_dist_abbrv char 4 Municipal district abbreviation
munic_dist_desc char 30 Municipal district name
dist_1_abbrv char 4 Prosecutorial district abbreviation
dist_1_desc char 30 Prosecutorial district name
dist_2_abbrv char 4 <not used>
dist_2_desc char 30 <not used>
confidential_ind char 1 Confidential indicator
cancellation_dt char 10 Cancellation date
vtd_abbrv char 6 Voter tabuluation district abbreviation
vtd_desc char 30 Voter tabuluation district name
load_dt char 10 Data load date
age_group char 35 Age group range
-- ---------------------------------------------------------------------------------
The snapshot ZIP file was manually downloaded (572 MB), uncompressed
(5.7 GB), then re-compressed in XZ
format to minimise the size
(248 MB). The compressed snapshot file and the data dictionary file are
stored in the data/
directory.
These analysis notebooks are rendered to webpages by the knitr
package.
This uses pandoc
as a postprocessor, which requires its input as UTF-8 encoded character strings.
In these notebooks we will occasionally display literal values from the voter registration snapshot file,
so we will need to convert the voter registration data to UTF-8 encoding
if it is not already encoded that way.
The data dictionary indicates that the data is stored as characters but doesn’t give any hints as to the encoding used. I don’t have any details on how the this data is collected at source and subsequently assembled into these snapshot files. However, I wouldn’t be surprised if the voter registration is managed by the individual counties and for the smaller counties it might be managed by part-time employees using Microsoft Excel on old PCs running obsolete versions of Microsoft Windows. If this is the case it is unlikely that the voter registration data is encoded in UTF-8.
Use the uchardet
to guess the encoding of the voter registration snapshot file.
# The snapshot needs to be uncompressed for uchardet to read it
# This probably only works on linux
# xz and uchardet must be installed
system(paste0("xz --decompress --stdout '", f_entity_raw_tsv, "' | uchardet"), intern = TRUE)
[1] "WINDOWS-1252"
WINDOWS-1252
)[https://en.wikipedia.org/wiki/Windows-1252].
This is not surprising as it is used by default for English in the legacy components of Microsoft Windows.The function to read the data will have to convert the encoding from WINDOWS-1252
to UTF-8
.
The data is tab-separated. The data dictionary says that the data file is tab separated, but the data dictionary gives column widths, which could be interpreted as implying the data is formatted as fixed width fields. Examining the uncompressed data with a text editor shows that the columns are tab separated.
The field widths in the data dictionary are probably intended to be interpreted as maximum lengths. However, this interpretation is not accurate. Some fields contain values longer than the stated width.
Inspection of the raw data with a text editor shows that the character fields are unquoted. However, at least one character value contains an un-escaped double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.
The column specifications are written by taking the column names and their order in the data dictionary as correct.
Read the data file as character columns (i.e. treat numbers and dates as character strings), to simplify finding wrongly formatted input.
# Function to get the raw entity data
raw_entity_data_read <- function(
file_path # character - file path usable by vroom
) {
vroom::vroom(
file_path,
# n_max = 1e4, # limit the rows for testing
col_select = c( # get all the columns that might conceivably be used
# the names and ordering are from the metadata file
snapshot_dt : voter_status_reason_desc, # 9 cols
last_name : street_sufx_cd, # 10 cols
unit_num : zip_code, # 4 cols
area_cd, phone_num, # 2 cols
sex_code : registr_dt, # 5 cols
cancellation_dt, load_dt # 2 cols
), # total 32 cols
col_types = cols(
.default = col_character() # all cols as chars to allow for bad formatting
),
delim = "\t", # assume that fields are *only* delimited by tabs
col_names = TRUE, # use the column names on the first line of data
na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
quote = "", # strings NEVER quoted. Read embedded double quote as just another character
comment = "", # don't allow for comments
trim_ws = TRUE, # trim leading and trailing whitespace
escape_double = FALSE, # assume no escaped quotes
escape_backslash = FALSE , # assume no escaped backslashes
locale = locale(encoding = "WINDOWS-1252") # tell vroom the encoding of the input file
) # the returned value is encoded s UTF-8
}
# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
# Read the raw entity data
d <- raw_entity_data_read(f_entity_raw_tsv)
Check the number of rows and columns read and take a quick look at the data.
dplyr::glimpse(d)
Rows: 8,003,293
Columns: 32
$ snapshot_dt <chr> "2005-11-25 00:00:00", "2005-11-25 00:00:00",…
$ county_id <chr> "18", "7", "10", "16", "58", "60", "62", "73"…
$ county_desc <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERET…
$ voter_reg_num <chr> "0", "000000000000", "000000000000", "0000000…
$ ncid <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ status_cd <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", …
$ voter_status_desc <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", "…
$ reason_cd <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "RP…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE"…
$ last_name <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "B…
$ first_name <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZZ…
$ midl_name <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES",…
$ name_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ house_num <chr> "0", "961", "0", "264", "1536", "1431", "171"…
$ half_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ street_dir <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA, …
$ street_name <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GAR…
$ street_type_cd <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, "…
$ street_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ unit_num <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, NA…
$ res_city_desc <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LAK…
$ state_cd <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA,…
$ zip_code <chr> "28613", "27817", "28461", "28570", "27892", …
$ area_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ phone_num <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sex_code <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U", …
$ sex <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "F…
$ age <chr> "62", "26", "0", "58", "63", "30", "93", "0",…
$ birth_place <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, "…
$ registr_dt <chr> "1984-10-06 00:00:00", "2000-07-31 00:00:00",…
$ cancellation_dt <chr> NA, "2001-07-06 00:00:00", "2001-02-05 00:00:…
$ load_dt <chr> "2014-07-15 22:21:54.150000000", "2014-07-15 …
Correct number of data rows read
Correct number of columns read (checked against manual count of columns in data dictionary)
The initial values in each column seem plausible with respect to the column descriptions
Computation time (excl. render): 71.384 sec elapsed
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] vroom_1.4.0 dplyr_1.0.6 fs_1.5.0 here_1.0.1 tictoc_1.0.1
[6] targets_0.4.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 pillar_1.6.1 compiler_4.1.0 bslib_0.2.5
[5] later_1.2.0 jquerylib_0.1.4 git2r_0.28.0 workflowr_1.6.2
[9] tools_4.1.0 bit_4.0.4 digest_0.6.27 jsonlite_1.7.2
[13] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2 pkgconfig_2.0.3
[17] rlang_0.4.11 igraph_1.2.6 rstudioapi_0.13 cli_2.5.0
[21] parallel_4.1.0 yaml_2.2.1 xfun_0.23 withr_2.4.2
[25] stringr_1.4.0 knitr_1.33 generics_0.1.0 vctrs_0.3.8
[29] sass_0.4.0 bit64_4.0.5 tidyselect_1.1.1 rprojroot_2.0.2
[33] data.table_1.14.0 glue_1.4.2 R6_2.5.0 processx_3.5.2
[37] fansi_0.4.2 rmarkdown_2.8 bookdown_0.22 purrr_0.3.4
[41] callr_3.7.0 magrittr_2.0.1 whisker_0.4 codetools_0.2-18
[45] ps_1.6.0 promises_1.2.0.1 ellipsis_0.3.2 htmltools_0.5.1.1
[49] renv_0.13.2 httpuv_1.6.1 utf8_1.2.1 stringi_1.6.2
[53] crayon_1.4.1