Last updated: 2021-05-27

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a6fb2e3

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a6fb2e3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    data/VR_Snapshot_20081104.txt.xz
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/m_01_1_read_raw_entity_data.Rmd) and HTML (docs/m_01_1_read_raw_entity_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ab90fe6	Ross Gayler	2021-05-18	WIP
html	ab90fe6	Ross Gayler	2021-05-18	WIP
html	0bb37f0	Ross Gayler	2021-05-15	Build site.
Rmd	54a8052	Ross Gayler	2021-05-15	wflow_publish("analysis/m_01_1*.Rmd")
Rmd	d7b5c39	Ross Gayler	2021-05-15	WIP
Rmd	e1b609b	Ross Gayler	2021-05-14	WIP
Rmd	4109078	Ross Gayler	2021-05-13	WIP
html	4109078	Ross Gayler	2021-05-13	WIP
Rmd	ebd787e	Ross Gayler	2021-03-28	WIP
html	ebd787e	Ross Gayler	2021-03-28	WIP

# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root

here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal

library(fs) # file system operations
library(dplyr) # data wrangling


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(vroom) # fast reading of delimited text files

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

These meta notebooks document the development of functions that will be applied in the core pipeline.

The aim of the m_01 set of meta notebooks is to work out how to read the raw entity data, drop excluded cases, discard irrelevant variables, apply any cleaning, and construct standardised names. This does not include construction of any modelling features. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the cleaned and standardised raw data, before constructing any modelling features.

This notebook documents the process of working out how to read the raw entity data. This is necessary because the documentation of data is often omits some essential detail.

The subsequent notebooks in this set will develop the other functions needed to generate the cleaned and standardised data.

1.1 Entity data

This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to an online folder of Voter Registration snapshots, which contains the snapshot data files and a data dictionary file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.

The snapshots contain many columns that are irrelevant to this project (e.g. school district name) and/or prohibited under Australian privacy law (e.g. political affiliation, race). We do not read these unneeded columns from the snapshot file.

We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.

2 Display data dictionary

The data dictionary is stored in the data/ directory.

f_entity_raw_dd <- here::here("data", "layout_VR_Snapshot.txt") # data dictionary file

readLines(f_entity_raw_dd) %>% writeLines()

/* *******************************************************************************
* name:    layout_VR_Snapshot.txt
* purpose: Layout for the VR_SNAPSHOT_YYYYMMDD file. This file contains a denormalized
*          point-in-time snapshot of information for active and inactive voters 
*          as-well-as removed voters going back for a period of ten years.
* format:  tab delimited column names in first row
* updated: 06/28/2020
******************************************************************************* */


-- --------------------------------------------------------------------------------
name                            data type       description
-- --------------------------------------------------------------------------------
snapshot_dt         char 10         Date of snapshot
county_id           char  3         County identification number
county_desc         char 15         County description
voter_reg_num           char 12         Voter registration number (unique by county)
ncid                char 12         North Carolina identification number (NCID) of voter
status_cd           char  1         Status code for voter registration
voter_status_desc       char 10         Satus code descriptions.
reason_cd           char  2         Reason code for voter registration status
voter_status_reason_desc    char 60         Reason code description
absent_ind          char  1         <not used> 
name_prefx_cd           char  4         <not used> 
last_name           char 25         Voter last name
first_name          char 20         Voter first name
midl_name           char 20         Voter middle name
name_sufx_cd            char  4         Voter name suffix 
house_num           char 10         Residential address street number
half_code           char  1         Residential address street number half code
street_dir          char  2         Residential address street direction (N,S,E,W,NE,SW, etc.)
street_name         char 30         Residential address street name
street_type_cd          char  4         Residential address street type (RD, ST, DR, BLVD, etc.)
street_sufx_cd          char  4         Residential address street suffix (BUS, EXT, and directional)
unit_designator         char  4         <not used>
unit_num            char  7         Residential address unit number
res_city_desc           char 20         Residential address city name
state_cd            char  2         Residential address state code
zip_code            char  9         Residential address zip code
mail_addr1          char 40         Mailing street address
mail_addr2          char 40         Mailing address line two
mail_addr3          char 40         Mailing address line three
mail_addr4          char 40         Mailing address line four
mail_city           char 30         Mailing address city name
mail_state          char  2         Mailing address state code
mail_zipcode            char  9         Mailing address zip code
area_cd             char  3         Area code for phone number
phone_num           char  7         Telephone number
race_code           char  3         Race code
race_desc           char 35         Race description
ethnic_code         char  2         Ethnicity code
ethnic_desc         char 30         Ethnicity description
party_cd            char  3         Party affiliation code
party_desc          char 12         Party affiliation description
sex_code            char  1         Gender code
sex             char  6         Gender description
age             char  3         Age
birth_place         char  2         Birth place  
registr_dt          char 10         Voter registration date
precinct_abbrv          char  6         Precinct abbreviation
precinct_desc           char 30         Precinct name
municipality_abbrv      char  4         Municipality abbreviation   
municipality_desc       char 30         Municipality name
ward_abbrv          char  4         Ward abbreviation
ward_desc           char 30         Ward name
cong_dist_abbrv         char  4         Congressional district abbreviation 
cong_dist_desc          char 30         Congressional district name
super_court_abbrv       char  4         Supreme Court abbreviation 
super_court_desc        char 30         Supreme Court name
judic_dist_abbrv        char  4         Judicial district abbreviation 
judic_dist_desc         char 30         Judicial district name
NC_senate_abbrv         char  4         NC Senate district abbreviation 
NC_senate_desc          char 30         NC Senate district name
NC_house_abbrv          char  4         NC House district abbreviation 
NC_house_desc           char 30         NC House district name
county_commiss_abbrv        char  4         County Commissioner district abbreviation 
county_commiss_desc     char 30         County Commissioner district name
township_abbrv          char  6         Township district abbreviation
township_desc           char 30         Township district name
school_dist_abbrv       char  6         School district abbreviation
school_dist_desc        char 30         School district name
fire_dist_abbrv         char  4         Fire district abbreviation 
fire_dist_desc          char 30         Fire district name
water_dist_abbrv        char  4         Water district abbreviation 
water_dist_desc         char 30         Water district name
sewer_dist_abbrv        char  4         Sewer district abbreviation 
sewer_dist_desc         char 30         Sewer district name
sanit_dist_abbrv        char  4         Sanitation district abbreviation 
sanit_dist_desc         char 30         Sanitation district name
rescue_dist_abbrv       char  4         Rescue district abbreviation 
rescue_dist_desc        char 30         Rescue district name
munic_dist_abbrv        char  4         Municipal district abbreviation 
munic_dist_desc         char 30         Municipal district name
dist_1_abbrv            char  4         Prosecutorial district abbreviation 
dist_1_desc         char 30         Prosecutorial district name
dist_2_abbrv            char  4         <not used>
dist_2_desc         char 30         <not used>
confidential_ind        char  1         Confidential indicator
cancellation_dt         char 10         Cancellation date
vtd_abbrv           char  6         Voter tabuluation district abbreviation 
vtd_desc            char 30         Voter tabuluation district name 
load_dt             char 10         Data load date
age_group           char 35         Age group range
-- ---------------------------------------------------------------------------------

3 Read entity data

The snapshot ZIP file was manually downloaded (572 MB), uncompressed (5.7 GB), then re-compressed in XZ format to minimise the size (248 MB). The compressed snapshot file and the data dictionary file are stored in the data/ directory.

3.1 Character encoding

These analysis notebooks are rendered to webpages by the knitr package. This uses pandoc as a postprocessor, which requires its input as UTF-8 encoded character strings. In these notebooks we will occasionally display literal values from the voter registration snapshot file, so we will need to convert the voter registration data to UTF-8 encoding if it is not already encoded that way.

The data dictionary indicates that the data is stored as characters but doesn’t give any hints as to the encoding used. I don’t have any details on how the this data is collected at source and subsequently assembled into these snapshot files. However, I wouldn’t be surprised if the voter registration is managed by the individual counties and for the smaller counties it might be managed by part-time employees using Microsoft Excel on old PCs running obsolete versions of Microsoft Windows. If this is the case it is unlikely that the voter registration data is encoded in UTF-8.

Use the uchardet to guess the encoding of the voter registration snapshot file.

# The snapshot needs to be uncompressed for uchardet to read it
# This probably only works on linux
# xz and uchardet must be installed

system(paste0("xz --decompress --stdout '", f_entity_raw_tsv, "' | uchardet"), intern = TRUE)

[1] "WINDOWS-1252"

The voter registration snapshot appears to be encoded as (WINDOWS-1252)[https://en.wikipedia.org/wiki/Windows-1252]. This is not surprising as it is used by default for English in the legacy components of Microsoft Windows.

The function to read the data will have to convert the encoding from WINDOWS-1252 to UTF-8.

3.2 Read formatted data

The data is tab-separated. The data dictionary says that the data file is tab separated, but the data dictionary gives column widths, which could be interpreted as implying the data is formatted as fixed width fields. Examining the uncompressed data with a text editor shows that the columns are tab separated.

The field widths in the data dictionary are probably intended to be interpreted as maximum lengths. However, this interpretation is not accurate. Some fields contain values longer than the stated width.

Inspection of the raw data with a text editor shows that the character fields are unquoted. However, at least one character value contains an un-escaped double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.

The column specifications are written by taking the column names and their order in the data dictionary as correct.

Read the data file as character columns (i.e. treat numbers and dates as character strings), to simplify finding wrongly formatted input.

# Function to get the raw entity data
raw_entity_data_read <- function(
  file_path # character - file path usable by vroom
) {
  vroom::vroom(
    file_path,
    # n_max = 1e4, # limit the rows for testing
    col_select = c( # get all the columns that might conceivably be used
      # the names and ordering are from the metadata file
      snapshot_dt : voter_status_reason_desc, # 9 cols
      last_name : street_sufx_cd, # 10 cols
      unit_num : zip_code, # 4 cols
      area_cd, phone_num, # 2 cols
      sex_code : registr_dt, # 5 cols
      cancellation_dt, load_dt # 2 cols
    ), # total 32 cols
    col_types = cols(
      .default = col_character() # all cols as chars to allow for bad formatting
    ),
    delim = "\t", # assume that fields are *only* delimited by tabs
    col_names = TRUE, # use the column names on the first line of data
    na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
    quote = "", # strings NEVER quoted. Read embedded double quote as just another character
    comment = "", # don't allow for comments
    trim_ws = TRUE, # trim leading and trailing whitespace
    escape_double = FALSE, # assume no escaped quotes
    escape_backslash = FALSE , # assume no escaped backslashes
    locale = locale(encoding = "WINDOWS-1252") # tell vroom the encoding of the input file
  ) # the returned value is encoded s UTF-8
}

# Show the data file name
fs::path_file(f_entity_raw_tsv)

[1] "VR_20051125.txt.xz"

# Read the raw entity data
d <- raw_entity_data_read(f_entity_raw_tsv)

Check the number of rows and columns read and take a quick look at the data.

dplyr::glimpse(d)

Rows: 8,003,293
Columns: 32
$ snapshot_dt              <chr> "2005-11-25 00:00:00", "2005-11-25 00:00:00",…
$ county_id                <chr> "18", "7", "10", "16", "58", "60", "62", "73"…
$ county_desc              <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERET…
$ voter_reg_num            <chr> "0", "000000000000", "000000000000", "0000000…
$ ncid                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ status_cd                <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", …
$ voter_status_desc        <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", "…
$ reason_cd                <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "RP…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE"…
$ last_name                <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "B…
$ first_name               <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZZ…
$ midl_name                <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES",…
$ name_sufx_cd             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ house_num                <chr> "0", "961", "0", "264", "1536", "1431", "171"…
$ half_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ street_dir               <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA, …
$ street_name              <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GAR…
$ street_type_cd           <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, "…
$ street_sufx_cd           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ unit_num                 <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, NA…
$ res_city_desc            <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LAK…
$ state_cd                 <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA,…
$ zip_code                 <chr> "28613", "27817", "28461", "28570", "27892", …
$ area_cd                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ phone_num                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sex_code                 <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U", …
$ sex                      <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "F…
$ age                      <chr> "62", "26", "0", "58", "63", "30", "93", "0",…
$ birth_place              <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, "…
$ registr_dt               <chr> "1984-10-06 00:00:00", "2000-07-31 00:00:00",…
$ cancellation_dt          <chr> NA, "2001-07-06 00:00:00", "2001-02-05 00:00:…
$ load_dt                  <chr> "2014-07-15 22:21:54.150000000", "2014-07-15 …

Correct number of data rows read
- External line count of input file = 8,003,294 (including header row of column names)
Correct number of columns read (checked against manual count of columns in data dictionary)
The initial values in each column seem plausible with respect to the column descriptions

Timing

Computation time (excl. render): 71.384 sec elapsed

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] vroom_1.4.0   dplyr_1.0.6   fs_1.5.0      here_1.0.1    tictoc_1.0.1 
[6] targets_0.4.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.6.1      compiler_4.1.0    bslib_0.2.5      
 [5] later_1.2.0       jquerylib_0.1.4   git2r_0.28.0      workflowr_1.6.2  
 [9] tools_4.1.0       bit_4.0.4         digest_0.6.27     jsonlite_1.7.2   
[13] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2      pkgconfig_2.0.3  
[17] rlang_0.4.11      igraph_1.2.6      rstudioapi_0.13   cli_2.5.0        
[21] parallel_4.1.0    yaml_2.2.1        xfun_0.23         withr_2.4.2      
[25] stringr_1.4.0     knitr_1.33        generics_0.1.0    vctrs_0.3.8      
[29] sass_0.4.0        bit64_4.0.5       tidyselect_1.1.1  rprojroot_2.0.2  
[33] data.table_1.14.0 glue_1.4.2        R6_2.5.0          processx_3.5.2   
[37] fansi_0.4.2       rmarkdown_2.8     bookdown_0.22     purrr_0.3.4      
[41] callr_3.7.0       magrittr_2.0.1    whisker_0.4       codetools_0.2-18 
[45] ps_1.6.0          promises_1.2.0.1  ellipsis_0.3.2    htmltools_0.5.1.1
[49] renv_0.13.2       httpuv_1.6.1      utf8_1.2.1        stringi_1.6.2    
[53] crayon_1.4.1

[meta] Read the raw entity data

m_01_1_read_raw_entity_data

Ross Gayler

2021-05-14