01_read_data

Last updated: 2023-02-02

Checks: 7 0

Knit directory: multiclass_AUC/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230112)

The command set.seed(20230112) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 55c2c04

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 55c2c04. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    renv/library/
    Ignored:    renv/sandbox/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01_read_data.Rmd) and HTML (docs/01_read_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	413c810	Ross Gayler	2023-01-28	end 2023-01-28
html	b13919c	Ross Gayler	2023-01-26	Add notebook 04_score_dependencies
Rmd	0cdb34a	Ross Gayler	2023-01-24	Initial commit
html	0cdb34a	Ross Gayler	2023-01-24	Initial commit

Introduction

Read the raw data from all the data files, reshape it into a more useful format, save it as an R object, and provide a summary of the contents to give some assurance that the data was read correctly.

Data organisation

The data are generated by applying models to some of the test datasets from the UCR Time Series Classification Repository. All the models are classification models, that is, they assign each case to one of a fixed set of dataset-specific classes.

All the models of interest here map each case to a vector of scores, one for each class. The case is categorised as belonging to the class with the highest score.

All the raw data are stored in data/UCR_Data_Scores.

There is a separate subdirectory of data/UCR_Data_Scores for each dataset analysed (e.g. data/UCR_Data_Scores/UCR_14).

The datasets used are shown in the table below. (The table has to be manually populated.)

Table of dataset names and links to their online descriptions.
Dataset	Name	Description page
UCR_14	CinCECGTorso	https://www.timeseriesclassification.com/description.php?Dataset=CinCECGTorso
UCR_48	GestureMidAirD3	https://www.timeseriesclassification.com/description.php?Dataset=GestureMidAirD3

In the subdirectory for each dataset analysed there are multiple files, each corresponding to the application of a single model to that dataset. The file naming convention is ModelName_test_results.csv (e.g. MINIROCKET_test_results.csv).

Each data file is a CSV with \(k + 1\) columns, where \(k\) is the number of classes.

Each row corresponds to a case from the UCR dataset that has been processed by the model corresponding to the file.

The first column contains an integer in the range \([0, k - 1]\), indicating which of the \(k\) classes is the “true” class of the case.

The remaining \(k\) columns are the class scores for the case for each of the classes in order from \(0\) to \(k - 1\).

Output organisation

The data from all the UCR datasets is concatenated into a single R data frame and saved as an RDS file (output/d_scores.RDS).

Different datasets have different numbers of classes, so the data is pivoted from wide to tall format to enable concatenation of the datasets into a single data frame.

The columns of the data frame are: dataset, model, class_id, score_id, score_val.

Get the file names

Get the names of all the data files and extract the dataset and model names.

d_files <- here::here("data/UCR_Data_Scores") |>
  fs::dir_ls(glob = "*_test_results.csv", recurse = 1) |>
  tibble::as_tibble_col(column_name = "path") |>
  dplyr::arrange(path) |>
  dplyr::mutate(
    dataset = fs::path_dir(path) |>
      stringr::str_remove(pattern = ".*/"),
    model = fs::path_file(path) |>
      stringr::str_remove(pattern = "_test_results\\.csv"),
  )

# quick view of the data files to be read
d_files

# A tibble: 4 × 3
  path                                                             dataset model
  <fs::path>                                                       <chr>   <chr>
1 …AUC/data/UCR_Data_Scores/UCR_14/HDC_MINIROCKET_test_results.csv UCR_14  HDC_…
2 …ass_AUC/data/UCR_Data_Scores/UCR_14/MINIROCKET_test_results.csv UCR_14  MINI…
3 …AUC/data/UCR_Data_Scores/UCR_48/HDC_MINIROCKET_test_results.csv UCR_48  HDC_…
4 …ass_AUC/data/UCR_Data_Scores/UCR_48/MINIROCKET_test_results.csv UCR_48  MINI…

Read the files

Create a function to read one file and reformat it.

read_1 <- function(
    path, # character - path of file to read
    dataset, # character - ID of dataset
    model # character - name of model applied to dataset
) {
  # read the file
  d <- readr::read_csv(path, col_names = FALSE, show_col_types = FALSE)
  
  # rename the columns
  n_class <- ncol(d) - 1 # 1 column for each class score plus 1 for the true class
  colnames(d) <- c("class_id", paste0("score_", 0:(n_class - 1))) # 0-origin class indexing
  
  d |>
    # add file identifiers and within-file case numbers
    dplyr::mutate(
      dataset = dataset,
      model = model,
      case = 1:n(),
      # force the types for neatness
      class_id = as.integer(class_id)
    ) |>
    # reformat to long
    tidyr::pivot_longer( 
      cols = tidyr::starts_with("score_"), 
      names_to = "score_id",
      names_prefix = "score_",
      values_to = "score_val"
    ) |>
    dplyr::mutate(
      # force the types for neatness
      score_id = as.integer(score_id)
    ) |>
    # reorder the columns for more intuitive display
    dplyr::relocate(dataset, model, case)
}

Read all the files and concatenate them.

d_scores <- purrr::pmap_dfr(.l = d_files, .f = read_1)

# quick view of the data that was read
d_scores

# A tibble: 17,800 × 6
   dataset model           case class_id score_id score_val
   <chr>   <chr>          <int>    <int>    <int>     <dbl>
 1 UCR_14  HDC_MINIROCKET     1        1        0    -0.636
 2 UCR_14  HDC_MINIROCKET     1        1        1     0.380
 3 UCR_14  HDC_MINIROCKET     1        1        2    -1.14 
 4 UCR_14  HDC_MINIROCKET     1        1        3    -0.604
 5 UCR_14  HDC_MINIROCKET     2        3        0    -1.03 
 6 UCR_14  HDC_MINIROCKET     2        3        1    -0.633
 7 UCR_14  HDC_MINIROCKET     2        3        2    -0.560
 8 UCR_14  HDC_MINIROCKET     2        3        3     0.219
 9 UCR_14  HDC_MINIROCKET     3        2        0    -1.21 
10 UCR_14  HDC_MINIROCKET     3        2        1    -1.06 
# … with 17,790 more rows

Save the concatenated data.

d_scores |> saveRDS(file = here::here("output", "d_scores.RDS"))

Get check summaries

Calculate the number of observations, classes, and scores per file as a basic check. These need to be manually checked against the metadata for the datasets.

d_scores |>
  dplyr::group_by(dataset, model) |>
  dplyr::summarise(
    n_case = max(case),
    min_class_id = min(class_id),
    max_class_id = max(class_id),
    n_class_id = unique(class_id) |> length(),
    min_score_id = min(score_id),
    max_score_id = max(score_id),
    n_score_id = unique(score_id) |> length()
  ) |>
  gt::gt()

`summarise()` has grouped output by 'dataset'. You can override using the
`.groups` argument.

model	n_case	min_class_id	max_class_id	n_class_id	min_score_id	max_score_id	n_score_id
UCR_14
HDC_MINIROCKET	1380	0	3	4	0	3	4
MINIROCKET	1380	0	3	4	0	3	4
UCR_48
HDC_MINIROCKET	130	0	25	26	0	25	26
MINIROCKET	130	0	25	26	0	25	26

That looks as expected.

Calculate the number of observations and scores for each class in each file. These need to be manually checked against the metadata for the datasets.

d_scores |>
  dplyr::group_by(dataset, model, class_id) |>
  dplyr::summarise(
    n_case = unique(case) |> length(),
    n_score = unique(score_id) |> length()
  ) |>
  gt::gt()

`summarise()` has grouped output by 'dataset', 'model'. You can override using
the `.groups` argument.

class_id	n_case	n_score
UCR_14 - HDC_MINIROCKET
0	350	4
1	343	4
2	345	4
3	342	4
UCR_14 - MINIROCKET
0	350	4
1	343	4
2	345	4
3	342	4
UCR_48 - HDC_MINIROCKET
0	5	26
1	5	26
2	5	26
3	5	26
4	5	26
5	5	26
6	5	26
7	5	26
8	5	26
9	5	26
10	5	26
11	5	26
12	5	26
13	5	26
14	5	26
15	5	26
16	5	26
17	5	26
18	5	26
19	5	26
20	5	26
21	5	26
22	5	26
23	5	26
24	5	26
25	5	26
UCR_48 - MINIROCKET
0	5	26
1	5	26
2	5	26
3	5	26
4	5	26
5	5	26
6	5	26
7	5	26
8	5	26
9	5	26
10	5	26
11	5	26
12	5	26
13	5	26
14	5	26
15	5	26
16	5	26
17	5	26
18	5	26
19	5	26
20	5	26
21	5	26
22	5	26
23	5	26
24	5	26
25	5	26