Last updated: 2023-02-02
Checks: 7 0
Knit directory: multiclass_AUC/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20230112)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 55c2c04. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: renv/library/
Ignored: renv/sandbox/
Ignored: renv/staging/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/01_read_data.Rmd
) and HTML
(docs/01_read_data.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 413c810 | Ross Gayler | 2023-01-28 | end 2023-01-28 |
html | b13919c | Ross Gayler | 2023-01-26 | Add notebook 04_score_dependencies |
Rmd | 0cdb34a | Ross Gayler | 2023-01-24 | Initial commit |
html | 0cdb34a | Ross Gayler | 2023-01-24 | Initial commit |
Read the raw data from all the data files, reshape it into a more useful format, save it as an R object, and provide a summary of the contents to give some assurance that the data was read correctly.
The data are generated by applying models to some of the test datasets from the UCR Time Series Classification Repository. All the models are classification models, that is, they assign each case to one of a fixed set of dataset-specific classes.
All the models of interest here map each case to a vector of scores, one for each class. The case is categorised as belonging to the class with the highest score.
All the raw data are stored in data/UCR_Data_Scores
.
There is a separate subdirectory of data/UCR_Data_Scores
for each dataset analysed
(e.g. data/UCR_Data_Scores/UCR_14
).
The datasets used are shown in the table below. (The table has to be manually populated.)
Dataset | Name | Description page |
---|---|---|
UCR_14 | CinCECGTorso | https://www.timeseriesclassification.com/description.php?Dataset=CinCECGTorso |
UCR_48 | GestureMidAirD3 | https://www.timeseriesclassification.com/description.php?Dataset=GestureMidAirD3 |
In the subdirectory for each dataset analysed there are multiple
files, each corresponding to the application of a single model to that
dataset. The file naming convention is
ModelName_test_results.csv
(e.g. MINIROCKET_test_results.csv
).
Each data file is a CSV with \(k + 1\) columns, where \(k\) is the number of classes.
Each row corresponds to a case from the UCR dataset that has been processed by the model corresponding to the file.
The first column contains an integer in the range \([0, k - 1]\), indicating which of the \(k\) classes is the “true” class of the case.
The remaining \(k\) columns are the class scores for the case for each of the classes in order from \(0\) to \(k - 1\).
The data from all the UCR datasets is concatenated into a single R
data frame and saved as an RDS file
(output/d_scores.RDS
).
Different datasets have different numbers of classes, so the data is pivoted from wide to tall format to enable concatenation of the datasets into a single data frame.
The columns of the data frame are: dataset
,
model
, class_id
, score_id
,
score_val
.
Get the names of all the data files and extract the dataset and model names.
d_files <- here::here("data/UCR_Data_Scores") |>
fs::dir_ls(glob = "*_test_results.csv", recurse = 1) |>
tibble::as_tibble_col(column_name = "path") |>
dplyr::arrange(path) |>
dplyr::mutate(
dataset = fs::path_dir(path) |>
stringr::str_remove(pattern = ".*/"),
model = fs::path_file(path) |>
stringr::str_remove(pattern = "_test_results\\.csv"),
)
# quick view of the data files to be read
d_files
# A tibble: 4 × 3
path dataset model
<fs::path> <chr> <chr>
1 …AUC/data/UCR_Data_Scores/UCR_14/HDC_MINIROCKET_test_results.csv UCR_14 HDC_…
2 …ass_AUC/data/UCR_Data_Scores/UCR_14/MINIROCKET_test_results.csv UCR_14 MINI…
3 …AUC/data/UCR_Data_Scores/UCR_48/HDC_MINIROCKET_test_results.csv UCR_48 HDC_…
4 …ass_AUC/data/UCR_Data_Scores/UCR_48/MINIROCKET_test_results.csv UCR_48 MINI…
Create a function to read one file and reformat it.
read_1 <- function(
path, # character - path of file to read
dataset, # character - ID of dataset
model # character - name of model applied to dataset
) {
# read the file
d <- readr::read_csv(path, col_names = FALSE, show_col_types = FALSE)
# rename the columns
n_class <- ncol(d) - 1 # 1 column for each class score plus 1 for the true class
colnames(d) <- c("class_id", paste0("score_", 0:(n_class - 1))) # 0-origin class indexing
d |>
# add file identifiers and within-file case numbers
dplyr::mutate(
dataset = dataset,
model = model,
case = 1:n(),
# force the types for neatness
class_id = as.integer(class_id)
) |>
# reformat to long
tidyr::pivot_longer(
cols = tidyr::starts_with("score_"),
names_to = "score_id",
names_prefix = "score_",
values_to = "score_val"
) |>
dplyr::mutate(
# force the types for neatness
score_id = as.integer(score_id)
) |>
# reorder the columns for more intuitive display
dplyr::relocate(dataset, model, case)
}
Read all the files and concatenate them.
d_scores <- purrr::pmap_dfr(.l = d_files, .f = read_1)
# quick view of the data that was read
d_scores
# A tibble: 17,800 × 6
dataset model case class_id score_id score_val
<chr> <chr> <int> <int> <int> <dbl>
1 UCR_14 HDC_MINIROCKET 1 1 0 -0.636
2 UCR_14 HDC_MINIROCKET 1 1 1 0.380
3 UCR_14 HDC_MINIROCKET 1 1 2 -1.14
4 UCR_14 HDC_MINIROCKET 1 1 3 -0.604
5 UCR_14 HDC_MINIROCKET 2 3 0 -1.03
6 UCR_14 HDC_MINIROCKET 2 3 1 -0.633
7 UCR_14 HDC_MINIROCKET 2 3 2 -0.560
8 UCR_14 HDC_MINIROCKET 2 3 3 0.219
9 UCR_14 HDC_MINIROCKET 3 2 0 -1.21
10 UCR_14 HDC_MINIROCKET 3 2 1 -1.06
# … with 17,790 more rows
Save the concatenated data.
d_scores |> saveRDS(file = here::here("output", "d_scores.RDS"))
Calculate the number of observations, classes, and scores per file as a basic check. These need to be manually checked against the metadata for the datasets.
d_scores |>
dplyr::group_by(dataset, model) |>
dplyr::summarise(
n_case = max(case),
min_class_id = min(class_id),
max_class_id = max(class_id),
n_class_id = unique(class_id) |> length(),
min_score_id = min(score_id),
max_score_id = max(score_id),
n_score_id = unique(score_id) |> length()
) |>
gt::gt()
`summarise()` has grouped output by 'dataset'. You can override using the
`.groups` argument.
model | n_case | min_class_id | max_class_id | n_class_id | min_score_id | max_score_id | n_score_id |
---|---|---|---|---|---|---|---|
UCR_14 | |||||||
HDC_MINIROCKET | 1380 | 0 | 3 | 4 | 0 | 3 | 4 |
MINIROCKET | 1380 | 0 | 3 | 4 | 0 | 3 | 4 |
UCR_48 | |||||||
HDC_MINIROCKET | 130 | 0 | 25 | 26 | 0 | 25 | 26 |
MINIROCKET | 130 | 0 | 25 | 26 | 0 | 25 | 26 |
That looks as expected.
Calculate the number of observations and scores for each class in each file. These need to be manually checked against the metadata for the datasets.
d_scores |>
dplyr::group_by(dataset, model, class_id) |>
dplyr::summarise(
n_case = unique(case) |> length(),
n_score = unique(score_id) |> length()
) |>
gt::gt()
`summarise()` has grouped output by 'dataset', 'model'. You can override using
the `.groups` argument.
class_id | n_case | n_score |
---|---|---|
UCR_14 - HDC_MINIROCKET | ||
0 | 350 | 4 |
1 | 343 | 4 |
2 | 345 | 4 |
3 | 342 | 4 |
UCR_14 - MINIROCKET | ||
0 | 350 | 4 |
1 | 343 | 4 |
2 | 345 | 4 |
3 | 342 | 4 |
UCR_48 - HDC_MINIROCKET | ||
0 | 5 | 26 |
1 | 5 | 26 |
2 | 5 | 26 |
3 | 5 | 26 |
4 | 5 | 26 |
5 | 5 | 26 |
6 | 5 | 26 |
7 | 5 | 26 |
8 | 5 | 26 |
9 | 5 | 26 |
10 | 5 | 26 |
11 | 5 | 26 |
12 | 5 | 26 |
13 | 5 | 26 |
14 | 5 | 26 |
15 | 5 | 26 |
16 | 5 | 26 |
17 | 5 | 26 |
18 | 5 | 26 |
19 | 5 | 26 |
20 | 5 | 26 |
21 | 5 | 26 |
22 | 5 | 26 |
23 | 5 | 26 |
24 | 5 | 26 |
25 | 5 | 26 |
UCR_48 - MINIROCKET | ||
0 | 5 | 26 |
1 | 5 | 26 |
2 | 5 | 26 |
3 | 5 | 26 |
4 | 5 | 26 |
5 | 5 | 26 |
6 | 5 | 26 |
7 | 5 | 26 |
8 | 5 | 26 |
9 | 5 | 26 |
10 | 5 | 26 |
11 | 5 | 26 |
12 | 5 | 26 |
13 | 5 | 26 |
14 | 5 | 26 |
15 | 5 | 26 |
16 | 5 | 26 |
17 | 5 | 26 |
18 | 5 | 26 |
19 | 5 | 26 |
20 | 5 | 26 |
21 | 5 | 26 |
22 | 5 | 26 |
23 | 5 | 26 |
24 | 5 | 26 |
25 | 5 | 26 |
That looks as expected.
sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] gt_0.8.0 purrr_1.0.1 tidyr_1.3.0 readr_2.1.3
[5] stringr_1.5.0 dplyr_1.0.10 tibble_3.1.8 fs_1.6.0
[9] here_1.0.1 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.36 bslib_0.4.2 colorspace_2.1-0
[5] vctrs_0.5.2 generics_0.1.3 htmltools_0.5.4 yaml_2.3.7
[9] utf8_1.2.2 rlang_1.0.6 jquerylib_0.1.4 later_1.3.0
[13] pillar_1.8.1 withr_2.5.0 glue_1.6.2 bit64_4.0.5
[17] lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1 evaluate_0.20
[21] knitr_1.42 callr_3.7.3 tzdb_0.3.0 fastmap_1.1.0
[25] httpuv_1.6.8 ps_1.7.2 parallel_4.2.2 fansi_1.0.4
[29] Rcpp_1.0.10 scales_1.2.1 renv_0.16.0 promises_1.2.0.1
[33] cachem_1.0.6 vroom_1.6.1 jsonlite_1.8.4 bit_4.0.5
[37] ggplot2_3.4.0 hms_1.1.2 digest_0.6.31 stringi_1.7.12
[41] processx_3.8.0 getPass_0.2-2 rprojroot_2.0.3 grid_4.2.2
[45] cli_3.6.0 tools_4.2.2 magrittr_2.0.3 sass_0.4.5
[49] crayon_1.5.2 whisker_0.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
[53] rmarkdown_2.20 httr_1.4.4 rstudioapi_0.14 R6_2.5.1
[57] git2r_0.30.1 compiler_4.2.2