Last active
November 11, 2023 23:56
-
-
Save jenniferthompson/1e6059569214807bbc7db472ff117442 to your computer and use it in GitHub Desktop.
Example structure for data dictionary + code used for derivation using RMarkdown. Creates three data tables and documents general + field-specific info.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Example Data Dictionary" | |
author: "Jennifer Thompson" | |
date: "11/1/2018" | |
output: | |
html_document: | |
theme: yeti | |
code_folding: hide | |
--- | |
*This is a toy example of how I created a data dictionary for tables with flags | |
to include patients in specific cohorts (eg, all hospital survivors, all patients | |
with data at a follow-up time point, etc).* | |
Several analysts will be using the combined data from Study A and Study B, | |
necessitating a single "source of truth" for criteria to determine common | |
cohorts. Indicators for specific cohorts will be created and documented below, | |
alongside a data dictionary for each of three data tables which should be | |
incorporated into all future analyses using this data. Our goal is to eliminate | |
confusion and inconsistencies resulting from different analysts making slightly | |
different decisions when building a cohort for a project. | |
# File Structure | |
These cohort definitions are applied to *analysis* datasets, created in separate | |
scripts and stored in `[data file]`. Code for creating individual variables for | |
analysis is available in `derivationscripts/[study]_datamgmt.R`; code used to | |
combine the two studies' data into a single deidentified data file is in | |
`combine_data.R`, all in this directory. | |
All tables created in this document are stored in `cohorttables/` within this | |
directory, in both RDS and CSV formats, and are intended for merging with the | |
tables in `[data file]`. | |
```{r setup, message = FALSE} | |
knitr::opts_chunk$set(message = FALSE, eval = FALSE) ## obvs, eval = TRUE in real life | |
library(knitr) | |
library(tidyverse) | |
library(kableExtra) | |
``` | |
```{r load_data} | |
load('analysisdata/datafile.Rdata') | |
``` | |
# In-Hospital Indicators | |
```{r inhosp, eval = TRUE} | |
################################################################################ | |
## In-hospital indicators | |
################################################################################ | |
## -- Prep --------------------------------------------------------------------- | |
## Derivation code would go here! | |
## -- Create indicators for in-hospital statuses: ------------------------------ | |
## More derivation code! | |
## data.frame to contain info; putting it in a data.frame allows the table to be | |
## prettified with kableExtra | |
inhosp_info <- tribble( | |
~ indicator, ~ description, | |
"`wd_data`", "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**", | |
"`died_inhosp`", "Died during index hospitalization at any point after enrollment.", | |
"`wd_inhosp`", "Withdrew from study during index hospitalization (but allowed use of data already collected).", | |
"`hosp_survivor`", "Survived index hospitalization without death or withdrawal.", | |
"`had_biomarker`", "Had >=1 measurement of the following biomarkers during hospitalization (per protocol, these were drawn on days 1, 3, 5 following enrollment): [list specific markers]" | |
) | |
``` | |
`inhosp_df` includes one row per enrolled patient (total N = `nrow(df1)`) and | |
one column for each of the following indicators (`TRUE/FALSE`): | |
```{r print_inhosp, eval = TRUE} | |
kable( | |
inhosp_info, | |
format = "html", | |
col.names = c("", "") | |
) %>% | |
group_rows(index = c("Discharge Status" = 4, "In-Hospital Cohorts" = 1)) %>% | |
kable_styling(bootstrap_options = c("hover")) | |
``` | |
Each patient must have >=1 of `died_inhosp`, `wd_inhosp`, and `hosp_survivor` = | |
`TRUE`. It is possible to have death information on a withdrawn patient, as some | |
patients allowed us to continue to access their medical records after | |
withdrawal. | |
# Overall Patient Status at Each Time Point | |
These studies are longitudinal, and have several prespecified time points for | |
data collection: | |
- **In-hospital**: Data was collected daily in the hospital, during and | |
following critical illness, until death, withdrawal, or discharge from the | |
hospital. | |
- **[Original follow-up points]**: The original follow-up time points for these | |
studies, when a full this and thta battery was performed for all available | |
patients. If patients were found to have died since last contact, this | |
information was also entered; similarly, if patients withdrew from the study | |
when they were contacted for assessment, this was noted. Some patients could not | |
be found or could not complete an assessment; these patients are considered | |
"lost to follow-up." | |
- **[Later follow-up points]**: These follow-up points were added for Study A | |
**only** after the initial studies were complete, performing a similar battery | |
as time1- and time2 follow-up. | |
```{r overall_status} | |
################################################################################ | |
## Status indicators at ALL time points | |
## (in-hospital, ...) | |
################################################################################ | |
## -- Prep --------------------------------------------------------------------- | |
## You guessed it! More derivation code! | |
## -- Create dummy dataset: all IDs, all time points --------------------------- | |
## And yet MORE CODE | |
status_info <- tribble( | |
~ indicator, ~ description, | |
"`timept`", "Time point (`inhosp`, `time1`, `time2`, `time3`, `time4`)", | |
"`wd_data`", "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**", | |
"`status`", "Status at this time point (`Deceased`, `Withdrawn`, or `Survived, in study`)", | |
"`died`", "Patient deceased at (or prior to) this time point", | |
"`wd`", "Patient withdrew at (or prior to) this time point", | |
"`alive_instudy`", "Patient remained alive and in the study at this time point; may or may not have assessment data" | |
) | |
``` | |
`status_df` includes one row per enrolled patient per time point (5 time points | |
for Study A patients, and 3 time points for Study B patients), and one column | |
for each of the following indicators (`TRUE/FALSE`): | |
```{r print_status} | |
kable( | |
status_info, | |
format = "html", | |
col.names = c("", "") | |
) %>% | |
group_rows(index = c(" " = 3, "Status Indicators" = 3)) %>% | |
kable_styling(bootstrap_options = c("hover")) | |
``` | |
Patients are only `alive_instudy` if they have neither died nor withdrawn. A | |
patient who withdrew could, however, also be deceased, if the patient still | |
allowed us to access health records or public information after withdrawal, and | |
that patient was found to have died. | |
<insert specific example> | |
Patients who withdrew and revoked access to all data have `NA` for each | |
indicator. Patients who died or withdrew, but allowed continued record access, | |
have `FALSE` for each indicator. | |
# Follow-Up Status | |
At each follow-up assessment point, patients could be fully assessed; partially | |
assessed; alive, but not assessed; withdrawn; or deceased. Some tests on the | |
assessment battery (..., ..., ...) had to be done in person, whereas other tests | |
(...) could be done over the phone; therefore, more patients were able to | |
complete the other tests. | |
<info about how we tend to handle analysis for patients with incomplete data> | |
```{r followup} | |
################################################################################ | |
## Follow-Up Indicators | |
################################################################################ | |
## COOOODDDDEEEEE | |
fu_info <- tribble( | |
~ indicator, ~ description, | |
"`timept`", "Time point (`time1`, `time2`, `time3`, `time4`)", | |
"`any_outcomes`", "Patient has data for any this and/or that outcome assessment", | |
"`this_outcomes`", "Patient has data for >=1 *this* assessment (...)", | |
"`that_outcomes`", "Patient has data for >=1 *that* assessment (...)" | |
) | |
``` | |
`fu_df` includes one row per enrolled patient per follow-up time point (time1 + | |
time2 for all patients; time3 + time4 for Study A only) and one column for each | |
of the following indicators (`TRUE/FALSE`): | |
```{r print_fu} | |
kable( | |
fu_info, | |
format = "html", | |
col.names = c("", "") | |
) %>% | |
group_rows(index = c(" " = 1, "Assessment Indicators" = 3)) %>% | |
kable_styling(bootstrap_options = c("hover")) | |
``` | |
Patients who withdrew and revoked access to all data have `NA` for each | |
indicator. Patients who died or withdrew, but allowed continued access, have | |
`FALSE` for each indicator. | |
### Code to Save All Reference Tables | |
```{r save_tables} | |
walk2( | |
.x = list(inhosp_df, status_df, fu_df), | |
.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"), | |
~ saveRDS(.x, file = paste0("cohorttables/", .y, ".rds")) | |
) | |
walk2( | |
.x = list(inhosp_df, status_df, fu_df), | |
.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"), | |
~ write.csv(.x, row.names = FALSE, file = paste0("cohorttables/", .y, ".csv")) | |
) | |
``` |
This is so beautiful I can hardly stand it. Thank you so much for putting this together and sharing.
🙌
Shoutout to @haozhu233 for enabling these beautiful kableExtra
tables!
I love this! Literally using it right away. Thank you :)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Screenshot of first data dictionary section: