jenniferthompson · November 11, 2023 23:56 · jenniferthompson · Nov 1, 2018 · caitlinhudon · Nov 1, 2018
diff --git a/genericized_dd.Rmd b/genericized_dd.Rmd
 ---
 title: "Example Data Dictionary"
 author: "Jennifer Thompson"
 date: "11/1/2018"
 output:
  html_document:
    theme: yeti
    code_folding: hide
 ---

 *This is a toy example of how I created a data dictionary for tables with flags
 to include patients in specific cohorts (eg, all hospital survivors, all patients
 with data at a follow-up time point, etc).*

 Several analysts will be using the combined data from Study A and Study B,
 necessitating a single "source of truth" for criteria to determine common
 cohorts. Indicators for specific cohorts will be created and documented below,
 alongside a data dictionary for each of three data tables which should be
 incorporated into all future analyses using this data. Our goal is to eliminate
 confusion and inconsistencies resulting from different analysts making slightly
 different decisions when building a cohort for a project.

 # File Structure

 These cohort definitions are applied to *analysis* datasets, created in separate
 scripts and stored in `[data file]`. Code for creating individual variables for
 analysis is available in `derivationscripts/[study]_datamgmt.R`; code used to
 combine the two studies' data into a single deidentified data file is in
 `combine_data.R`, all in this directory.

 All tables created in this document are stored in `cohorttables/` within this
 directory, in both RDS and CSV formats, and are intended for merging with the
 tables in `[data file]`.

 ```{r setup, message = FALSE}
 knitr::opts_chunk$set(message = FALSE, eval = FALSE) ## obvs, eval = TRUE in real life

 library(knitr)
 library(tidyverse)
 library(kableExtra)

 ```

 ```{r load_data}
 load('analysisdata/datafile.Rdata')

 ```

 # In-Hospital Indicators

 ```{r inhosp, eval = TRUE}
 ################################################################################
 ## In-hospital indicators
 ################################################################################

 ## -- Prep ---------------------------------------------------------------------
 ## Derivation code would go here!

 ## -- Create indicators for in-hospital statuses: ------------------------------
 ## More derivation code!

 ## data.frame to contain info; putting it in a data.frame allows the table to be
 ## prettified with kableExtra
 inhosp_info <- tribble(
  ~ indicator, ~ description,
  "`wd_data`",   "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**",
  "`died_inhosp`", "Died during index hospitalization at any point after enrollment.",
  "`wd_inhosp`", "Withdrew from study during index hospitalization (but allowed use of data already collected).",
  "`hosp_survivor`", "Survived index hospitalization without death or withdrawal.",
  "`had_biomarker`", "Had >=1 measurement of the following biomarkers during hospitalization (per protocol, these were drawn on days 1, 3, 5 following enrollment): [list specific markers]"
 )

 ```

 `inhosp_df` includes one row per enrolled patient (total N = `nrow(df1)`) and
 one column for each of the following indicators (`TRUE/FALSE`):

 ```{r print_inhosp, eval = TRUE}
 kable(
  inhosp_info,
  format = "html",
  col.names = c("", "")
 ) %>%
  group_rows(index = c("Discharge Status" = 4, "In-Hospital Cohorts" = 1)) %>%
  kable_styling(bootstrap_options = c("hover"))

 ```

 Each patient must have >=1 of `died_inhosp`, `wd_inhosp`, and `hosp_survivor` =
 `TRUE`. It is possible to have death information on a withdrawn patient, as some
 patients allowed us to continue to access their medical records after
 withdrawal.

 # Overall Patient Status at Each Time Point

 These studies are longitudinal, and have several prespecified time points for
 data collection:

 - **In-hospital**: Data was collected daily in the hospital, during and
 following critical illness, until death, withdrawal, or discharge from the
 hospital.
 - **[Original follow-up points]**: The original follow-up time points for these
 studies, when a full this and thta battery was performed for all available
 patients. If patients were found to have died since last contact, this
 information was also entered; similarly, if patients withdrew from the study
 when they were contacted for assessment, this was noted. Some patients could not
 be found or could not complete an assessment; these patients are considered
 "lost to follow-up."
 - **[Later follow-up points]**: These follow-up points were added for Study A
 **only** after the initial studies were complete, performing a similar battery
 as time1- and time2 follow-up.

 ```{r overall_status}
 ################################################################################
 ## Status indicators at ALL time points
 ##  (in-hospital, ...)
 ################################################################################

 ## -- Prep ---------------------------------------------------------------------
 ## You guessed it! More derivation code!

 ## -- Create dummy dataset: all IDs, all time points ---------------------------
 ## And yet MORE CODE

 status_info <- tribble(
  ~ indicator, ~ description,
  "`timept`", "Time point (`inhosp`, `time1`, `time2`, `time3`, `time4`)",
  "`wd_data`", "Patient withdrew from study *and* revoked permission to use any data. **All other data should be missing.**",
  "`status`", "Status at this time point (`Deceased`, `Withdrawn`, or `Survived, in study`)",
  "`died`", "Patient deceased at (or prior to) this time point",
  "`wd`", "Patient withdrew at (or prior to) this time point",
  "`alive_instudy`", "Patient remained alive and in the study at this time point; may or may not have assessment data"
 )
  
 ```

 `status_df` includes one row per enrolled patient per time point (5 time points
 for Study A patients, and 3 time points for Study B patients), and one column
 for each of the following indicators (`TRUE/FALSE`):

 ```{r print_status}
 kable(
  status_info,
  format = "html",
  col.names = c("", "")
 ) %>%
  group_rows(index = c(" " = 3, "Status Indicators" = 3)) %>%
  kable_styling(bootstrap_options = c("hover"))

 ```

 Patients are only `alive_instudy` if they have neither died nor withdrawn. A
 patient who withdrew could, however, also be deceased, if the patient still
 allowed us to access health records or public information after withdrawal, and
 that patient was found to have died.

 <insert specific example>

 Patients who withdrew and revoked access to all data have `NA` for each
 indicator. Patients who died or withdrew, but allowed continued record access,
 have `FALSE` for each indicator.

 # Follow-Up Status

 At each follow-up assessment point, patients could be fully assessed; partially
 assessed; alive, but not assessed; withdrawn; or deceased. Some tests on the
 assessment battery (..., ..., ...) had to be done in person, whereas other tests
 (...) could be done over the phone; therefore, more patients were able to
 complete the other tests.

 <info about how we tend to handle analysis for patients with incomplete data>

 ```{r followup}
 ################################################################################
 ## Follow-Up Indicators
 ################################################################################

 ## COOOODDDDEEEEE

 fu_info <- tribble(
  ~ indicator, ~ description,
  "`timept`", "Time point (`time1`, `time2`, `time3`, `time4`)",
  "`any_outcomes`", "Patient has data for any this and/or that outcome assessment",
  "`this_outcomes`", "Patient has data for >=1 *this* assessment (...)",
  "`that_outcomes`", "Patient has data for >=1 *that* assessment (...)"
 )
  
 ```

 `fu_df` includes one row per enrolled patient per follow-up time point (time1 +
 time2 for all patients; time3 + time4 for Study A only) and one column for each
 of the following indicators (`TRUE/FALSE`):

 ```{r print_fu}
 kable(
  fu_info,
  format = "html",
  col.names = c("", "")
 ) %>%
  group_rows(index = c(" " = 1, "Assessment Indicators" = 3)) %>%
  kable_styling(bootstrap_options = c("hover"))

 ```

 Patients who withdrew and revoked access to all data have `NA` for each
 indicator. Patients who died or withdrew, but allowed continued access, have
 `FALSE` for each indicator.

 ### Code to Save All Reference Tables

 ```{r save_tables}
 walk2(
  .x = list(inhosp_df, status_df, fu_df),
  .y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
  ~ saveRDS(.x, file = paste0("cohorttables/", .y, ".rds"))
 )

 walk2(
  .x = list(inhosp_df, status_df, fu_df),
  .y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
  ~ write.csv(.x, row.names = FALSE, file = paste0("cohorttables/", .y, ".csv"))
 )

 ```
	---
	title: "Example Data Dictionary"
	author: "Jennifer Thompson"
	date: "11/1/2018"
	output:
	html_document:
	theme: yeti
	code_folding: hide
	---

	*This is a toy example of how I created a data dictionary for tables with flags
	to include patients in specific cohorts (eg, all hospital survivors, all patients
	with data at a follow-up time point, etc).*

	Several analysts will be using the combined data from Study A and Study B,
	necessitating a single "source of truth" for criteria to determine common
	cohorts. Indicators for specific cohorts will be created and documented below,
	alongside a data dictionary for each of three data tables which should be
	incorporated into all future analyses using this data. Our goal is to eliminate
	confusion and inconsistencies resulting from different analysts making slightly
	different decisions when building a cohort for a project.

	# File Structure

	These cohort definitions are applied to analysis datasets, created in separate
	scripts and stored in `[data file]`. Code for creating individual variables for
	analysis is available in `derivationscripts/[study]_datamgmt.R`; code used to
	combine the two studies' data into a single deidentified data file is in
	`combine_data.R`, all in this directory.

	All tables created in this document are stored in `cohorttables/` within this
	directory, in both RDS and CSV formats, and are intended for merging with the
	tables in `[data file]`.

	```{r setup, message = FALSE}
	knitr::opts_chunk$set(message = FALSE, eval = FALSE) ## obvs, eval = TRUE in real life

	library(knitr)
	library(tidyverse)
	library(kableExtra)

	```

	```{r load_data}
	load('analysisdata/datafile.Rdata')

	```

	# In-Hospital Indicators

	```{r inhosp, eval = TRUE}
	################################################################################
	## In-hospital indicators
	################################################################################

	## -- Prep ---------------------------------------------------------------------
	## Derivation code would go here!

	## -- Create indicators for in-hospital statuses: ------------------------------
	## More derivation code!

	## data.frame to contain info; putting it in a data.frame allows the table to be
	## prettified with kableExtra
	inhosp_info <- tribble(
	~ indicator, ~ description,
	"`wd_data`", "Patient withdrew from study and revoked permission to use any data. All other data should be missing.",
	"`died_inhosp`", "Died during index hospitalization at any point after enrollment.",
	"`wd_inhosp`", "Withdrew from study during index hospitalization (but allowed use of data already collected).",
	"`hosp_survivor`", "Survived index hospitalization without death or withdrawal.",
	"`had_biomarker`", "Had >=1 measurement of the following biomarkers during hospitalization (per protocol, these were drawn on days 1, 3, 5 following enrollment): [list specific markers]"
	)

	```

	`inhosp_df` includes one row per enrolled patient (total N = `nrow(df1)`) and
	one column for each of the following indicators (`TRUE/FALSE`):

	```{r print_inhosp, eval = TRUE}
	kable(
	inhosp_info,
	format = "html",
	col.names = c("", "")
	) %>%
	group_rows(index = c("Discharge Status" = 4, "In-Hospital Cohorts" = 1)) %>%
	kable_styling(bootstrap_options = c("hover"))

	```

	Each patient must have >=1 of `died_inhosp`, `wd_inhosp`, and `hosp_survivor` =
	`TRUE`. It is possible to have death information on a withdrawn patient, as some
	patients allowed us to continue to access their medical records after
	withdrawal.

	# Overall Patient Status at Each Time Point

	These studies are longitudinal, and have several prespecified time points for
	data collection:

	- In-hospital: Data was collected daily in the hospital, during and
	following critical illness, until death, withdrawal, or discharge from the
	hospital.
	- [Original follow-up points]: The original follow-up time points for these
	studies, when a full this and thta battery was performed for all available
	patients. If patients were found to have died since last contact, this
	information was also entered; similarly, if patients withdrew from the study
	when they were contacted for assessment, this was noted. Some patients could not
	be found or could not complete an assessment; these patients are considered
	"lost to follow-up."
	- [Later follow-up points]: These follow-up points were added for Study A
	only after the initial studies were complete, performing a similar battery
	as time1- and time2 follow-up.

	```{r overall_status}
	################################################################################
	## Status indicators at ALL time points
	## (in-hospital, ...)
	################################################################################

	## -- Prep ---------------------------------------------------------------------
	## You guessed it! More derivation code!

	## -- Create dummy dataset: all IDs, all time points ---------------------------
	## And yet MORE CODE

	status_info <- tribble(
	~ indicator, ~ description,
	"`timept`", "Time point (`inhosp`, `time1`, `time2`, `time3`, `time4`)",
	"`wd_data`", "Patient withdrew from study and revoked permission to use any data. All other data should be missing.",
	"`status`", "Status at this time point (`Deceased`, `Withdrawn`, or `Survived, in study`)",
	"`died`", "Patient deceased at (or prior to) this time point",
	"`wd`", "Patient withdrew at (or prior to) this time point",
	"`alive_instudy`", "Patient remained alive and in the study at this time point; may or may not have assessment data"
	)

	```

	`status_df` includes one row per enrolled patient per time point (5 time points
	for Study A patients, and 3 time points for Study B patients), and one column
	for each of the following indicators (`TRUE/FALSE`):

	```{r print_status}
	kable(
	status_info,
	format = "html",
	col.names = c("", "")
	) %>%
	group_rows(index = c(" " = 3, "Status Indicators" = 3)) %>%
	kable_styling(bootstrap_options = c("hover"))

	```

	Patients are only `alive_instudy` if they have neither died nor withdrawn. A
	patient who withdrew could, however, also be deceased, if the patient still
	allowed us to access health records or public information after withdrawal, and
	that patient was found to have died.

	<insert specific example>

	Patients who withdrew and revoked access to all data have `NA` for each
	indicator. Patients who died or withdrew, but allowed continued record access,
	have `FALSE` for each indicator.

	# Follow-Up Status

	At each follow-up assessment point, patients could be fully assessed; partially
	assessed; alive, but not assessed; withdrawn; or deceased. Some tests on the
	assessment battery (..., ..., ...) had to be done in person, whereas other tests
	(...) could be done over the phone; therefore, more patients were able to
	complete the other tests.

	<info about how we tend to handle analysis for patients with incomplete data>

	```{r followup}
	################################################################################
	## Follow-Up Indicators
	################################################################################

	## COOOODDDDEEEEE

	fu_info <- tribble(
	~ indicator, ~ description,
	"`timept`", "Time point (`time1`, `time2`, `time3`, `time4`)",
	"`any_outcomes`", "Patient has data for any this and/or that outcome assessment",
	"`this_outcomes`", "Patient has data for >=1 this assessment (...)",
	"`that_outcomes`", "Patient has data for >=1 that assessment (...)"
	)

	```

	`fu_df` includes one row per enrolled patient per follow-up time point (time1 +
	time2 for all patients; time3 + time4 for Study A only) and one column for each
	of the following indicators (`TRUE/FALSE`):

	```{r print_fu}
	kable(
	fu_info,
	format = "html",
	col.names = c("", "")
	) %>%
	group_rows(index = c(" " = 1, "Assessment Indicators" = 3)) %>%
	kable_styling(bootstrap_options = c("hover"))

	```

	Patients who withdrew and revoked access to all data have `NA` for each
	indicator. Patients who died or withdrew, but allowed continued access, have
	`FALSE` for each indicator.

	### Code to Save All Reference Tables

	```{r save_tables}
	walk2(
	.x = list(inhosp_df, status_df, fu_df),
	.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
	~ saveRDS(.x, file = paste0("cohorttables/", .y, ".rds"))
	)

	walk2(
	.x = list(inhosp_df, status_df, fu_df),
	.y = c("inhosp_cohorts", "overall_status", "fu_cohorts"),
	~ write.csv(.x, row.names = FALSE, file = paste0("cohorttables/", .y, ".csv"))
	)

	```