lwaldron/cmd_healthycontrols.R

lwaldron · 2024-05-03T14:33:14Z

Ran on May 3, 2024 on superstudio. Results at 2024-05-04_fornash.zip

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.1.4                     curatedMetagenomicData_3.10.0   TreeSummarizedExperiment_2.10.0
 [4] Biostrings_2.70.3               XVector_0.42.0                  SingleCellExperiment_1.24.0    
 [7] SummarizedExperiment_1.32.0     Biobase_2.62.0                  GenomicRanges_1.54.1           
[10] GenomeInfoDb_1.38.8             IRanges_2.36.0                  S4Vectors_0.40.2               
[13] BiocGenerics_0.48.1             MatrixGenerics_1.14.0           matrixStats_1.3.0              

loaded via a namespace (and not attached):
  [1] rstudioapi_0.16.0             jsonlite_1.8.8                MultiAssayExperiment_1.28.0  
  [4] magrittr_2.0.3                ggbeeswarm_0.7.2              fs_1.6.4                     
  [7] zlibbioc_1.48.2               vctrs_0.6.5                   memoise_2.0.1                
 [10] DelayedMatrixStats_1.24.0     RCurl_1.98-1.14               htmltools_0.5.8.1            
 [13] S4Arrays_1.2.1                AnnotationHub_3.10.1          curl_5.2.1                   
 [16] BiocNeighbors_1.20.2          SparseArray_1.2.4             plyr_1.8.9                   
 [19] DECIPHER_2.30.0               cachem_1.0.8                  igraph_2.0.3                 
 [22] mime_0.12                     lifecycle_1.0.4               pkgconfig_2.0.3              
 [25] rsvd_1.0.5                    Matrix_1.6-5                  R6_2.5.1                     
 [28] fastmap_1.1.1                 GenomeInfoDbData_1.2.11       shiny_1.8.1.1                
 [31] digest_0.6.35                 colorspace_2.1-0              AnnotationDbi_1.64.1         
 [34] scater_1.30.1                 irlba_2.3.5.1                 ExperimentHub_2.10.0         
 [37] RSQLite_2.3.6                 vegan_2.6-4                   beachmat_2.18.1              
 [40] filelock_1.0.3                fansi_1.0.6                   httr_1.4.7                   
 [43] abind_1.4-5                   mgcv_1.9-0                    compiler_4.3.2               
 [46] bit64_4.0.5                   withr_3.0.0                   BiocParallel_1.36.0          
 [49] viridis_0.6.5                 DBI_1.2.2                     MASS_7.3-60                  
 [52] rappdirs_0.3.3                DelayedArray_0.28.0           bluster_1.12.0               
 [55] permute_0.9-7                 tools_4.3.2                   vipor_0.4.7                  
 [58] beeswarm_0.4.0                ape_5.8                       interactiveDisplayBase_1.40.0
 [61] httpuv_1.6.15                 glue_1.7.0                    nlme_3.1-163                 
 [64] promises_1.3.0                grid_4.3.2                    mia_1.10.0                   
 [67] cluster_2.1.6                 reshape2_1.4.4                generics_0.1.3               
 [70] gtable_0.3.5                  tidyr_1.3.1                   BiocSingular_1.18.0          
 [73] ScaledMatrix_1.10.0           utf8_1.2.4                    ggrepel_0.9.5                
 [76] BiocVersion_3.18.1            pillar_1.9.0                  stringr_1.5.1                
 [79] yulab.utils_0.1.4             later_1.3.2                   splines_4.3.2                
 [82] BiocFileCache_2.10.2          treeio_1.26.0                 lattice_0.22-6               
 [85] bit_4.0.5                     tidyselect_1.2.1              DirichletMultinomial_1.44.0  
 [88] scuttle_1.12.0                gridExtra_2.3                 stringi_1.8.3                
 [91] lazyeval_0.2.2                yaml_2.3.8                    codetools_0.2-19             
 [94] tibble_3.2.1                  BiocManager_1.30.22           cli_3.6.2                    
 [97] xtable_1.8-4                  munsell_0.5.1                 Rcpp_1.0.12                  
[100] dbplyr_2.5.0                  png_0.1-8                     parallel_4.3.2               
[103] ggplot2_3.5.1                 blob_1.2.4                    sparseMatrixStats_1.14.0     
[106] bitops_1.0-7                  decontam_1.22.0               viridisLite_0.4.2            
[109] tidytree_0.4.6                scales_1.3.0                  purrr_1.0.2                  
[112] crayon_1.5.2                  rlang_1.1.3                   KEGGREST_1.42.0

lwaldron · 2024-05-03T14:36:55Z

Gist provides a relative abundance file with NCBI IDs in columns and observations in rows, and a corresponding metadata file for stool specimens from healthy control participants. I divided the files into age categories, since they'll have somewhat different properties:

$ wc -l *relab.csv
    8983 adult_relab.csv
     821 child_relab.csv
    2328 newborn_relab.csv
     229 schoolage_relab.csv
     835 senior_relab.csv
   13196 total

Note that the relative abundances won't always add up quite to 100% because some species that could not be mapped to the phylogeny were dropped, but these are rare and low abundance. Note also that there are an additional 1,301 control samples from body sites other than stool which are not included here, but available if you want them. And finally, we'll be re-running these and some (possibly tens of) thousands more specimens through MetaPhlAn4, which will add a large number of Species Genome Bins, putative species based on high-quality metagenome assemblies, that have not yet been isolated or named (or assigned NCBI identifiers).

	library(curatedMetagenomicData)
	library(dplyr)
	agecats <- unique(sampleMetadata$age_category) \|> na.omit()
	sm <- filter(sampleMetadata, study_condition=="control") \|>
	filter(disease == "healthy") \|>
	filter(body_site == "stool") \|>
	filter(!is.na(age_category))

	for (agecat in agecats){
	sm1 <- filter(sm, age_category == agecat)
	se <- returnSamples(sm1, dataType = "relative_abundance", rownames = "NCBI")
	write.csv(t(assay(se)), file=paste0(agecat, "_relab.csv"))
	write.csv(colData(se), file=paste0(agecat, "_samplemetadata.csv"))
	}

lwaldron/cmd_healthycontrols.R

lwaldron commented May 3, 2024

lwaldron commented May 3, 2024