Skip to content

Instantly share code, notes, and snippets.

View aofarrel's full-sized avatar
🧬

Ash O'Farrell aofarrel

🧬
  • Genomics Institute, UC Santa Cruz
  • Santa Cruz (via Ireland's Ancient East)
View GitHub Profile
# Notes:
# * Works for SAMN, SAME, and SAMD BioSamples (should also work for SRS/ERS format)
# * Grabs date and location of sample isolation, host, sample source, and strain
# * edirect tools needs to be on $PATH, or you can use my Docker image for a pre-installed version: https://hub.docker.com/r/ashedpotatoes/sranwrp/tags
# * elink is known to randomly fail so this code doesn't use it -- however, without elink, you can't get the SRA reads (SRR/ERR/DRR) that make up a BioSample
import subprocess
biosamples = ["SAMEA6451356","SAMEA6451357","SAMEA6451358","SAMEA6451360","SAMEA6451361","SAMEA6451362","SAMEA6451363","SAMEA6451366","SAMEA6451367","SAMEA6451368","SAMEA6451369","SAMEA6451370","SAMEA6451371","SAMEA6451372","SAMEA6451373","SAMEA6451374","SAMEA6451375","SAMEA6451377"]

If the input variable is an array, we must include an array separator. In WDL 1.0, this is done using the sep= expression placeholder. Every value in the WDL Array[String] will be separated by whatever value is declared via sep. In this example, that is a simple space, as that is one way how to construct a bash variable.

task count_words {
  input {
    Array[String] a_big_sentence
  }
  command <<<
    ARRAY_OF_WORDS=(~{sep=" " a_big_sentence})
 echo ${#ARRAY_OF_FILES[@]} &gt;&gt; length.txt
@aofarrel
aofarrel / TBProf_SAMEA11744634.txt
Created July 14, 2023 17:05
TBProfiler output when run on myco_sra for BioSample SAMEA11744634. TBProfiler was run on the fastqs this time, not the bams.
TBProfiler report
=================
The following report has been generated by TBProfiler.
Summary
-------
ID: SAMEA11744634
Date: Mon Jul 10 17:53:25 2023
Strain: lineage4.2.2.2
@aofarrel
aofarrel / script
Created June 7, 2023 19:28
Pulled from the folder of a run of a decontamination task (the first one listed here: https://github.com/aofarrel/clockwork-wdl/blob/main/tasks/combined_decontamination.wdl) which ran on Terra. The file is known as "script" with no extension, but has a bash shebang.
#!/bin/bash
cd /cromwell_root
tmpDir=$(mkdir -p "/cromwell_root/tmp.65940862" && echo "/cromwell_root/tmp.65940862")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /cromwell_root
import os
import json
lazy_array_of_strings_to_write = []
for file in os.listdir("rand12344"):
if file.endswith(".json"):
sample = file.rstrip("._to_Ref.H37Rv.bam.results.json")
with open(f"rand12344/{file}") as thisjson:
data = json.load(thisjson)
strain = data["sublin"]
lazy_array_of_strings_to_write.append(f"{sample}\t{strain}\n")
@aofarrel
aofarrel / batch_download_by_shard_from_terra.py
Created October 28, 2022 19:13
Download files per shard from Terra.
import os
vcfs = []
bgas = []
diffs = []
for vcf in vcfs:
for bga in bgas:
for diff in diffs:
split = diff.split("/")
@aofarrel
aofarrel / batch_file_rename_example.py
Created October 25, 2022 20:52
Batch renaming + download of several files in google cloud. This was needed because, when downloading with the gcloud CLI (as opposed to the web interface) google just saves the file as "final.vcf" without the full gs:// path. So, trying to batch download a dozen final.vcf files doesn't work.
import os
vcfs = ["gs://your_bucket_here/call-varcall/shard-1/var_call_SRR1049633/final.vcf", "gs://your_bucket_here/call-varcall/shard-2/var_call_SRR1049634/final.vcf", "gs://your_bucket_here/call-varcall/shard-3/var_call_SRR1062904/final.vcf", "gs://your_bucket_here/call-varcall/shard-4/var_call_SRR1140590/final.vcf", "gs://your_bucket_here/call-varcall/shard-5/var_call_SRR1047996/final.vcf", "gs://your_bucket_here/call-varcall/shard-6/var_call_SRR1144759/final.vcf", "gs://your_bucket_here/call-varcall/shard-7/var_call_SRR1165658/final.vcf", "gs://your_bucket_here/call-varcall/shard-8/var_call_SRR1167377/final.vcf", "gs://your_bucket_here/call-varcall/shard-9/var_call_SRR1169242/final.vcf"]
new_vcfs = []
for vcf in vcfs:
split = vcf.split("/")
new_name_start = "/".join(split[:-1])+"/"
new_name_end = split[-2:][0] + ".vcf"
new_name = "".join([new_name_start, new_name_end])
new_vcfs.append(new_name)
@aofarrel
aofarrel / .dockstore.yml
Created January 24, 2022 21:02
Example .dockstore.yml file. This file is embedded into a Medium post, so please don't delete it.
version: 1.2
workflows:
- subclass: WDL
primaryDescriptorPath: goleft_functions.wdl
testParameterFiles:
- goleft_terra.json
@aofarrel
aofarrel / order_in_the_court.wdl
Last active May 27, 2021 23:03
Enforcing order of execution of tasks in Cromwell: proof of concept
version 1.0
# Caveat programmator: Please be sure to read the readme on Github
import "https://raw.githubusercontent.com/DataBiosphere/analysis_pipeline_WDL/implement-merge-gds/ld-pruning-wf.wdl" as test_run_ldpruning
task md5sum {
input {
File test
Array[File] truth
@aofarrel
aofarrel / useful cromwell regular expressions
Created May 26, 2021 23:31
Cromwell is extremely verbose, but you can quickly find the most important parts with the power of regular expressions. These are the regular expressions I use for an iTerm2 profile, setting these to be highlighted when a match is found.
## Match a successful workflow
\[info\] WorkflowExecutionActor(.*) Workflow(.*)complete. Final Outputs:
## Match a failed task (will include the specific error thrown, although in most cases it will simply tell you "problems running X" and to check the stderr file)
\[error\] WorkflowManagerActor Workflow(.*)
## Match status changes
Status change from(.*)