https://gist.github.com/pichuan/64d73bc965300645470eb29a66116593
Here is an example of how to apply the model described in Improving variant calling using population data and deep learning . The example is similar to DeepVariant WGS case study.
I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform
Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md to download the data.
And, because we are going to test the allele frequency feature, we need to
download the file used in --population_vcfs
flag, and also the trained model
files:
The full pVCFs can be found in:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)
We created a set of smaller set of files that contain just the necessary allele frequency information here:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)
In this tutorial, we'll just run on chr20. So we can download just chr20:
AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi
Model files:
MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS
curl ${MODEL_DIR}/model.ckpt-2520200.data-00000-of-00001 > input/model.ckpt-2520200.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt-2520200.index > input/model.ckpt-2520200.index
curl ${MODEL_DIR}/model.ckpt-2520200.meta > input/model.ckpt-2520200.meta
And then in this step: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.1.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--regions chr20 \
--intermediate_results_dir /output/intermediate_results_dir \
--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
--customized_model="/input/model.ckpt-2520200"
Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-case-study.md to finish hap.py evaluation:
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 10581 53 21070 26 10007 19 0.995016 0.997650 0.474941 0.996331 NaN NaN 1.749861 2.304642
INDEL PASS 10634 10581 53 21070 26 10007 19 0.995016 0.997650 0.474941 0.996331 NaN NaN 1.749861 2.304642
SNP ALL 70209 69956 253 86127 73 16068 15 0.996396 0.998958 0.186562 0.997676 2.297347 2.068096 1.884533 1.955580
SNP PASS 70209 69956 253 86127 73 16068 15 0.996396 0.998958 0.186562 0.997676 2.297347 2.068096 1.884533 1.955580
To run on the whole genome (instead of just chr20), the steps will be similar to the ones above. A few changes:
Download the whole genome:
mkdir -p input
gsutil -m cp gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai} input/
Download all of the population VCF files:
gsutil -m cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/
When run DeepVariant, use this command instead:
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.1.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--intermediate_results_dir /output/intermediate_results_dir \
--make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr1.release_missing2ref.no_calls.vcf.gz /input/cohort-chr10.release_missing2ref.no_calls.vcf.gz /input/cohort-chr11.release_missing2ref.no_calls.vcf.gz /input/cohort-chr12.release_missing2ref.no_calls.vcf.gz /input/cohort-chr13.release_missing2ref.no_calls.vcf.gz /input/cohort-chr14.release_missing2ref.no_calls.vcf.gz /input/cohort-chr15.release_missing2ref.no_calls.vcf.gz /input/cohort-chr16.release_missing2ref.no_calls.vcf.gz /input/cohort-chr17.release_missing2ref.no_calls.vcf.gz /input/cohort-chr18.release_missing2ref.no_calls.vcf.gz /input/cohort-chr19.release_missing2ref.no_calls.vcf.gz /input/cohort-chr2.release_missing2ref.no_calls.vcf.gz /input/cohort-chr20.release_missing2ref.no_calls.vcf.gz /input/cohort-chr21.release_missing2ref.no_calls.vcf.gz /input/cohort-chr22.release_missing2ref.no_calls.vcf.gz /input/cohort-chr3.release_missing2ref.no_calls.vcf.gz /input/cohort-chr4.release_missing2ref.no_calls.vcf.gz /input/cohort-chr5.release_missing2ref.no_calls.vcf.gz /input/cohort-chr6.release_missing2ref.no_calls.vcf.gz /input/cohort-chr7.release_missing2ref.no_calls.vcf.gz /input/cohort-chr8.release_missing2ref.no_calls.vcf.gz /input/cohort-chr9.release_missing2ref.no_calls.vcf.gz /input/cohort-chrX.release_missing2ref.no_calls.vcf.gz /input/cohort-chrY.release_missing2ref.no_calls.vcf.gz"' \
--customized_model="/input/model.ckpt-2520200"
If you're using BIN_VERSION="1.4.0"
, and stil using the model ckpt above (which was trained without the new "insert_size" channel), please add an extra channel=''
to your make_examples_extra_args. So it looks like:
--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz,channels=''"
The semantics here is: "Do not add any extra channel." This will overwrite this flag: https://github.com/google/deepvariant/blob/r1.4/scripts/run_deepvariant.py#L238 which was set to "insert_size" for WGS default.