This gist is a follow up on https://gist.github.com/pichuan/64d73bc965300645470eb29a66116593 (which was trained with v1.1.0 codebase and training data)
In v1.4.0, our WGS
and WES models have a new default insert_size
channel. Based on the requests
from our users, I've trained a WGS and WES model on v1.4.0 codebase and training
data, and added the additional allele frequency channel.
For more details about this work, see Improving variant calling using population data and deep learning
The example is similar to the WGS case study.
Main differences:
- We'll be downloading the WGS AF model (and WES AF model later).
- We'll need to download the pVCF files to supply the population information.
- Flags will need to be added to enable the allele frequency channel.
I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform
Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md to download the data.
And, because we are going to test the allele frequency feature, we need to
download the file used in --population_vcfs
flag, and also the trained model
files:
The full pVCFs can be found in:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)
We created a set of smaller set of files that contain just the necessary allele frequency information here:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)
In this tutorial, we'll just run on chr20. So we can download just chr20:
AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi
Model files:
MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.4.0
curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wgs_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wgs_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wgs_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wgs_af.model.ckpt.example_info.json
And then in this step: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:
mkdir -p output
BIN_VERSION="1.4.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--regions chr20 \
--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
--customized_model="/input/wgs_af.model.ckpt"
(Note: In the example above, I skipped the --intermediate_results_dir flag. You can add it as needed.)
Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md to finish hap.py evaluation:
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10586 42 21097 22 10036 17 4 0.996048 0.998011 0.475707 0.997029 NaN NaN 1.748961 2.326957
INDEL PASS 10628 10586 42 21097 22 10036 17 4 0.996048 0.998011 0.475707 0.997029 NaN NaN 1.748961 2.326957
SNP ALL 70166 69921 245 85470 53 15464 20 5 0.996508 0.999243 0.180929 0.997874 2.296566 2.078981 1.883951 1.949056
SNP PASS 70166 69921 245 85470 53 15464 20 5 0.996508 0.999243 0.180929 0.997874 2.296566 2.078981 1.883951 1.949056
If you want to evaluate on everything (not just chr20), you can download the BAM
file at gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai}
.
Then, when you run the command above, give it the full BAM,
remove --regions chr20
, and make sure to provide all the pVCFs for
population_vcfs
. You can see he WES example below on how to use all pVCFs.
The example is similar to the WES case study.
First, follow the case study above to download the test data needed.
For WES, we plan to run on everything (not just chr20). So, donwload all pVCFs:
gsutil -m cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/
Then, download the WES AF model:
MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WES/1.4.0
curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wes_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wes_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wes_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wes_af.model.ckpt.example_info.json
Then,
mkdir -p output
BIN_VERSION="1.4.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
--regions /input/idt_capture_novogene.grch38.bed \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr1.release_missing2ref.no_calls.vcf.gz /input/cohort-chr10.release_missing2ref.no_calls.vcf.gz /input/cohort-chr11.release_missing2ref.no_calls.vcf.gz /input/cohort-chr12.release_missing2ref.no_calls.vcf.gz /input/cohort-chr13.release_missing2ref.no_calls.vcf.gz /input/cohort-chr14.release_missing2ref.no_calls.vcf.gz /input/cohort-chr15.release_missing2ref.no_calls.vcf.gz /input/cohort-chr16.release_missing2ref.no_calls.vcf.gz /input/cohort-chr17.release_missing2ref.no_calls.vcf.gz /input/cohort-chr18.release_missing2ref.no_calls.vcf.gz /input/cohort-chr19.release_missing2ref.no_calls.vcf.gz /input/cohort-chr2.release_missing2ref.no_calls.vcf.gz /input/cohort-chr20.release_missing2ref.no_calls.vcf.gz /input/cohort-chr21.release_missing2ref.no_calls.vcf.gz /input/cohort-chr22.release_missing2ref.no_calls.vcf.gz /input/cohort-chr3.release_missing2ref.no_calls.vcf.gz /input/cohort-chr4.release_missing2ref.no_calls.vcf.gz /input/cohort-chr5.release_missing2ref.no_calls.vcf.gz /input/cohort-chr6.release_missing2ref.no_calls.vcf.gz /input/cohort-chr7.release_missing2ref.no_calls.vcf.gz /input/cohort-chr8.release_missing2ref.no_calls.vcf.gz /input/cohort-chr9.release_missing2ref.no_calls.vcf.gz /input/cohort-chrX.release_missing2ref.no_calls.vcf.gz /input/cohort-chrY.release_missing2ref.no_calls.vcf.gz"' \
--customized_model="/input/wes_af.model.ckpt"
Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-exome-case-study.md to finish hap.py evaluation:
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 1051 1027 24 1458 7 404 5 1 0.977165 0.993359 0.277092 0.985195 NaN NaN 1.747283 1.890909
INDEL PASS 1051 1027 24 1458 7 404 5 1 0.977165 0.993359 0.277092 0.985195 NaN NaN 1.747283 1.890909
SNP ALL 25279 24980 299 27665 41 2642 33 2 0.988172 0.998362 0.095500 0.993241 2.854703 2.757366 1.623027 1.645646
SNP PASS 25279 24980 299 27665 41 2642 33 2 0.988172 0.998362 0.095500 0.993241 2.854703 2.757366 1.623027 1.645646