This gist is a follow up on https://gist.github.com/pichuan/7ad09bf1fa8f519facf6806eca835ea6 (which was trained with v1.4.0 codebase and training data)
I've trained a WGS and WES model on v1.5.0 codebase and training data, and added the additional allele frequency channel.
For more details about this work, see Improving variant calling using population data and deep learning
The example is similar to the WGS case study.
Main differences:
- We'll be downloading the WGS AF model (and WES AF model later).
- We'll need to download the pVCF files to supply the population information.
- Flags will need to be added to enable the allele frequency channel.
I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform
Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md to download the data.
And, because we are going to test the allele frequency feature, we need to
download the file used in --population_vcfs
flag, and also the trained model
files:
The full pVCFs can be found in:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)
We created a set of smaller set of files that contain just the necessary allele frequency information here:
$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)
In this tutorial, we'll just run on chr20. So we can download just chr20:
AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi
Model files:
MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.5.0
curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wgs_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wgs_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wgs_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wgs_af.model.ckpt.example_info.json
And then in this step: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:
mkdir -p output
BIN_VERSION="1.5.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WGS \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--regions chr20 \
--make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
--customized_model="/input/wgs_af.model.ckpt"
(Note: In the example above, I skipped the --intermediate_results_dir flag. You can add it as needed.)
Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md to finish hap.py evaluation:
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10592 36 21169 18 10104 13 3 0.996613 0.998373 0.477302 0.997492 NaN NaN 1.748961 2.338576
INDEL PASS 10628 10592 36 21169 18 10104 13 3 0.996613 0.998373 0.477302 0.997492 NaN NaN 1.748961 2.338576
SNP ALL 70166 69923 243 85873 49 15863 13 1 0.996537 0.999300 0.184726 0.997917 2.296566 2.074704 1.883951 1.951011
SNP PASS 70166 69923 243 85873 49 15863 13 1 0.996537 0.999300 0.184726 0.997917 2.296566 2.074704 1.883951 1.951011
If you want to evaluate on everything (not just chr20), you can download the BAM
file at gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai}
.
Then, when you run the command above, give it the full BAM,
remove --regions chr20
, and make sure to provide all the pVCFs for
population_vcfs
. You can see he WES example below on how to use all pVCFs.
The example is similar to the WES case study.
First, follow the case study above to download the test data needed.
For WES, we plan to run on everything (not just chr20). So, donwload all pVCFs:
gcloud storage cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/
Then, download the WES AF model:
MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WES/1.5.0
curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wes_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wes_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wes_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wes_af.model.ckpt.example_info.json
Then,
mkdir -p output
BIN_VERSION="1.5.0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
--regions /input/idt_capture_novogene.grch38.bed \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr*.release_missing2ref.no_calls.vcf.gz"' \
--customized_model="/input/wes_af.model.ckpt"
Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-exome-case-study.md to finish hap.py evaluation:
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 1051 1023 28 1483 13 428 9 3 0.973359 0.987678 0.288604 0.980466 NaN NaN 1.747283 1.888889
INDEL PASS 1051 1023 28 1483 13 428 9 3 0.973359 0.987678 0.288604 0.980466 NaN NaN 1.747283 1.888889
SNP ALL 25279 25001 278 27797 38 2756 22 2 0.989003 0.998482 0.099147 0.993720 2.854703 2.750337 1.623027 1.657673
SNP PASS 25279 25001 278 27797 38 2756 22 2 0.989003 0.998482 0.099147 0.993720 2.854703 2.750337 1.623027 1.657673