DeepVariant allele frequency models (WGS and WES) - trained with v1.4.0 codebase and training data.

This gist is a follow up on https://gist.github.com/pichuan/64d73bc965300645470eb29a66116593 (which was trained with v1.1.0 codebase and training data)

In v1.4.0, our WGS and WES models have a new default insert_size channel. Based on the requests from our users, I've trained a WGS and WES model on v1.4.0 codebase and training data, and added the additional allele frequency channel.

For more details about this work, see Improving variant calling using population data and deep learning

WGS AF case study on chr20

The example is similar to the WGS case study.

Main differences:

We'll be downloading the WGS AF model (and WES AF model later).
We'll need to download the pVCF files to supply the population information.
Flags will need to be added to enable the allele frequency channel.

I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform

Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md to download the data.

And, because we are going to test the allele frequency feature, we need to download the file used in --population_vcfs flag, and also the trained model files:

The full pVCFs can be found in:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)

We created a set of smaller set of files that contain just the necessary allele frequency information here:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)

In this tutorial, we'll just run on chr20. So we can download just chr20:

AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref

curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi

Model files:

MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.4.0

curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wgs_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wgs_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wgs_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wgs_af.model.ckpt.example_info.json

And then in this step: https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:

mkdir -p output

BIN_VERSION="1.4.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WGS \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --regions chr20 \
  --make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
  --customized_model="/input/wgs_af.model.ckpt"

(Note: In the example above, I skipped the --intermediate_results_dir flag. You can add it as needed.)

Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-case-study.md to finish hap.py evaluation:

Output:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL        10628     10586        42        21097        22      10036     17      4       0.996048          0.998011        0.475707         0.997029                     NaN                     NaN                   1.748961                   2.326957
INDEL   PASS        10628     10586        42        21097        22      10036     17      4       0.996048          0.998011        0.475707         0.997029                     NaN                     NaN                   1.748961                   2.326957
  SNP    ALL        70166     69921       245        85470        53      15464     20      5       0.996508          0.999243        0.180929         0.997874                2.296566                2.078981                   1.883951                   1.949056
  SNP   PASS        70166     69921       245        85470        53      15464     20      5       0.996508          0.999243        0.180929         0.997874                2.296566                2.078981                   1.883951                   1.949056

If you want to evaluate on everything (not just chr20), you can download the BAM file at gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai}.

Then, when you run the command above, give it the full BAM, remove --regions chr20, and make sure to provide all the pVCFs for population_vcfs. You can see he WES example below on how to use all pVCFs.

WES AF case study on all chromosomes

The example is similar to the WES case study.

First, follow the case study above to download the test data needed.

For WES, we plan to run on everything (not just chr20). So, donwload all pVCFs:

gsutil -m cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/

Then, download the WES AF model:

MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WES/1.4.0

curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wes_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wes_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wes_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wes_af.model.ckpt.example_info.json

Then,

mkdir -p output

BIN_VERSION="1.4.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WES \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
  --regions /input/idt_capture_novogene.grch38.bed \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr1.release_missing2ref.no_calls.vcf.gz /input/cohort-chr10.release_missing2ref.no_calls.vcf.gz /input/cohort-chr11.release_missing2ref.no_calls.vcf.gz /input/cohort-chr12.release_missing2ref.no_calls.vcf.gz /input/cohort-chr13.release_missing2ref.no_calls.vcf.gz /input/cohort-chr14.release_missing2ref.no_calls.vcf.gz /input/cohort-chr15.release_missing2ref.no_calls.vcf.gz /input/cohort-chr16.release_missing2ref.no_calls.vcf.gz /input/cohort-chr17.release_missing2ref.no_calls.vcf.gz /input/cohort-chr18.release_missing2ref.no_calls.vcf.gz /input/cohort-chr19.release_missing2ref.no_calls.vcf.gz /input/cohort-chr2.release_missing2ref.no_calls.vcf.gz /input/cohort-chr20.release_missing2ref.no_calls.vcf.gz /input/cohort-chr21.release_missing2ref.no_calls.vcf.gz /input/cohort-chr22.release_missing2ref.no_calls.vcf.gz /input/cohort-chr3.release_missing2ref.no_calls.vcf.gz /input/cohort-chr4.release_missing2ref.no_calls.vcf.gz /input/cohort-chr5.release_missing2ref.no_calls.vcf.gz /input/cohort-chr6.release_missing2ref.no_calls.vcf.gz /input/cohort-chr7.release_missing2ref.no_calls.vcf.gz /input/cohort-chr8.release_missing2ref.no_calls.vcf.gz /input/cohort-chr9.release_missing2ref.no_calls.vcf.gz /input/cohort-chrX.release_missing2ref.no_calls.vcf.gz /input/cohort-chrY.release_missing2ref.no_calls.vcf.gz"' \
  --customized_model="/input/wes_af.model.ckpt"

Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.4/docs/deepvariant-exome-case-study.md to finish hap.py evaluation:

Output:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         1051      1027        24         1458         7        404      5      1       0.977165          0.993359        0.277092         0.985195                     NaN                     NaN                   1.747283                   1.890909
INDEL   PASS         1051      1027        24         1458         7        404      5      1       0.977165          0.993359        0.277092         0.985195                     NaN                     NaN                   1.747283                   1.890909
  SNP    ALL        25279     24980       299        27665        41       2642     33      2       0.988172          0.998362        0.095500         0.993241                2.854703                2.757366                   1.623027                   1.645646
  SNP   PASS        25279     24980       299        27665        41       2642     33      2       0.988172          0.998362        0.095500         0.993241                2.854703                2.757366                   1.623027                   1.645646

pichuan/deepvariant-af-v1.4.0.md

DeepVariant allele frequency models (WGS and WES) - trained with v1.4.0 codebase and training data.

WGS AF case study on chr20

WES AF case study on all chromosomes