DeepVariant allele frequency models (WGS and WES) - trained with v1.5.0 codebase and training data.

This gist is a follow up on https://gist.github.com/pichuan/7ad09bf1fa8f519facf6806eca835ea6 (which was trained with v1.4.0 codebase and training data)

I've trained a WGS and WES model on v1.5.0 codebase and training data, and added the additional allele frequency channel.

For more details about this work, see Improving variant calling using population data and deep learning

WGS AF case study on chr20

The example is similar to the WGS case study.

Main differences:

We'll be downloading the WGS AF model (and WES AF model later).
We'll need to download the pVCF files to supply the population information.
Flags will need to be added to enable the allele frequency channel.

I got a machine using a command like this: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform

Then, follow the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md to download the data.

And, because we are going to test the allele frequency feature, we need to download the file used in --population_vcfs flag, and also the trained model files:

The full pVCFs can be found in:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.vcf.gz
...
TOTAL: 24 objects, 1002670628843 bytes (933.81 GiB)

We created a set of smaller set of files that contain just the necessary allele frequency information here:

$ gsutil ls -lh gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz
...
TOTAL: 24 objects, 2353577310 bytes (2.19 GiB)

In this tutorial, we'll just run on chr20. So we can download just chr20:

AF_DIR=https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref

curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz
curl ${AF_DIR}/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi > input/cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi

Model files:

MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.5.0

curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wgs_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wgs_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wgs_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wgs_af.model.ckpt.example_info.json

And then in this step: https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md#running-on-a-cpu-only-machine, use this command instead:

mkdir -p output

BIN_VERSION="1.5.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WGS \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.chr20.bam \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --regions chr20 \
  --make_examples_extra_args="use_allele_frequency=true,population_vcfs=/input/cohort-chr20.release_missing2ref.no_calls.vcf.gz" \
  --customized_model="/input/wgs_af.model.ckpt"

(Note: In the example above, I skipped the --intermediate_results_dir flag. You can add it as needed.)

Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-case-study.md to finish hap.py evaluation:

Output:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL        10628     10592        36        21169        18      10104     13      3       0.996613          0.998373        0.477302         0.997492                     NaN                     NaN                   1.748961                   2.338576
INDEL   PASS        10628     10592        36        21169        18      10104     13      3       0.996613          0.998373        0.477302         0.997492                     NaN                     NaN                   1.748961                   2.338576
  SNP    ALL        70166     69923       243        85873        49      15863     13      1       0.996537          0.999300        0.184726         0.997917                2.296566                2.074704                   1.883951                   1.951011
  SNP   PASS        70166     69923       243        85873        49      15863     13      1       0.996537          0.999300        0.184726         0.997917                2.296566                2.074704                   1.883951                   1.951011

If you want to evaluate on everything (not just chr20), you can download the BAM file at gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam{,.bai}.

Then, when you run the command above, give it the full BAM, remove --regions chr20, and make sure to provide all the pVCFs for population_vcfs. You can see he WES example below on how to use all pVCFs.

WES AF case study on all chromosomes

The example is similar to the WES case study.

First, follow the case study above to download the test data needed.

For WES, we plan to run on everything (not just chr20). So, donwload all pVCFs:

gcloud storage cp gs://brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref/cohort-*.release_missing2ref.no_calls.vcf.gz{,.tbi} input/

Then, download the WES AF model:

MODEL_DIR=https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WES/1.5.0

curl ${MODEL_DIR}/model.ckpt.data-00000-of-00001 > input/wes_af.model.ckpt.data-00000-of-00001
curl ${MODEL_DIR}/model.ckpt.index > input/wes_af.model.ckpt.index
curl ${MODEL_DIR}/model.ckpt.meta > input/wes_af.model.ckpt.meta
curl ${MODEL_DIR}/model.ckpt.example_info.json > input/wes_af.model.ckpt.example_info.json

Then,

mkdir -p output

BIN_VERSION="1.5.0"

sudo docker run \
  -v "${PWD}/input":"/input" \
  -v "${PWD}/output":"/output" \
  -v "${PWD}/reference":"/reference" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WES \
  --ref /reference/GRCh38_no_alt_analysis_set.fasta \
  --reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
  --regions /input/idt_capture_novogene.grch38.bed \
  --output_vcf /output/HG003.output.vcf.gz \
  --output_gvcf /output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --make_examples_extra_args='use_allele_frequency=true,population_vcfs="/input/cohort-chr*.release_missing2ref.no_calls.vcf.gz"' \
  --customized_model="/input/wes_af.model.ckpt"

Then, follow the rest of the instructions in https://github.com/google/deepvariant/blob/r1.5/docs/deepvariant-exome-case-study.md to finish hap.py evaluation:

Output:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         1051      1023        28         1483        13        428      9      3       0.973359          0.987678        0.288604         0.980466                     NaN                     NaN                   1.747283                   1.888889
INDEL   PASS         1051      1023        28         1483        13        428      9      3       0.973359          0.987678        0.288604         0.980466                     NaN                     NaN                   1.747283                   1.888889
  SNP    ALL        25279     25001       278        27797        38       2756     22      2       0.989003          0.998482        0.099147         0.993720                2.854703                2.750337                   1.623027                   1.657673
  SNP   PASS        25279     25001       278        27797        38       2756     22      2       0.989003          0.998482        0.099147         0.993720                2.854703                2.750337                   1.623027                   1.657673

pichuan/deepvariant-af-v1.5.0.md

DeepVariant allele frequency models (WGS and WES) - trained with v1.5.0 codebase and training data.

WGS AF case study on chr20

WES AF case study on all chromosomes