Using --variant_caller=vcf_candidate_importer
in make_examples, we can use DeepVariant to act like a genotyper to force call candidates in the VCF file specified in --proposed_variants
.
Before reading this guide, read the WES Case Study and metrics:
- https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-exome-case-study.md
- https://github.com/google/deepvariant/blob/r1.1/docs/metrics.md
The instructions below will be similar to: https://raw.githubusercontent.com/google/deepvariant/r1.1/scripts/run_wes_case_study_doc ker.sh
But we’ll provide new flags for you to run in force calling mode.
I got a GCE instance following this command: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-details.md#command-for-a-cpu-only-machine-on-google-cloud-platform
I used ubuntu-1804-lts
instead of 1604. (Ubuntu 16.04 should work too. I just picked one for this tutorial.)
sudo apt -y update && \
sudo apt -y install docker.io && \
sudo apt -y install bedtools && \
sudo apt -y install tabix
With the files from the WES Case Study, run:
bedtools intersect -header \
-a benchmark/HG003_GRCh38_1_22_v4.2_benchmark.vcf.gz \
-b benchmark/HG003_GRCh38_1_22_v4.2_benchmark.bed \
| bgzip > input/force_calling.candidates.vcf.gz
tabix -p vcf input/force_calling.candidates.vcf.gz
Prepare a VCF file with proposed sites (that we can do force calling on):
Replace the run_deepvariant section in the WES Case Study with this:
mkdir -p output
mkdir -p output/intermediate_results_dir
BIN_VERSION="1.1.0"
MAKE_EXAMPLE_ARGS="variant_caller=vcf_candidate_importer,proposed_variants=/input/force_calling.candidates.vcf.gz,vsc_min_count_snps=0,vsc_min_count_indels=0,vsc_min_fraction_snps=0,vsc_min_fraction_indels=0"
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /reference/GRCh38_no_alt_analysis_set.fasta \
--reads /input/HG003.novaseq.wes_idt.100x.dedup.bam \
--regions /input/idt_capture_novogene.grch38.bed \
--output_vcf /output/HG003.output.vcf.gz \
--output_gvcf /output/HG003.output.g.vcf.gz \
--num_shards $(nproc) \
--make_examples_extra_args="${MAKE_EXAMPLE_ARGS}" \
--intermediate_results_dir /output/intermediate_results_dir
Callset is generated at output/HG003.output.vcf.gz
.
make_examples
step took:
real 7m34.109s
user 446m27.685s
sys 9m19.618s
call_variants
step took:
real 0m59.444s
user 25m49.371s
sys 1m25.558s
postprocess_variants
step took:
real 0m44.978s
user 0m43.506s
sys 0m8.865s
Baseline (WES metrics):
Type | # TP | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|---|
Indel | 1025 | 28 | 19 | 0.973409 | 0.982126 | 0.977748 |
SNP | 25005 | 319 | 169 | 0.987403 | 0.993287 | 0.990337 |
Force calling on NIST v4.2 truth:
Type | # TP | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|---|
Indel | 1026 | 27 | 7 | 0.974359 | 0.993340 | 0.983758 |
SNP | 25022 | 302 | 48 | 0.988075 | 0.998085 | 0.993055 |
hap.py output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 1053 1026 27 1051 7 0 7 0.974359 0.993340 0 0.983758 NaN NaN 1.752717 1.833333
INDEL PASS 1053 1026 27 1051 7 0 7 0.974359 0.993340 0 0.983758 NaN NaN 1.752717 1.833333
SNP ALL 25324 25022 302 25071 48 0 48 0.988075 0.998085 0 0.993055 2.856273 2.860508 1.625246 1.625773
SNP PASS 25324 25022 302 25071 48 0 48 0.988075 0.998085 0 0.993055 2.856273 2.860508 1.625246 1.625773