Introduction


This tutorial provides a step-by-step guide for using DRAGMAP (DRAGEN open-source mapper) for read alignment and GATK (Genome Analysis Toolkit) for variant calling from whole genome sequencing (WGS) and whole exome sequencing (WES) data. It covers both somatic and germline variant calling workflows with command examples, filtering strategies, annotation using AI-based tools, and prioritization methods.

What is DRAGMAP?

DRAGMAP is the open-source software implementation of Illumina’s DRAGEN (Dynamic Read Analysis for GENomics) mapper/aligner. It provides fast and accurate alignment of sequencing reads to reference genomes, with support for:

  • Hash table-based alignment: Efficient mapping using pre-built hash tables
  • Decoy sequence support: Reduces false positives from misaligned reads
  • ALT haplotype masking: Minimizes ambiguous mapping in complex genomic regions
  • Pangenome references: Graph-augmented references for improved mapping in difficult regions (DRAGEN v4.3+)
  • Hardware-agnostic: Runs on standard CPU/GPU hardware (unlike DRAGEN hardware which requires FPGA)

DRAGMAP is particularly well-suited for:

  • Large-scale sequencing projects requiring fast alignment
  • Somatic variant calling where mapping accuracy is critical
  • Germline cohort studies (can use pangenome references for improved accuracy)
  • Projects requiring DRAGEN-compatible alignment outputs

What is GATK?

GATK (Genome Analysis Toolkit) is a comprehensive suite of tools for variant discovery and genotyping developed by the Broad Institute. Key features include:

  • HaplotypeCaller: Industry-standard tool for germline variant calling
  • Mutect2: Specialized tool for somatic variant calling
  • DRAGEN-GATK features: Software implementation of DRAGEN algorithms for standard hardware since GATK 4.2
  • Variant Quality Score Recalibration (VQSR): Machine learning-based variant filtering
  • Best Practices workflows: Well-established pipelines for variant analysis

When to Use DRAGMAP vs BWA-MEM

Use DRAGMAP when:

  • You need DRAGEN-compatible alignment outputs
  • You’re working with large-scale projects and need fast alignment
  • You want to leverage decoy sequences and ALT masking for improved accuracy
  • You’re performing somatic variant calling where mapping specificity is critical

Use BWA-MEM when:

  • You need a widely-adopted, well-documented aligner
  • You’re working with standard research workflows
  • You prefer a simpler installation and setup process
  • You’re aligning to non-standard reference genomes

When to Use GATK for Somatic vs Germline Variant Calling

For Germline Variant Calling:

  • HaplotypeCaller: Use for single-sample or cohort analysis
  • GVCF workflow: Recommended for cohort studies (more efficient than per-sample VCF)
  • GenotypeGVCFs: Joint genotyping across multiple samples

For Somatic Variant Calling:

  • Mutect2: Primary tool for tumor-normal pairs or tumor-only samples
  • Panel of Normals (PoN): Essential for tumor-only mode
  • FilterMutectCalls: Somatic-specific filtering

Prerequisites and Software Installation

Before starting, ensure you have the following installed:

Required Software:

  • DRAGMAP: Available via Bioconda (conda install dragmap) or build from source
  • GATK: Download from GATK website (version 4.2+ recommended)
  • samtools: For BAM file manipulation (conda install samtools)
  • bcftools: For VCF manipulation (conda install bcftools)
  • VEP (Variant Effect Predictor): For variant annotation (conda install ensembl-vep)

Reference Genome:

  • GRCh38/hg38: Recommended standard reference genome
  • Decoy sequences: Include hs38d1 or similar decoy sequences
  • ALT masking: Use alt-masked reference for optimal results

System Requirements:

  • Memory: Minimum 32GB RAM (64GB+ recommended for WGS)
  • CPU: Multi-core processor (8+ cores recommended)
  • Storage: Sufficient space for reference genome, hash tables, and intermediate files

Reference Genome Considerations

Why Use Decoyed Reference? Decoy sequences (e.g., hs38d1) help “soak up” reads from repetitive or unplaced regions that would otherwise mis-map to the primary assembly. This reduces false positive variant calls, especially important for:

  • Somatic variant calling (low VAF variants)
  • Structural variant detection
  • Regions with high sequence similarity

Why Use ALT-Masked Reference? ALT haplotype assemblies in GRCh38 represent alternate sequences that differ from the primary assembly. Masking regions of ALT contigs that are highly similar to the primary assembly:

  • Reduces ambiguous mapping
  • Improves mapping quality scores
  • Prevents false variant calls from mapping artifacts
  • Critical for accurate variant calling in complex genomic regions

Reference Consistency: It is critical to use the same reference genome version (decoyed + alt-masked) throughout the entire pipeline:

  • Alignment (DRAGMAP)
  • Variant calling (GATK)
  • Annotation (VEP)
  • Filtering and prioritization

Mixing reference versions can lead to coordinate mismatches, dropped variants, or incorrect annotations.

Workflow Overview


The following diagram illustrates the workflow from raw FASTQ files to prioritized variants:

graph TB
    Start["Raw FASTQ Files"] --> Prep["Stage 1: Reference Preparation<br/>Decoyed + Alt-masked Reference<br/>DRAGMAP Hash Table"]
    Prep --> Align["Stage 2: DRAGMAP Alignment<br/>Single/Paired-end Reads<br/>WGS/WES"]
    Align --> Process["Stage 3: Post-Alignment Processing<br/>MarkDuplicates, BQSR"]
    Process --> QC["Stage 4: Quality Control<br/>Coverage Metrics, Alignment QC"]
    QC --> SNVIndel["Stage 5: SNV/Indel Calling<br/>HaplotypeCaller (Germline)<br/>Mutect2 (Somatic)"]
    QC --> SVCNV["Stage 6: SV & CNV Calling<br/>Structural Variants<br/>Copy Number Variants"]
    SNVIndel --> Filter["Stage 7: Variant Filtering<br/>VQSR, FilterMutectCalls<br/>SV/CNV Quality Filters"]
    SVCNV --> Filter
    Filter --> Annotate["Stage 8: Variant Annotation<br/>VEP, AI Tools, ENCODE cCRE<br/>SV/CNV Annotation"]
    Annotate --> Prioritize["Stage 9: Variant Prioritization<br/>Pathway Enrichment, AI Scores<br/>Regulatory Annotation"]
    Prioritize --> End["Prioritized Variants Report"]
    
    style Start fill:#0066cc,stroke:#003366,stroke-width:3px,color:#ffffff
    style End fill:#006600,stroke:#003300,stroke-width:3px,color:#ffffff
    style Prep fill:#ffcc00,stroke:#cc9900,stroke-width:2px,color:#000000
    style QC fill:#ffcc00,stroke:#cc9900,stroke-width:2px,color:#000000
    style SNVIndel fill:#0099cc,stroke:#006699,stroke-width:2px,color:#ffffff
    style SVCNV fill:#cc0066,stroke:#990033,stroke-width:2px,color:#ffffff
    style Filter fill:#ff9900,stroke:#cc6600,stroke-width:2px,color:#000000
    style Annotate fill:#9966cc,stroke:#663399,stroke-width:2px,color:#ffffff
    style Prioritize fill:#9966cc,stroke:#663399,stroke-width:2px,color:#ffffff

Key Workflow Components

  1. Reference Preparation: Build decoyed and alt-masked reference, create DRAGMAP hash table
  2. Alignment: Map reads using DRAGMAP with decoy and alt masking support
  3. Post-Alignment Processing: Mark duplicates, perform base quality score recalibration
  4. Quality Control: Assess alignment quality and coverage metrics
  5. SNV/Indel Variant Calling:
    • Germline: HaplotypeCaller with DRAGEN features
    • Somatic: Mutect2 with DRAGEN features
  6. SV and CNV Calling: Structural variant and copy number variant detection using GATK tools
  7. Variant Filtering: Quality-based filtering for SNVs/indels, SVs, and CNVs
    • Germline: VQSR for SNVs/indels
    • Somatic: FilterMutectCalls for SNVs/indels
    • SV/CNV: Quality and size-based filtering
  8. Variant Annotation:
    • Coding variants: VEP with AI tools (AlphaMissense, AlphaFold, FoldX) and SNPeffect 5.0
    • Non-coding variants: ENCODE cCRE annotation
    • SV annotation: AnnotSV, VEP with SV support
    • CNV annotation: Gene overlap and functional impact
  9. Variant Prioritization: Pathway enrichment, AI-based scoring, regulatory impact assessment

WGS vs WES Considerations

Aspect WGS WES
Target BED file Not required Required for QC and optional variant calling restriction
Coverage metrics CollectWgsMetrics CollectHsMetrics with BED
Mean coverage 30-60x (germline), 60-100x (somatic) 100-150x in targets
On-target % N/A ≥80% recommended
Variant calling scope Genome-wide Optionally restrict to target regions
Non-coding variants Comprehensive detection Limited to exonic and nearby regions
Reference preparation Full genome with decoys Same, but target BED needed for QC

Stage 1: Reference Genome Preparation


Reference Genome Selection

Recommended: GRCh38/hg38 with decoy sequences and ALT masking

Reference options:

  1. Linear reference with decoys and ALT masking (standard, recommended for most workflows)
  2. Pangenome reference (recommended for large-scale germline cohort studies, see Pangenome Reference Support section below)

Why GRCh38/hg38?

  • Current standard reference genome
  • Best annotation support
  • Compatible with all major databases (gnomAD, ClinVar, COSMIC)
  • Well-supported by DRAGMAP and GATK
  • Prebuilt pangenome references available

Downloading Reference Genome

Option 1: From GATK Resource Bundle (Recommended)

The GATK resource bundle provides pre-formatted reference genomes with decoys. Download from the official GATK resource bundle:

# Download GATK resource bundle using gsutil (Google Cloud SDK)
# First, install gsutil if not already installed
# Then download the reference files:

gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/ .

# Or download specific files:
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict .

# Alternative: Use wget with the public URL (if accessible)
# Check GATK documentation for current public URLs:
# https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle

Option 2: From NCBI (Build Your Own with Decoys)

# Download primary assembly from NCBI
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39

# Download decoy sequences (hs38d1)
# Combine primary assembly with decoy sequences to create hs38d1 reference

Option 3: From GENCODE or Ensembl

# GENCODE: https://www.gencodegenes.org/human/
# Ensembl: https://www.ensembl.org/Homo_sapiens/Info/Index

# Download GRCh38 primary assembly and add decoy sequences manually

Note: Always verify the reference genome version and ensure it matches your annotation databases. Check the GATK resource bundle documentation for the most current download instructions.

Obtaining ALT Mask BED File

ALT mask BED files specify regions in ALT contigs that should be masked (replaced with Ns) to reduce ambiguous mapping. DRAGMAP can auto-detect and apply default masks for standard references, or you can provide a custom mask file.

Sources for ALT mask BED files:

  • DRAGMAP repository: Check the fasta_mask directory in the DRAGMAP GitHub repository
  • Illumina documentation: DRAGEN reference preparation guides
  • GATK resource bundle: May include alt mask files

Example locations:

# Check DRAGMAP repository for alt mask files
# https://github.com/Illumina/DRAGMAP/tree/master/fasta_mask

# Example: hg38 alt mask file
# fasta_mask/hg38_alt_mask.bed

Building DRAGMAP Hash Table

The hash table is a pre-computed index that DRAGMAP uses for fast alignment. It should be built with the same reference genome (including decoys and alt masking) that you’ll use for alignment.

Basic hash table building:

# Basic hash table building (DRAGMAP will auto-detect decoys if present)
dragen-os --build-hash-table true \
  --ht-reference reference.fasta \
  --output-directory /path/to/reference/ \
  --output-file-prefix dragmap.hg38

Hash table with explicit decoy sequences:

# If decoys are in separate file, specify explicitly
dragen-os --build-hash-table true \
  --ht-reference reference.fasta \
  --ht-decoys decoys.fasta \
  --output-directory /path/to/reference/ \
  --output-file-prefix dragmap.hg38_with_decoys

Hash table with ALT masking:

# Build hash table with ALT mask BED file
dragen-os --build-hash-table true \
  --ht-reference hg38.fa \
  --output-directory /path/to/reference/ \
  --output-file-prefix dragmap.hg38_alt_masked \
  --ht-mask-bed fasta_mask/hg38_alt_mask.bed

Hash table with both decoys and ALT masking:

# Combined: decoys + ALT masking (recommended)
dragen-os --build-hash-table true \
  --ht-reference hg38.fa \
  --ht-decoys hs38d1.fa \
  --output-directory /path/to/reference/ \
  --output-file-prefix dragmap.hg38_decoys_alt_masked \
  --ht-mask-bed fasta_mask/hg38_alt_mask.bed

Automatic detection: For standard references (hg38, hg19), DRAGMAP can automatically detect and apply default decoy sequences and ALT masks if they’re not explicitly specified. However, explicitly providing them ensures consistency.

Pangenome Reference Support

DRAGMAP supports pangenome references! DRAGMAP can use graph-augmented pangenome references that incorporate alternate variant paths from population cohorts, improving mapping accuracy especially in difficult-to-map regions.

What is a Pangenome Reference?

A pangenome reference in DRAGMAP context refers to:

  • A linear reference scaffold (e.g., GRCh38) augmented with alternate variant paths from population VCFs
  • Internally uses a graph structure for multigenome mapping
  • Output remains in standard linear reference coordinates (BAM/VCF compatible with standard tools)
  • DRAGEN v4.3+ includes a prebuilt pangenome based on 128 human assemblies from 26 ancestries

Benefits of Pangenome References

Recommended for:

  • Human germline analyses: Improved mapping in highly polymorphic regions
  • Difficult-to-map regions: Better read placement in structurally complex regions
  • Reducing reference bias: More accurate variant calling across diverse ancestries
  • High coverage in ancestral diversity: Better representation of population variation

Less recommended for:

  • Somatic variant calling: Linear references are typically preferred for somatic workflows
  • Non-human organisms: Prebuilt pangenomes are primarily for human; custom pangenomes can be built

Using Prebuilt Pangenome References

DRAGEN/DRAGMAP provides prebuilt pangenome hash tables for major human assemblies:

  • hg19
  • hs37d5
  • hg38
  • T2T-CHM13v2.0

Download prebuilt pangenome references:

# Check Illumina/DRAGEN documentation for prebuilt pangenome hash tables
# These are large files and require significant storage space
# Prebuilt pangenomes are optimized for human germline analysis

Building Custom Pangenome References

You can build custom pangenome references using population VCFs:

# Build pangenome hash table with multi-sample VCF (msVCF)
dragen-os --build-hash-table true \
  --ht-reference reference.fasta \
  --ht-graph-msvcf-file population_variants.msvcf.gz \
  --ht-graph-extra-kmer-bed graph_regions.bed \
  --ht-mask-bed mask.bed \
  --output-directory /path/to/pangenome_reference/ \
  --output-file-prefix dragmap.pangenome.hg38

Requirements for custom pangenome:

  • Multi-sample VCF (msVCF): Population variants from cohort analysis
  • Phasing information: Phased variants improve pangenome quality (long-read assemblies help)
  • BED files: Optional region inclusion/exclusion or masking
  • Computational resources: Building pangenomes with many samples (>256) requires significant memory and time

Using Pangenome References for Alignment

Once you have a pangenome hash table, use it like a standard reference:

# Alignment with pangenome reference
dragen-os -r /path/to/pangenome_reference/ \
  -1 sample_R1.fastq.gz \
  -2 sample_R2.fastq.gz \
  --output-directory /path/to/output/ \
  --output-file-prefix sample

# DRAGMAP automatically uses multigenome mapper when pangenome hash table is detected
# Output BAM files are in standard linear reference coordinates

Considerations and Limitations

When to use pangenome references:

  • Large-scale germline cohort studies
  • Studies with diverse ancestral backgrounds
  • Projects focusing on difficult-to-map regions
  • When reference bias is a concern

When to use linear references:

  • Somatic variant calling workflows
  • Standard germline analysis (linear references work well)
  • Non-human organisms (unless custom pangenome built)
  • When computational resources are limited

Important notes:

  • Pangenome references are primarily validated for germline workflows
  • Some DRAGEN components may not have fully validated accuracy with pangenomes for somatic workflows
  • Check DRAGEN documentation for component-specific pangenome support
  • Output coordinates remain in linear reference space, ensuring compatibility with standard tools

Preparing Known Sites for BQSR

Base Quality Score Recalibration (BQSR) requires known variant sites to model systematic errors. Download these resources from the GATK resource bundle:

# Create directory for known sites
mkdir -p /path/to/known_sites
cd /path/to/known_sites

# Download using gsutil (Google Cloud SDK)
# Install gsutil if needed: https://cloud.google.com/storage/docs/gsutil_install

# Download dbSNP (common variants)
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx .

# Download Mills and 1000G gold standard indels
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi .

# Download 1000G phase 1 high-confidence SNPs (optional, for VQSR)
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi .

# Alternative: Download entire resource bundle directory
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/ .

# For current URLs and additional resources, check:
# https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle

WES Target BED File Preparation

For WES data, you’ll need a BED file specifying the exome target regions:

# Download target BED file from capture kit manufacturer
# Example: Illumina Nextera Rapid Capture Exome
# Or use standard exome target files from:
# - Gencode
# - RefSeq
# - Capture kit manufacturer website

# Ensure BED file is in hg38 coordinates
# If in hg19, use liftOver to convert:
liftOver exome_targets.hg19.bed hg19ToHg38.over.chain.gz exome_targets.hg38.bed unmapped.bed

Reference Consistency Checklist

Before proceeding, verify:

  • Same reference genome version used for hash table building
  • Decoy sequences included (if using)
  • ALT masking applied (if using)
  • Known sites match reference genome build (hg38)
  • WES target BED file matches reference build (if applicable)
  • All files indexed properly (.fai, .dict, .tbi files present)

Stage 2: DRAGMAP Alignment


Note: Before alignment, perform quality control on raw sequencing data using tools like FastQC or MultiQC to assess read quality, adapter contamination, and sequencing metrics. This helps identify problematic samples and determine if adapter trimming or quality filtering is needed before alignment.

Installation

Option 1: Bioconda (Recommended)

# Install via Bioconda
conda install -c bioconda dragmap

# Verify installation
dragen-os --help

Option 2: Build from Source

# Clone DRAGMAP repository
git clone https://github.com/Illumina/DRAGMAP.git
cd DRAGMAP

# Build (requires C++17 compiler, Boost library)
make

# Binary will be in ./build/release/
# Optional: Install to /usr/bin/
make install

Alignment Commands

Paired-end reads (WGS or WES):

# Basic paired-end alignment
dragen-os -r /path/to/reference/ \
  -1 sample_R1.fastq.gz \
  -2 sample_R2.fastq.gz \
  --output-directory /path/to/output/ \
  --output-file-prefix sample

# With read group information
dragen-os -r /path/to/reference/ \
  -1 sample_R1.fastq.gz \
  -2 sample_R2.fastq.gz \
  --RGID sample_lib1 \
  --RGSM sample_name \
  --RGPL ILLUMINA \
  --RGLB lib1 \
  --RGPU unit1 \
  --output-directory /path/to/output/ \
  --output-file-prefix sample  > sample.sam

Single-end reads:

# Single-end alignment
dragen-os -r /path/to/reference/ \
  -1 sample_R1.fastq.gz \
  --output-directory /path/to/output/ \
  --output-file-prefix sample

WGS vs WES considerations: Both use the same alignment command. Differences are in downstream QC (target BED file for WES) and optional variant calling restrictions. See the WGS vs WES Considerations table in Workflow Overview.

Post-Alignment Processing

After alignment, sort and index the BAM file:

# Sort BAM file
samtools sort -@ 8 sample.bam -o sample.sorted.bam

# Index BAM file
samtools index sample.sorted.bam

# Verify BAM file
samtools view -H sample.sorted.bam | head -20

Why Use Decoyed + Alt-Masked Reference?

See the Reference Genome Considerations section in the Introduction for detailed explanation. Key benefits:

  • Reduces false positives from mis-mapped reads
  • Improves mapping specificity in complex regions
  • Critical for somatic variant calling and structural variant detection

Comparison with BWA-MEM

See the When to Use DRAGMAP vs BWA-MEM section in the Introduction for detailed comparison. DRAGMAP offers built-in decoy and ALT masking support, while BWA-MEM requires manual setup.

Stage 3: Post-Alignment Processing


MarkDuplicates

Mark PCR and optical duplicates to avoid false positive variant calls:

# Mark duplicates using GATK
gatk MarkDuplicates \
  -I sample.sorted.bam \
  -O sample.markdup.bam \
  -M sample.markdup.metrics.txt \
  --CREATE_INDEX true

# Verify duplicate marking
samtools view -f 1024 sample.markdup.bam | head -5

Base Quality Score Recalibration (BQSR)

BQSR corrects systematic biases in base quality scores. Note: When using DRAGEN mode in variant calling, BQSR may be bypassed as DRAGEN’s error model already accounts for systematic biases (see BQSR Considerations section below for details).

First pass - Build recalibration table:

# For WGS (no target regions needed)
gatk BaseRecalibrator \
  -R reference.fa \
  -I sample.markdup.bam \
  --known-sites dbsnp.vcf.gz \
  --known-sites mills.vcf.gz \
  --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
  -O sample.recal.table

# For WES (optionally restrict to target regions for efficiency)
gatk BaseRecalibrator \
  -R reference.fa \
  -I sample.markdup.bam \
  -L exome_targets.bed \
  --known-sites dbsnp.vcf.gz \
  --known-sites mills.vcf.gz \
  -O sample.recal.table

Apply BQSR:

gatk ApplyBQSR \
  -R reference.fa \
  -I sample.markdup.bam \
  -bqsr sample.recal.table \
  -O sample.recal.bam \
  --CREATE_INDEX true

Second pass - Generate recalibration metrics (optional, for QC):

gatk BaseRecalibrator \
  -R reference.fa \
  -I sample.recal.bam \
  --known-sites dbsnp.vcf.gz \
  --known-sites mills.vcf.gz \
  -O sample.recal.table2

# Compare before/after (optional visualization)
gatk AnalyzeCovariates \
  -before sample.recal.table \
  -after sample.recal.table2 \
  -plots sample.recalibration_plots.pdf

BQSR and DRAGEN Mode: See BQSR Considerations section in Understanding DRAGEN-GATK Features for details.

Stage 4: Quality Control


Alignment Metrics Collection

For WGS:

gatk CollectWgsMetrics \
  -R reference.fa \
  -I sample.recal.bam \
  -O sample.wgs_metrics.txt

For WES:

gatk CollectHsMetrics \
  -R reference.fa \
  -I sample.recal.bam \
  -O sample.wes_metrics.txt \
  -TI exome_targets.bed \
  -BI exome_targets.bed

Coverage Analysis

Key metrics to check:

Metric WGS Target WES Target
Mean coverage 30-60x (germline), 60-100x (somatic) 100-150x in targets
On-target % N/A ≥80%
Uniformity >0.9 >0.8
Duplication rate <20% <30%

Interpretation:

  • Low coverage: May miss variants or have low confidence calls
  • High duplication: May indicate library issues or low input
  • Poor uniformity: Uneven coverage may affect variant detection
  • Low on-target (WES): May indicate capture efficiency issues

Quality Thresholds Before Variant Calling

Pass criteria:

  • Mean coverage meets target (see table above)
  • Duplication rate acceptable
  • Mapping rate >95%
  • Proper read group information present
  • BAM file properly sorted and indexed
  • For WES: On-target percentage ≥80%

Fail samples that don’t meet QC thresholds before proceeding to variant calling.

Understanding DRAGEN-GATK Features


What is DRAGEN-GATK?

DRAGEN-GATK is a software implementation of DRAGEN algorithms that runs on standard hardware (CPU/GPU) without requiring DRAGEN FPGA hardware. It provides:

  • Improved error models: Better handling of sequencing errors
  • Foreign Read Detection (FRD): Identifies reads from contamination
  • Base Quality Dropout (BQD): Models quality score degradation
  • STR Error Correction (STRE): Improved handling of short tandem repeats
  • DRAGstr model: Specialized model for STR regions

Two Operational Modes

1. Functional Equivalence Mode:

  • Produces results equivalent to DRAGEN hardware output
  • Use when you need DRAGEN-compatible results
  • May disable some features for equivalence

2. Maximum Quality Mode:

  • Prioritizes accuracy and variant call richness
  • May produce more variants than DRAGEN hardware
  • Use when accuracy is more important than equivalence

Key Command-Line Flags

Flag Description Typical Values
--dragen-mode true Main flag to enable DRAGEN features true
--use-pdhmm true/false Enables Partially Determined HMM model true for max quality, false for equivalence
--disable-spanning-event-genotyping true/false Controls spanning event genotyping true for equivalence, false for max quality
--dragstr-params-path Path to DRAGstr model parameters file Path to text file (generated by CalibrateDragstrModel)
--dragstr-het-hom-ratio Controls heterozygous/homozygous prior ratio 2 (default)

DRAGstr Parameters

DRAGstr parameters are not pre-downloaded files - they must be generated using GATK’s CalibrateDragstrModel tool. These parameters contain tables (GOP, GCP, API) that model short tandem repeat (STR) regions for improved variant calling accuracy in DRAGEN mode.

Step 1: Generate STR table for your reference genome:

# Create STR table from reference genome
gatk ComposeSTRTableFile \
  -R reference.fa \
  -O str_table.tsv

Step 2: Calibrate DRAGstr model:

# Run CalibrateDragstrModel on your BAM file(s)
# This generates the DRAGstr parameters file
gatk CalibrateDragstrModel \
  -R reference.fa \
  -I sample.recal.bam \
  -str str_table.tsv \
  -O dragstr_params.txt

# The output file (dragstr_params.txt) contains:
# - GOP (Gap Open Penalty) table
# - GCP (Gap Continue Penalty) table  
# - API (Allelic Pruning Index) table

Step 3: Use the generated parameters file:

# Use the generated DRAGstr parameters in HaplotypeCaller
gatk HaplotypeCaller \
  -R reference.fa \
  -I sample.recal.bam \
  --dragen-mode true \
  --dragstr-params-path dragstr_params.txt \
  -O output.vcf.gz

Important notes:

  • DRAGstr parameters are reference-specific - generate them for your specific reference genome
  • The parameters file is a text file (not JSON) containing parameter tables
  • You can reuse the same parameters file for multiple samples aligned to the same reference
  • For best results, calibrate using high-quality BAM files from your project

BQSR Considerations

When using DRAGEN mode, DRAGEN’s error model already accounts for systematic biases, so BQSR may be bypassed. However, BQSR is still recommended for:

  • Standard GATK workflows (non-DRAGEN mode)
  • Ensuring compatibility with other tools
  • QC purposes (to assess systematic biases)

Stage 5: SNV/Indel Variant Calling with GATK


This stage covers calling single nucleotide variants (SNVs) and small insertions/deletions (indels) for both germline and somatic analyses.

Germline Variant Calling

HaplotypeCaller - Single Sample Calling

Basic single sample calling:

# For WGS (no target regions needed)
gatk HaplotypeCaller \
  -R reference.fa \
  -I sample.recal.bam \
  -O sample.g.vcf.gz \
  -ERC GVCF

# For WES (optionally restrict to target regions for efficiency)
gatk HaplotypeCaller \
  -R reference.fa \
  -I sample.recal.bam \
  -L exome_targets.bed \
  -O sample.g.vcf.gz \
  -ERC GVCF

Output formats:

  • GVCF: Genomic VCF format (recommended for cohort analysis)
  • VCF: Standard VCF format (for single sample)

Step 1: Generate GVCF for each sample:

# Generate GVCF for each sample
for sample in sample1 sample2 sample3; do
  gatk HaplotypeCaller \
    -R reference.fa \
    -I ${sample}.recal.bam \
    -O ${sample}.g.vcf.gz \
    -ERC GVCF
done

Step 2: Joint genotyping:

# Create sample map file
echo -e "sample1\tsample1.g.vcf.gz" > samples.map
echo -e "sample2\tsample2.g.vcf.gz" >> samples.map
echo -e "sample3\tsample3.g.vcf.gz" >> samples.map

# Joint genotyping
gatk GenotypeGVCFs \
  -R reference.fa \
  -V samples.map \
  -O cohort.vcf.gz

Enabling DRAGEN Features in HaplotypeCaller

Functional Equivalence Mode:

gatk --java-options "-Xmx32G" HaplotypeCaller \
  -R reference.fa \
  -I sample.recal.bam \
  -O sample-dragen-equiv.g.vcf.gz \
  -ERC GVCF \
  --dragen-mode true \
  --use-pdhmm false \
  --disable-spanning-event-genotyping true \
  --dragstr-params-path /path/to/dragstr_params.json \
  --dragstr-het-hom-ratio 2

Maximum Quality Mode:

gatk --java-options "-Xmx32G" HaplotypeCaller \
  -R reference.fa \
  -I sample.recal.bam \
  -O sample-dragen-maxquality.g.vcf.gz \
  -ERC GVCF \
  --dragen-mode true \
  --use-pdhmm true \
  --disable-spanning-event-genotyping false \
  --dragstr-params-path /path/to/dragstr_params.json \
  --dragstr-het-hom-ratio 2

Key differences:

  • Functional Equivalence: --use-pdhmm false, --disable-spanning-event-genotyping true
  • Maximum Quality: --use-pdhmm true, --disable-spanning-event-genotyping false

Somatic Variant Calling

Mutect2 - Tumor-Normal Pair

Basic tumor-normal calling:

# For WGS (no target regions needed)
gatk Mutect2 \
  -R reference.fa \
  -I tumor.recal.bam \
  -I normal.recal.bam \
  -normal normal_sample \
  -O mutect2.vcf.gz \
  --tumor-sample tumor_sample

# For WES (optionally restrict to target regions for efficiency)
gatk Mutect2 \
  -R reference.fa \
  -I tumor.recal.bam \
  -I normal.recal.bam \
  -normal normal_sample \
  -L exome_targets.bed \
  -O mutect2.vcf.gz \
  --tumor-sample tumor_sample

Mutect2 - Tumor-Only Mode

Step 1: Create Panel of Normals (PoN):

# Generate PoN from multiple normal samples
gatk CreateSomaticPanelOfNormals \
  -R reference.fa \
  -vcfs normal1.vcf.gz normal2.vcf.gz normal3.vcf.gz \
  -O pon.vcf.gz

Step 2: Get pileup summaries:

# For tumor sample
gatk GetPileupSummaries \
  -R reference.fa \
  -I tumor.recal.bam \
  -V gnomad.vcf.gz \
  -L exome_targets.bed \
  -O tumor.pileups.table

# For normal samples (if available for contamination estimation)
gatk GetPileupSummaries \
  -R reference.fa \
  -I normal.recal.bam \
  -V gnomad.vcf.gz \
  -L exome_targets.bed \
  -O normal.pileups.table

Step 3: Calculate contamination:

# Tumor only
gatk CalculateContamination \
  -I tumor.pileups.table \
  -O tumor.contamination.table

# tumor-normal pairs
gatk CalculateContamination \
  -I tumor.pileups.table \
  -matched normal.pileups.table \
  -O tumor.contamination.table

Step 4: Run Mutect2 with PoN:

gatk Mutect2 \
  -R reference.fa \
  -I tumor.recal.bam \
  --germline-resource gnomad.vcf.gz \
  --panel-of-normals pon.vcf.gz \
  -O mutect2.vcf.gz \
  --tumor-sample tumor_sample

FilterMutectCalls

Apply somatic filtering:

gatk FilterMutectCalls \
  -R reference.fa \
  -V mutect2.vcf.gz \
  -O mutect2.filtered.vcf.gz \
  --contamination-table tumor.contamination.table

Enabling DRAGEN Features in Mutect2

With DRAGEN mode:

gatk Mutect2 \
  -R reference.fa \
  -I tumor.recal.bam \
  -I normal.recal.bam \
  -normal normal_sample \
  -O mutect2-dragen.vcf.gz \
  --tumor-sample tumor_sample \
  --dragen-mode true \
  --use-pdhmm true \
  --dragstr-params-path /path/to/dragstr_params.json

Stage 6: Structural Variant (SV) and Copy Number Variant (CNV) Calling


This section covers calling structural variants (SVs) and copy number variants (CNVs) using GATK tools with DRAGMAP-aligned BAM files. SVs include deletions, duplications, inversions, and translocations, while CNVs refer to copy number gains and losses.

Structural Variant (SV) Calling

GATK provides the StructuralVariantDiscoveryPipeline (GATK-SV) for comprehensive SV detection. This pipeline integrates multiple SV callers and evidence types.

GATK-SV Pipeline Overview

GATK-SV is a cloud-based ensemble SV discovery pipeline that combines multiple SV detection methods:

  • Manta: Paired-end and split-read evidence
  • Wham: Read-depth and split-read evidence
  • Scramble: Assembly-based detection

Input requirements:

  • DRAGMAP-aligned BAM files (sorted and indexed)
  • Reference genome (same as used for alignment)
  • Optional: Target intervals for WES/panel data

Running GATK-SV (Basic Workflow)

Note: GATK-SV is typically run using WDL workflows. Here we show the key components:

# GATK-SV is typically run via WDL workflows
# Download GATK-SV workflows from: https://github.com/broadinstitute/gatk-sv

# Basic SV calling workflow (simplified example)
# The full pipeline requires WDL execution, but key steps include:

# 1. Extract evidence (paired-end, split-read, depth)
# 2. Run ensemble callers (Manta, Wham, Scramble)
# 3. Merge and filter SV calls
# 4. Annotate with population frequencies

# For detailed GATK-SV usage, see:
# https://broadinstitute.github.io/gatk-sv/docs/intro/

Alternative: Using Individual SV Callers

If you prefer to use individual SV callers directly:

Manta (recommended for paired-end data):

# Install Manta
# conda install -c bioconda manta

# Configure Manta
configManta.py \
  --bam tumor.recal.bam \
  --referenceFasta reference.fa \
  --runDir manta_workdir

# Run Manta
python manta_workdir/runWorkflow.py

# Output: manta_workdir/results/variants/candidateSV.vcf.gz

Delly (for translocations and complex SVs):

# Install Delly
# conda install -c bioconda delly

# Germline SV calling
delly call \
  -g reference.fa \
  -o delly.bcf \
  sample.recal.bam

# Somatic SV calling (tumor-normal)
delly call \
  -g reference.fa \
  -o delly.bcf \
  -x exclude_regions.bed \
  tumor.recal.bam \
  normal.recal.bam

Lumpy (high sensitivity for complex rearrangements):

# Install Lumpy
# conda install -c bioconda lumpy-sv

# Extract discordant and split-reads
samtools view -b -F 1294 tumor.recal.bam | \
  samtools sort - > tumor.discordants.bam

samtools view -h tumor.recal.bam | \
  extractSplitReads_BwaMem -i stdin | \
  samtools view -Sb - > tumor.splitters.bam

# Run Lumpy
lumpyexpress \
  -B tumor.recal.bam \
  -S tumor.splitters.bam \
  -D tumor.discordants.bam \
  -o lumpy.vcf

Copy Number Variant (CNV) Calling

GATK provides two main approaches for CNV calling:

  1. gCNV: For germline CNV calling (cohort-based)
  2. Somatic CNV pipeline: For tumor CNV calling (tumor-normal or tumor-only)

Germline CNV Calling with gCNV

gCNV is a cohort-based tool that uses a panel of normals (PoN) to detect germline CNVs. It’s particularly useful for WES and targeted panel data.

For WES germline CNV

Step 1: Create Panel of Normals (PoN)

# Collect read counts for each normal sample
# Repeat for all normal samples in your cohort

gatk CollectReadCounts \
  -I normal1.recal.bam \
  -L exome_targets.bed \
  -R reference.fa \
  -format HDF5 \
  -O normal1.counts.hdf5

# Repeat for normal2, normal3, etc.

# Denoise read counts (creates PoN)
gatk DenoiseReadCounts \
  -I normal1.counts.hdf5 \
  -I normal2.counts.hdf5 \
  -I normal3.counts.hdf5 \
  --standardized-copy-ratios pon.standardizedCR.tsv \
  --denoised-copy-ratios pon.denoisedCR.tsv \
  --count-panel-of-normals pon.counts.hdf5

Step 2: Call CNVs in Case Samples

# Collect read counts for case sample
gatk CollectReadCounts \
  -I case.recal.bam \
  -L exome_targets.bed \
  -R reference.fa \
  -format HDF5 \
  -O case.counts.hdf5

# Denoise using PoN
gatk DenoiseReadCounts \
  -I case.counts.hdf5 \
  --count-panel-of-normals pon.counts.hdf5 \
  --standardized-copy-ratios case.standardizedCR.tsv \
  --denoised-copy-ratios case.denoisedCR.tsv

# Model segments
gatk ModelSegments \
  --denoised-copy-ratios case.denoisedCR.tsv \
  --alleles-case case.hets.vcf.gz \
  --output case.modelBegin.seg \
  --output-prefix case

# Call copy ratio segments
gatk CallCopyRatioSegments \
  --input case.modelBegin.seg \
  --output case.called.seg

For WGS germline CNV (without target intervals):

# Collect read counts (no -L parameter for WGS)
gatk CollectReadCounts \
  -I sample.recal.bam \
  -R reference.fa \
  -format HDF5 \
  -O sample.counts.hdf5

# Continue with denoising and calling as above

Somatic CNV Calling

Tumor-Normal Pair:

# Step 1: Collect read counts for tumor and normal
gatk CollectReadCounts \
  -I tumor.recal.bam \
  -L exome_targets.bed \
  -R reference.fa \
  -format HDF5 \
  -O tumor.counts.hdf5

gatk CollectReadCounts \
  -I normal.recal.bam \
  -L exome_targets.bed \
  -R reference.fa \
  -format HDF5 \
  -O normal.counts.hdf5

# Step 2: Denoise (tumor normalized to normal)
gatk DenoiseReadCounts \
  -I tumor.counts.hdf5 \
  -I normal.counts.hdf5 \
  --standardized-copy-ratios tumor.standardizedCR.tsv \
  --denoised-copy-ratios tumor.denoisedCR.tsv

# Step 3: Model segments (with B-allele frequency from SNVs)
# First, extract heterozygous SNPs from normal
gatk SelectVariants \
  -R reference.fa \
  -V normal.vcf.gz \
  -select-type SNP \
  -O normal.hets.vcf.gz

# Model segments with BAF
gatk ModelSegments \
  --denoised-copy-ratios tumor.denoisedCR.tsv \
  --alleles normal.hets.vcf.gz \
  --output tumor.modelBegin.seg \
  --output-prefix tumor

# Step 4: Call copy ratio segments
gatk CallCopyRatioSegments \
  --input tumor.modelBegin.seg \
  --output tumor.called.seg

# Step 5: Plot segments (optional, for visualization)
gatk PlotModeledSegments \
  --denoised-copy-ratios tumor.denoisedCR.tsv \
  --allelic-counts tumor.hets.allelicCounts.tsv \
  --segments tumor.modelBegin.seg \
  --sequence-dictionary reference.dict \
  --output tumor_plots \
  --output-prefix tumor

Tumor-Only CNV Calling:

# For tumor-only, use a panel of normals (PoN) instead of matched normal
# Create PoN from multiple normal samples (see [Copy Number Variant (CNV) Calling](#copy-number-variant-cnv-calling) section above)

# Collect read counts for tumor
gatk CollectReadCounts \
  -I tumor.recal.bam \
  -L exome_targets.bed \
  -R reference.fa \
  -format HDF5 \
  -O tumor.counts.hdf5

# Denoise using PoN
gatk DenoiseReadCounts \
  -I tumor.counts.hdf5 \
  --count-panel-of-normals pon.counts.hdf5 \
  --standardized-copy-ratios tumor.standardizedCR.tsv \
  --denoised-copy-ratios tumor.denoisedCR.tsv

# Continue with modeling and calling as above

Best Practices for SV and CNV Analysis

  1. Reference consistency: Use the same reference genome for alignment and SV/CNV calling (see Reference Genome Considerations in Introduction)
  2. Target intervals: For WES/panel, ensure target BED files match across samples
  3. Panel of Normals: Use PoN for WES/panel CNV calling (improves accuracy)
  4. Multiple callers: For SVs, use ensemble approaches (GATK-SV) or multiple callers
  5. Quality filtering: Apply appropriate filters based on sequencing depth and type
  6. Validation: Validate high-priority SVs/CNVs with orthogonal methods (PCR, FISH, etc.)
  7. Population frequencies: Check against gnomAD SV and DGV for common variants
  8. Size considerations:
    • WGS: Can detect SVs/CNVs of all sizes
    • WES: Limited to exonic regions, may miss intergenic SVs
    • Panel: Very limited, focus on target regions

Integration with SNV/Indel Results

Combined variant analysis:

# Merge SNV/Indel, SV, and CNV results for comprehensive analysis
# Use custom scripts or tools like SURVIVOR for SV merging

# Example: Create combined variant report
# 1. SNV/Indel: mutect2.filtered.vcf.gz
# 2. SV: sv.filtered.vcf.gz  
# 3. CNV: tumor.called.seg

# Prioritize variants affecting same genes/regions

Stage 7: Variant Filtering


SNV and Small Indel Filtering

Germline Filtering - VQSR

Variant Quality Score Recalibration (VQSR) uses machine learning to identify true variants:

# Build recalibration model for SNPs
gatk VariantRecalibrator \
  -R reference.fa \
  -V cohort.vcf.gz \
  --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
  --resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
  --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
  --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
  -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
  -mode SNP \
  -O cohort.snps.recal \
  --tranches-file cohort.snps.tranches

# Apply VQSR to SNPs
gatk ApplyVQSR \
  -R reference.fa \
  -V cohort.vcf.gz \
  -O cohort.snps.vqsr.vcf.gz \
  --recal-file cohort.snps.recal \
  --tranches-file cohort.snps.tranches \
  --truth-sensitivity-filter-level 99.0 \
  -mode SNP

# Similar process for indels (using Mills and 1000G indels as resources)

Germline Filtering - Hard Filtering (Alternative)

If VQSR is not feasible (e.g., small sample size), use hard filters:

# Hard filter SNPs
gatk VariantFiltration \
  -R reference.fa \
  -V cohort.vcf.gz \
  -O cohort.filtered.vcf.gz \
  --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
  --filter-name "SNP_FILTER" \
  -mode SNP

# Hard filter indels
gatk VariantFiltration \
  -R reference.fa \
  -V cohort.filtered.vcf.gz \
  -O cohort.filtered.vcf.gz \
  --filter-expression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
  --filter-name "INDEL_FILTER" \
  -mode INDEL

Somatic Filtering

FilterMutectCalls output already applies somatic-specific filters. Additional hard filters may be applied:

# Additional hard filters for somatic variants
bcftools filter -i 'FILTER=="PASS" && INFO/DP>=10 && INFO/AF>=0.05' \
  mutect2.filtered.vcf.gz \
  -O mutect2.hardfiltered.vcf.gz

VAF thresholds:

  • WGS tissue: 5-10% minimum (adjust for purity)
  • WES tissue: 5-10% minimum
  • ctDNA: 0.1-0.5% minimum (with high depth)

SV and CNV Filtering

Note: VQSR (Variant Quality Score Recalibration) is designed for SNVs and indels, not for structural variants. SV filtering uses different quality metrics and filtering criteria as shown below.

SV Filtering

Quality filters for SV calls:

# Filter by quality score, read support, and size
bcftools filter \
  -i 'QUAL>20 && SVLEN>50 && SVLEN<1000000 && IMPRECISE==0' \
  -O z \
  -o sv.filtered.vcf.gz \
  sv.vcf.gz

# Additional filtering with bcftools
bcftools filter \
  -i 'INFO/SUPPORT_READS>5 && INFO/SUPPORT_FRACTION>0.1' \
  -O z \
  -o sv.high_confidence.vcf.gz \
  sv.filtered.vcf.gz

Common SV filtering criteria:

  • Quality score: QUAL > 20-30
  • Read support: Minimum supporting reads (e.g., >5)
  • Size range: Filter very small (<50bp) or very large (>10Mb) SVs
  • Precision: Prefer PRECISE over IMPRECISE calls
  • Population frequency: Filter common SVs (gnomAD SV frequency < 1%)

CNV Filtering

Filter CNV segments:

# Filter segments by copy ratio and size
# Copy ratio thresholds:
# - Deletion: < 0.7 (log2 < -0.5)
# - Amplification: > 1.3 (log2 > 0.4)
# - Neutral: 0.7-1.3

# Using awk or custom script
awk 'BEGIN{FS=OFS="\t"} {
  if ($6 < -0.5 && $3-$2 > 10000) print  # Deletions > 10kb
  if ($6 > 0.4 && $3-$2 > 10000) print    # Amplifications > 10kb
}' tumor.called.seg > tumor.filtered.seg

Common CNV filtering criteria:

  • Copy ratio thresholds:
    • Deletion: log2 copy ratio < -0.5
    • Amplification: log2 copy ratio > 0.4
  • Segment size: Minimum segment size (e.g., >10kb for WGS, >5kb for WES)
  • Quality scores: Filter by segment quality metrics
  • Overlap with known CNVs: Check against DGV, gnomAD SV

Stage 8: Variant Annotation


VEP

Ensembl Variant Effect Predictor (VEP) (documentation) is a comprehensive command-line tool developed by EMBL-EBI that predicts and annotates the functional impact of genomic variants (SNPs, indels, structural variants) relative to reference annotations. VEP uses Ensembl and RefSeq transcript annotations, population frequency databases, regulatory element data, and custom annotation sources to assess variant consequences.

Key capabilities:

  • Consequence prediction: Assigns Sequence Ontology (SO) terms (e.g., nonsynonymous_variant, synonymous_variant, splice_region_variant) for all overlapping transcripts
  • Allele frequency annotation: Integrates data from 1000 Genomes, gnomAD, and other population databases
  • Clinical significance: Annotates variants with ClinVar clinical significance classifications
  • Regulatory region overlap: Reports variants intersecting regulatory regions, transcription factor binding sites, and promoter regions
  • HGVS nomenclature: Generates HGVS-compliant coding (HGVSc) and protein (HGVSp) change descriptions
  • Flexible output formats: Supports VCF (with CSQ tag), tab-delimited, JSON, and summary formats
  • Plugin architecture: Extensible through plugins for additional annotations (e.g., CADD, dbNSFP, AlphaMissense, AlphaFold)
  • Transcript selection: Options like --canonical, --pick, --most_severe, and --per_gene for prioritizing transcripts in multi-isoform genes

VEP can operate using local cache files (recommended for performance) or connect to remote Ensembl databases. The tool supports over 100 species and multiple genome assemblies, making it suitable for both human and non-human variant annotation workflows.

VEP Installation and Setup

# Install VEP via conda
conda install -c bioconda ensembl-vep

# Or install manually
# Download and install from: https://github.com/Ensembl/ensembl-vep

# Install VEP cache
vep_install -a cf -s homo_sapiens -y GRCh38 -c /path/to/vep_cache

Basic VEP Annotation

# Basic VEP annotation (includes traditional pathogenicity scores)
vep -i variants.vcf.gz \
  -o variants.vep.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --everything

# VEP's --everything flag includes:
# - SIFT scores (deleterious if <0.05)
# - PolyPhen-2 scores (damaging if >0.447)
# - Other functional predictions

# With additional annotation sources
vep -i variants.vcf.gz \
  -o variants.vep.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --custom /path/to/cosmic.vcf.gz,COSMIC,vcf,exact,0 \
  --custom /path/to/gnomad.vcf.gz,gnomAD,vcf,exact,0 \
  --plugin CADD,/path/to/CADD.tsv.gz

Traditional pathogenicity scores available via VEP:

  • SIFT: <0.05 (deleterious), included with --everything
  • PolyPhen-2: >0.447 (possibly damaging), >0.908 (probably damaging), included with --everything
  • CADD: >15-20 (deleterious), requires --plugin CADD
  • PhyloP: >2.0 (conserved, likely functional), available via dbNSFP plugin

VEP with AI Tools

AlphaMissense

What it is: AlphaFold-derived model predicting pathogenicity of all possible single amino acid substitutions in human proteins.

Using with VEP:

# VEP with AlphaMissense plugin
vep -i coding_variants.vcf.gz \
  -o coding_variants.alphamissense.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz

# Score interpretation:
# - Pathogenic: score > 0.5
# - Ambiguous: 0.35 < score <= 0.5
# - Benign: score <= 0.35
AlphaFold Structure Integration

Mapping variants to AlphaFold structures:

# VEP with AlphaFold structure annotation
vep -i coding_variants.vcf.gz \
  -o coding_variants.alphafold.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --plugin AlphaFold

# pLDDT confidence scores:
# - Very high: pLDDT >= 90
# - Confident: pLDDT >= 70 (use for predictions)
# - Low: pLDDT < 70 (less reliable)

Accessing AlphaFold models:

  • AlphaFold DB: https://alphafold.ebi.ac.uk/
  • Download structures for specific proteins
  • Use pLDDT scores to filter reliable regions (≥70)

EVE (Evolutionary Model of Variant Effect)

What it is: EVE is an evolutionary model-based AI predictor that uses unsupervised machine learning on multiple sequence alignments to predict variant pathogenicity. It can be used independently or integrated with VEP workflows.

Using EVE scores:

# EVE provides precomputed scores for human missense variants
# Download from: https://evemodel.org/

# Annotate with EVE scores
bcftools annotate -a eve_scores.vcf.gz \
  -c INFO/EVE \
  variants.vcf.gz \
  -O variants.eve.vcf.gz

# Score interpretation:
# - Pathogenic: EVE_score > 0.5
# - Benign: EVE_score < 0.5

Other AI Tools

Additional AI-based predictors available via VEP plugins:

  • REVEL: Ensemble predictor combining multiple scores (available via dbNSFP)
  • PrimateAI: Deep learning model trained on primate sequences
  • DANN: Deep neural network for pathogenicity prediction
  • MutPred2: Machine learning predictor of molecular and phenotypic impact

Integration methods:

  • VEP plugins (dbNSFP provides many scores including REVEL, SIFT, PolyPhen-2, PhyloP)
  • Custom annotation with bcftools annotate
  • Database lookups (dbNSFP, dbscSNV)

Note: Traditional scores (SIFT, PolyPhen-2, CADD, PhyloP) are widely used and well-validated. They complement AI-based scores and should be considered together for comprehensive variant assessment.

SNPeffect 5.0

SNPeffect is a comprehensive variant effect prediction pipeline (GitHub) that integrates multiple computational tools to assess the functional impact of coding variants. The pipeline processes VCF files through SnpEff annotation, then performs protein structure-based analysis using FoldX for stability predictions, along with sequence-based predictions from SIFT, PROVEAN, AGADIR, and TMHMM. It generates detailed reports including FoldX ΔΔG values, domain annotations (via Gene3D for CATH domain classification), transmembrane predictions, and functional impact scores.

Integrated tools:

  • Structure analysis: FoldX (protein stability ΔΔG), AlphaFold (structure retrieval and pLDDT scoring), Gene3D (CATH domain annotation)
  • Sequence-based predictors: SIFT (pathogenicity), PROVEAN (protein variant effect), AGADIR (helix propensity), TMHMM (transmembrane domain prediction)

SNPeffect 5.0 Pipeline

Integrated AlphaFold + FoldX workflow:

# SNPeffect 5.0 usage
snpeffect5 --variants coding_variants.vcf.gz \
  --output snpeffect_results.tsv \
  --structure_db /path/to/alphafold_db/ \
  --threads 8 \
  --min_plddt 70

# Output includes:
# - ΔΔG (stability change)
# - Domain annotation
# - pLDDT confidence
# - Functional impact predictions

Comparison: VEP vs. SNPeffect 5.0

Both VEP and SNPeffect 5.0 are powerful variant annotation tools, but they serve different purposes and have distinct strengths. Understanding their differences helps in choosing the right tool for your analysis needs.

Feature VEP (Ensembl Variant Effect Predictor) SNPeffect 5.0
Primary Focus Comprehensive variant annotation across all variant types Deep structural and functional analysis of coding variants
Variant Types Supported SNVs, indels, structural variants, CNVs Primarily coding variants (missense, nonsense, etc.)
Annotation Approach Transcript-based consequences, population frequencies, regulatory regions Protein structure-based predictions with biophysical modeling
AI-Based Predictors AlphaMissense (plugin), AlphaFold (plugin), EVE (external), REVEL (dbNSFP plugin), PrimateAI/DANN/MutPred2 (dbNSFP) AlphaFold (integrated), FoldX (core component for ΔΔG stability predictions)
Traditional Predictors SIFT, PolyPhen-2 (built-in), CADD, PhyloP, PhastCons (via plugins) SIFT, PROVEAN, AGADIR, TMHMM
Structure Analysis Limited: AlphaFold plugin provides pLDDT scores; no direct stability calculations Core feature: Integrated FoldX for ΔΔG predictions, automated AlphaFold structure retrieval, Gene3D domain annotation
Integration Method Plugin architecture (extensible), custom annotations via --custom, combinable with external tools Integrated pipeline (automated), requires pre-built AlphaFold structure database
Species Support 100+ species (mammals, plants, fungi, etc.) Primarily human (focused on human protein structures)
Population Data Extensive (1000 Genomes, gnomAD, ExAC, etc.) Limited (relies on external annotation sources)
Regulatory Annotation Built-in support for regulatory regions, TFBS, promoters Not a primary focus
Clinical Data ClinVar integration, HGMD (via plugins) Not included
Output Formats VCF (CSQ tag), tab-delimited, JSON, summary Tab-delimited reports (FoldX reports, sequence analysis)
Computational Requirements Moderate (16-32GB RAM with cache) High (requires protein structures, FoldX calculations)
Ease of Use Well-documented, widely adopted, plugin ecosystem Specialized pipeline requiring structure databases

Key Distinction: VEP excels at annotation aggregation - bringing together multiple AI and traditional predictors through plugins for comprehensive annotation in a single pass. SNPeffect 5.0 focuses on deep structural analysis - integrating AlphaFold structures with FoldX to provide unique biophysical insights (protein stability ΔΔG) that VEP cannot calculate directly.

When to Use VEP:

  • Initial annotation: First-pass annotation of all variants (coding and non-coding)
  • Population genetics: Analyzing allele frequencies across populations
  • Regulatory variants: Identifying variants in regulatory regions, promoters, enhancers
  • Clinical interpretation: Integrating ClinVar and other clinical databases
  • Multi-species analysis: Working with non-human genomes
  • Standard workflows: When following established annotation pipelines

When to Use SNPeffect 5.0:

  • Protein structure analysis: When you need detailed biophysical predictions (ΔΔG values)
  • Stability predictions: Understanding how variants affect protein folding and stability
  • Research focus: Deep analysis of specific genes or protein families
  • Structure-function studies: Investigating relationships between structure and function
  • Complementary analysis: After initial VEP annotation, for prioritized coding variants

Recommended Workflow: Use both tools in a complementary manner:

  1. VEP for initial comprehensive annotation and filtering
  2. SNPeffect 5.0 for detailed structural analysis of prioritized coding variants
  3. Combine results for comprehensive variant interpretation

Integration Example:

# Step 1: VEP for comprehensive annotation
vep -i variants.vcf.gz \
  -o variants.vep.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --everything \
  --plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz

# Step 2: Extract coding variants of interest
bcftools view -i 'INFO/CSQ~"missense_variant"' variants.vep.vcf.gz \
  > coding_variants.vcf

# Step 3: SNPeffect for structural analysis
snpeffect5 --variants coding_variants.vcf \
  --output snpeffect_results.tsv \
  --structure_db /path/to/alphafold_db/ \
  --min_plddt 70

# Step 4: Merge results for comprehensive interpretation
# (Custom script to combine VEP and SNPeffect outputs)

ENCODE cCRE Annotation for Non-coding Variants

What are cCREs?

cCREs (candidate cis-Regulatory Elements) are genomic intervals identified by ENCODE as candidate regulatory elements based on:

  • Transcription factor ChIP-seq
  • Chromatin accessibility (DNase/ATAC)
  • Histone modifications (H3K4me3, H3K27ac)
  • CTCF binding

cCRE classification:

  • Promoter-like: Near transcription start sites
  • Enhancer-like: Distal regulatory elements
  • CTCF-only: Insulator elements
  • Other: Additional regulatory classes

Downloading cCRE Data

# Download from ENCODE cCRE Registry
# Version 4 contains ~2.37M human cCREs for hg38

# Option 1: SCREEN portal
# https://screen.wenglab.org/downloads
# Download BED files for specific cell types or all cCREs

# Option 2: ENCODE FTP
# Download from ENCODE project FTP site

# Example file: ENCODE_cCREs.hg38.v4.bed

Overlapping Variants with cCREs

Method 1: bedtools intersect:

# Extract variant positions to BED format
bcftools query -f '%CHROM\t%POS\t%POS\n' non_coding_variants.vcf.gz > variants.bed

# Intersect with cCREs
bedtools intersect -a variants.bed \
  -b ENCODE_cCREs.hg38.v4.bed \
  -wa -wb > variants_in_ccres.txt

# Add cCRE information back to VCF
# (requires custom script to merge results)

Method 2: VEP custom annotation:

# VEP with cCRE custom annotation
vep -i non_coding_variants.vcf.gz \
  -o non_coding_variants.ccre.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --custom /path/to/ENCODE_cCREs.hg38.v4.bed.gz,ENCODE_cCRE,bed,overlap,0

Method 3: R/Bioconductor:

library(GenomicRanges)
library(VariantAnnotation)

# Read variants
variants <- readVcf("non_coding_variants.vcf.gz", "hg38")

# Read cCREs
ccres <- rtracklayer::import("ENCODE_cCREs.hg38.v4.bed")

# Find overlaps
overlaps <- findOverlaps(variants, ccres)

# Annotate variants with cCRE type
annotated <- cbind(
  as.data.frame(variants[queryHits(overlaps)]),
  cCRE_class = elementMetadata(ccres[subjectHits(overlaps)])$class
)

Tissue-Specific cCRE Activity

Prioritizing variants in tissue-relevant cCREs:

  • Use cell-type specific cCRE annotations when available
  • Prioritize variants in cCREs active in disease-relevant tissues
  • Consider tissue-specific chromatin states

Integration with Other Regulatory Annotations

Combine cCRE annotation with:

  • TF binding motifs: JASPAR, TRANSFAC databases
  • Chromatin states: ChromHMM, Segway segmentation
  • Histone modifications: H3K4me3 (promoters), H3K27ac (enhancers)
  • Motif disruption: motifbreakR, FIMO tools

SV and CNV Annotation

SV Annotation

AnnotSV (specialized SV annotation tool):

# Install AnnotSV
# conda install -c bioconda annotsv

# Annotate SVs
AnnotSV \
  -SVinputFile sv.vcf.gz \
  -genomeBuild GRCh38 \
  -outputDir annotsv_output \
  -outputFile sv.annotated.vcf

VEP annotation for SVs:

# VEP can annotate SVs (limited compared to AnnotSV)
vep -i sv.vcf.gz \
  -o sv.vep.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --check_existing \
  --check_svs

CNV Annotation

Annotate CNV segments with gene information:

# Use bedtools to overlap CNV segments with gene annotations
bedtools intersect \
  -a tumor.called.seg \
  -b genes.gtf \
  -wa -wb > cnv.genes.txt

# Or use R/Bioconductor GenomicRanges

Stage 9: Variant Prioritization


Variant prioritization involves ranking variants by their potential biological and clinical significance. Different prioritization strategies are used depending on variant type (coding vs. non-coding) and analysis context (germline vs. somatic). For more details about prioritization strategies, multi-tool consensus approaches, and comprehensive guidelines, see the Guidelines for Sequencing-Based Somatic Variant Analysis.

Coding Variant Prioritization

Coding variants (SNVs and indels in protein-coding regions) can be prioritized using multiple complementary approaches.

Pathogenicity Score Assessment

AI-based scores (recommended for recent analyses):

  • AlphaMissense: >0.5 (pathogenic)
  • EVE: >0.5 (pathogenic)
  • REVEL: >0.5 (pathogenic)

Traditional scores (widely used, well-validated):

  • CADD: >15-20 (deleterious)
  • SIFT: <0.05 (deleterious)
  • PolyPhen-2: >0.447 (possibly damaging), >0.908 (probably damaging)
  • PhyloP: >2.0 (conserved, likely functional)

Comparison: AI-Based vs. Traditional Scoring for Protein Impact

AI-Based Scores (AlphaMissense, EVE, REVEL):

Advantages:

  • Higher accuracy: Trained on large-scale experimental and clinical data, often outperform traditional methods
  • Protein structure-aware: AlphaMissense leverages AlphaFold protein structures, capturing structural context
  • Evolutionary modeling: EVE uses evolutionary information across species to predict functional impact
  • Ensemble approaches: REVEL combines multiple predictors for robust predictions
  • Comprehensive coverage: Pre-computed scores available for all possible missense variants in human proteins
  • Better for rare variants: Less dependent on training data frequency, better generalization

Limitations:

  • Computational requirements: Some models require significant computational resources
  • Interpretability: Deep learning models can be “black boxes” with less transparent decision-making
  • Database dependencies: May require large pre-computed databases (e.g., AlphaMissense database)
  • Newer tools: Less historical validation compared to traditional methods

Traditional Scores (SIFT, PolyPhen-2, CADD, PhyloP):

Advantages:

  • Well-validated: Extensive validation in clinical and research settings over many years
  • Interpretable: Based on clear biological principles (sequence conservation, physicochemical properties)
  • Widely available: Integrated into standard annotation pipelines (VEP, ANNOVAR)
  • Fast computation: Efficient algorithms, suitable for large-scale analyses
  • Established thresholds: Well-characterized score cutoffs with known performance metrics
  • Complementary information: Different methods capture different aspects (conservation, structure, function)

Limitations:

  • Lower accuracy: Generally less accurate than state-of-the-art AI methods
  • Sequence-based: SIFT and PolyPhen-2 primarily use sequence features, less structural context
  • Conservation bias: PhyloP and SIFT heavily rely on evolutionary conservation, may miss species-specific effects
  • Limited for rare variants: Performance may degrade for variants not well-represented in training data

When to Use Each Approach:

Use AI-based scores when:

  • Prioritizing variants for experimental validation or clinical interpretation
  • Analyzing large-scale datasets where accuracy is critical
  • Working with rare or novel variants not well-represented in traditional training sets
  • Need protein structure context for variant impact assessment
  • Have access to pre-computed AI score databases

Use traditional scores when:

  • Need interpretable predictions based on clear biological principles
  • Working with established annotation pipelines (VEP, ANNOVAR)
  • Require fast annotation for large-scale screening
  • Need complementary information from multiple independent methods
  • Comparing results with historical datasets or published studies

Best Practice: Combined Approach:

  • Use both AI and traditional scores for comprehensive assessment
  • Prioritize variants with agreement across multiple predictor types
  • Consider context: AI scores for high-priority variants, traditional scores for initial screening
  • Validate predictions with experimental or clinical evidence when possible

Prioritization strategy:

  • Highest priority: Agreement across multiple predictors (e.g., AlphaMissense >0.5 + CADD >20 + SIFT <0.05)
  • High priority: Multiple predictors (AI or traditional) agree on pathogenicity
  • Moderate priority: Single strong predictor (either AI or traditional)

Structural Impact Assessment

Prioritize variants with:

  • Destabilizing ΔΔG: FoldX ΔΔG > 0.5 kcal/mol
  • High-confidence structure: AlphaFold pLDDT ≥ 70
  • Domain disruption: Variants in functional domains or interaction surfaces

Codon Usage Change Analysis for Synonymous Variants

Why codon usage matters: Synonymous variants can cause disease (including cancer) through expression changes when codon usage is dramatically altered. Codon usage bias affects:

  • mRNA stability: Rare codons can reduce mRNA stability, leading to decreased expression
  • Translation efficiency: Codon usage affects translation speed and efficiency
  • Protein folding: Altered translation kinetics can affect co-translational protein folding
  • Gene expression levels: Dramatic codon usage changes can significantly alter gene expression

Calculating Codon Usage Frequency (CUF) Change:

  1. Obtain codon usage frequencies: Download species-specific codon usage tables from:
    • Codon Usage Database (CUTG): http://www.kazusa.or.jp/codon/
    • Kazusa Codon Usage Database
    • Human codon usage tables from genomic databases
  2. Identify the codon change: For a synonymous variant, determine:
    • Original codon (e.g., TTC encoding Phe)
    • New codon (e.g., TTT encoding Phe)
    • Both encode the same amino acid (synonymous)
  3. Look up frequencies: Find the relative frequency of each codon:
    • Normalized to 0-1 scale (where 1.0 = most common codon for that amino acid)
    • Example: TTC (Phe) might have CUF = 0.8, TTT (Phe) might have CUF = 0.2
  4. Calculate change:
    CUF_change = |CUF_original - CUF_new|
    
    • Example: 0.8 - 0.2 = 0.6 (large change, high priority)

Prioritization criteria for synonymous variants:

  • Large CUF change (>0.4): High priority, especially in cancer genes
  • Moderate CUF change (0.2-0.4): Moderate priority, consider gene context
  • Small CUF change (<0.2): Lower priority
  • Cancer gene context: Higher priority for synonymous variants in oncogenes or tumor suppressors

Tools for codon usage analysis:

  • CodonW: Codon usage analysis software
  • CAI (Codon Adaptation Index): Measures codon usage bias
  • tAI (tRNA Adaptation Index): Considers tRNA abundance
  • Custom scripts: Calculate codon frequency changes from variant annotations

Example calculation (Python):

# Load codon usage table (from CUTG or Kazusa database)
codon_freq = {
    'TTC': 0.8,  # Phe - common codon
    'TTT': 0.2,  # Phe - rare codon
    'GAA': 0.9,  # Glu - common codon
    'GAG': 0.1,  # Glu - rare codon
    # ... more codons
}

def calculate_cuf_change(ref_codon, alt_codon, codon_freq_table):
    """Calculate codon usage frequency change for synonymous variant"""
    ref_freq = codon_freq_table.get(ref_codon, 0.0)
    alt_freq = codon_freq_table.get(alt_codon, 0.0)
    cuf_change = abs(ref_freq - alt_freq)
    return cuf_change

# Example: Synonymous variant TTC > TTT (both encode Phe)
cuf_change = calculate_cuf_change('TTC', 'TTT', codon_freq)
# Result: |0.8 - 0.2| = 0.6 (large change, high priority)

Integration with expression data: If RNA-seq data is available, prioritize synonymous variants with:

  • Large codon usage changes (>0.4)
  • Associated expression changes in the sample
  • Located in cancer-relevant genes

Pathway and Gene Set Enrichment Analysis

Why perform enrichment analysis? Pathway enrichment analysis connects variant-level signals to biological function, revealing disease mechanisms, affected pathways, and gene sets with coordinated variant burden.

Preparation steps (before enrichment analysis):

  1. Filter variants by impact to focus on functionally relevant variants:
    • High impact: Frameshift, stop-gain/loss, splice-site disrupting
    • Moderate impact: Missense (predicted deleterious), in-frame indels
  2. Aggregate variants to gene level for enrichment analysis:
# Step 1: Extract coding variants with moderate or high impact
# Using VEP-annotated VCF
bcftools filter -i 'CSQ[*] ~ "HIGH" || CSQ[*] ~ "MODERATE"' \
  variants.vep.vcf.gz \
  -O coding_variants.vcf.gz

# Alternative: Filter by CADD score (e.g., CADD > 15)
bcftools filter -i 'INFO/CADD > 15' \
  variants.vcf.gz \
  -O coding_variants.cadd_filtered.vcf.gz

# Step 2: Count variants per gene and create gene list
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/CSQ\n' \
  coding_variants.vcf.gz | \
  awk -F'|' '{print $4}' | sort | uniq -c | sort -rn > gene_burden.txt

# Create gene list for enrichment tools
cut -d' ' -f2 gene_burden.txt | grep -v "^$" > genes_with_variants.txt

Pathway enrichment tools:

GSEA (Gene Set Enrichment Analysis):

# Prepare ranked gene list (e.g., by variant burden)
# Format: GENE_NAME  RANK_SCORE

# Run GSEA (Java application)
java -Xmx4g -cp gsea.jar \
  xtools.gsea.Gsea \
  -res ranked_genes.rnk \
  -cls phenotype.cls \
  -gmx pathway.gmt \
  -chip gene_chip.chip \
  -out output_dir

Enrichr (Web-based):

# Prepare gene list
# Upload to: https://maayanlab.cloud/Enrichr/

# Or use R package
library(enrichR)
dbs <- listEnrichrDbs()
enriched <- enrichr(genes_with_variants, dbs)

clusterProfiler (R):

library(clusterProfiler)
library(org.Hs.eg.db)

# Convert gene symbols to ENTREZ IDs
gene_ids <- bitr(genes_with_variants, 
                  fromType="SYMBOL", 
                  toType="ENTREZID", 
                  OrgDb=org.Hs.eg.db)

# GO enrichment
go_enrich <- enrichGO(gene = gene_ids$ENTREZID,
                      OrgDb = org.Hs.eg.db,
                      ont = "BP",
                      pAdjustMethod = "BH",
                      pvalueCutoff = 0.05,
                      qvalueCutoff = 0.2,
                      readable = TRUE)

# KEGG enrichment
kegg_enrich <- enrichKEGG(gene = gene_ids$ENTREZID,
                         organism = "hsa",
                         pvalueCutoff = 0.05)

# Reactome enrichment
library(ReactomePA)
reactome_enrich <- enrichPathway(gene = gene_ids$ENTREZID,
                                 organism = "human",
                                 pvalueCutoff = 0.05)

Gene set databases:

  • KEGG: Metabolic and signaling pathways
  • Reactome: Comprehensive human pathways
  • MSigDB: Hallmark gene sets, oncogenic signatures, immunologic signatures
  • Enrichr: Web-based enrichment analysis platform with multiple gene set libraries - https://maayanlab.cloud/Enrichr/. Databases can be downloaded.
  • GO: Biological process, molecular function, cellular component
  • WikiPathways: Community-curated pathways

Best practices for enrichment analysis:

  1. Multiple testing correction: Use FDR (Benjamini-Hochberg) or Bonferroni
  2. Proper background: Use callable/high-confidence genes as background
  3. Gene size bias: Consider gene length and mutability
  4. Integration: Combine with functional predictions (AI scores, structural impact)
  5. Sample size: Ensure sufficient power for rare variant enrichment

Integration Strategy for Coding Variants

Pathogenicity score integration: Combine multiple predictors for robust prioritization:

AI-based scores (recommended for recent analyses):

  • AlphaMissense: >0.5 (pathogenic)
  • EVE: >0.5 (pathogenic)
  • REVEL: >0.5 (pathogenic)

Traditional scores (widely used, well-validated):

  • CADD: >15-20 (deleterious)
  • SIFT: <0.05 (deleterious)
  • PolyPhen-2: >0.908 (probably damaging) or >0.447 (possibly damaging)
  • PhyloP: >2.0 (conserved, likely functional)

Highest priority variants:

  • Multiple high pathogenicity scores: Agreement across multiple predictors (e.g., AlphaMissense >0.5 + CADD >20 + SIFT <0.05)
  • Destabilizing structural impact: FoldX ΔΔG > 0.5 kcal/mol
  • Located in functional domain: Variants in known functional domains or interaction surfaces
  • In enriched pathway: Variants in genes from significantly enriched pathways
  • Synonymous variants: Large codon usage change (>0.4) in cancer genes with expression evidence

Germline Variant Prioritization

Prioritization criteria:

  • Population frequency: gnomAD AF < 0.01% (rare variants)
  • Pathogenicity predictions:
    • AI-based: High AlphaMissense, EVE, REVEL scores (>0.5)
    • Traditional: High CADD (>15-20), SIFT (<0.05), PolyPhen-2 (>0.447), PhyloP (>2.0)
    • Agreement across multiple predictors strengthens confidence
  • Clinical significance: Pathogenic/Likely Pathogenic in ClinVar
  • Inheritance patterns: Match expected inheritance mode
  • Functional impact: High or moderate impact variants (frameshift, stop-gain, missense)
  • Gene-level burden: Variants in genes with known disease associations

Non-coding Variant Prioritization

ENCODE cCRE overlap:

  • Highest priority: Promoter-like and enhancer-like cCREs
  • Tissue-specific: Variants in tissue-relevant cCREs
  • High confidence: cCREs with strong experimental support

Regulatory impact assessment:

  • TF binding disruption: motifbreakR, FIMO predictions
  • Active chromatin: H3K4me3 (promoters), H3K27ac (enhancers)
  • eQTL evidence: GTEx, eQTLGen associations
  • Chromatin state: ChromHMM, Segway segmentation

Somatic Variant Prioritization

Cancer-specific criteria:

  • COSMIC: Known cancer-associated variants
  • OncoKB: Therapeutic implications
  • Pathway enrichment: Variants in enriched cancer pathways (TP53, PI3K/AKT, MAPK, DNA repair)
  • Recurrence: Variants recurrent across samples
  • Tier classification: Tier 1 (highest clinical significance) to Tier 4
  • Synonymous variants with codon usage changes:
    • Prioritize synonymous variants in cancer genes with large codon usage frequency changes (>0.4)
    • Consider expression changes associated with codon usage alterations
    • Higher priority for variants in oncogenes or tumor suppressors

Complete Workflow Examples


Example 1: WGS Germline Variant Calling (Single Sample)

#!/bin/bash
# Complete workflow for WGS germline variant calling

# Setup
REFERENCE="/path/to/hg38.fa"
REF_DIR="/path/to/dragmap_reference"
SAMPLE="sample1"
FASTQ_R1="sample1_R1.fastq.gz"
FASTQ_R2="sample1_R2.fastq.gz"

# Stage 1: Alignment
dragen-os -r ${REF_DIR} \
  -1 ${FASTQ_R1} \
  -2 ${FASTQ_R2} \
  --RGID ${SAMPLE}_lib1 \
  --RGSM ${SAMPLE} \
  --RGPL ILLUMINA \
  --output-directory ./alignment/ \
  --output-file-prefix ${SAMPLE}

# Stage 2: Sort and index
samtools sort -@ 8 ./alignment/${SAMPLE}.bam -o ${SAMPLE}.sorted.bam
samtools index ${SAMPLE}.sorted.bam

# Stage 3: Mark duplicates
gatk MarkDuplicates \
  -I ${SAMPLE}.sorted.bam \
  -O ${SAMPLE}.markdup.bam \
  -M ${SAMPLE}.markdup.metrics.txt \
  --CREATE_INDEX true

# Stage 4: BQSR
gatk BaseRecalibrator \
  -R ${REFERENCE} \
  -I ${SAMPLE}.markdup.bam \
  --known-sites dbsnp.vcf.gz \
  --known-sites mills.vcf.gz \
  -O ${SAMPLE}.recal.table

gatk ApplyBQSR \
  -R ${REFERENCE} \
  -I ${SAMPLE}.markdup.bam \
  -bqsr ${SAMPLE}.recal.table \
  -O ${SAMPLE}.recal.bam \
  --CREATE_INDEX true

# Stage 5: SNV/Indel variant calling with DRAGEN mode
gatk HaplotypeCaller \
  -R ${REFERENCE} \
  -I ${SAMPLE}.recal.bam \
  -O ${SAMPLE}.g.vcf.gz \
  -ERC GVCF \
  --dragen-mode true \
  --use-pdhmm true \
  --dragstr-params-path /path/to/dragstr_params.json

# Genotype (if needed as VCF)
gatk GenotypeGVCFs \
  -R ${REFERENCE} \
  -V ${SAMPLE}.g.vcf.gz \
  -O ${SAMPLE}.vcf.gz

# Stage 6: SV and CNV calling (optional)
# See [Stage 6: Structural Variant (SV) and Copy Number Variant (CNV) Calling](#stage-6-structural-variant-sv-and-copy-number-variant-cnv-calling) section for SV/CNV calling commands

# Stage 7: Variant filtering
gatk VariantFiltration \
  -R ${REFERENCE} \
  -V ${SAMPLE}.vcf.gz \
  -O ${SAMPLE}.filtered.vcf.gz \
  --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0" \
  --filter-name "SNP_FILTER" \
  -mode SNP

# Stage 8: Variant annotation
vep -i ${SAMPLE}.filtered.vcf.gz \
  -o ${SAMPLE}.annotated.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz \
  --everything

Example 2: WES Germline Variant Calling (Cohort)

#!/bin/bash
# Cohort analysis with GVCF workflow

REFERENCE="/path/to/hg38.fa"
TARGETS="exome_targets.bed"
SAMPLES=("sample1" "sample2" "sample3")

# Generate GVCF for each sample
for SAMPLE in "${SAMPLES[@]}"; do
  gatk HaplotypeCaller \
    -R ${REFERENCE} \
    -I ${SAMPLE}.recal.bam \
    -L ${TARGETS} \
    -O ${SAMPLE}.g.vcf.gz \
    -ERC GVCF \
    --dragen-mode true
done

# Create sample map
echo -e "sample1\tsample1.g.vcf.gz" > samples.map
echo -e "sample2\tsample2.g.vcf.gz" >> samples.map
echo -e "sample3\tsample3.g.vcf.gz" >> samples.map

# Joint genotyping
gatk GenotypeGVCFs \
  -R ${REFERENCE} \
  -V samples.map \
  -O cohort.vcf.gz

# VQSR (if sufficient samples)
# ... VQSR commands ...

# Annotation
vep -i cohort.vcf.gz \
  -o cohort.annotated.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --everything

Example 3: WGS Somatic Variant Calling (Tumor-Normal Pair)

#!/bin/bash
# Tumor-normal pair somatic calling

REFERENCE="/path/to/hg38.fa"
TUMOR="tumor.recal.bam"
NORMAL="normal.recal.bam"

# Mutect2 calling
gatk Mutect2 \
  -R ${REFERENCE} \
  -I ${TUMOR} \
  -I ${NORMAL} \
  -normal normal_sample \
  -O mutect2.vcf.gz \
  --tumor-sample tumor_sample \
  --dragen-mode true \
  --use-pdhmm true

# Filter
gatk FilterMutectCalls \
  -R ${REFERENCE} \
  -V mutect2.vcf.gz \
  -O mutect2.filtered.vcf.gz

# Additional hard filters
bcftools filter -i 'FILTER=="PASS" && INFO/DP>=10 && INFO/AF>=0.05' \
  mutect2.filtered.vcf.gz \
  -O mutect2.hardfiltered.vcf.gz

# Annotation
vep -i mutect2.hardfiltered.vcf.gz \
  -o mutect2.annotated.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --custom /path/to/cosmic.vcf.gz,COSMIC,vcf,exact,0 \
  --plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz

Example 4: WES Somatic Variant Calling (Tumor-Only with PoN)

#!/bin/bash
# Tumor-only calling with Panel of Normals

REFERENCE="/path/to/hg38.fa"
TARGETS="exome_targets.bed"
TUMOR="tumor.recal.bam"
PON="pon.vcf.gz"
GNOMAD="gnomad.vcf.gz"

# Get pileup summaries
gatk GetPileupSummaries \
  -R ${REFERENCE} \
  -I ${TUMOR} \
  -V ${GNOMAD} \
  -L ${TARGETS} \
  -O tumor.pileups.table

# Calculate contamination
gatk CalculateContamination \
  -I tumor.pileups.table \
  -O tumor.contamination.table

# Mutect2 with PoN
gatk Mutect2 \
  -R ${REFERENCE} \
  -I ${TUMOR} \
  --germline-resource ${GNOMAD} \
  --panel-of-normals ${PON} \
  -L ${TARGETS} \
  -O mutect2.vcf.gz \
  --tumor-sample tumor_sample \
  --dragen-mode true

# Filter with contamination table
gatk FilterMutectCalls \
  -R ${REFERENCE} \
  -V mutect2.vcf.gz \
  -O mutect2.filtered.vcf.gz \
  --contamination-table tumor.contamination.table

Example 5: Coding Variant Annotation with AI Tools

#!/bin/bash
# Complete AI-based annotation workflow

VCF="coding_variants.vcf.gz"

# Step 1: VEP with AlphaMissense
vep -i ${VCF} \
  -o variants.alphamissense.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz

# Step 2: SNPeffect for structural impact
snpeffect5 --variants variants.alphamissense.vcf \
  --output snpeffect_results.tsv \
  --structure_db /path/to/alphafold_db/ \
  --min_plddt 70 \
  --threads 8

# Step 3: Combine results
# (Custom script to merge VEP and SNPeffect outputs)

# Step 4: Prioritize
# Filter for high-confidence pathogenic variants:
# - AlphaMissense > 0.5
# - FoldX ΔΔG > 0.5 (destabilizing)
# - pLDDT >= 70 (high confidence structure)

Example 6: Pathway Enrichment Analysis

# R script for pathway enrichment

library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)

# Read genes with variants (moderate/high impact)
genes_with_variants <- read.table("genes_with_variants.txt", 
                                   header=FALSE, 
                                   stringsAsFactors=FALSE)$V1

# Convert to ENTREZ IDs
gene_ids <- bitr(genes_with_variants, 
                  fromType="SYMBOL", 
                  toType="ENTREZID", 
                  OrgDb=org.Hs.eg.db)

# GO enrichment
go_enrich <- enrichGO(gene = gene_ids$ENTREZID,
                      OrgDb = org.Hs.eg.db,
                      ont = "BP",
                      pAdjustMethod = "BH",
                      pvalueCutoff = 0.05,
                      qvalueCutoff = 0.2,
                      readable = TRUE)

# KEGG enrichment
kegg_enrich <- enrichKEGG(gene = gene_ids$ENTREZID,
                          organism = "hsa",
                          pvalueCutoff = 0.05)

# Reactome enrichment
reactome_enrich <- enrichPathway(gene = gene_ids$ENTREZID,
                                 organism = "human",
                                 pvalueCutoff = 0.05)

# Save results
write.csv(as.data.frame(go_enrich), "go_enrichment.csv")
write.csv(as.data.frame(kegg_enrich), "kegg_enrichment.csv")
write.csv(as.data.frame(reactome_enrich), "reactome_enrichment.csv")

Example 7: Non-coding Variant Annotation with ENCODE cCRE

#!/bin/bash
# ENCODE cCRE annotation workflow

VCF="non_coding_variants.vcf.gz"
CCRE_BED="ENCODE_cCREs.hg38.v4.bed.gz"

# Method 1: VEP custom annotation
vep -i ${VCF} \
  -o variants.ccre.vcf \
  --cache --dir_cache /path/to/vep_cache \
  --assembly GRCh38 \
  --format vcf \
  --vcf \
  --custom ${CCRE_BED},ENCODE_cCRE,bed,overlap,0

# Method 2: bedtools intersect (for detailed analysis)
bcftools query -f '%CHROM\t%POS\t%POS\n' ${VCF} > variants.bed

bedtools intersect -a variants.bed \
  -b ${CCRE_BED} \
  -wa -wb > variants_in_ccres.txt

# Filter for promoter-like and enhancer-like cCREs
grep -E "(promoter|enhancer)" variants_in_ccres.txt > variants_regulatory.txt

Troubleshooting and Best Practices


Reference Genome Consistency

Critical: Use the same reference genome version (decoyed + alt-masked) throughout the entire pipeline. See Reference Genome Considerations in the Introduction for details.

Common issues:

  • Coordinate mismatches between alignment and variant calling
  • Dropped variants due to reference version differences
  • Incorrect annotations from coordinate mismatches

Decoy and ALT masking: Always use for WGS (critical), recommended for WES (improves accuracy). See Reference Genome Considerations section in the Introduction for detailed rationale.

Common Errors and Solutions

Error: “Hash table not found”

  • Solution: Verify hash table path and that it was built with the correct reference

Error: “Reference mismatch”

  • Solution: Ensure all tools use the same reference genome version

Error: “Low mapping rate”

  • Solution: Check read quality, adapter contamination, reference genome version

Error: “VQSR fails with insufficient samples”

  • Solution: Use hard filtering instead of VQSR for small cohorts (<30 samples)

Performance Optimization

DRAGMAP alignment:

  • Use multiple threads: -t parameter
  • Ensure sufficient memory (64GB+ for WGS)

GATK variant calling:

  • Use --java-options "-Xmx32G" for memory allocation
  • Parallelize by chromosome when possible
  • Use GVCF workflow for cohorts (more efficient)

VEP annotation:

  • Use --offline mode with local cache
  • Parallelize with --fork option
  • Use compressed input/output for disk space

Resource Requirements

Stage Memory CPU Disk
DRAGMAP alignment 32-64GB 8+ cores 100GB+
BQSR 16-32GB 4+ cores 50GB+
HaplotypeCaller 32-64GB 8+ cores 50GB+
Mutect2 32-64GB 8+ cores 50GB+
VEP annotation 16-32GB 4+ cores 100GB+ (with cache)

Quality Control Checkpoints

After alignment:

  • Mapping rate >95%
  • Mean coverage meets target
  • Duplication rate acceptable
  • Proper read group information

After variant calling:

  • Variant count reasonable (not too high/low)
  • Ti/Tv ratio ~2.0-2.1 (for germline)
  • PASS filter rate acceptable
  • No obvious technical artifacts

After annotation:

  • Annotation success rate >95%
  • Expected annotation fields present
  • No coordinate mismatches

References and Resources


Official Documentation

  • DRAGMAP: https://github.com/Illumina/DRAGMAP
  • GATK: https://gatk.broadinstitute.org/
  • VEP: https://ensembl.org/info/docs/tools/vep/
  • AlphaFold: https://alphafold.ebi.ac.uk/
  • ENCODE: https://www.encodeproject.org/

AI Variant Effect Prediction Resources

  • AlphaMissense: Database and VEP plugin - https://www.nature.com/articles/s41586-023-06345-5
  • AlphaFold DB: https://alphafold.ebi.ac.uk/
  • FoldX: Protein stability prediction -https://foldxsuite.crg.eu/ (Note: May require HTTPS)
  • SNPeffect 5.0: Integrated pipeline - https://github.com/vibbits/snpeffect
  • EVE: Evolutionary model - https://evemodel.org/
  • REVEL: Ensemble predictor - Available via dbNSFP
  • PrimateAI: Deep learning model - Available via dbNSFP

Pathway Enrichment Resources

  • GSEA: https://www.gsea-msigdb.org/gsea/index.jsp
  • Enrichr: https://maayanlab.cloud/Enrichr/
  • DAVID: https://davidbioinformatics.nih.gov/
  • clusterProfiler: Bioconductor R package
  • MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/
  • KEGG: https://www.genome.jp/kegg/
  • Reactome: https://reactome.org/
  • VarSAn: Variant set enrichment analysis tool

ENCODE cCRE Resources

  • ENCODE cCRE Registry: https://www.encodeproject.org/data/annotations/
  • SCREEN portal: https://screen.wenglab.org/
  • ENCODE FTP: For downloading cCRE BED files
  • cCRE documentation: Classification and usage guidelines

Key Papers

  1. DRAGEN/DRAGMAP: Behera, S., Catreux, S., Rossi, M. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat Biotechnol 43, 1177–1191 (2025). https://doi.org/10.1038/s41587-024-02382-1
  2. GATK Best Practices: Van der Auwera & O’Connor (2020) Genomics in the Cloud
  3. AlphaMissense: Cheng et al. (2023) Nature - AlphaMissense predicts pathogenicity
  4. ENCODE cCRE: Moore et al. (2020) Nature - Expanded ENCODE annotations
  5. Pathway Enrichment: Subramanian et al. (2005) PNAS - GSEA method

Additional Tools and Databases

  • gnomAD: Population frequency database - https://gnomad.broadinstitute.org/
  • ClinVar: Clinical significance database - https://www.ncbi.nlm.nih.gov/clinvar/
  • COSMIC: Cancer mutation database - https://cancer.sanger.ac.uk/cosmic
  • OncoKB: Cancer therapeutic database - https://www.oncokb.org/
  • dbNSFP: Functional prediction database - https://www.dbnsfp.org/
  • Codon Usage Databases:
    • CUTG (Codon Usage Database): http://www.kazusa.or.jp/codon/
    • Kazusa Codon Usage Database: Species-specific codon usage tables
  • Codon Usage Analysis Tools:
    • CodonW: Codon usage analysis software
    • CAI (Codon Adaptation Index): Measures codon usage bias
    • tAI (tRNA Adaptation Index): Considers tRNA abundance in codon optimality

Conclusion


Key takeaways from this tutorial:

  1. Reference preparation is critical: Use decoyed and alt-masked references consistently
  2. DRAGEN features enhance accuracy: Enable DRAGEN mode for improved variant calling
  3. AI tools provide powerful annotation: AlphaMissense, AlphaFold, and FoldX offer structural insights
  4. Pathway enrichment reveals biology: Connect variants to biological function through enrichment analysis
  5. Non-coding variants matter: ENCODE cCRE annotation helps prioritize regulatory variants

For more advanced prioritization strategies and multi-tool consensus approaches, see the comprehensive variant analysis guidelines.