Introduction
This tutorial provides a step-by-step guide for using DRAGMAP (DRAGEN open-source mapper) for read alignment and GATK (Genome Analysis Toolkit) for variant calling from whole genome sequencing (WGS) and whole exome sequencing (WES) data. It covers both somatic and germline variant calling workflows with command examples, filtering strategies, annotation using AI-based tools, and prioritization methods.
What is DRAGMAP?
DRAGMAP is the open-source software implementation of Illumina’s DRAGEN (Dynamic Read Analysis for GENomics) mapper/aligner. It provides fast and accurate alignment of sequencing reads to reference genomes, with support for:
- Hash table-based alignment: Efficient mapping using pre-built hash tables
- Decoy sequence support: Reduces false positives from misaligned reads
- ALT haplotype masking: Minimizes ambiguous mapping in complex genomic regions
- Pangenome references: Graph-augmented references for improved mapping in difficult regions (DRAGEN v4.3+)
- Hardware-agnostic: Runs on standard CPU/GPU hardware (unlike DRAGEN hardware which requires FPGA)
DRAGMAP is particularly well-suited for:
- Large-scale sequencing projects requiring fast alignment
- Somatic variant calling where mapping accuracy is critical
- Germline cohort studies (can use pangenome references for improved accuracy)
- Projects requiring DRAGEN-compatible alignment outputs
What is GATK?
GATK (Genome Analysis Toolkit) is a comprehensive suite of tools for variant discovery and genotyping developed by the Broad Institute. Key features include:
- HaplotypeCaller: Industry-standard tool for germline variant calling
- Mutect2: Specialized tool for somatic variant calling
- DRAGEN-GATK features: Software implementation of DRAGEN algorithms for standard hardware since GATK 4.2
- Variant Quality Score Recalibration (VQSR): Machine learning-based variant filtering
- Best Practices workflows: Well-established pipelines for variant analysis
When to Use DRAGMAP vs BWA-MEM
Use DRAGMAP when:
- You need DRAGEN-compatible alignment outputs
- You’re working with large-scale projects and need fast alignment
- You want to leverage decoy sequences and ALT masking for improved accuracy
- You’re performing somatic variant calling where mapping specificity is critical
Use BWA-MEM when:
- You need a widely-adopted, well-documented aligner
- You’re working with standard research workflows
- You prefer a simpler installation and setup process
- You’re aligning to non-standard reference genomes
When to Use GATK for Somatic vs Germline Variant Calling
For Germline Variant Calling:
- HaplotypeCaller: Use for single-sample or cohort analysis
- GVCF workflow: Recommended for cohort studies (more efficient than per-sample VCF)
- GenotypeGVCFs: Joint genotyping across multiple samples
For Somatic Variant Calling:
- Mutect2: Primary tool for tumor-normal pairs or tumor-only samples
- Panel of Normals (PoN): Essential for tumor-only mode
- FilterMutectCalls: Somatic-specific filtering
Prerequisites and Software Installation
Before starting, ensure you have the following installed:
Required Software:
- DRAGMAP: Available via Bioconda (
conda install dragmap) or build from source - GATK: Download from GATK website (version 4.2+ recommended)
- samtools: For BAM file manipulation (
conda install samtools) - bcftools: For VCF manipulation (
conda install bcftools) - VEP (Variant Effect Predictor): For variant annotation (
conda install ensembl-vep)
Reference Genome:
- GRCh38/hg38: Recommended standard reference genome
- Decoy sequences: Include hs38d1 or similar decoy sequences
- ALT masking: Use alt-masked reference for optimal results
System Requirements:
- Memory: Minimum 32GB RAM (64GB+ recommended for WGS)
- CPU: Multi-core processor (8+ cores recommended)
- Storage: Sufficient space for reference genome, hash tables, and intermediate files
Reference Genome Considerations
Why Use Decoyed Reference? Decoy sequences (e.g., hs38d1) help “soak up” reads from repetitive or unplaced regions that would otherwise mis-map to the primary assembly. This reduces false positive variant calls, especially important for:
- Somatic variant calling (low VAF variants)
- Structural variant detection
- Regions with high sequence similarity
Why Use ALT-Masked Reference? ALT haplotype assemblies in GRCh38 represent alternate sequences that differ from the primary assembly. Masking regions of ALT contigs that are highly similar to the primary assembly:
- Reduces ambiguous mapping
- Improves mapping quality scores
- Prevents false variant calls from mapping artifacts
- Critical for accurate variant calling in complex genomic regions
Reference Consistency: It is critical to use the same reference genome version (decoyed + alt-masked) throughout the entire pipeline:
- Alignment (DRAGMAP)
- Variant calling (GATK)
- Annotation (VEP)
- Filtering and prioritization
Mixing reference versions can lead to coordinate mismatches, dropped variants, or incorrect annotations.
Workflow Overview
The following diagram illustrates the workflow from raw FASTQ files to prioritized variants:
graph TB
Start["Raw FASTQ Files"] --> Prep["Stage 1: Reference Preparation<br/>Decoyed + Alt-masked Reference<br/>DRAGMAP Hash Table"]
Prep --> Align["Stage 2: DRAGMAP Alignment<br/>Single/Paired-end Reads<br/>WGS/WES"]
Align --> Process["Stage 3: Post-Alignment Processing<br/>MarkDuplicates, BQSR"]
Process --> QC["Stage 4: Quality Control<br/>Coverage Metrics, Alignment QC"]
QC --> SNVIndel["Stage 5: SNV/Indel Calling<br/>HaplotypeCaller (Germline)<br/>Mutect2 (Somatic)"]
QC --> SVCNV["Stage 6: SV & CNV Calling<br/>Structural Variants<br/>Copy Number Variants"]
SNVIndel --> Filter["Stage 7: Variant Filtering<br/>VQSR, FilterMutectCalls<br/>SV/CNV Quality Filters"]
SVCNV --> Filter
Filter --> Annotate["Stage 8: Variant Annotation<br/>VEP, AI Tools, ENCODE cCRE<br/>SV/CNV Annotation"]
Annotate --> Prioritize["Stage 9: Variant Prioritization<br/>Pathway Enrichment, AI Scores<br/>Regulatory Annotation"]
Prioritize --> End["Prioritized Variants Report"]
style Start fill:#0066cc,stroke:#003366,stroke-width:3px,color:#ffffff
style End fill:#006600,stroke:#003300,stroke-width:3px,color:#ffffff
style Prep fill:#ffcc00,stroke:#cc9900,stroke-width:2px,color:#000000
style QC fill:#ffcc00,stroke:#cc9900,stroke-width:2px,color:#000000
style SNVIndel fill:#0099cc,stroke:#006699,stroke-width:2px,color:#ffffff
style SVCNV fill:#cc0066,stroke:#990033,stroke-width:2px,color:#ffffff
style Filter fill:#ff9900,stroke:#cc6600,stroke-width:2px,color:#000000
style Annotate fill:#9966cc,stroke:#663399,stroke-width:2px,color:#ffffff
style Prioritize fill:#9966cc,stroke:#663399,stroke-width:2px,color:#ffffff
Key Workflow Components
- Reference Preparation: Build decoyed and alt-masked reference, create DRAGMAP hash table
- Alignment: Map reads using DRAGMAP with decoy and alt masking support
- Post-Alignment Processing: Mark duplicates, perform base quality score recalibration
- Quality Control: Assess alignment quality and coverage metrics
- SNV/Indel Variant Calling:
- Germline: HaplotypeCaller with DRAGEN features
- Somatic: Mutect2 with DRAGEN features
- SV and CNV Calling: Structural variant and copy number variant detection using GATK tools
- Variant Filtering: Quality-based filtering for SNVs/indels, SVs, and CNVs
- Germline: VQSR for SNVs/indels
- Somatic: FilterMutectCalls for SNVs/indels
- SV/CNV: Quality and size-based filtering
- Variant Annotation:
- Coding variants: VEP with AI tools (AlphaMissense, AlphaFold, FoldX) and SNPeffect 5.0
- Non-coding variants: ENCODE cCRE annotation
- SV annotation: AnnotSV, VEP with SV support
- CNV annotation: Gene overlap and functional impact
- Variant Prioritization: Pathway enrichment, AI-based scoring, regulatory impact assessment
WGS vs WES Considerations
| Aspect | WGS | WES |
|---|---|---|
| Target BED file | Not required | Required for QC and optional variant calling restriction |
| Coverage metrics | CollectWgsMetrics | CollectHsMetrics with BED |
| Mean coverage | 30-60x (germline), 60-100x (somatic) | 100-150x in targets |
| On-target % | N/A | ≥80% recommended |
| Variant calling scope | Genome-wide | Optionally restrict to target regions |
| Non-coding variants | Comprehensive detection | Limited to exonic and nearby regions |
| Reference preparation | Full genome with decoys | Same, but target BED needed for QC |
Stage 1: Reference Genome Preparation
Reference Genome Selection
Recommended: GRCh38/hg38 with decoy sequences and ALT masking
Reference options:
- Linear reference with decoys and ALT masking (standard, recommended for most workflows)
- Pangenome reference (recommended for large-scale germline cohort studies, see Pangenome Reference Support section below)
Why GRCh38/hg38?
- Current standard reference genome
- Best annotation support
- Compatible with all major databases (gnomAD, ClinVar, COSMIC)
- Well-supported by DRAGMAP and GATK
- Prebuilt pangenome references available
Downloading Reference Genome
Option 1: From GATK Resource Bundle (Recommended)
The GATK resource bundle provides pre-formatted reference genomes with decoys. Download from the official GATK resource bundle:
# Download GATK resource bundle using gsutil (Google Cloud SDK)
# First, install gsutil if not already installed
# Then download the reference files:
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/ .
# Or download specific files:
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict .
# Alternative: Use wget with the public URL (if accessible)
# Check GATK documentation for current public URLs:
# https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
Option 2: From NCBI (Build Your Own with Decoys)
# Download primary assembly from NCBI
# https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39
# Download decoy sequences (hs38d1)
# Combine primary assembly with decoy sequences to create hs38d1 reference
Option 3: From GENCODE or Ensembl
# GENCODE: https://www.gencodegenes.org/human/
# Ensembl: https://www.ensembl.org/Homo_sapiens/Info/Index
# Download GRCh38 primary assembly and add decoy sequences manually
Note: Always verify the reference genome version and ensure it matches your annotation databases. Check the GATK resource bundle documentation for the most current download instructions.
Obtaining ALT Mask BED File
ALT mask BED files specify regions in ALT contigs that should be masked (replaced with Ns) to reduce ambiguous mapping. DRAGMAP can auto-detect and apply default masks for standard references, or you can provide a custom mask file.
Sources for ALT mask BED files:
- DRAGMAP repository: Check the
fasta_maskdirectory in the DRAGMAP GitHub repository - Illumina documentation: DRAGEN reference preparation guides
- GATK resource bundle: May include alt mask files
Example locations:
# Check DRAGMAP repository for alt mask files
# https://github.com/Illumina/DRAGMAP/tree/master/fasta_mask
# Example: hg38 alt mask file
# fasta_mask/hg38_alt_mask.bed
Building DRAGMAP Hash Table
The hash table is a pre-computed index that DRAGMAP uses for fast alignment. It should be built with the same reference genome (including decoys and alt masking) that you’ll use for alignment.
Basic hash table building:
# Basic hash table building (DRAGMAP will auto-detect decoys if present)
dragen-os --build-hash-table true \
--ht-reference reference.fasta \
--output-directory /path/to/reference/ \
--output-file-prefix dragmap.hg38
Hash table with explicit decoy sequences:
# If decoys are in separate file, specify explicitly
dragen-os --build-hash-table true \
--ht-reference reference.fasta \
--ht-decoys decoys.fasta \
--output-directory /path/to/reference/ \
--output-file-prefix dragmap.hg38_with_decoys
Hash table with ALT masking:
# Build hash table with ALT mask BED file
dragen-os --build-hash-table true \
--ht-reference hg38.fa \
--output-directory /path/to/reference/ \
--output-file-prefix dragmap.hg38_alt_masked \
--ht-mask-bed fasta_mask/hg38_alt_mask.bed
Hash table with both decoys and ALT masking:
# Combined: decoys + ALT masking (recommended)
dragen-os --build-hash-table true \
--ht-reference hg38.fa \
--ht-decoys hs38d1.fa \
--output-directory /path/to/reference/ \
--output-file-prefix dragmap.hg38_decoys_alt_masked \
--ht-mask-bed fasta_mask/hg38_alt_mask.bed
Automatic detection: For standard references (hg38, hg19), DRAGMAP can automatically detect and apply default decoy sequences and ALT masks if they’re not explicitly specified. However, explicitly providing them ensures consistency.
Pangenome Reference Support
DRAGMAP supports pangenome references! DRAGMAP can use graph-augmented pangenome references that incorporate alternate variant paths from population cohorts, improving mapping accuracy especially in difficult-to-map regions.
What is a Pangenome Reference?
A pangenome reference in DRAGMAP context refers to:
- A linear reference scaffold (e.g., GRCh38) augmented with alternate variant paths from population VCFs
- Internally uses a graph structure for multigenome mapping
- Output remains in standard linear reference coordinates (BAM/VCF compatible with standard tools)
- DRAGEN v4.3+ includes a prebuilt pangenome based on 128 human assemblies from 26 ancestries
Benefits of Pangenome References
Recommended for:
- Human germline analyses: Improved mapping in highly polymorphic regions
- Difficult-to-map regions: Better read placement in structurally complex regions
- Reducing reference bias: More accurate variant calling across diverse ancestries
- High coverage in ancestral diversity: Better representation of population variation
Less recommended for:
- Somatic variant calling: Linear references are typically preferred for somatic workflows
- Non-human organisms: Prebuilt pangenomes are primarily for human; custom pangenomes can be built
Using Prebuilt Pangenome References
DRAGEN/DRAGMAP provides prebuilt pangenome hash tables for major human assemblies:
- hg19
- hs37d5
- hg38
- T2T-CHM13v2.0
Download prebuilt pangenome references:
# Check Illumina/DRAGEN documentation for prebuilt pangenome hash tables
# These are large files and require significant storage space
# Prebuilt pangenomes are optimized for human germline analysis
Building Custom Pangenome References
You can build custom pangenome references using population VCFs:
# Build pangenome hash table with multi-sample VCF (msVCF)
dragen-os --build-hash-table true \
--ht-reference reference.fasta \
--ht-graph-msvcf-file population_variants.msvcf.gz \
--ht-graph-extra-kmer-bed graph_regions.bed \
--ht-mask-bed mask.bed \
--output-directory /path/to/pangenome_reference/ \
--output-file-prefix dragmap.pangenome.hg38
Requirements for custom pangenome:
- Multi-sample VCF (msVCF): Population variants from cohort analysis
- Phasing information: Phased variants improve pangenome quality (long-read assemblies help)
- BED files: Optional region inclusion/exclusion or masking
- Computational resources: Building pangenomes with many samples (>256) requires significant memory and time
Using Pangenome References for Alignment
Once you have a pangenome hash table, use it like a standard reference:
# Alignment with pangenome reference
dragen-os -r /path/to/pangenome_reference/ \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
--output-directory /path/to/output/ \
--output-file-prefix sample
# DRAGMAP automatically uses multigenome mapper when pangenome hash table is detected
# Output BAM files are in standard linear reference coordinates
Considerations and Limitations
When to use pangenome references:
- Large-scale germline cohort studies
- Studies with diverse ancestral backgrounds
- Projects focusing on difficult-to-map regions
- When reference bias is a concern
When to use linear references:
- Somatic variant calling workflows
- Standard germline analysis (linear references work well)
- Non-human organisms (unless custom pangenome built)
- When computational resources are limited
Important notes:
- Pangenome references are primarily validated for germline workflows
- Some DRAGEN components may not have fully validated accuracy with pangenomes for somatic workflows
- Check DRAGEN documentation for component-specific pangenome support
- Output coordinates remain in linear reference space, ensuring compatibility with standard tools
Preparing Known Sites for BQSR
Base Quality Score Recalibration (BQSR) requires known variant sites to model systematic errors. Download these resources from the GATK resource bundle:
# Create directory for known sites
mkdir -p /path/to/known_sites
cd /path/to/known_sites
# Download using gsutil (Google Cloud SDK)
# Install gsutil if needed: https://cloud.google.com/storage/docs/gsutil_install
# Download dbSNP (common variants)
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx .
# Download Mills and 1000G gold standard indels
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi .
# Download 1000G phase 1 high-confidence SNPs (optional, for VQSR)
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz .
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi .
# Alternative: Download entire resource bundle directory
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38/v0/ .
# For current URLs and additional resources, check:
# https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
WES Target BED File Preparation
For WES data, you’ll need a BED file specifying the exome target regions:
# Download target BED file from capture kit manufacturer
# Example: Illumina Nextera Rapid Capture Exome
# Or use standard exome target files from:
# - Gencode
# - RefSeq
# - Capture kit manufacturer website
# Ensure BED file is in hg38 coordinates
# If in hg19, use liftOver to convert:
liftOver exome_targets.hg19.bed hg19ToHg38.over.chain.gz exome_targets.hg38.bed unmapped.bed
Reference Consistency Checklist
Before proceeding, verify:
- Same reference genome version used for hash table building
- Decoy sequences included (if using)
- ALT masking applied (if using)
- Known sites match reference genome build (hg38)
- WES target BED file matches reference build (if applicable)
- All files indexed properly (.fai, .dict, .tbi files present)
Stage 2: DRAGMAP Alignment
Note: Before alignment, perform quality control on raw sequencing data using tools like FastQC or MultiQC to assess read quality, adapter contamination, and sequencing metrics. This helps identify problematic samples and determine if adapter trimming or quality filtering is needed before alignment.
Installation
Option 1: Bioconda (Recommended)
# Install via Bioconda
conda install -c bioconda dragmap
# Verify installation
dragen-os --help
Option 2: Build from Source
# Clone DRAGMAP repository
git clone https://github.com/Illumina/DRAGMAP.git
cd DRAGMAP
# Build (requires C++17 compiler, Boost library)
make
# Binary will be in ./build/release/
# Optional: Install to /usr/bin/
make install
Alignment Commands
Paired-end reads (WGS or WES):
# Basic paired-end alignment
dragen-os -r /path/to/reference/ \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
--output-directory /path/to/output/ \
--output-file-prefix sample
# With read group information
dragen-os -r /path/to/reference/ \
-1 sample_R1.fastq.gz \
-2 sample_R2.fastq.gz \
--RGID sample_lib1 \
--RGSM sample_name \
--RGPL ILLUMINA \
--RGLB lib1 \
--RGPU unit1 \
--output-directory /path/to/output/ \
--output-file-prefix sample > sample.sam
Single-end reads:
# Single-end alignment
dragen-os -r /path/to/reference/ \
-1 sample_R1.fastq.gz \
--output-directory /path/to/output/ \
--output-file-prefix sample
WGS vs WES considerations: Both use the same alignment command. Differences are in downstream QC (target BED file for WES) and optional variant calling restrictions. See the WGS vs WES Considerations table in Workflow Overview.
Post-Alignment Processing
After alignment, sort and index the BAM file:
# Sort BAM file
samtools sort -@ 8 sample.bam -o sample.sorted.bam
# Index BAM file
samtools index sample.sorted.bam
# Verify BAM file
samtools view -H sample.sorted.bam | head -20
Why Use Decoyed + Alt-Masked Reference?
See the Reference Genome Considerations section in the Introduction for detailed explanation. Key benefits:
- Reduces false positives from mis-mapped reads
- Improves mapping specificity in complex regions
- Critical for somatic variant calling and structural variant detection
Comparison with BWA-MEM
See the When to Use DRAGMAP vs BWA-MEM section in the Introduction for detailed comparison. DRAGMAP offers built-in decoy and ALT masking support, while BWA-MEM requires manual setup.
Stage 3: Post-Alignment Processing
MarkDuplicates
Mark PCR and optical duplicates to avoid false positive variant calls:
# Mark duplicates using GATK
gatk MarkDuplicates \
-I sample.sorted.bam \
-O sample.markdup.bam \
-M sample.markdup.metrics.txt \
--CREATE_INDEX true
# Verify duplicate marking
samtools view -f 1024 sample.markdup.bam | head -5
Base Quality Score Recalibration (BQSR)
BQSR corrects systematic biases in base quality scores. Note: When using DRAGEN mode in variant calling, BQSR may be bypassed as DRAGEN’s error model already accounts for systematic biases (see BQSR Considerations section below for details).
First pass - Build recalibration table:
# For WGS (no target regions needed)
gatk BaseRecalibrator \
-R reference.fa \
-I sample.markdup.bam \
--known-sites dbsnp.vcf.gz \
--known-sites mills.vcf.gz \
--known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-O sample.recal.table
# For WES (optionally restrict to target regions for efficiency)
gatk BaseRecalibrator \
-R reference.fa \
-I sample.markdup.bam \
-L exome_targets.bed \
--known-sites dbsnp.vcf.gz \
--known-sites mills.vcf.gz \
-O sample.recal.table
Apply BQSR:
gatk ApplyBQSR \
-R reference.fa \
-I sample.markdup.bam \
-bqsr sample.recal.table \
-O sample.recal.bam \
--CREATE_INDEX true
Second pass - Generate recalibration metrics (optional, for QC):
gatk BaseRecalibrator \
-R reference.fa \
-I sample.recal.bam \
--known-sites dbsnp.vcf.gz \
--known-sites mills.vcf.gz \
-O sample.recal.table2
# Compare before/after (optional visualization)
gatk AnalyzeCovariates \
-before sample.recal.table \
-after sample.recal.table2 \
-plots sample.recalibration_plots.pdf
BQSR and DRAGEN Mode: See BQSR Considerations section in Understanding DRAGEN-GATK Features for details.
Stage 4: Quality Control
Alignment Metrics Collection
For WGS:
gatk CollectWgsMetrics \
-R reference.fa \
-I sample.recal.bam \
-O sample.wgs_metrics.txt
For WES:
gatk CollectHsMetrics \
-R reference.fa \
-I sample.recal.bam \
-O sample.wes_metrics.txt \
-TI exome_targets.bed \
-BI exome_targets.bed
Coverage Analysis
Key metrics to check:
| Metric | WGS Target | WES Target |
|---|---|---|
| Mean coverage | 30-60x (germline), 60-100x (somatic) | 100-150x in targets |
| On-target % | N/A | ≥80% |
| Uniformity | >0.9 | >0.8 |
| Duplication rate | <20% | <30% |
Interpretation:
- Low coverage: May miss variants or have low confidence calls
- High duplication: May indicate library issues or low input
- Poor uniformity: Uneven coverage may affect variant detection
- Low on-target (WES): May indicate capture efficiency issues
Quality Thresholds Before Variant Calling
Pass criteria:
- Mean coverage meets target (see table above)
- Duplication rate acceptable
- Mapping rate >95%
- Proper read group information present
- BAM file properly sorted and indexed
- For WES: On-target percentage ≥80%
Fail samples that don’t meet QC thresholds before proceeding to variant calling.
Understanding DRAGEN-GATK Features
What is DRAGEN-GATK?
DRAGEN-GATK is a software implementation of DRAGEN algorithms that runs on standard hardware (CPU/GPU) without requiring DRAGEN FPGA hardware. It provides:
- Improved error models: Better handling of sequencing errors
- Foreign Read Detection (FRD): Identifies reads from contamination
- Base Quality Dropout (BQD): Models quality score degradation
- STR Error Correction (STRE): Improved handling of short tandem repeats
- DRAGstr model: Specialized model for STR regions
Two Operational Modes
1. Functional Equivalence Mode:
- Produces results equivalent to DRAGEN hardware output
- Use when you need DRAGEN-compatible results
- May disable some features for equivalence
2. Maximum Quality Mode:
- Prioritizes accuracy and variant call richness
- May produce more variants than DRAGEN hardware
- Use when accuracy is more important than equivalence
Key Command-Line Flags
| Flag | Description | Typical Values |
|---|---|---|
--dragen-mode true |
Main flag to enable DRAGEN features | true |
--use-pdhmm true/false |
Enables Partially Determined HMM model | true for max quality, false for equivalence |
--disable-spanning-event-genotyping true/false |
Controls spanning event genotyping | true for equivalence, false for max quality |
--dragstr-params-path |
Path to DRAGstr model parameters file | Path to text file (generated by CalibrateDragstrModel) |
--dragstr-het-hom-ratio |
Controls heterozygous/homozygous prior ratio | 2 (default) |
DRAGstr Parameters
DRAGstr parameters are not pre-downloaded files - they must be generated using GATK’s CalibrateDragstrModel tool. These parameters contain tables (GOP, GCP, API) that model short tandem repeat (STR) regions for improved variant calling accuracy in DRAGEN mode.
Step 1: Generate STR table for your reference genome:
# Create STR table from reference genome
gatk ComposeSTRTableFile \
-R reference.fa \
-O str_table.tsv
Step 2: Calibrate DRAGstr model:
# Run CalibrateDragstrModel on your BAM file(s)
# This generates the DRAGstr parameters file
gatk CalibrateDragstrModel \
-R reference.fa \
-I sample.recal.bam \
-str str_table.tsv \
-O dragstr_params.txt
# The output file (dragstr_params.txt) contains:
# - GOP (Gap Open Penalty) table
# - GCP (Gap Continue Penalty) table
# - API (Allelic Pruning Index) table
Step 3: Use the generated parameters file:
# Use the generated DRAGstr parameters in HaplotypeCaller
gatk HaplotypeCaller \
-R reference.fa \
-I sample.recal.bam \
--dragen-mode true \
--dragstr-params-path dragstr_params.txt \
-O output.vcf.gz
Important notes:
- DRAGstr parameters are reference-specific - generate them for your specific reference genome
- The parameters file is a text file (not JSON) containing parameter tables
- You can reuse the same parameters file for multiple samples aligned to the same reference
- For best results, calibrate using high-quality BAM files from your project
BQSR Considerations
When using DRAGEN mode, DRAGEN’s error model already accounts for systematic biases, so BQSR may be bypassed. However, BQSR is still recommended for:
- Standard GATK workflows (non-DRAGEN mode)
- Ensuring compatibility with other tools
- QC purposes (to assess systematic biases)
Stage 5: SNV/Indel Variant Calling with GATK
This stage covers calling single nucleotide variants (SNVs) and small insertions/deletions (indels) for both germline and somatic analyses.
Germline Variant Calling
HaplotypeCaller - Single Sample Calling
Basic single sample calling:
# For WGS (no target regions needed)
gatk HaplotypeCaller \
-R reference.fa \
-I sample.recal.bam \
-O sample.g.vcf.gz \
-ERC GVCF
# For WES (optionally restrict to target regions for efficiency)
gatk HaplotypeCaller \
-R reference.fa \
-I sample.recal.bam \
-L exome_targets.bed \
-O sample.g.vcf.gz \
-ERC GVCF
Output formats:
- GVCF: Genomic VCF format (recommended for cohort analysis)
- VCF: Standard VCF format (for single sample)
GVCF Workflow (Recommended for Cohorts)
Step 1: Generate GVCF for each sample:
# Generate GVCF for each sample
for sample in sample1 sample2 sample3; do
gatk HaplotypeCaller \
-R reference.fa \
-I ${sample}.recal.bam \
-O ${sample}.g.vcf.gz \
-ERC GVCF
done
Step 2: Joint genotyping:
# Create sample map file
echo -e "sample1\tsample1.g.vcf.gz" > samples.map
echo -e "sample2\tsample2.g.vcf.gz" >> samples.map
echo -e "sample3\tsample3.g.vcf.gz" >> samples.map
# Joint genotyping
gatk GenotypeGVCFs \
-R reference.fa \
-V samples.map \
-O cohort.vcf.gz
Enabling DRAGEN Features in HaplotypeCaller
Functional Equivalence Mode:
gatk --java-options "-Xmx32G" HaplotypeCaller \
-R reference.fa \
-I sample.recal.bam \
-O sample-dragen-equiv.g.vcf.gz \
-ERC GVCF \
--dragen-mode true \
--use-pdhmm false \
--disable-spanning-event-genotyping true \
--dragstr-params-path /path/to/dragstr_params.json \
--dragstr-het-hom-ratio 2
Maximum Quality Mode:
gatk --java-options "-Xmx32G" HaplotypeCaller \
-R reference.fa \
-I sample.recal.bam \
-O sample-dragen-maxquality.g.vcf.gz \
-ERC GVCF \
--dragen-mode true \
--use-pdhmm true \
--disable-spanning-event-genotyping false \
--dragstr-params-path /path/to/dragstr_params.json \
--dragstr-het-hom-ratio 2
Key differences:
- Functional Equivalence:
--use-pdhmm false,--disable-spanning-event-genotyping true - Maximum Quality:
--use-pdhmm true,--disable-spanning-event-genotyping false
Somatic Variant Calling
Mutect2 - Tumor-Normal Pair
Basic tumor-normal calling:
# For WGS (no target regions needed)
gatk Mutect2 \
-R reference.fa \
-I tumor.recal.bam \
-I normal.recal.bam \
-normal normal_sample \
-O mutect2.vcf.gz \
--tumor-sample tumor_sample
# For WES (optionally restrict to target regions for efficiency)
gatk Mutect2 \
-R reference.fa \
-I tumor.recal.bam \
-I normal.recal.bam \
-normal normal_sample \
-L exome_targets.bed \
-O mutect2.vcf.gz \
--tumor-sample tumor_sample
Mutect2 - Tumor-Only Mode
Step 1: Create Panel of Normals (PoN):
# Generate PoN from multiple normal samples
gatk CreateSomaticPanelOfNormals \
-R reference.fa \
-vcfs normal1.vcf.gz normal2.vcf.gz normal3.vcf.gz \
-O pon.vcf.gz
Step 2: Get pileup summaries:
# For tumor sample
gatk GetPileupSummaries \
-R reference.fa \
-I tumor.recal.bam \
-V gnomad.vcf.gz \
-L exome_targets.bed \
-O tumor.pileups.table
# For normal samples (if available for contamination estimation)
gatk GetPileupSummaries \
-R reference.fa \
-I normal.recal.bam \
-V gnomad.vcf.gz \
-L exome_targets.bed \
-O normal.pileups.table
Step 3: Calculate contamination:
# Tumor only
gatk CalculateContamination \
-I tumor.pileups.table \
-O tumor.contamination.table
# tumor-normal pairs
gatk CalculateContamination \
-I tumor.pileups.table \
-matched normal.pileups.table \
-O tumor.contamination.table
Step 4: Run Mutect2 with PoN:
gatk Mutect2 \
-R reference.fa \
-I tumor.recal.bam \
--germline-resource gnomad.vcf.gz \
--panel-of-normals pon.vcf.gz \
-O mutect2.vcf.gz \
--tumor-sample tumor_sample
FilterMutectCalls
Apply somatic filtering:
gatk FilterMutectCalls \
-R reference.fa \
-V mutect2.vcf.gz \
-O mutect2.filtered.vcf.gz \
--contamination-table tumor.contamination.table
Enabling DRAGEN Features in Mutect2
With DRAGEN mode:
gatk Mutect2 \
-R reference.fa \
-I tumor.recal.bam \
-I normal.recal.bam \
-normal normal_sample \
-O mutect2-dragen.vcf.gz \
--tumor-sample tumor_sample \
--dragen-mode true \
--use-pdhmm true \
--dragstr-params-path /path/to/dragstr_params.json
Stage 6: Structural Variant (SV) and Copy Number Variant (CNV) Calling
This section covers calling structural variants (SVs) and copy number variants (CNVs) using GATK tools with DRAGMAP-aligned BAM files. SVs include deletions, duplications, inversions, and translocations, while CNVs refer to copy number gains and losses.
Structural Variant (SV) Calling
GATK provides the StructuralVariantDiscoveryPipeline (GATK-SV) for comprehensive SV detection. This pipeline integrates multiple SV callers and evidence types.
GATK-SV Pipeline Overview
GATK-SV is a cloud-based ensemble SV discovery pipeline that combines multiple SV detection methods:
- Manta: Paired-end and split-read evidence
- Wham: Read-depth and split-read evidence
- Scramble: Assembly-based detection
Input requirements:
- DRAGMAP-aligned BAM files (sorted and indexed)
- Reference genome (same as used for alignment)
- Optional: Target intervals for WES/panel data
Running GATK-SV (Basic Workflow)
Note: GATK-SV is typically run using WDL workflows. Here we show the key components:
# GATK-SV is typically run via WDL workflows
# Download GATK-SV workflows from: https://github.com/broadinstitute/gatk-sv
# Basic SV calling workflow (simplified example)
# The full pipeline requires WDL execution, but key steps include:
# 1. Extract evidence (paired-end, split-read, depth)
# 2. Run ensemble callers (Manta, Wham, Scramble)
# 3. Merge and filter SV calls
# 4. Annotate with population frequencies
# For detailed GATK-SV usage, see:
# https://broadinstitute.github.io/gatk-sv/docs/intro/
Alternative: Using Individual SV Callers
If you prefer to use individual SV callers directly:
Manta (recommended for paired-end data):
# Install Manta
# conda install -c bioconda manta
# Configure Manta
configManta.py \
--bam tumor.recal.bam \
--referenceFasta reference.fa \
--runDir manta_workdir
# Run Manta
python manta_workdir/runWorkflow.py
# Output: manta_workdir/results/variants/candidateSV.vcf.gz
Delly (for translocations and complex SVs):
# Install Delly
# conda install -c bioconda delly
# Germline SV calling
delly call \
-g reference.fa \
-o delly.bcf \
sample.recal.bam
# Somatic SV calling (tumor-normal)
delly call \
-g reference.fa \
-o delly.bcf \
-x exclude_regions.bed \
tumor.recal.bam \
normal.recal.bam
Lumpy (high sensitivity for complex rearrangements):
# Install Lumpy
# conda install -c bioconda lumpy-sv
# Extract discordant and split-reads
samtools view -b -F 1294 tumor.recal.bam | \
samtools sort - > tumor.discordants.bam
samtools view -h tumor.recal.bam | \
extractSplitReads_BwaMem -i stdin | \
samtools view -Sb - > tumor.splitters.bam
# Run Lumpy
lumpyexpress \
-B tumor.recal.bam \
-S tumor.splitters.bam \
-D tumor.discordants.bam \
-o lumpy.vcf
Copy Number Variant (CNV) Calling
GATK provides two main approaches for CNV calling:
- gCNV: For germline CNV calling (cohort-based)
- Somatic CNV pipeline: For tumor CNV calling (tumor-normal or tumor-only)
Germline CNV Calling with gCNV
gCNV is a cohort-based tool that uses a panel of normals (PoN) to detect germline CNVs. It’s particularly useful for WES and targeted panel data.
For WES germline CNV
Step 1: Create Panel of Normals (PoN)
# Collect read counts for each normal sample
# Repeat for all normal samples in your cohort
gatk CollectReadCounts \
-I normal1.recal.bam \
-L exome_targets.bed \
-R reference.fa \
-format HDF5 \
-O normal1.counts.hdf5
# Repeat for normal2, normal3, etc.
# Denoise read counts (creates PoN)
gatk DenoiseReadCounts \
-I normal1.counts.hdf5 \
-I normal2.counts.hdf5 \
-I normal3.counts.hdf5 \
--standardized-copy-ratios pon.standardizedCR.tsv \
--denoised-copy-ratios pon.denoisedCR.tsv \
--count-panel-of-normals pon.counts.hdf5
Step 2: Call CNVs in Case Samples
# Collect read counts for case sample
gatk CollectReadCounts \
-I case.recal.bam \
-L exome_targets.bed \
-R reference.fa \
-format HDF5 \
-O case.counts.hdf5
# Denoise using PoN
gatk DenoiseReadCounts \
-I case.counts.hdf5 \
--count-panel-of-normals pon.counts.hdf5 \
--standardized-copy-ratios case.standardizedCR.tsv \
--denoised-copy-ratios case.denoisedCR.tsv
# Model segments
gatk ModelSegments \
--denoised-copy-ratios case.denoisedCR.tsv \
--alleles-case case.hets.vcf.gz \
--output case.modelBegin.seg \
--output-prefix case
# Call copy ratio segments
gatk CallCopyRatioSegments \
--input case.modelBegin.seg \
--output case.called.seg
For WGS germline CNV (without target intervals):
# Collect read counts (no -L parameter for WGS)
gatk CollectReadCounts \
-I sample.recal.bam \
-R reference.fa \
-format HDF5 \
-O sample.counts.hdf5
# Continue with denoising and calling as above
Somatic CNV Calling
Tumor-Normal Pair:
# Step 1: Collect read counts for tumor and normal
gatk CollectReadCounts \
-I tumor.recal.bam \
-L exome_targets.bed \
-R reference.fa \
-format HDF5 \
-O tumor.counts.hdf5
gatk CollectReadCounts \
-I normal.recal.bam \
-L exome_targets.bed \
-R reference.fa \
-format HDF5 \
-O normal.counts.hdf5
# Step 2: Denoise (tumor normalized to normal)
gatk DenoiseReadCounts \
-I tumor.counts.hdf5 \
-I normal.counts.hdf5 \
--standardized-copy-ratios tumor.standardizedCR.tsv \
--denoised-copy-ratios tumor.denoisedCR.tsv
# Step 3: Model segments (with B-allele frequency from SNVs)
# First, extract heterozygous SNPs from normal
gatk SelectVariants \
-R reference.fa \
-V normal.vcf.gz \
-select-type SNP \
-O normal.hets.vcf.gz
# Model segments with BAF
gatk ModelSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--alleles normal.hets.vcf.gz \
--output tumor.modelBegin.seg \
--output-prefix tumor
# Step 4: Call copy ratio segments
gatk CallCopyRatioSegments \
--input tumor.modelBegin.seg \
--output tumor.called.seg
# Step 5: Plot segments (optional, for visualization)
gatk PlotModeledSegments \
--denoised-copy-ratios tumor.denoisedCR.tsv \
--allelic-counts tumor.hets.allelicCounts.tsv \
--segments tumor.modelBegin.seg \
--sequence-dictionary reference.dict \
--output tumor_plots \
--output-prefix tumor
Tumor-Only CNV Calling:
# For tumor-only, use a panel of normals (PoN) instead of matched normal
# Create PoN from multiple normal samples (see [Copy Number Variant (CNV) Calling](#copy-number-variant-cnv-calling) section above)
# Collect read counts for tumor
gatk CollectReadCounts \
-I tumor.recal.bam \
-L exome_targets.bed \
-R reference.fa \
-format HDF5 \
-O tumor.counts.hdf5
# Denoise using PoN
gatk DenoiseReadCounts \
-I tumor.counts.hdf5 \
--count-panel-of-normals pon.counts.hdf5 \
--standardized-copy-ratios tumor.standardizedCR.tsv \
--denoised-copy-ratios tumor.denoisedCR.tsv
# Continue with modeling and calling as above
Best Practices for SV and CNV Analysis
- Reference consistency: Use the same reference genome for alignment and SV/CNV calling (see Reference Genome Considerations in Introduction)
- Target intervals: For WES/panel, ensure target BED files match across samples
- Panel of Normals: Use PoN for WES/panel CNV calling (improves accuracy)
- Multiple callers: For SVs, use ensemble approaches (GATK-SV) or multiple callers
- Quality filtering: Apply appropriate filters based on sequencing depth and type
- Validation: Validate high-priority SVs/CNVs with orthogonal methods (PCR, FISH, etc.)
- Population frequencies: Check against gnomAD SV and DGV for common variants
- Size considerations:
- WGS: Can detect SVs/CNVs of all sizes
- WES: Limited to exonic regions, may miss intergenic SVs
- Panel: Very limited, focus on target regions
Integration with SNV/Indel Results
Combined variant analysis:
# Merge SNV/Indel, SV, and CNV results for comprehensive analysis
# Use custom scripts or tools like SURVIVOR for SV merging
# Example: Create combined variant report
# 1. SNV/Indel: mutect2.filtered.vcf.gz
# 2. SV: sv.filtered.vcf.gz
# 3. CNV: tumor.called.seg
# Prioritize variants affecting same genes/regions
Stage 7: Variant Filtering
SNV and Small Indel Filtering
Germline Filtering - VQSR
Variant Quality Score Recalibration (VQSR) uses machine learning to identify true variants:
# Build recalibration model for SNPs
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O cohort.snps.recal \
--tranches-file cohort.snps.tranches
# Apply VQSR to SNPs
gatk ApplyVQSR \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.snps.vqsr.vcf.gz \
--recal-file cohort.snps.recal \
--tranches-file cohort.snps.tranches \
--truth-sensitivity-filter-level 99.0 \
-mode SNP
# Similar process for indels (using Mills and 1000G indels as resources)
Germline Filtering - Hard Filtering (Alternative)
If VQSR is not feasible (e.g., small sample size), use hard filters:
# Hard filter SNPs
gatk VariantFiltration \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.filtered.vcf.gz \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "SNP_FILTER" \
-mode SNP
# Hard filter indels
gatk VariantFiltration \
-R reference.fa \
-V cohort.filtered.vcf.gz \
-O cohort.filtered.vcf.gz \
--filter-expression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filter-name "INDEL_FILTER" \
-mode INDEL
Somatic Filtering
FilterMutectCalls output already applies somatic-specific filters. Additional hard filters may be applied:
# Additional hard filters for somatic variants
bcftools filter -i 'FILTER=="PASS" && INFO/DP>=10 && INFO/AF>=0.05' \
mutect2.filtered.vcf.gz \
-O mutect2.hardfiltered.vcf.gz
VAF thresholds:
- WGS tissue: 5-10% minimum (adjust for purity)
- WES tissue: 5-10% minimum
- ctDNA: 0.1-0.5% minimum (with high depth)
SV and CNV Filtering
Note: VQSR (Variant Quality Score Recalibration) is designed for SNVs and indels, not for structural variants. SV filtering uses different quality metrics and filtering criteria as shown below.
SV Filtering
Quality filters for SV calls:
# Filter by quality score, read support, and size
bcftools filter \
-i 'QUAL>20 && SVLEN>50 && SVLEN<1000000 && IMPRECISE==0' \
-O z \
-o sv.filtered.vcf.gz \
sv.vcf.gz
# Additional filtering with bcftools
bcftools filter \
-i 'INFO/SUPPORT_READS>5 && INFO/SUPPORT_FRACTION>0.1' \
-O z \
-o sv.high_confidence.vcf.gz \
sv.filtered.vcf.gz
Common SV filtering criteria:
- Quality score: QUAL > 20-30
- Read support: Minimum supporting reads (e.g., >5)
- Size range: Filter very small (<50bp) or very large (>10Mb) SVs
- Precision: Prefer PRECISE over IMPRECISE calls
- Population frequency: Filter common SVs (gnomAD SV frequency < 1%)
CNV Filtering
Filter CNV segments:
# Filter segments by copy ratio and size
# Copy ratio thresholds:
# - Deletion: < 0.7 (log2 < -0.5)
# - Amplification: > 1.3 (log2 > 0.4)
# - Neutral: 0.7-1.3
# Using awk or custom script
awk 'BEGIN{FS=OFS="\t"} {
if ($6 < -0.5 && $3-$2 > 10000) print # Deletions > 10kb
if ($6 > 0.4 && $3-$2 > 10000) print # Amplifications > 10kb
}' tumor.called.seg > tumor.filtered.seg
Common CNV filtering criteria:
- Copy ratio thresholds:
- Deletion: log2 copy ratio < -0.5
- Amplification: log2 copy ratio > 0.4
- Segment size: Minimum segment size (e.g., >10kb for WGS, >5kb for WES)
- Quality scores: Filter by segment quality metrics
- Overlap with known CNVs: Check against DGV, gnomAD SV
Stage 8: Variant Annotation
VEP
Ensembl Variant Effect Predictor (VEP) (documentation) is a comprehensive command-line tool developed by EMBL-EBI that predicts and annotates the functional impact of genomic variants (SNPs, indels, structural variants) relative to reference annotations. VEP uses Ensembl and RefSeq transcript annotations, population frequency databases, regulatory element data, and custom annotation sources to assess variant consequences.
Key capabilities:
- Consequence prediction: Assigns Sequence Ontology (SO) terms (e.g.,
nonsynonymous_variant,synonymous_variant,splice_region_variant) for all overlapping transcripts - Allele frequency annotation: Integrates data from 1000 Genomes, gnomAD, and other population databases
- Clinical significance: Annotates variants with ClinVar clinical significance classifications
- Regulatory region overlap: Reports variants intersecting regulatory regions, transcription factor binding sites, and promoter regions
- HGVS nomenclature: Generates HGVS-compliant coding (
HGVSc) and protein (HGVSp) change descriptions - Flexible output formats: Supports VCF (with
CSQtag), tab-delimited, JSON, and summary formats - Plugin architecture: Extensible through plugins for additional annotations (e.g., CADD, dbNSFP, AlphaMissense, AlphaFold)
- Transcript selection: Options like
--canonical,--pick,--most_severe, and--per_genefor prioritizing transcripts in multi-isoform genes
VEP can operate using local cache files (recommended for performance) or connect to remote Ensembl databases. The tool supports over 100 species and multiple genome assemblies, making it suitable for both human and non-human variant annotation workflows.
VEP Installation and Setup
# Install VEP via conda
conda install -c bioconda ensembl-vep
# Or install manually
# Download and install from: https://github.com/Ensembl/ensembl-vep
# Install VEP cache
vep_install -a cf -s homo_sapiens -y GRCh38 -c /path/to/vep_cache
Basic VEP Annotation
# Basic VEP annotation (includes traditional pathogenicity scores)
vep -i variants.vcf.gz \
-o variants.vep.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--everything
# VEP's --everything flag includes:
# - SIFT scores (deleterious if <0.05)
# - PolyPhen-2 scores (damaging if >0.447)
# - Other functional predictions
# With additional annotation sources
vep -i variants.vcf.gz \
-o variants.vep.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--custom /path/to/cosmic.vcf.gz,COSMIC,vcf,exact,0 \
--custom /path/to/gnomad.vcf.gz,gnomAD,vcf,exact,0 \
--plugin CADD,/path/to/CADD.tsv.gz
Traditional pathogenicity scores available via VEP:
- SIFT: <0.05 (deleterious), included with
--everything - PolyPhen-2: >0.447 (possibly damaging), >0.908 (probably damaging), included with
--everything - CADD: >15-20 (deleterious), requires
--plugin CADD - PhyloP: >2.0 (conserved, likely functional), available via dbNSFP plugin
VEP with AI Tools
AlphaMissense
What it is: AlphaFold-derived model predicting pathogenicity of all possible single amino acid substitutions in human proteins.
Using with VEP:
# VEP with AlphaMissense plugin
vep -i coding_variants.vcf.gz \
-o coding_variants.alphamissense.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz
# Score interpretation:
# - Pathogenic: score > 0.5
# - Ambiguous: 0.35 < score <= 0.5
# - Benign: score <= 0.35
AlphaFold Structure Integration
Mapping variants to AlphaFold structures:
# VEP with AlphaFold structure annotation
vep -i coding_variants.vcf.gz \
-o coding_variants.alphafold.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--plugin AlphaFold
# pLDDT confidence scores:
# - Very high: pLDDT >= 90
# - Confident: pLDDT >= 70 (use for predictions)
# - Low: pLDDT < 70 (less reliable)
Accessing AlphaFold models:
- AlphaFold DB: https://alphafold.ebi.ac.uk/
- Download structures for specific proteins
- Use pLDDT scores to filter reliable regions (≥70)
EVE (Evolutionary Model of Variant Effect)
What it is: EVE is an evolutionary model-based AI predictor that uses unsupervised machine learning on multiple sequence alignments to predict variant pathogenicity. It can be used independently or integrated with VEP workflows.
Using EVE scores:
# EVE provides precomputed scores for human missense variants
# Download from: https://evemodel.org/
# Annotate with EVE scores
bcftools annotate -a eve_scores.vcf.gz \
-c INFO/EVE \
variants.vcf.gz \
-O variants.eve.vcf.gz
# Score interpretation:
# - Pathogenic: EVE_score > 0.5
# - Benign: EVE_score < 0.5
Other AI Tools
Additional AI-based predictors available via VEP plugins:
- REVEL: Ensemble predictor combining multiple scores (available via dbNSFP)
- PrimateAI: Deep learning model trained on primate sequences
- DANN: Deep neural network for pathogenicity prediction
- MutPred2: Machine learning predictor of molecular and phenotypic impact
Integration methods:
- VEP plugins (dbNSFP provides many scores including REVEL, SIFT, PolyPhen-2, PhyloP)
- Custom annotation with
bcftools annotate - Database lookups (dbNSFP, dbscSNV)
Note: Traditional scores (SIFT, PolyPhen-2, CADD, PhyloP) are widely used and well-validated. They complement AI-based scores and should be considered together for comprehensive variant assessment.
SNPeffect 5.0
SNPeffect is a comprehensive variant effect prediction pipeline (GitHub) that integrates multiple computational tools to assess the functional impact of coding variants. The pipeline processes VCF files through SnpEff annotation, then performs protein structure-based analysis using FoldX for stability predictions, along with sequence-based predictions from SIFT, PROVEAN, AGADIR, and TMHMM. It generates detailed reports including FoldX ΔΔG values, domain annotations (via Gene3D for CATH domain classification), transmembrane predictions, and functional impact scores.
Integrated tools:
- Structure analysis: FoldX (protein stability ΔΔG), AlphaFold (structure retrieval and pLDDT scoring), Gene3D (CATH domain annotation)
- Sequence-based predictors: SIFT (pathogenicity), PROVEAN (protein variant effect), AGADIR (helix propensity), TMHMM (transmembrane domain prediction)
SNPeffect 5.0 Pipeline
Integrated AlphaFold + FoldX workflow:
# SNPeffect 5.0 usage
snpeffect5 --variants coding_variants.vcf.gz \
--output snpeffect_results.tsv \
--structure_db /path/to/alphafold_db/ \
--threads 8 \
--min_plddt 70
# Output includes:
# - ΔΔG (stability change)
# - Domain annotation
# - pLDDT confidence
# - Functional impact predictions
Comparison: VEP vs. SNPeffect 5.0
Both VEP and SNPeffect 5.0 are powerful variant annotation tools, but they serve different purposes and have distinct strengths. Understanding their differences helps in choosing the right tool for your analysis needs.
| Feature | VEP (Ensembl Variant Effect Predictor) | SNPeffect 5.0 |
|---|---|---|
| Primary Focus | Comprehensive variant annotation across all variant types | Deep structural and functional analysis of coding variants |
| Variant Types Supported | SNVs, indels, structural variants, CNVs | Primarily coding variants (missense, nonsense, etc.) |
| Annotation Approach | Transcript-based consequences, population frequencies, regulatory regions | Protein structure-based predictions with biophysical modeling |
| AI-Based Predictors | AlphaMissense (plugin), AlphaFold (plugin), EVE (external), REVEL (dbNSFP plugin), PrimateAI/DANN/MutPred2 (dbNSFP) | AlphaFold (integrated), FoldX (core component for ΔΔG stability predictions) |
| Traditional Predictors | SIFT, PolyPhen-2 (built-in), CADD, PhyloP, PhastCons (via plugins) | SIFT, PROVEAN, AGADIR, TMHMM |
| Structure Analysis | Limited: AlphaFold plugin provides pLDDT scores; no direct stability calculations | Core feature: Integrated FoldX for ΔΔG predictions, automated AlphaFold structure retrieval, Gene3D domain annotation |
| Integration Method | Plugin architecture (extensible), custom annotations via --custom, combinable with external tools |
Integrated pipeline (automated), requires pre-built AlphaFold structure database |
| Species Support | 100+ species (mammals, plants, fungi, etc.) | Primarily human (focused on human protein structures) |
| Population Data | Extensive (1000 Genomes, gnomAD, ExAC, etc.) | Limited (relies on external annotation sources) |
| Regulatory Annotation | Built-in support for regulatory regions, TFBS, promoters | Not a primary focus |
| Clinical Data | ClinVar integration, HGMD (via plugins) | Not included |
| Output Formats | VCF (CSQ tag), tab-delimited, JSON, summary | Tab-delimited reports (FoldX reports, sequence analysis) |
| Computational Requirements | Moderate (16-32GB RAM with cache) | High (requires protein structures, FoldX calculations) |
| Ease of Use | Well-documented, widely adopted, plugin ecosystem | Specialized pipeline requiring structure databases |
Key Distinction: VEP excels at annotation aggregation - bringing together multiple AI and traditional predictors through plugins for comprehensive annotation in a single pass. SNPeffect 5.0 focuses on deep structural analysis - integrating AlphaFold structures with FoldX to provide unique biophysical insights (protein stability ΔΔG) that VEP cannot calculate directly.
When to Use VEP:
- Initial annotation: First-pass annotation of all variants (coding and non-coding)
- Population genetics: Analyzing allele frequencies across populations
- Regulatory variants: Identifying variants in regulatory regions, promoters, enhancers
- Clinical interpretation: Integrating ClinVar and other clinical databases
- Multi-species analysis: Working with non-human genomes
- Standard workflows: When following established annotation pipelines
When to Use SNPeffect 5.0:
- Protein structure analysis: When you need detailed biophysical predictions (ΔΔG values)
- Stability predictions: Understanding how variants affect protein folding and stability
- Research focus: Deep analysis of specific genes or protein families
- Structure-function studies: Investigating relationships between structure and function
- Complementary analysis: After initial VEP annotation, for prioritized coding variants
Recommended Workflow: Use both tools in a complementary manner:
- VEP for initial comprehensive annotation and filtering
- SNPeffect 5.0 for detailed structural analysis of prioritized coding variants
- Combine results for comprehensive variant interpretation
Integration Example:
# Step 1: VEP for comprehensive annotation
vep -i variants.vcf.gz \
-o variants.vep.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--everything \
--plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz
# Step 2: Extract coding variants of interest
bcftools view -i 'INFO/CSQ~"missense_variant"' variants.vep.vcf.gz \
> coding_variants.vcf
# Step 3: SNPeffect for structural analysis
snpeffect5 --variants coding_variants.vcf \
--output snpeffect_results.tsv \
--structure_db /path/to/alphafold_db/ \
--min_plddt 70
# Step 4: Merge results for comprehensive interpretation
# (Custom script to combine VEP and SNPeffect outputs)
ENCODE cCRE Annotation for Non-coding Variants
What are cCREs?
cCREs (candidate cis-Regulatory Elements) are genomic intervals identified by ENCODE as candidate regulatory elements based on:
- Transcription factor ChIP-seq
- Chromatin accessibility (DNase/ATAC)
- Histone modifications (H3K4me3, H3K27ac)
- CTCF binding
cCRE classification:
- Promoter-like: Near transcription start sites
- Enhancer-like: Distal regulatory elements
- CTCF-only: Insulator elements
- Other: Additional regulatory classes
Downloading cCRE Data
# Download from ENCODE cCRE Registry
# Version 4 contains ~2.37M human cCREs for hg38
# Option 1: SCREEN portal
# https://screen.wenglab.org/downloads
# Download BED files for specific cell types or all cCREs
# Option 2: ENCODE FTP
# Download from ENCODE project FTP site
# Example file: ENCODE_cCREs.hg38.v4.bed
Overlapping Variants with cCREs
Method 1: bedtools intersect:
# Extract variant positions to BED format
bcftools query -f '%CHROM\t%POS\t%POS\n' non_coding_variants.vcf.gz > variants.bed
# Intersect with cCREs
bedtools intersect -a variants.bed \
-b ENCODE_cCREs.hg38.v4.bed \
-wa -wb > variants_in_ccres.txt
# Add cCRE information back to VCF
# (requires custom script to merge results)
Method 2: VEP custom annotation:
# VEP with cCRE custom annotation
vep -i non_coding_variants.vcf.gz \
-o non_coding_variants.ccre.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--custom /path/to/ENCODE_cCREs.hg38.v4.bed.gz,ENCODE_cCRE,bed,overlap,0
Method 3: R/Bioconductor:
library(GenomicRanges)
library(VariantAnnotation)
# Read variants
variants <- readVcf("non_coding_variants.vcf.gz", "hg38")
# Read cCREs
ccres <- rtracklayer::import("ENCODE_cCREs.hg38.v4.bed")
# Find overlaps
overlaps <- findOverlaps(variants, ccres)
# Annotate variants with cCRE type
annotated <- cbind(
as.data.frame(variants[queryHits(overlaps)]),
cCRE_class = elementMetadata(ccres[subjectHits(overlaps)])$class
)
Tissue-Specific cCRE Activity
Prioritizing variants in tissue-relevant cCREs:
- Use cell-type specific cCRE annotations when available
- Prioritize variants in cCREs active in disease-relevant tissues
- Consider tissue-specific chromatin states
Integration with Other Regulatory Annotations
Combine cCRE annotation with:
- TF binding motifs: JASPAR, TRANSFAC databases
- Chromatin states: ChromHMM, Segway segmentation
- Histone modifications: H3K4me3 (promoters), H3K27ac (enhancers)
- Motif disruption: motifbreakR, FIMO tools
SV and CNV Annotation
SV Annotation
AnnotSV (specialized SV annotation tool):
# Install AnnotSV
# conda install -c bioconda annotsv
# Annotate SVs
AnnotSV \
-SVinputFile sv.vcf.gz \
-genomeBuild GRCh38 \
-outputDir annotsv_output \
-outputFile sv.annotated.vcf
VEP annotation for SVs:
# VEP can annotate SVs (limited compared to AnnotSV)
vep -i sv.vcf.gz \
-o sv.vep.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--check_existing \
--check_svs
CNV Annotation
Annotate CNV segments with gene information:
# Use bedtools to overlap CNV segments with gene annotations
bedtools intersect \
-a tumor.called.seg \
-b genes.gtf \
-wa -wb > cnv.genes.txt
# Or use R/Bioconductor GenomicRanges
Stage 9: Variant Prioritization
Variant prioritization involves ranking variants by their potential biological and clinical significance. Different prioritization strategies are used depending on variant type (coding vs. non-coding) and analysis context (germline vs. somatic). For more details about prioritization strategies, multi-tool consensus approaches, and comprehensive guidelines, see the Guidelines for Sequencing-Based Somatic Variant Analysis.
Coding Variant Prioritization
Coding variants (SNVs and indels in protein-coding regions) can be prioritized using multiple complementary approaches.
Pathogenicity Score Assessment
AI-based scores (recommended for recent analyses):
- AlphaMissense: >0.5 (pathogenic)
- EVE: >0.5 (pathogenic)
- REVEL: >0.5 (pathogenic)
Traditional scores (widely used, well-validated):
- CADD: >15-20 (deleterious)
- SIFT: <0.05 (deleterious)
- PolyPhen-2: >0.447 (possibly damaging), >0.908 (probably damaging)
- PhyloP: >2.0 (conserved, likely functional)
Comparison: AI-Based vs. Traditional Scoring for Protein Impact
AI-Based Scores (AlphaMissense, EVE, REVEL):
Advantages:
- Higher accuracy: Trained on large-scale experimental and clinical data, often outperform traditional methods
- Protein structure-aware: AlphaMissense leverages AlphaFold protein structures, capturing structural context
- Evolutionary modeling: EVE uses evolutionary information across species to predict functional impact
- Ensemble approaches: REVEL combines multiple predictors for robust predictions
- Comprehensive coverage: Pre-computed scores available for all possible missense variants in human proteins
- Better for rare variants: Less dependent on training data frequency, better generalization
Limitations:
- Computational requirements: Some models require significant computational resources
- Interpretability: Deep learning models can be “black boxes” with less transparent decision-making
- Database dependencies: May require large pre-computed databases (e.g., AlphaMissense database)
- Newer tools: Less historical validation compared to traditional methods
Traditional Scores (SIFT, PolyPhen-2, CADD, PhyloP):
Advantages:
- Well-validated: Extensive validation in clinical and research settings over many years
- Interpretable: Based on clear biological principles (sequence conservation, physicochemical properties)
- Widely available: Integrated into standard annotation pipelines (VEP, ANNOVAR)
- Fast computation: Efficient algorithms, suitable for large-scale analyses
- Established thresholds: Well-characterized score cutoffs with known performance metrics
- Complementary information: Different methods capture different aspects (conservation, structure, function)
Limitations:
- Lower accuracy: Generally less accurate than state-of-the-art AI methods
- Sequence-based: SIFT and PolyPhen-2 primarily use sequence features, less structural context
- Conservation bias: PhyloP and SIFT heavily rely on evolutionary conservation, may miss species-specific effects
- Limited for rare variants: Performance may degrade for variants not well-represented in training data
When to Use Each Approach:
Use AI-based scores when:
- Prioritizing variants for experimental validation or clinical interpretation
- Analyzing large-scale datasets where accuracy is critical
- Working with rare or novel variants not well-represented in traditional training sets
- Need protein structure context for variant impact assessment
- Have access to pre-computed AI score databases
Use traditional scores when:
- Need interpretable predictions based on clear biological principles
- Working with established annotation pipelines (VEP, ANNOVAR)
- Require fast annotation for large-scale screening
- Need complementary information from multiple independent methods
- Comparing results with historical datasets or published studies
Best Practice: Combined Approach:
- Use both AI and traditional scores for comprehensive assessment
- Prioritize variants with agreement across multiple predictor types
- Consider context: AI scores for high-priority variants, traditional scores for initial screening
- Validate predictions with experimental or clinical evidence when possible
Prioritization strategy:
- Highest priority: Agreement across multiple predictors (e.g., AlphaMissense >0.5 + CADD >20 + SIFT <0.05)
- High priority: Multiple predictors (AI or traditional) agree on pathogenicity
- Moderate priority: Single strong predictor (either AI or traditional)
Structural Impact Assessment
Prioritize variants with:
- Destabilizing ΔΔG: FoldX ΔΔG > 0.5 kcal/mol
- High-confidence structure: AlphaFold pLDDT ≥ 70
- Domain disruption: Variants in functional domains or interaction surfaces
Codon Usage Change Analysis for Synonymous Variants
Why codon usage matters: Synonymous variants can cause disease (including cancer) through expression changes when codon usage is dramatically altered. Codon usage bias affects:
- mRNA stability: Rare codons can reduce mRNA stability, leading to decreased expression
- Translation efficiency: Codon usage affects translation speed and efficiency
- Protein folding: Altered translation kinetics can affect co-translational protein folding
- Gene expression levels: Dramatic codon usage changes can significantly alter gene expression
Calculating Codon Usage Frequency (CUF) Change:
- Obtain codon usage frequencies: Download species-specific codon usage tables from:
- Codon Usage Database (CUTG): http://www.kazusa.or.jp/codon/
- Kazusa Codon Usage Database
- Human codon usage tables from genomic databases
- Identify the codon change: For a synonymous variant, determine:
- Original codon (e.g., TTC encoding Phe)
- New codon (e.g., TTT encoding Phe)
- Both encode the same amino acid (synonymous)
- Look up frequencies: Find the relative frequency of each codon:
- Normalized to 0-1 scale (where 1.0 = most common codon for that amino acid)
- Example: TTC (Phe) might have CUF = 0.8, TTT (Phe) might have CUF = 0.2
- Calculate change:
CUF_change = |CUF_original - CUF_new|-
Example: 0.8 - 0.2 = 0.6 (large change, high priority)
-
Prioritization criteria for synonymous variants:
- Large CUF change (>0.4): High priority, especially in cancer genes
- Moderate CUF change (0.2-0.4): Moderate priority, consider gene context
- Small CUF change (<0.2): Lower priority
- Cancer gene context: Higher priority for synonymous variants in oncogenes or tumor suppressors
Tools for codon usage analysis:
- CodonW: Codon usage analysis software
- CAI (Codon Adaptation Index): Measures codon usage bias
- tAI (tRNA Adaptation Index): Considers tRNA abundance
- Custom scripts: Calculate codon frequency changes from variant annotations
Example calculation (Python):
# Load codon usage table (from CUTG or Kazusa database)
codon_freq = {
'TTC': 0.8, # Phe - common codon
'TTT': 0.2, # Phe - rare codon
'GAA': 0.9, # Glu - common codon
'GAG': 0.1, # Glu - rare codon
# ... more codons
}
def calculate_cuf_change(ref_codon, alt_codon, codon_freq_table):
"""Calculate codon usage frequency change for synonymous variant"""
ref_freq = codon_freq_table.get(ref_codon, 0.0)
alt_freq = codon_freq_table.get(alt_codon, 0.0)
cuf_change = abs(ref_freq - alt_freq)
return cuf_change
# Example: Synonymous variant TTC > TTT (both encode Phe)
cuf_change = calculate_cuf_change('TTC', 'TTT', codon_freq)
# Result: |0.8 - 0.2| = 0.6 (large change, high priority)
Integration with expression data: If RNA-seq data is available, prioritize synonymous variants with:
- Large codon usage changes (>0.4)
- Associated expression changes in the sample
- Located in cancer-relevant genes
Pathway and Gene Set Enrichment Analysis
Why perform enrichment analysis? Pathway enrichment analysis connects variant-level signals to biological function, revealing disease mechanisms, affected pathways, and gene sets with coordinated variant burden.
Preparation steps (before enrichment analysis):
- Filter variants by impact to focus on functionally relevant variants:
- High impact: Frameshift, stop-gain/loss, splice-site disrupting
- Moderate impact: Missense (predicted deleterious), in-frame indels
- Aggregate variants to gene level for enrichment analysis:
# Step 1: Extract coding variants with moderate or high impact
# Using VEP-annotated VCF
bcftools filter -i 'CSQ[*] ~ "HIGH" || CSQ[*] ~ "MODERATE"' \
variants.vep.vcf.gz \
-O coding_variants.vcf.gz
# Alternative: Filter by CADD score (e.g., CADD > 15)
bcftools filter -i 'INFO/CADD > 15' \
variants.vcf.gz \
-O coding_variants.cadd_filtered.vcf.gz
# Step 2: Count variants per gene and create gene list
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/CSQ\n' \
coding_variants.vcf.gz | \
awk -F'|' '{print $4}' | sort | uniq -c | sort -rn > gene_burden.txt
# Create gene list for enrichment tools
cut -d' ' -f2 gene_burden.txt | grep -v "^$" > genes_with_variants.txt
Pathway enrichment tools:
GSEA (Gene Set Enrichment Analysis):
# Prepare ranked gene list (e.g., by variant burden)
# Format: GENE_NAME RANK_SCORE
# Run GSEA (Java application)
java -Xmx4g -cp gsea.jar \
xtools.gsea.Gsea \
-res ranked_genes.rnk \
-cls phenotype.cls \
-gmx pathway.gmt \
-chip gene_chip.chip \
-out output_dir
Enrichr (Web-based):
# Prepare gene list
# Upload to: https://maayanlab.cloud/Enrichr/
# Or use R package
library(enrichR)
dbs <- listEnrichrDbs()
enriched <- enrichr(genes_with_variants, dbs)
clusterProfiler (R):
library(clusterProfiler)
library(org.Hs.eg.db)
# Convert gene symbols to ENTREZ IDs
gene_ids <- bitr(genes_with_variants,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb=org.Hs.eg.db)
# GO enrichment
go_enrich <- enrichGO(gene = gene_ids$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
# KEGG enrichment
kegg_enrich <- enrichKEGG(gene = gene_ids$ENTREZID,
organism = "hsa",
pvalueCutoff = 0.05)
# Reactome enrichment
library(ReactomePA)
reactome_enrich <- enrichPathway(gene = gene_ids$ENTREZID,
organism = "human",
pvalueCutoff = 0.05)
Gene set databases:
- KEGG: Metabolic and signaling pathways
- Reactome: Comprehensive human pathways
- MSigDB: Hallmark gene sets, oncogenic signatures, immunologic signatures
- Enrichr: Web-based enrichment analysis platform with multiple gene set libraries - https://maayanlab.cloud/Enrichr/. Databases can be downloaded.
- GO: Biological process, molecular function, cellular component
- WikiPathways: Community-curated pathways
Best practices for enrichment analysis:
- Multiple testing correction: Use FDR (Benjamini-Hochberg) or Bonferroni
- Proper background: Use callable/high-confidence genes as background
- Gene size bias: Consider gene length and mutability
- Integration: Combine with functional predictions (AI scores, structural impact)
- Sample size: Ensure sufficient power for rare variant enrichment
Integration Strategy for Coding Variants
Pathogenicity score integration: Combine multiple predictors for robust prioritization:
AI-based scores (recommended for recent analyses):
- AlphaMissense: >0.5 (pathogenic)
- EVE: >0.5 (pathogenic)
- REVEL: >0.5 (pathogenic)
Traditional scores (widely used, well-validated):
- CADD: >15-20 (deleterious)
- SIFT: <0.05 (deleterious)
- PolyPhen-2: >0.908 (probably damaging) or >0.447 (possibly damaging)
- PhyloP: >2.0 (conserved, likely functional)
Highest priority variants:
- Multiple high pathogenicity scores: Agreement across multiple predictors (e.g., AlphaMissense >0.5 + CADD >20 + SIFT <0.05)
- Destabilizing structural impact: FoldX ΔΔG > 0.5 kcal/mol
- Located in functional domain: Variants in known functional domains or interaction surfaces
- In enriched pathway: Variants in genes from significantly enriched pathways
- Synonymous variants: Large codon usage change (>0.4) in cancer genes with expression evidence
Germline Variant Prioritization
Prioritization criteria:
- Population frequency: gnomAD AF < 0.01% (rare variants)
- Pathogenicity predictions:
- AI-based: High AlphaMissense, EVE, REVEL scores (>0.5)
- Traditional: High CADD (>15-20), SIFT (<0.05), PolyPhen-2 (>0.447), PhyloP (>2.0)
- Agreement across multiple predictors strengthens confidence
- Clinical significance: Pathogenic/Likely Pathogenic in ClinVar
- Inheritance patterns: Match expected inheritance mode
- Functional impact: High or moderate impact variants (frameshift, stop-gain, missense)
- Gene-level burden: Variants in genes with known disease associations
Non-coding Variant Prioritization
ENCODE cCRE overlap:
- Highest priority: Promoter-like and enhancer-like cCREs
- Tissue-specific: Variants in tissue-relevant cCREs
- High confidence: cCREs with strong experimental support
Regulatory impact assessment:
- TF binding disruption: motifbreakR, FIMO predictions
- Active chromatin: H3K4me3 (promoters), H3K27ac (enhancers)
- eQTL evidence: GTEx, eQTLGen associations
- Chromatin state: ChromHMM, Segway segmentation
Somatic Variant Prioritization
Cancer-specific criteria:
- COSMIC: Known cancer-associated variants
- OncoKB: Therapeutic implications
- Pathway enrichment: Variants in enriched cancer pathways (TP53, PI3K/AKT, MAPK, DNA repair)
- Recurrence: Variants recurrent across samples
- Tier classification: Tier 1 (highest clinical significance) to Tier 4
- Synonymous variants with codon usage changes:
- Prioritize synonymous variants in cancer genes with large codon usage frequency changes (>0.4)
- Consider expression changes associated with codon usage alterations
- Higher priority for variants in oncogenes or tumor suppressors
Complete Workflow Examples
Example 1: WGS Germline Variant Calling (Single Sample)
#!/bin/bash
# Complete workflow for WGS germline variant calling
# Setup
REFERENCE="/path/to/hg38.fa"
REF_DIR="/path/to/dragmap_reference"
SAMPLE="sample1"
FASTQ_R1="sample1_R1.fastq.gz"
FASTQ_R2="sample1_R2.fastq.gz"
# Stage 1: Alignment
dragen-os -r ${REF_DIR} \
-1 ${FASTQ_R1} \
-2 ${FASTQ_R2} \
--RGID ${SAMPLE}_lib1 \
--RGSM ${SAMPLE} \
--RGPL ILLUMINA \
--output-directory ./alignment/ \
--output-file-prefix ${SAMPLE}
# Stage 2: Sort and index
samtools sort -@ 8 ./alignment/${SAMPLE}.bam -o ${SAMPLE}.sorted.bam
samtools index ${SAMPLE}.sorted.bam
# Stage 3: Mark duplicates
gatk MarkDuplicates \
-I ${SAMPLE}.sorted.bam \
-O ${SAMPLE}.markdup.bam \
-M ${SAMPLE}.markdup.metrics.txt \
--CREATE_INDEX true
# Stage 4: BQSR
gatk BaseRecalibrator \
-R ${REFERENCE} \
-I ${SAMPLE}.markdup.bam \
--known-sites dbsnp.vcf.gz \
--known-sites mills.vcf.gz \
-O ${SAMPLE}.recal.table
gatk ApplyBQSR \
-R ${REFERENCE} \
-I ${SAMPLE}.markdup.bam \
-bqsr ${SAMPLE}.recal.table \
-O ${SAMPLE}.recal.bam \
--CREATE_INDEX true
# Stage 5: SNV/Indel variant calling with DRAGEN mode
gatk HaplotypeCaller \
-R ${REFERENCE} \
-I ${SAMPLE}.recal.bam \
-O ${SAMPLE}.g.vcf.gz \
-ERC GVCF \
--dragen-mode true \
--use-pdhmm true \
--dragstr-params-path /path/to/dragstr_params.json
# Genotype (if needed as VCF)
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V ${SAMPLE}.g.vcf.gz \
-O ${SAMPLE}.vcf.gz
# Stage 6: SV and CNV calling (optional)
# See [Stage 6: Structural Variant (SV) and Copy Number Variant (CNV) Calling](#stage-6-structural-variant-sv-and-copy-number-variant-cnv-calling) section for SV/CNV calling commands
# Stage 7: Variant filtering
gatk VariantFiltration \
-R ${REFERENCE} \
-V ${SAMPLE}.vcf.gz \
-O ${SAMPLE}.filtered.vcf.gz \
--filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0" \
--filter-name "SNP_FILTER" \
-mode SNP
# Stage 8: Variant annotation
vep -i ${SAMPLE}.filtered.vcf.gz \
-o ${SAMPLE}.annotated.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz \
--everything
Example 2: WES Germline Variant Calling (Cohort)
#!/bin/bash
# Cohort analysis with GVCF workflow
REFERENCE="/path/to/hg38.fa"
TARGETS="exome_targets.bed"
SAMPLES=("sample1" "sample2" "sample3")
# Generate GVCF for each sample
for SAMPLE in "${SAMPLES[@]}"; do
gatk HaplotypeCaller \
-R ${REFERENCE} \
-I ${SAMPLE}.recal.bam \
-L ${TARGETS} \
-O ${SAMPLE}.g.vcf.gz \
-ERC GVCF \
--dragen-mode true
done
# Create sample map
echo -e "sample1\tsample1.g.vcf.gz" > samples.map
echo -e "sample2\tsample2.g.vcf.gz" >> samples.map
echo -e "sample3\tsample3.g.vcf.gz" >> samples.map
# Joint genotyping
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V samples.map \
-O cohort.vcf.gz
# VQSR (if sufficient samples)
# ... VQSR commands ...
# Annotation
vep -i cohort.vcf.gz \
-o cohort.annotated.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--everything
Example 3: WGS Somatic Variant Calling (Tumor-Normal Pair)
#!/bin/bash
# Tumor-normal pair somatic calling
REFERENCE="/path/to/hg38.fa"
TUMOR="tumor.recal.bam"
NORMAL="normal.recal.bam"
# Mutect2 calling
gatk Mutect2 \
-R ${REFERENCE} \
-I ${TUMOR} \
-I ${NORMAL} \
-normal normal_sample \
-O mutect2.vcf.gz \
--tumor-sample tumor_sample \
--dragen-mode true \
--use-pdhmm true
# Filter
gatk FilterMutectCalls \
-R ${REFERENCE} \
-V mutect2.vcf.gz \
-O mutect2.filtered.vcf.gz
# Additional hard filters
bcftools filter -i 'FILTER=="PASS" && INFO/DP>=10 && INFO/AF>=0.05' \
mutect2.filtered.vcf.gz \
-O mutect2.hardfiltered.vcf.gz
# Annotation
vep -i mutect2.hardfiltered.vcf.gz \
-o mutect2.annotated.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--custom /path/to/cosmic.vcf.gz,COSMIC,vcf,exact,0 \
--plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz
Example 4: WES Somatic Variant Calling (Tumor-Only with PoN)
#!/bin/bash
# Tumor-only calling with Panel of Normals
REFERENCE="/path/to/hg38.fa"
TARGETS="exome_targets.bed"
TUMOR="tumor.recal.bam"
PON="pon.vcf.gz"
GNOMAD="gnomad.vcf.gz"
# Get pileup summaries
gatk GetPileupSummaries \
-R ${REFERENCE} \
-I ${TUMOR} \
-V ${GNOMAD} \
-L ${TARGETS} \
-O tumor.pileups.table
# Calculate contamination
gatk CalculateContamination \
-I tumor.pileups.table \
-O tumor.contamination.table
# Mutect2 with PoN
gatk Mutect2 \
-R ${REFERENCE} \
-I ${TUMOR} \
--germline-resource ${GNOMAD} \
--panel-of-normals ${PON} \
-L ${TARGETS} \
-O mutect2.vcf.gz \
--tumor-sample tumor_sample \
--dragen-mode true
# Filter with contamination table
gatk FilterMutectCalls \
-R ${REFERENCE} \
-V mutect2.vcf.gz \
-O mutect2.filtered.vcf.gz \
--contamination-table tumor.contamination.table
Example 5: Coding Variant Annotation with AI Tools
#!/bin/bash
# Complete AI-based annotation workflow
VCF="coding_variants.vcf.gz"
# Step 1: VEP with AlphaMissense
vep -i ${VCF} \
-o variants.alphamissense.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--plugin AlphaMissense,/path/to/AlphaMissense_hg38.tsv.gz
# Step 2: SNPeffect for structural impact
snpeffect5 --variants variants.alphamissense.vcf \
--output snpeffect_results.tsv \
--structure_db /path/to/alphafold_db/ \
--min_plddt 70 \
--threads 8
# Step 3: Combine results
# (Custom script to merge VEP and SNPeffect outputs)
# Step 4: Prioritize
# Filter for high-confidence pathogenic variants:
# - AlphaMissense > 0.5
# - FoldX ΔΔG > 0.5 (destabilizing)
# - pLDDT >= 70 (high confidence structure)
Example 6: Pathway Enrichment Analysis
# R script for pathway enrichment
library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)
# Read genes with variants (moderate/high impact)
genes_with_variants <- read.table("genes_with_variants.txt",
header=FALSE,
stringsAsFactors=FALSE)$V1
# Convert to ENTREZ IDs
gene_ids <- bitr(genes_with_variants,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb=org.Hs.eg.db)
# GO enrichment
go_enrich <- enrichGO(gene = gene_ids$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
# KEGG enrichment
kegg_enrich <- enrichKEGG(gene = gene_ids$ENTREZID,
organism = "hsa",
pvalueCutoff = 0.05)
# Reactome enrichment
reactome_enrich <- enrichPathway(gene = gene_ids$ENTREZID,
organism = "human",
pvalueCutoff = 0.05)
# Save results
write.csv(as.data.frame(go_enrich), "go_enrichment.csv")
write.csv(as.data.frame(kegg_enrich), "kegg_enrichment.csv")
write.csv(as.data.frame(reactome_enrich), "reactome_enrichment.csv")
Example 7: Non-coding Variant Annotation with ENCODE cCRE
#!/bin/bash
# ENCODE cCRE annotation workflow
VCF="non_coding_variants.vcf.gz"
CCRE_BED="ENCODE_cCREs.hg38.v4.bed.gz"
# Method 1: VEP custom annotation
vep -i ${VCF} \
-o variants.ccre.vcf \
--cache --dir_cache /path/to/vep_cache \
--assembly GRCh38 \
--format vcf \
--vcf \
--custom ${CCRE_BED},ENCODE_cCRE,bed,overlap,0
# Method 2: bedtools intersect (for detailed analysis)
bcftools query -f '%CHROM\t%POS\t%POS\n' ${VCF} > variants.bed
bedtools intersect -a variants.bed \
-b ${CCRE_BED} \
-wa -wb > variants_in_ccres.txt
# Filter for promoter-like and enhancer-like cCREs
grep -E "(promoter|enhancer)" variants_in_ccres.txt > variants_regulatory.txt
Troubleshooting and Best Practices
Reference Genome Consistency
Critical: Use the same reference genome version (decoyed + alt-masked) throughout the entire pipeline. See Reference Genome Considerations in the Introduction for details.
Common issues:
- Coordinate mismatches between alignment and variant calling
- Dropped variants due to reference version differences
- Incorrect annotations from coordinate mismatches
Decoy and ALT masking: Always use for WGS (critical), recommended for WES (improves accuracy). See Reference Genome Considerations section in the Introduction for detailed rationale.
Common Errors and Solutions
Error: “Hash table not found”
- Solution: Verify hash table path and that it was built with the correct reference
Error: “Reference mismatch”
- Solution: Ensure all tools use the same reference genome version
Error: “Low mapping rate”
- Solution: Check read quality, adapter contamination, reference genome version
Error: “VQSR fails with insufficient samples”
- Solution: Use hard filtering instead of VQSR for small cohorts (<30 samples)
Performance Optimization
DRAGMAP alignment:
- Use multiple threads:
-tparameter - Ensure sufficient memory (64GB+ for WGS)
GATK variant calling:
- Use
--java-options "-Xmx32G"for memory allocation - Parallelize by chromosome when possible
- Use GVCF workflow for cohorts (more efficient)
VEP annotation:
- Use
--offlinemode with local cache - Parallelize with
--forkoption - Use compressed input/output for disk space
Resource Requirements
| Stage | Memory | CPU | Disk |
|---|---|---|---|
| DRAGMAP alignment | 32-64GB | 8+ cores | 100GB+ |
| BQSR | 16-32GB | 4+ cores | 50GB+ |
| HaplotypeCaller | 32-64GB | 8+ cores | 50GB+ |
| Mutect2 | 32-64GB | 8+ cores | 50GB+ |
| VEP annotation | 16-32GB | 4+ cores | 100GB+ (with cache) |
Quality Control Checkpoints
After alignment:
- Mapping rate >95%
- Mean coverage meets target
- Duplication rate acceptable
- Proper read group information
After variant calling:
- Variant count reasonable (not too high/low)
- Ti/Tv ratio ~2.0-2.1 (for germline)
- PASS filter rate acceptable
- No obvious technical artifacts
After annotation:
- Annotation success rate >95%
- Expected annotation fields present
- No coordinate mismatches
References and Resources
Official Documentation
- DRAGMAP: https://github.com/Illumina/DRAGMAP
- GATK: https://gatk.broadinstitute.org/
- VEP: https://ensembl.org/info/docs/tools/vep/
- AlphaFold: https://alphafold.ebi.ac.uk/
- ENCODE: https://www.encodeproject.org/
AI Variant Effect Prediction Resources
- AlphaMissense: Database and VEP plugin - https://www.nature.com/articles/s41586-023-06345-5
- AlphaFold DB: https://alphafold.ebi.ac.uk/
- FoldX: Protein stability prediction -https://foldxsuite.crg.eu/ (Note: May require HTTPS)
- SNPeffect 5.0: Integrated pipeline - https://github.com/vibbits/snpeffect
- EVE: Evolutionary model - https://evemodel.org/
- REVEL: Ensemble predictor - Available via dbNSFP
- PrimateAI: Deep learning model - Available via dbNSFP
Pathway Enrichment Resources
- GSEA: https://www.gsea-msigdb.org/gsea/index.jsp
- Enrichr: https://maayanlab.cloud/Enrichr/
- DAVID: https://davidbioinformatics.nih.gov/
- clusterProfiler: Bioconductor R package
- MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/
- KEGG: https://www.genome.jp/kegg/
- Reactome: https://reactome.org/
- VarSAn: Variant set enrichment analysis tool
ENCODE cCRE Resources
- ENCODE cCRE Registry: https://www.encodeproject.org/data/annotations/
- SCREEN portal: https://screen.wenglab.org/
- ENCODE FTP: For downloading cCRE BED files
- cCRE documentation: Classification and usage guidelines
Key Papers
- DRAGEN/DRAGMAP: Behera, S., Catreux, S., Rossi, M. et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat Biotechnol 43, 1177–1191 (2025). https://doi.org/10.1038/s41587-024-02382-1
- GATK Best Practices: Van der Auwera & O’Connor (2020) Genomics in the Cloud
- AlphaMissense: Cheng et al. (2023) Nature - AlphaMissense predicts pathogenicity
- ENCODE cCRE: Moore et al. (2020) Nature - Expanded ENCODE annotations
- Pathway Enrichment: Subramanian et al. (2005) PNAS - GSEA method
Additional Tools and Databases
- gnomAD: Population frequency database - https://gnomad.broadinstitute.org/
- ClinVar: Clinical significance database - https://www.ncbi.nlm.nih.gov/clinvar/
- COSMIC: Cancer mutation database - https://cancer.sanger.ac.uk/cosmic
- OncoKB: Cancer therapeutic database - https://www.oncokb.org/
- dbNSFP: Functional prediction database - https://www.dbnsfp.org/
- Codon Usage Databases:
- CUTG (Codon Usage Database): http://www.kazusa.or.jp/codon/
- Kazusa Codon Usage Database: Species-specific codon usage tables
- Codon Usage Analysis Tools:
- CodonW: Codon usage analysis software
- CAI (Codon Adaptation Index): Measures codon usage bias
- tAI (tRNA Adaptation Index): Considers tRNA abundance in codon optimality
Conclusion
Key takeaways from this tutorial:
- Reference preparation is critical: Use decoyed and alt-masked references consistently
- DRAGEN features enhance accuracy: Enable DRAGEN mode for improved variant calling
- AI tools provide powerful annotation: AlphaMissense, AlphaFold, and FoldX offer structural insights
- Pathway enrichment reveals biology: Connect variants to biological function through enrichment analysis
- Non-coding variants matter: ENCODE cCRE annotation helps prioritize regulatory variants
For more advanced prioritization strategies and multi-tool consensus approaches, see the comprehensive variant analysis guidelines.
Comments