Introduction
Bulk RNA sequencing (RNA-seq) is a powerful transcriptomics technology that enables researchers to measure RNA abundance, a convenient proxy of gene expression levels, across the entire transcriptome. Unlike microarrays, RNA-seq provides a comprehensive, unbiased view of the transcriptome with high sensitivity and dynamic range. This technology has become a cornerstone of modern genomics research, enabling discoveries in disease mechanisms, drug development, and basic biology.
Applications of Bulk RNA-seq
Bulk RNA-seq has diverse applications in biological and biomedical research:
- Differential gene expression analysis: Identifying genes that are significantly up- or down-regulated between experimental conditions (e.g., treated vs. control, diseased vs. healthy)
- Transcriptome profiling: Characterizing the complete set of RNA transcripts in a sample qualitively and quantitatively and tell where and when genes are expressed
- Genome annotation: Identifying and annotating genes, transcripts, and their structures in newly sequenced genomes or improving existing annotations
- Alternative splicing analysis: Detecting and quantifying different transcript isoforms
- Gene fusion detection: Identifying chromosomal rearrangements that create fusion genes
- Expression quantitative trait loci (eQTL) mapping: Discovering genetic variants that influence gene expression
- Biomarker discovery: Finding gene expression signatures associated with disease states or treatment responses
- Pathway and network analysis: Understanding how genes interact and function together
Limitations of Bulk RNA-seq
Despite its power, bulk RNA-seq has important limitations:
-
RNA abundance snapshot, not actual gene expression: RNA-seq measures RNA abundance at a specific time point, which represents a snapshot of the transcriptome. This reflects the balance between transcription, RNA processing, and degradation, but does not directly measure the actual rate of gene transcription or protein production. RNA abundance is influenced by both transcriptional activity and post-transcriptional regulation (RNA stability, degradation rates), making it an indirect proxy for gene expression rather than a direct measurement.
- Cell population averaging: Bulk RNA-seq measures average expression across all cells in a sample, losing information about cellular heterogeneity
- Inability to detect rare cell populations: Rare cell types (<5% of the population) may be masked by more abundant cell types
- Limited resolution for cell type-specific expression: Cannot distinguish which cell types express which genes in mixed populations
- Masking of cell-to-cell variation: Individual cell differences are averaged out
- Challenges with mixed cell populations: In tissues with multiple cell types, it’s difficult to attribute expression changes to specific cell populations
Emerging Alternatives
New technologies are addressing some limitations of bulk RNA-seq:
-
Single-cell RNA-seq (scRNA-seq): Captures gene expression at the individual cell level, enabling the identification of cellular heterogeneity, rare cell populations, and cell type-specific expression patterns. However, it’s more expensive and technically challenging than bulk RNA-seq.
-
Spatial transcriptomics: Preserves the spatial context of gene expression within tissues, allowing researchers to understand how gene expression varies across tissue regions. This is particularly valuable for understanding tissue architecture and cell-cell interactions.
-
Direct transcription measurement technologies: While RNA-seq measures RNA abundance (a snapshot), several technologies directly measure transcriptional activity:
- GRO-seq (Global Run-On sequencing): Measures active transcription by sequencing nascent RNA transcripts labeled during transcription run-on assays. Provides direct measurement of transcription rates and identifies active transcription start sites.
- PRO-seq (Precision Run-On sequencing): An improved version of GRO-seq with higher resolution, allowing precise mapping of RNA polymerase positions and transcription dynamics.
- ChRO-seq (Chromatin Run-On sequencing): Combines chromatin accessibility with run-on sequencing to measure transcription while preserving chromatin context, providing insights into the relationship between chromatin state and transcription.
When to choose bulk vs. single-cell vs. spatial approaches:
- Bulk RNA-seq: Best for well-defined, homogeneous samples, large-scale studies, cost-effective screening, and when average expression is sufficient
- Single-cell RNA-seq: Best when cellular heterogeneity is important, rare cell populations need to be identified, or cell type-specific responses are of interest
- Spatial transcriptomics: Best when spatial organization matters, tissue architecture is important, or understanding regional expression patterns is the goal
Experimental Design
Proper experimental design is crucial for obtaining reliable and interpretable RNA-seq results. Several key considerations should be addressed before sample collection and sequencing:
Replication
Biological replication (independent biological samples) is essential for statistical power and generalizability:
- Minimum recommendations: At least 2 biological replicates per condition
- More replicates: Provide greater statistical power and ability to detect smaller effect sizes
- Cost vs. power trade-off: More replicates are generally more valuable than deeper sequencing per sample
Technical replication (same sample sequenced multiple times) is less critical:
- Primarily useful for assessing technical variability
- Generally not necessary if library preparation is consistent
- Resources are better spent on biological replicates
Sample Size and Power Analysis
- Power analysis: Estimate required sample size based on expected effect sizes, variability, and desired statistical power (typically 80% power, α=0.05)
- Recommended tool:
RNASeqPower(R package) orPROPER(R package) for RNA-seq-specific power analysis - These tools account for RNA-seq-specific characteristics (overdispersion, count distributions) and can estimate sample sizes needed to detect differential expression with desired power
- Recommended tool:
- Effect size considerations: Smaller fold changes require more replicates
- Variability: High biological variability requires more replicates
- Multiple comparisons: Account for the number of genes tested when planning sample size
Sequencing Depth Recommendations
Sequencing depth (number of reads per sample) is a critical factor affecting both cost and analysis quality:
- Minimum recommendations:
- Human/mammalian genomes: 20-30 million reads per sample (single-end) or 15-20 million read pairs (paired-end)
- Smaller genomes (e.g., yeast, bacteria): 5-10 million reads per sample
- Lower depth may be sufficient for highly expressed genes but will miss lowly expressed transcripts
- Depth vs. replicates trade-off:
- More replicates > deeper sequencing: Statistical power increases more with additional biological replicates than with deeper sequencing per sample
- Recommended strategy: 3-5 biological replicates at moderate depth (20-30M reads) is generally better than 2 replicates at very high depth (50M+ reads)
- Exception: For detection of very rare transcripts or alternative splicing, deeper sequencing may be necessary
- Depth requirements by analysis type:
- Differential expression: 20-30M reads per sample typically sufficient
- Alternative splicing: 30-50M reads per sample recommended
- Low-abundance transcripts: 50M+ reads may be needed
- Single-cell RNA-seq: Much lower depth per cell (10K-100K reads), but many cells needed
- Cost considerations:
- Sequencing costs scale linearly with depth
- Consider your research question: Do you need to detect lowly expressed genes, or are highly expressed genes sufficient?
- Power analysis tools can help optimize the balance between depth and replicates for your specific experimental goals
Control Groups
- Appropriate controls: Essential for meaningful comparisons (e.g., untreated controls, wild-type controls, baseline time points)
- Negative controls: Include to detect contamination or technical artifacts
- Positive controls: Known differentially expressed genes can validate the experiment
Randomization and Blocking
- Randomization: Randomize sample processing order to avoid systematic biases
- Blocking: Group related samples together (e.g., by batch, date, technician) to account for technical factors and avoid confounding effects
- Avoiding confounding: Ensure that experimental conditions are not perfectly correlated with technical factors (e.g., all control samples in one batch, all treated samples in another). Confounding makes it impossible to distinguish biological effects from technical artifacts. For more details, see this site
- Balanced design: Ensure equal representation of conditions across batches/blocks when possible
Design Matrix
Plan your experimental design matrix to:
- Clearly define all experimental factors and their levels
- Account for potential confounders (batch, date, etc.)
- Enable proper statistical modeling in downstream analysis
- Document all experimental variables for inclusion in statistical models
RNA Sample Preparation and Library Construction
The quality of RNA samples and library construction directly impacts sequencing results and downstream analysis success.
RNA Extraction
Methods vary by sample type:
- Tissue samples: Mechanical disruption (homogenization, bead beating) followed by column-based or phenol-chloroform extraction
- Cell cultures: Direct lysis with appropriate buffers
- Blood/plasma: Specialized kits for low-abundance RNA or cell-free RNA
- Formalin-fixed paraffin-embedded (FFPE) samples: Specialized protocols for degraded RNA
Key considerations:
- Use RNase-free reagents and equipment
- Work quickly to minimize RNA degradation
- Store samples appropriately (typically -80°C for long-term storage)
RNA Quality Assessment
RIN (RNA Integrity Number) is a critical quality metric:
- Scale: 1-10 (10 = perfect integrity)
- Minimum recommendations:
- RIN ≥ 7 for most applications
- RIN ≥ 8 for sensitive analyses (e.g., alternative splicing)
- Lower RIN may be acceptable for degraded samples (FFPE) with appropriate protocols
- Assessment tools: Agilent Bioanalyzer, TapeStation, or similar instruments
Other quality metrics:
- 28S/18S rRNA ratio: Should be ~2:1 for intact total RNA
- DV200 (percentage of RNA fragments >200 nucleotides): Alternative metric for degraded samples
- Concentration: Ensure sufficient RNA for library preparation (typically 100 ng - 1 μg)
Library Types
NOTES: Remove/block some extremely abundant gene’s transcripts, such as hemoglobin RNAs in blood samples to get enough sequencing coverage for many other interesting genes.
PolyA-selected libraries:
- Enriches for polyadenylated mRNA
- Removes exetremely abundant rRNA (ribosomal RNA)
- Best for: Standard mRNA expression analysis
- Limitations: Misses non-polyadenylated transcripts
rRNA-depleted libraries:
- Removes ribosomal RNA while retaining other RNA species
- Best for: Total RNA analysis, non-coding RNA studies, degraded samples
- Methods: RiboZero, RiboMinus, or similar depletion kits
Total RNA libraries:
- Includes all RNA species
- Best for: Comprehensive transcriptome analysis
- Considerations: High rRNA content requires deeper sequencing
Stranded vs. Unstranded Libraries
Stranded libraries:
- Preserve information about which DNA strand was transcribed
- Advantages:
- Distinguish overlapping genes on opposite strands
- Better for alternative splicing analysis
- More accurate quantification
- Methods: dUTP-based, adaptor-based, or other strand-specific protocols
Unstranded libraries:
- Do not preserve strand information
- Use cases: Simple expression analysis when strand information isn’t critical
- Limitations: Cannot distinguish overlapping genes on opposite strands
Library Preparation Protocols
Common commercial kits:
- Illumina: TruSeq RNA Library Prep, NEBNext Ultra II
- Other platforms: Compatible with various sequencing platforms
Ultra-low Input kits:
- Use case: When RNA quantity is very limited (<100 ng, down to 10 ng or even single cells)
- Specialized protocols: Designed to work with minimal starting material
- Examples:
- SMART-Seq (Switching Mechanism at 5’ End of RNA Template) for single-cell or low-input RNA
- Ovation RNA-Seq System for low-input samples
- Clontech SMARTer kits for ultra-low input
- Considerations:
- May have higher technical variability due to amplification steps
- May introduce amplification biases
- Require careful quality control
- May need deeper sequencing to compensate for lower complexity libraries
Key steps:
- RNA fragmentation: Break RNA into appropriate sizes (typically 300-400 bp for 150 bp reads)
- First-strand cDNA synthesis: Reverse transcription
- Second-strand synthesis: Create double-stranded cDNA
- Adapter ligation: Add sequencing adapters
- PCR amplification: Enrich for adapter-ligated fragments
- Size selection: Remove fragments outside desired size range
- Quality control: Assess library quality before sequencing
Quality Control Before Sequencing
Library QC metrics:
- Concentration: Ensure sufficient library for sequencing
- Size distribution: Should match expected fragment size
- Adapter dimers: Should be minimal (can interfere with sequencing)
- Library complexity: Assessed through pre-sequencing QC
Pre-sequencing validation:
- qPCR or similar methods to verify library quality
- Small test sequencing run (optional but recommended for large studies)
- Check for contamination or adapter issues
Best Practices
- Standardize protocols: Use consistent methods across all samples
- Document everything: Record RNA quality metrics, library preparation dates, and any deviations
- Include controls: Process positive and negative controls alongside samples
- Minimize batch effects: Process samples in balanced batches when possible
- Store appropriately: Keep RNA and libraries at appropriate temperatures
- Track samples: Maintain clear sample tracking from collection through sequencing
Overview of the Analysis Workflow
A typical bulk RNA-seq differential analysis workflow consists of the following steps:
- Raw read quality control (FastQC): Assess the quality of sequencing reads
- Adapter trimming (fastp, Trimmomatic, Trim Galore): Remove adapter sequences if needed
- Read alignment (STAR, HISAT2, etc.): Map reads to a reference genome
- Post-alignment quality control (QoRTs, RSeQC, Qualimap): Assess alignment quality
- Expression quantification (featureCounts): Count reads mapping to genes
- Contamination detection: Identify and address rRNA, genomic DNA, or environmental contamination
- Differential expression analysis (DESeq2, edgeR, limma/voom): Identify significantly changed genes
- Functional enrichment analysis (ORA, GSEA): Understand biological meaning of results
Prerequisites and Software Requirements
This tutorial assumes you have:
- Basic familiarity with command-line interfaces (bash/shell)
- Access to a Unix-like system (Linux or macOS)
- Basic understanding of RNA-seq concepts
- R programming knowledge (for differential expression and enrichment analysis)
- The following software tools installed and in your PATH:
- FastQC
- fastp (or Trimmomatic/Trim Galore)
- STAR (or HISAT2/GSNAP)
- samtools
- featureCounts
- QoRTs (or RSeQC/Qualimap)
- R with Bioconductor packages (DESeq2, fgsea, etc.)
Common File Formats in RNA-seq Analysis
Understanding file formats is essential for RNA-seq analysis. Here’s a quick reference:
| Format | Description | When Used | Tools |
|---|---|---|---|
| FASTQ | Raw sequencing reads with quality scores | Input for QC and alignment | FastQC, fastp, STAR |
| SAM | Text-based alignment format | Human-readable alignment output | Alignment tools, samtools |
| BAM | Binary version of SAM (compressed) | Storage-efficient alignment format | samtools, IGV, featureCounts |
| GTF/GFF3 | Gene annotation files | Reference for alignment and quantification | STAR index, featureCounts |
| BigWig | Compressed coverage track | Visualization in genome browsers | STAR, IGV, UCSC Genome Browser |
| BedGraph | Uncompressed coverage track | Alternative to BigWig | STAR, conversion tools |
| Count matrix | Tab-delimited gene counts | Input for differential expression | DESeq2, edgeR, limma |
| VCF | Variant call format | Variant detection (if applicable) | Variant callers |
Key format details:
- FASTQ quality encoding: Phred scores can be encoded in different formats (Sanger/Illumina 1.8+, Illumina 1.3+, etc.). FastQC automatically detects encoding.
- BAM vs SAM: BAM is compressed and indexed for efficient access. Use
samtools viewto convert between formats. - GTF vs GFF3: Both contain gene/transcript annotations. GTF is more commonly used. Ensure your annotation matches your reference genome version.
- Count matrix format: Typically tab-delimited with genes as rows and samples as columns. First column usually contains gene IDs.
Computational Resources and Performance
RNA-seq analysis can be computationally intensive. Here are typical resource requirements:
Memory (RAM) requirements:
- STAR alignment:
- Human genome: ~30-40 GB RAM for indexing, ~16-32 GB for alignment
- Smaller genomes (mouse, etc.): ~16-24 GB for indexing, ~8-16 GB for alignment
- Can be reduced with
--limitGenomeGenerateRAMoption
- featureCounts: ~4-8 GB RAM typically sufficient
- DESeq2/R analysis: ~4-16 GB depending on dataset size
- FastQC: Minimal memory requirements (~1-2 GB)
CPU requirements:
- STAR: Highly parallelizable, benefits from multiple cores (8-16 cores recommended)
- featureCounts: Can use multiple threads with
-Toption - Most tools: Benefit from multi-core systems
Storage requirements:
- Raw FASTQ files: ~1-3 GB per sample (compressed)
- BAM files: ~2-5 GB per sample (uncompressed, can be compressed)
- Index files: ~30-50 GB for human genome (STAR index)
- Intermediate files: Plan for 2-3x the size of raw data
Performance tips:
- Use compressed formats (BAM instead of SAM, gzipped FASTQ)
- Parallelize when possible (process multiple samples simultaneously)
- Consider cloud computing (AWS, Google Cloud, Azure) for large datasets
- Use cluster computing (SLURM, PBS) for institutional resources
- Monitor disk space regularly - RNA-seq data can accumulate quickly
Section 1: Raw Read Quality Control (FastQC)
Quality control is the first and crucial step in any RNA-seq analysis. FastQC is a widely used tool that provides a comprehensive quality assessment of raw sequencing reads before alignment.
Introduction to FastQC
FastQC generates a detailed HTML report with multiple quality metrics, helping you identify potential issues with your sequencing data. It examines various aspects of read quality, including base quality scores, sequence composition, adapter contamination, and more.
Running FastQC on Raw Reads
For single-end reads:
# Basic FastQC command for single-end reads
fastqc sample_R1.fastq.gz -o output_directory/
# With thread and memory options
fastqc sample_R1.fastq.gz -o output_directory/ \
-t 4 \ # Number of threads (default: 1)
--memory 4000 # Memory limit in MB (default: 250)
# Process multiple files
fastqc *.fastq.gz -o qc_reports/ -t 8
For paired-end reads:
# FastQC processes each file separately
fastqc sample_R1.fastq.gz sample_R2.fastq.gz -o output_directory/ \
-t 4 \ # Number of threads
--memory 512 # Memory limit in MB
FastQC will generate HTML reports and a zip file containing the data for each input file. See example reports.
Interpreting FastQC Reports
FastQC provides several key metrics:
Per Base Sequence Quality
This plot shows the quality scores (Phred scores) at each position along the reads. Quality scores indicate the probability of an incorrect base call:
- Good quality: Scores above 28 (green region), indicating <0.16% error rate
- Warning: Scores between 20-28 (yellow region)
- Fail: Scores below 20 (red region), indicating >1% error rate
What to look for:
- Quality should remain high throughout the read
- Degradation at the 3’ end is common in RNA-seq but should be minimal
- Consistent quality across all positions is ideal
Per Sequence Quality Scores
This shows the distribution of mean quality scores per read. Most reads should have high mean quality scores (>28).
What to look for:
- A single peak at high quality indicates good overall sequencing
- Multiple peaks or a peak at low quality suggests problems
Per Base Sequence Content
Shows the proportion of each base (A, T, G, C) at each position. In RNA-seq, you may see some bias at the beginning of reads due to random hexamer priming.
What to look for:
- Base proportions should reflect the GC content of coding regions for your species (after the first few bases, which may show random hexamer priming bias)
- For most species, coding regions have GC content between 40-60%, so you should see roughly balanced base composition. More accurately, A% = T% and G% = C%
- Strong bias throughout the read that doesn’t match expected species GC content may indicate contamination or adapter issues
Per Sequence GC Content
Shows the distribution of GC content across all reads. This should form a normal distribution centered around the expected GC content for your organism.
What to look for:
- A normal distribution centered on the expected GC content
- Multiple peaks may indicate contamination with sequences from different organisms
- Skewed distributions may indicate technical issues (sequencing artifacts)
Sequence Length Distribution
Shows the distribution of read lengths. All reads should be the same length (unless you’re using variable-length reads).
What to look for:
- Consistent read lengths
- Unexpected length variations may indicate adapter contamination or sequencing issues
Sequence Duplication Levels
Shows the percentage of sequences that appear multiple times. High duplication can indicate:
- PCR over-amplification
- Low complexity libraries
- Over-sequencing (sequencing the same fragments multiple times)
What to look for:
- Some duplication is normal, especially for highly expressed genes
- Very high duplication (>70%) may indicate problems
Overrepresented Sequences
Identifies sequences that appear more frequently than expected. These are often:
- Adapter sequences
- Contamination
- Highly expressed transcripts
What to look for:
- Check if overrepresented sequences match known adapters
Adapter Content
Shows the percentage of adapter sequence at each position. Adapters should only appear at the ends of reads if present.
What to look for:
- Adapter content should be minimal (<5%)
- High adapter content (>20%) indicates the need for trimming
- Adapters appearing in the middle of reads suggest poor quality or very short inserts
Common Quality Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Low quality at 3’ end | RNA degradation | Check RNA integrity (RIN score), use fresh samples |
| High adapter content | Short inserts or adapter contamination | Trim adapters using fastp, Trimmomatic, or Trim Galore |
| High duplication | PCR over-amplification or over-sequencing | May need to reduce PCR cycles or sequence less |
| Uneven base composition | Random hexamer bias (normal) or contamination | First few bases may show bias (normal), check for contamination |
| Multiple GC content peaks | Contamination | Identify and remove contaminating sequences |
Decision Point: When to Trim Adapters
Trim adapters if:
- If adapters are present in >20% of reads and account for a significant portion of read length, trimming may be necessary to improve mapping rate, though read aligners can do soft-clipping during alignment
- Overrepresented sequences match known adapters
- Read quality is poor at the ends
- You’re working with very short RNA species (e.g., miRNAs)
You may skip trimming if:
- Adapter content is <20%
- Reads are long (>75bp) and quality is good
- No overrepresented adapter sequences are detected
Section 2: Adapter Trimming (Conditional)
If FastQC indicates high adapter content or other quality issues, adapter trimming becomes necessary. This step removes adapter sequences, low-quality bases, and short reads that could interfere with alignment.
Overview of Trimming Tools
Several tools are available for adapter trimming and quality filtering:
| Tool | Language | Speed | Features | Best For |
|---|---|---|---|---|
| fastp | C++ | Very Fast | Auto adapter detection, quality filtering, deduplication | large volume of data |
| Trimmomatic | Java | Fast | Highly configurable, flexible | Custom trimming needs |
| Trim Galore | Perl (wrapper) | Fast | Auto adapter detection, wrapper around Cutadapt | Easy-to-use, automatic detection |
fastp (Detailed Example)
fastp is a fast, all-in-one FASTQ preprocessor that performs adapter trimming, quality filtering, and other preprocessing tasks in a single step. If adapter sequences are not provided, fastp can automatically detects and removes adapter sequences. But it is recommended to provide adapter sequences to avoid wrong trimming, especially for miRNA-seq data.
Complete Example with Common Options
For more options, see fastp documentation
fastp -i sample_R1.fastq.gz \
-I sample_R2.fastq.gz \
-o trimmed_R1.fastq.gz \
-O trimmed_R2.fastq.gz \
--detect_adapter_for_pe # Explicitly enable for paired-end
-h fastp_report.html \
-j fastp_report.json \
-q 20 \ # Quality threshold
-u 30 \ # Unqualified percent limit
-l 50 \ # Minimum length
--length_required 50 # Maximum read length allowed after trimming (only used for special case)
--trim_poly_g \ # Trim polyG (needed for Nextseq data)
--trim_poly_x \ # Trim polyX
--correction \ # Enable base correction for paired-end
--overlap_len_require 10 \ # Minimum overlap length
--overlap_diff_limit 1 \ # Maximum mismatch in overlap
-w 4 # Number of threads
Output Files Explanation
- trimmed_R1.fastq.gz / trimmed_R2.fastq.gz: Trimmed reads
- fastp_report.html: HTML report with statistics and plots
- fastp_report.json: JSON file with detailed statistics (for programmatic access)
The HTML report includes:
- Before/after quality score distributions
- Adapter trimming statistics
- Length distribution
- Duplication analysis
- Quality filtering statistics
Advanced Options
# Enable base correction for paired-end reads
--correction
# Specify adapter sequences manually
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
--adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
# Disable quality filtering (only trim adapters)
--disable_quality_filtering
# Keep unpaired reads (for paired-end)
--unpaired_paired_output
# Merge overlapping reads
--merge
Brief Mentions of Other Tools
Trimmomatic
Trimmomatic is a Java-based tool offering fine-grained control over trimming parameters:
# list all possible adapter sequences in fasta format in the file: adapters.fa
java -jar trimmomatic.jar PE \
input_R1.fastq.gz input_R2.fastq.gz \
output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz \
output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz \
ILLUMINACLIP:adapters.fa:2:30:10 \
LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:20 \
MINLEN:50
Key features:
- Highly configurable
- Multiple trimming strategies
- Handles both single-end and paired-end data
Trim Galore
Trim Galore is a wrapper around Cutadapt that provides automatic adapter detection:
trim_galore --paired \
--quality 20 \
--length 50 \
sample_R1.fastq.gz sample_R2.fastq.gz
Key features:
- Automatic adapter detection
- User-friendly interface
- Integrates with FastQC
Quality Control After Trimming
After trimming, always re-run FastQC on the trimmed reads to verify that:
- Adapter content has been reduced
- Quality scores have improved
- Read lengths are appropriate
- No new issues have been introduced
Best Practices and Recommendations
- Always check FastQC before and after trimming to verify improvement
- Be conservative with quality thresholds - too aggressive filtering may remove valid data
- Keep unpaired reads from paired-end trimming if they’re long enough
- Document your trimming parameters for reproducibility
- Consider your read length - very short reads (<25bp) may not align well
Section 3: Read Alignment
After quality control and trimming, reads must be aligned (mapped) to a reference genome or transcriptome. This step determines where each read originated in the genome, enabling downstream quantification and analysis. Transgenes or pathogen sequences and gene annotation can be added to the reference genome and its annotation file when necessary.
Reference Genome and Annotation Selection
Before alignment, you need to select appropriate reference genome and annotation files. This is a critical decision that affects all downstream analysis.
Choosing a Reference Genome
For human samples:
- T2T-CHM13 (T2T v2.0): Most complete human genome assembly (2022)
- Telomere-to-telomere assembly with complete chromosome sequences
- Fills gaps present in GRCh38 (centromeres, telomeres, segmental duplications)
- Based on CHM13 cell line (haploid)
- Best for: Research requiring complete genome coverage, studies of repetitive regions, centromeres, telomeres
- Considerations:
- Newer assembly, so some tools/annotations may not yet support it
- Annotation resources (GENCODE, Ensembl) are being developed but may be less mature
- May require more computational resources
- Available from: NCBI, T2T Consortium
- GRCh38 (hg38): Current standard, recommended for most analyses
- GRCh37 (hg19): Legacy version, still used in some studies
- May be required for compatibility with existing datasets
- Less complete than GRCh38 or T2T-CHM13
- Important: Ensure all tools and annotations use the same genome version
- Best for: Compatibility with older datasets and pipelines
For other organisms:
- Check organism-specific databases (Ensembl, NCBI, organism-specific resources)
- Use the most recent, well-annotated assembly
- Consider strain-specific genomes if working with specific strains
Key considerations:
- Version consistency: Genome, annotation (GTF/GFF), and any pre-built indexes must all match
- Primary vs. alternative contigs: Decide if you need alternative haplotypes (usually not necessary for standard RNA-seq)
- Patch versions: Minor updates to assemblies; usually safe to use latest patch
Annotation File Selection
Annotation sources:
- GENCODE (human, mouse): High-quality, comprehensive annotations
- Includes protein-coding and non-coding genes
- Multiple transcript isoforms per gene
- Regular updates
- Recommended for human/mouse studies
- Ensembl: Broad species coverage, regularly updated
- RefSeq: Curated annotations, often more conservative (fewer isoforms)
- NCBI: Official annotations, may lag behind other sources
File format:
- GTF (Gene Transfer Format): More commonly used, compatible with most tools
- GFF3 (General Feature Format version 3): Alternative format, some tools prefer GTF
Version matching:
- Annotation file must match genome assembly version
- Check release dates and version numbers
- Example: GRCh38.p14 genome should use GENCODE release matching GRCh38
Download locations:
- GENCODE: https://www.gencodegenes.org/
- Ensembl: https://www.ensembl.org/info/data/ftp/index.html
- NCBI: https://www.ncbi.nlm.nih.gov/genome/
- UCSC: https://hgdownload.soe.ucsc.edu/downloads.html
Best practices:
- Document the exact genome and annotation versions used
- Download from official sources
- Verify file integrity (checksums)
- Keep a record of download dates and versions for reproducibility
- For multi-species studies, ensure consistent annotation quality across species
Overview of Alignment Tools
Multiple alignment tools are available, each with different strengths:
| Tool | Type | Speed | Splice-aware | Best For |
|---|---|---|---|---|
| STAR | Aligner | Fast | Yes | Standard RNA-seq, large genomes |
| HISAT2 | Aligner | Very Fast | Yes | Standard RNA-seq, memory-efficient |
| GSNAP | Aligner | Moderate | Yes | Variant-aware alignment |
| Salmon | Quasi-mapper | Very Fast | NO | Quantification-focused, transcriptome |
| Kallisto | Pseudo-aligner | Very Fast | NO | Quantification-focused, transcriptome |
| Bowtie2-RSEM | Aligner/Quantifier | slow | NO | Quantification with uncertainty, transcriptome |
Key differences:
- Aligners (STAR, HISAT2, GSNAP): Generate BAM files with alignment coordinates. For benchmarking of the three aligners, see here
- Quasi-mappers/Pseudo-aligners (Salmon, Kallisto): Focus on quantification, faster but no BAM output. For best results, use genome-decoyed transcriptome index
- Bowtie2-Quantifiers (RSEM): Can work with alignments against transcriptome from Bowtie2 or independently
STAR Alignment (Detailed Example)
STAR (Spliced Transcripts Alignment to a Reference) is a popular seed-extension aligner that handles spliced alignments efficiently and accurately.
Building Genome Index
Before alignment, STAR requires a genome index:
STAR --runMode genomeGenerate \
--genomeDir /path/to/genome_index \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/annotations.gtf \
--sjdbOverhang 99 \ # Read length - 1 (for 100bp reads)
--runThreadN 8
Key parameters:
--genomeDir: Directory where the index will be created--genomeFastaFiles: Reference genome FASTA file--sjdbGTFfile: Gene annotation file (GTF or GFF)--sjdbOverhang: Should be maximal read length minus 1--runThreadN: Number of threads
Alignment Command
Single-end reads:
STAR --genomeDir /path/to/genome_index \
--readFilesIn sample.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--runThreadN 8
Paired-end reads:
STAR --genomeDir /path/to/genome_index \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--runThreadN 8
Complete example with common options for mammalian RNA-seq read alignment:
STAR --genomeDir /path/to/genome_index \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes Standard \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outReadsUnmapped Fastx \
--runThreadN 8
Output Files Explanation
STAR generates several output files:
sample_Aligned.sortedByCoord.out.bam: Sorted BAM file with alignmentssample_Log.final.out: Summary statistics (mapping rate, etc.)sample_Log.out: Detailed alignment logsample_Log.progress.out: Progress informationsample_SJ.out.tab: Detected splice junctionssample_Unmapped.out.mate1/2: Unmapped reads (if requested)
Key statistics in Log.final.out:
- Uniquely mapped reads %: Percentage of reads mapping to a single location
- % of reads mapped to multiple loci: Reads mapping to multiple locations
- % of reads unmapped: Reads that couldn’t be aligned
- % of reads unmapped: too short: Reads too short to align
- % of reads unmapped: other: Other unmapping reasons
Generating BigWig Coverage Tracks
BigWig files are useful for visualizing coverage in genome browsers like IGV or UCSC Genome Browser.
Using STAR’s built-in wiggle output:
STAR --genomeDir /path/to/genome_index \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample_ \
--outSAMtype BAM SortedByCoordinate \
--outWigType wiggle \
--outWigStrand Stranded \
--outWigNorm RPM \
--runThreadN 8
Converting wiggle to bigwig format:
STAR outputs wiggle files, which need to be converted to bigwig using wigToBigWig:
# First, sort the wiggle file by chromosome and position
sort -k1,1 -k2,2n sample_Signal.Unique.str1.out.wig > sample_sorted.wig
# Convert to bigwig (requires chromosome sizes file)
wigToBigWig sample_sorted.wig chrom.sizes sample.bw
Alternative: Generate bigwig from BAM file
You can also generate bigwig files directly from BAM files using tools like bamCoverage from deepTools:
# Install deepTools (if not already installed)
# conda install -c bioconda deeptools
# Generate bigwig with RPKM normalization
bamCoverage -b sample_Aligned.sortedByCoord.out.bam \
-o sample_RPKM.bw \
--normalizeUsing RPKM \
--binSize 10 \
--smoothLength 50
# Generate bigwig with CPM normalization
bamCoverage -b sample_Aligned.sortedByCoord.out.bam \
-o sample_CPM.bw \
--normalizeUsing CPM \
--binSize 10
# Generate strand-specific bigwig
bamCoverage -b sample_Aligned.sortedByCoord.out.bam \
-o sample_forward.bw \
--normalizeUsing RPKM \
--filterRNAstrand forward
bamCoverage -b sample_Aligned.sortedByCoord.out.bam \
-o sample_reverse.bw \
--normalizeUsing RPKM \
--filterRNAstrand reverse
Normalization options:
- RPKM (Reads Per Kilobase per Million): Normalizes for sequencing depth and gene length
- CPM (Counts Per Million): Normalizes only for sequencing depth
Uses for visualization:
- IGV (Integrative Genomics Viewer): Load BAM and bigwig files to visualize coverage
- UCSC Genome Browser: Upload bigwig files for web-based visualization
- Publication figures: Export coverage tracks for manuscripts
Parameter Optimization
Key STAR parameters to consider:
--outFilterMultimapNmax: Maximum number of multiple alignments (default: 10, increase for repetitive regions)--outFilterMismatchNmax: Maximum mismatches per read (default: 10)--alignIntronMin/--alignIntronMax: Minimum/maximum intron length--alignMatesGapMax: Maximum gap between paired-end mates--twopassMode Basic: Two-pass alignment for better splice junction detection- more parameter tuning: see here
Brief Mentions of Other Tools
HISAT2
HISAT2 is a fast, memory-efficient, exon-first aligner:
hisat2 -x genome_index \
-1 sample_R1.fastq.gz -2 sample_R2.fastq.gz \
-S sample.sam \
--threads 8
Best for: Memory-constrained environments, standard RNA-seq
GSNAP
GSNAP, seed-extension aligner, handles variants and indels well:
gsnap -d genome_index \
-D /path/to/genome_dir \
-v known_variants.vcf \ # Known variants file
--use-splicing=1 \ # Enable splice-aware alignment
--novelsplicing=1 \ # Detect novel splice sites
-m 5 \ # Max mismatches
-i 2 \ # Indel penalty
-w 200000 \ # Max indel length
-t 8 \ # Threads
sample_R1.fastq.gz sample_R2.fastq.gz
Best for: Variant-aware alignment, samples with known polymorphisms
Salmon
Salmon is a fast quantification tool that doesn’t produce BAM files:
# download reference genome and transcriptome fasta files
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.transcripts.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
# get decoy genome fasta headers
grep "^>" <(gunzip -c GRCm38.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt
# concatenate genome and transcriptome fasta files
cat gencode.vM23.transcripts.fa.gz GRCm38.primary_assembly.genome.fa.gz > gentrome.fa.gz
# generate Salmon index. option `--gencode` is only needed if fasta files are from Gencode
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode
# Salmon quantification
salmon quant -i transcriptome_index \
-l A \
-1 sample_R1.fastq.gz -2 sample_R2.fastq.gz \
-o salmon_output \
--threads 8
Best for: Fast quantification when BAM files aren’t needed
Kallisto
Kallisto is an ultra-fast pseudo-aligner:
kallisto quant -i transcriptome_index \
-o kallisto_output \
sample_R1.fastq.gz sample_R2.fastq.gz
Best for: Very fast quantification, large-scale studies
RSEM
RSEM provides quantification with uncertainty estimates using an EM algorithm. It take BAM output generated by Bowtie2, which align reads to the transcriptome:
rsem-calculate-expression --paired-end \
--bowtie2 \
sample_R1.fastq.gz sample_R2.fastq.gz \
rsem_index \
sample_output
Best for: Quantification with confidence intervals, uncertainty modeling
Section 4: Post-Alignment Quality Control
After alignment, it’s crucial to assess the quality of your alignments. Post-alignment QC helps identify issues like poor alignment rates, strand-specificity problems, or contamination that may not have been apparent in raw read QC.
Overview of Post-Alignment QC Tools
Several tools are available for post-alignment quality assessment:
| Tool | Features | Output | Best For |
|---|---|---|---|
| QoRTs | Comprehensive RNA-seq metrics | HTML reports, R objects | Detailed RNA-seq-specific QC |
| RSeQC | Modular, many metrics | Text reports, plots | Flexible, customizable QC |
| Qualimap | Interactive reports | HTML reports | User-friendly, comprehensive |
QoRTs (Detailed Example)
QoRTs (Quality of RNA-seq Tool-set) is specifically designed for RNA-seq data and provides comprehensive quality metrics. QoRTs generates detailed quality control reports including junction annotation, read distribution, gene body coverage, and various quality metrics. It’s particularly useful for identifying RNA-seq-specific issues.
Running QoRTs on BAM Files
QoRTs requires a sample file that lists all BAM files to be processed along with their metadata. This allows batch processing and proper organization of QC results.
Sample file format:
QoRTs accepts a tab-delimited file with sample information. The minimal format requires at least a sample ID and BAM file path:
For more comprehensive analysis and better organization, include additional metadata columns. See example below. Sample file format:
- One BAM file path per line
- Can use absolute or relative paths
- No header required
- Empty lines are ignored
Column descriptions:
- sampleID: Unique identifier for each sample (required)
- bamFile: Path to the BAM file, can be relative or absolute (required)
- condition: Experimental condition (e.g., control, treated) - useful for grouping in QC reports
- batch: Sequencing or processing batch - helps identify batch effects
- replicate: Replicate number - tracks biological replicates
- date: Processing date - useful for tracking temporal effects
- technician: Person who processed the sample - identifies technician-specific effects
- notes: Additional comments or metadata - free text for any relevant information
Creating the sample file:
# Create sample file manually or programmatically
cat > samples.txt << EOF
sampleID bamFile condition batch replicate date technician notes
sample1 sample1_Aligned.sortedByCoord.out.bam control batch1 rep1 2025-01-15 tech1 Control group
sample2 sample2_Aligned.sortedByCoord.out.bam control batch1 rep2 2025-01-15 tech1 Control group
sample3 sample3_Aligned.sortedByCoord.out.bam treated batch1 rep1 2025-01-15 tech1 Treated group
sample4 sample4_Aligned.sortedByCoord.out.bam treated batch1 rep2 2025-01-15 tech1 Treated group
EOF
Basic command with sample file:
java -jar QoRTs.jar QC \
samples.txt \
annotations.gtf \
qc_output_directory/
Single BAM file (alternative method):
If processing a single file, you can create a minimal sample file:
# Create sample file for single BAM
cat > single_sample.txt << EOF
sampleID bamFile
sample1 sample1_Aligned.sortedByCoord.out.bam
EOF
java -jar QoRTs.jar QC \
single_sample.txt \
annotations.gtf \
qc_output_directory/
With additional options:
java -jar QoRTs.jar QC \
--stranded \
--singleEnded \
samples.txt \
annotations.gtf \
qc_output_directory/
Interpreting QoRTs Reports
QoRTs generates comprehensive reports with multiple components:
Junction annotation:
- Known vs. novel splice junctions
- Junction quality scores
- Junction coverage
Read distribution:
- Percentage of reads in exons, introns, intergenic regions
- Strand-specificity assessment
- Expected vs. observed distributions
Gene body coverage:
- Coverage uniformity across gene bodies
- 5’ to 3’ bias detection
- Coverage plots for individual genes
Quality metrics:
- Mapping quality scores
- Read length distributions
- Duplication rates
- GC content analysis
RSeQC (Brief Mention)
RSeQC provides modular quality control with many specific analysis modules:
Key features:
- Modular design - run only needed analyses
- Many specialized modules
- Text-based output
Common modules:
# Gene body coverage
geneBody_coverage.py -r annotations.bed -i sample.bam -o output
# Read distribution
read_distribution.py -r annotations.bed -i sample.bam
# Junction saturation
junction_saturation.py -r annotations.bed -i sample.bam -o output
Use cases:
- Custom QC pipelines
- Specific metric extraction
- Batch processing
Qualimap (Brief Mention)
Qualimap provides user-friendly, interactive HTML reports:
Key features:
- Interactive HTML reports
- RNA-seq specific analysis
- Easy to interpret visualizations
Usage:
qualimap rnaseq \
-bam sample.bam \
-gtf annotations.gtf \
-outdir qualimap_output
Use cases:
- Quick quality assessment
- Interactive exploration
- Report sharing
Best Practices for Post-Alignment QC
- Check alignment rates: Should be >70% for most RNA-seq experiments
- Assess strand-specificity: Verify if your library is stranded or unstranded
- Examine gene body coverage: Should be relatively uniform (depending on the library construction protocol)
- Check junction annotation: Verify known vs. novel junctions
- Compare across samples: Look for outliers or batch effects
- Document QC metrics: Keep records for publication and troubleshooting
Section 5: Expression Quantification (featureCounts)
After quality control, the next step is to quantify gene expression by counting reads that map to each gene. This generates a count matrix that serves as input for differential expression analysis.
Introduction to Expression Quantification
Expression quantification involves counting the number of reads (or fragments for paired-end data) that map to each gene or transcript. These counts serve as a proxy for gene expression levels.
Overview of Quantification Tools
Two main tools are commonly used for counting reads:
| Tool | Language | Speed | Features |
|---|---|---|---|
| featureCounts | C++ | Very Fast | Multi-threading, flexible, part of Subread |
| HTSeq | Python | slow | Well-established, Python-based |
Why featureCounts over HTSeq
featureCounts is generally preferred over HTSeq for several reasons:
- Performance: Faster execution, especially for large datasets (often 5-10x faster)
- Built-in multi-threading support: Can utilize multiple CPU cores
- Better handling of multi-mapping reads: More sophisticated assignment strategies
- More flexible counting options: Better overlap resolution, strand-specificity handling
- Simpler command-line interface: Easier to use and script
- Active development and maintenance: Regularly updated with improvements
Introduction to featureCounts
featureCounts is part of the Subread package and is designed for counting reads overlapping genomic features (genes, exons, etc.) defined in a GTF/GFF annotation file, or a SAF file.
Running featureCounts
Basic command for single-end reads:
featureCounts -a annotations.gtf \
-o counts.txt \
sample.bam
For paired-end reads:
featureCounts -a annotations.gtf \
-o counts.txt \
-p \ # Count fragments (pairs) instead of reads
sample.bam
Complete example with common options, processing multiple BAM files:
featureCounts -a annotations.gtf \
-o counts.txt \
-T 8 \ # Number of threads
-p \ # Count fragments for paired-end
-s 1 \ # Strand-specific (1 = forward, 2 = reverse, 0 = unstranded)
-t exon \ # Feature type to count (exon, gene, etc.)
-g gene_id \ # Attribute to group features
--primary \ # Count only primary alignments
-Q 10 \ # Minimum mapping quality
ample1.bam sample2.bam sample3.bam
Understanding Output Format
featureCounts generates two main output files:
- Count matrix file (e.g.,
counts.txt): Contains gene counts for each sample - Summary file (e.g.,
counts.txt.summary): Contains alignment statistics
Count matrix format:
Geneid Chr Start End Strand Length sample1.bam sample2.bam sample3.bam
Gene1 chr1 1000 5000 + 4000 150 200 180
Gene2 chr1 6000 8000 - 2000 50 45 60
...
Summary file contains:
- Total reads processed
- Successfully assigned reads
- Unassigned reads (with reasons)
- Ambiguity assignments
Count Matrix Generation
The count matrix from featureCounts can be directly used as input for differential expression analysis tools like DESeq2 or edgeR. The matrix should have:
- Rows: Genes (or features)
- Columns: Samples
- Values: Read counts (or fragment counts for paired-end)
Preparing the matrix for DESeq2:
# Read the count matrix
counts <- read.table("counts.txt", header=TRUE, row.names=1)
# Remove annotation columns, keep only count columns
counts <- counts[, 6:ncol(counts)] # Adjust based on your file structure
# Ensure column names match sample names
colnames(counts) <- gsub(".bam", "", colnames(counts))
Normalization Methods in RNA-seq
Understanding normalization is crucial for interpreting RNA-seq data. Different normalization methods serve different purposes and are used at different stages of analysis.
Why Normalization is Needed
Raw read counts are not directly comparable across samples because:
- Sequencing depth varies: Different samples may have different numbers of total reads
- Gene length varies: Longer genes naturally accumulate more reads
- Compositional differences: Some samples may have higher expression of certain gene classes
Common Normalization Methods
1. RPKM (Reads Per Kilobase per Million mapped reads)
- Formula: (Read count × 10^9) / (Gene length × Total mapped reads)
- Normalizes for: Sequencing depth and gene length
- Use case: Comparing expression levels of different genes within the same sample
- Limitation: Not suitable for comparing across samples (compositional bias)
- When to use: Single-sample analysis, visualization
2. FPKM (Fragments Per Kilobase per Million mapped reads)
- Formula: Similar to RPKM but uses fragments instead of reads (for paired-end data)
- Normalizes for: Sequencing depth and gene length
- Use case: Same as RPKM, but for paired-end sequencing
- Limitation: Same compositional bias issues as RPKM
- When to use: Single-sample analysis with paired-end data
3. TPM (Transcripts Per Million)
- Formula: (Read count × 10^6) / (Gene length × Sum of (read count/gene length) for all genes)
- Normalizes for: Sequencing depth and gene length
- Advantage over RPKM/FPKM: Sum of TPM values across all genes is constant (1 million), making it more comparable across samples
- Use case: Comparing expression levels across samples (better than RPKM/FPKM)
- When to use: Cross-sample comparisons, visualization
4. CPM (Counts Per Million)
- Formula: (Read count × 10^6) / Total mapped reads
- Normalizes for: Sequencing depth only (not gene length)
- Use case: Quick normalization for visualization, not for statistical analysis
- When to use: Preliminary visualization, not for DE analysis
DESeq2’s Normalization Approach
DESeq2 uses a different normalization strategy that is specifically designed for differential expression analysis:
Size factors (similar to library size normalization):
- Estimates a size factor for each sample based on the median ratio of gene counts to a “reference” sample
- Accounts for differences in sequencing depth
- More robust than simple library size normalization because it uses median ratios (less affected by highly expressed genes)
Why DESeq2 doesn’t use RPKM/FPKM/TPM:
- These methods assume that most genes are not differentially expressed (not always true)
- DESeq2’s size factors are estimated from the data and account for the actual distribution of expression
- Designed specifically for count-based statistical models (negative binomial)
When to use normalized counts from DESeq2:
- For visualization (heatmaps, PCA plots)
- For comparing expression levels across samples
- Not for statistical testing (DESeq2 uses raw counts for DE analysis)
Example: Getting normalized counts from DESeq2
# After running DESeq2 analysis
dds <- DESeq(dds)
# Get normalized counts (size-factor normalized)
normalized_counts <- counts(dds, normalized=TRUE)
# These can be used for visualization
# For log transformation (common for visualization):
log_normalized <- log2(normalized_counts + 1)
Choosing the Right Normalization Method
| Purpose | Recommended Method | Tool |
|---|---|---|
| Differential expression analysis | DESeq2 size factors (internal) | DESeq2, edgeR |
| Visualization across samples | DESeq2 normalized counts or TPM | DESeq2, or calculate TPM |
| Single-sample gene comparison | RPKM/FPKM | Calculate manually or use tools |
| Quick visualization | CPM | Calculate manually |
| Cross-study comparisons | TPM (with caution) | Calculate manually |
Key takeaway: For differential expression analysis, use tools like DESeq2 or edgeR that handle normalization internally. Don’t use RPKM/FPKM/TPM-normalized data as input for DE analysis - these tools need raw counts.
Contamination Detection Using featureCounts Summary
Quality control doesn’t end after alignment. It’s crucial to check for various types of contamination that can affect downstream analysis and interpretation.
Importance of Contamination Detection
Contamination can lead to:
- Incorrect expression estimates
- False positive differential expression results
- Misinterpretation of biological findings
- Wasted computational resources
Using featureCounts Summary Statistics to Identify Contamination
The featureCounts summary file provides valuable information for detecting contamination. Key metrics to examine:
- Assigned vs. unassigned reads: High unassigned rates may indicate contamination
- Ambiguity assignments: Reads mapping to multiple features
- Feature type distribution: Unexpected distributions may signal issues
Types of Contamination to Check
rRNA Contamination
What to look for:
- High percentage of reads mapping to rRNA genes (>5-10% is concerning)
- Expected vs. observed rRNA percentages (should be <5% for polyA-selected libraries)
Impact on downstream analysis:
- Reduces usable reads for gene expression analysis
- Can skew normalization
- Wastes sequencing depth
Solutions:
- Prevention: Perform polyA(+)-RNA selection, or rRNA depletion (RiboZero, etc.) to remove rRNA.
- Filtering: Remove rRNA-mapped reads before differential expression analysis (if contamination is detected)
Checking in featureCounts summary:
# Count reads mapping to rRNA features separately
featureCounts -a annotations_with_rRNA.gtf \
-o counts_with_rRNA.txt \
sample.bam
# Check the summary file for rRNA assignment percentages
grep "rRNA" counts_with_rRNA.txt.summary
Host Genomic DNA Contamination
What to look for:
- Reads mapping to intergenic regions (high percentage is concerning)
- High percentage of unassigned reads that align to the genome
Visual inspection using IGV (Integrative Genomics Viewer):
IGV is invaluable for visually confirming DNA contamination:
- Loading BAM files and reference genome:
# IGV can be launched and files loaded through the GUI # Or use command line: igv.sh -g genome.fa sample.bam - Identifying characteristic patterns of DNA contamination:
- Intron-only reads without exon coverage: Reads mapping entirely to introns without spanning exons
- Reads spanning intron-exon boundaries incorrectly: Reads that don’t follow proper splicing patterns
- Intergenic reads: Reads mapping to regions between genes
- What to look for in IGV:
- Zoom into gene regions and check for:
- Reads mapping only to introns
- Lack of proper exon coverage
- Unusual read patterns that don’t match RNA-seq expectations
- Zoom into gene regions and check for:
Solutions:
- Prevention:
- DNase treatment during RNA extraction
- PolyA(+) RNA selection (removes most DNA)
- Rescue: CleanUpRNAseq package for correcting for genomic DNA contamination
Using CleanUpRNAseq:
When genomic DNA contamination is obvious after IGV visualization, CleanUpRNAseq can rescue the data.
Environmental Microorganism Contamination
What to look for:
- Unexpected alignments to bacterial/fungal genomes
- High percentage of reads that don’t align to the reference genome map to non-target organisms
Detection methods:
- BLAST: Blast unaligned reads against NCBI databases
- Kraken2: Taxonomic classification of reads
- Alignment to multiple genomes: Align to host + common contaminants
Solutions:
- Sample preparation improvements: Better sterile technique, contamination controls
Example with Kraken2:
# Classify reads taxonomically
kraken2 --db kraken_db \
--paired sample_R1.fastq.gz sample_R2.fastq.gz \
--output classification.txt \
--report report.txt
# Check for unexpected taxa
grep -v "Homo sapiens" report.txt | head -20
Interpreting featureCounts Summary Statistics
Assigned vs. unassigned reads:
- Assigned: Reads successfully counted for a feature (good)
- Unassigned_Ambiguity: Reads mapping to multiple features (may need filtering)
- Unassigned_NoFeatures: Reads in regions without annotated features (check for contamination)
- Unassigned_Unmapped: Reads that didn’t align (check alignment quality)
Ambiguity assignments:
- High ambiguity may indicate:
- Overlapping genes
- Gene families with high similarity
- Annotation issues
Feature type distribution:
- Should match library type:
- PolyA-selected: Mostly exonic reads
- Total RNA: Mix of exonic, intronic, intergenic
- rRNA-depleted: Mostly exonic, some intronic
Thresholds and Quality Metrics
Acceptable contamination levels:
- rRNA: <5% for polyA libraries, <20% for total RNA (depends on depletion efficiency)
- Genomic DNA: <1-2% for polyA libraries
- Environmental: Should be minimal (<0.1%)
Warning signs:
- Unassigned reads >20%
- High intergenic mapping (>5%)
- Unexpected feature type distributions
- Sample-to-sample variation in assignment rates
When to exclude samples:
- Contamination >50% of reads
- Severe quality issues that can’t be corrected
- Samples that are clear outliers after correction attempts
Command-Line Examples for Contamination Analysis
# Count with detailed feature annotation
featureCounts -a annotations.gtf \
-o detailed_counts.txt \
--extraAttributes gene_name,gene_biotype \
sample.bam
# Check summary statistics
cat detailed_counts.txt.summary
# Extract unassigned reads for further analysis
samtools view -b -F 4 sample.bam | \
samtools view -b -q 10 - | \
bedtools intersect -v -a - -b exons.bed > intergenic_reads.bam
# Count intergenic reads
samtools view -c intergenic_reads.bam
Best Practices for Contamination Prevention
- Use appropriate library preparation methods for your RNA type
- Include negative controls in your experiment
- Monitor QC metrics at each step
- Document contamination levels in your analysis
- Consider contamination in experimental design (e.g., rRNA depletion for total RNA)
- Re-run featureCounts after cleanup if using rescue methods
Section 6: Differential Expression Analysis
Differential expression analysis identifies genes that are significantly up- or down-regulated between experimental conditions. This is often the primary goal of RNA-seq experiments.
Overview of DE Tools
Three main tools are commonly used for differential expression analysis:
| Tool | Method | Best For | Features |
|---|---|---|---|
| DESeq2 | Negative binomial | Most RNA-seq studies | Robust, well-documented, comprehensive |
| edgeR | Negative binomial | Similar to DESeq2 | Fast, flexible, good for complex designs |
| limma/voom | Linear modeling | Large sample sizes | Fast, powerful for complex designs |
Batch Effect Considerations
Batch effects are systematic non-biological differences between groups of samples that can confound differential expression analysis. They must be identified and addressed to obtain reliable results.
What are Batch Effects and Why They Matter
Batch effects arise from technical variations such as:
- Different library preparation dates
- Different technicians or laboratories
- Different sequencing batches or runs
- Different sequencing instruments or lanes
Why they matter:
- Can create false positive differential expression
- Can mask true biological differences
- Can lead to incorrect biological interpretations
- Can reduce statistical power
Sources of Batch Effects
Common sources include:
- Sequencing batch: Samples sequenced in different batches
- Date: Samples processed on different dates
- Technician: Different people processing samples
- Laboratory: Samples processed in different labs
- Instrument: Different sequencing machines
- Lane effects: Different flow cell lanes
Impact on Differential Expression Results
Batch effects can:
- Create spurious associations between batch and condition
- Reduce power to detect true differences
- Lead to inflated false discovery rates
- Cause clustering by batch instead of condition
Detection Methods
PCA (Principal Component Analysis):
# After creating DESeqDataSet
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition", "batch"), returnData=TRUE)
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=batch)) + geom_point()
Hierarchical clustering:
# Check if samples cluster by batch
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix, annotation_col=sample_info)
Batch Effect Correction Tools
ComBat (from sva package)
ComBat uses an empirical Bayes method to adjust for known batch effects. Note: Use adjusted counts (convert back if needed) for exploratory analysis only. For differential analysis, include batch as a term in the model.
library(sva)
# Assuming you have normalized counts
# ComBat expects log-transformed data
log_counts <- log2(counts(dds, normalized=TRUE) + 1)
# Adjust for batch
batch <- colData(dds)$batch
condition <- colData(dds)$condition
modcombat <- model.matrix(~condition, data=colData(dds))
combat_counts <- ComBat(dat=log_counts, batch=batch, mod=modcombat)
When to use: Known batch effects, balanced design
SVAseq (Surrogate Variable Analysis)
SVAseq identifies and removes unknown batch effects:
library(sva)
# Identify surrogate variables
mod <- model.matrix(~condition, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svseq <- svaseq(counts(dds, normalized=FALSE), mod, mod0)
# Check how many surrogate variables were identified
n.sv <- svseq$n.sv
print(paste("Number of surrogate variables:", n.sv))
# Add surrogate variables to design (adjust based on n.sv)
for(i in 1:n.sv) {
dds[[paste0("SV", i)]] <- svseq$sv[,i]
}
sv_formula <- paste("~condition", paste0("+ SV", 1:n.sv, collapse=""), sep="")
design(dds) <- as.formula(sv_formula)
dds <- DESeq(dds)
When to use: Unknown batch effects, complex confounding
RUV-seq (Remove Unwanted Variation)
RUV-seq uses control genes or replicates to estimate unwanted variation. The k parameter specifies the number of unwanted variation factors to estimate and remove:
library(RUVSeq)
# Method 1: Using control genes (e.g., housekeeping genes)
# Select control genes (genes expected to be non-differentially expressed)
control_genes <- rownames(counts)[1:1000] # Or use known housekeeping genes
# Alternatively, use genes with low variance across samples
# Estimate RUV factors (k can be 1, 2, 3, etc. - choose based on data complexity)
ruv <- RUVg(counts(dds), control_genes, k=2) # k=2 estimates 2 factors
# Method 2: Using replicate samples
# scIdx should be a list of replicate groups
replicate_groups <- list(rep1=c(1,2), rep2=c(3,4), rep3=c(5,6)) # Example structure
ruv <- RUVs(counts(dds), cIdx=rownames(counts), k=2, scIdx=replicate_groups)
# Choosing k: Start with k=1 and increase if needed
# Can use PCA to estimate number of unwanted variation factors
# Or try different k values and compare results
# Add RUV factors to design
# Determine number of RUV factors (k parameter in RUVg/RUVs)
k <- ncol(ruv$W) # Number of factors estimated
print(paste("Number of RUV factors:", k))
# Dynamically add all RUV factors
for(i in 1:k) {
dds[[paste0("W_", i)]] <- ruv$W[,i]
}
# Create design formula with all RUV factors
ruv_formula <- paste("~condition", paste0("+ W_", 1:k, collapse=""), sep="")
design(dds) <- as.formula(ruv_formula)
dds <- DESeq(dds)
When to use: Control genes available, replicate samples available
Choosing k (number of RUV factors):
- Start with
k=1and increase if batch effects persist - Use PCA to visualize unwanted variation and estimate k
- Too many factors (high k) may remove biological signal
- Too few factors (low k) may not fully remove unwanted variation
- Typical range: k=1 to k=5, depending on data complexity
When to Apply Batch Correction
Apply batch correction when:
- PCA shows clear batch clustering
- Batch is confounded with condition
- You have known technical batches
- Statistical tests indicate batch effects
Consider not applying when:
- Batch and condition are perfectly balanced
- No evidence of batch effects in QC
- Correction might remove biological signal
- Small sample sizes (correction can be unstable due to lack of degrees of freedom)
Integration with DESeq2 Workflow
Batch information can be incorporated into DESeq2 in several ways:
- As a covariate in the design:
design(dds) <- ~ batch + condition - Using surrogate variables:
design(dds) <- ~ SV1 + SV2 + condition
DESeq2 Analysis (Detailed Example)
DESeq2 is a popular tool for differential expression analysis that uses a negative binomial model to account for overdispersion in count data.
R Setup and Package Loading
# Install Bioconductor if needed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install DESeq2
BiocManager::install("DESeq2")
# Load required libraries
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(RColorBrewer)
Loading Count Data
# Read count matrix (from featureCounts)
counts <- read.table("counts_matrix.txt", header=TRUE, row.names=1)
# Remove annotation columns, keep only count columns
# Adjust column indices based on your file structure
counts <- counts[, 6:ncol(counts)]
# Read sample information
sample_info <- read.table("sample_info.txt", header=TRUE, row.names=1)
# Ensure column names match
colnames(counts) <- rownames(sample_info)
Creating DESeqDataSet
Simple two-group design:
# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ condition)
# Pre-filter low count genes (optional but recommended)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
Designs with multiple conditions and covariates:
# Example 1: Multiple conditions (e.g., control, treated1, treated2)
# Ensure condition column has appropriate factor levels
sample_info$condition <- factor(sample_info$condition,
levels=c("control", "treated1", "treated2"))
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ condition)
# Example 2: With batch as covariate
sample_info$batch <- factor(sample_info$batch)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ batch + condition)
# Example 3: Multiple factors (condition and time)
sample_info$condition <- factor(sample_info$condition)
sample_info$time <- factor(sample_info$time)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ batch + condition + time + condition:time)
# Example 4: With continuous covariate (e.g., age, dose)
sample_info$age <- as.numeric(sample_info$age)
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ age + condition)
# Example 5: Complex design with multiple covariates
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ batch + age + condition + time)
Quality Control and Filtering
# Estimate size factors for normalization
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
# Check dispersion plot
plotDispEsts(dds)
# Remove genes with extreme counts (optional)
dds <- dds[rowSums(counts(dds) >= 10) >= 3, ] # At least 10 counts in 3 samples
Batch Effect Correction (if needed)
# Detect batch effects using PCA
vsd <- vst(dds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup=c("condition", "batch"), returnData=TRUE)
# If batch effects are present, apply correction
# Example with ComBat
library(sva)
log_counts <- log2(counts(dds, normalized=TRUE) + 1)
batch <- colData(dds)$batch
condition <- colData(dds)$condition
modcombat <- model.matrix(~condition, data=colData(dds))
combat_counts <- ComBat(dat=log_counts, batch=batch, mod=modcombat)
# Update DESeqDataSet with corrected counts (convert back from log)
# Note: This is a simplified approach, not recommended; see DESeq2 documentation for alternatives
assay(dds) <- round(2^combat_counts - 1)
# Or incorporate batch in design
design(dds) <- ~ batch + condition
Differential Expression Analysis
# Run DESeq2 analysis
dds <- DESeq(dds)
# Get results - Simple two-group comparison
res <- results(dds, contrast=c("condition", "treated", "control"))
# Examples with multiple conditions and covariates:
# Example 1: Multiple conditions (e.g., control, treated1, treated2)
# Design: ~ condition
# Compare treated1 vs control
res_treated1 <- results(dds, contrast=c("condition", "treated1", "control"))
# Compare treated2 vs control
res_treated2 <- results(dds, contrast=c("condition", "treated2", "control"))
# Compare treated2 vs treated1
res_treated2_vs_1 <- results(dds, contrast=c("condition", "treated2", "treated1"))
# Example 2: With batch as covariate
# Design: ~ batch + condition
# Compare treated vs control (accounting for batch)
res_batch <- results(dds, contrast=c("condition", "treated", "control"))
# Example 3: Multiple factors (condition and time) without interaction
# Design: ~ batch + time + condition
# Compare treated vs control (accounting for time and batch)
res_time <- results(dds, contrast=c("condition", "treated", "control"))
# Example 4: Testing interaction terms (condition:time)
# Design: ~ batch + condition + time + condition:time
# First, check available coefficient names
resultsNames(dds)
# Test if there is a significant interaction between condition and time
# This tests whether the effect of condition differs across time points
res_interaction_test <- results(dds, name="conditiontreated.time2")
# Or test the interaction coefficient directly
res_interaction_coef <- results(dds, name="conditiontreated.time2")
# Compare treated vs control at a specific time point
# Method 1: Using contrast list for interaction
# This compares treated vs control at time2, accounting for the interaction
res_time2 <- results(dds,
contrast=list(c("condition_treated_vs_control",
"conditiontreated.time2"),
c("condition_treated_vs_control",
"conditiontreated.time1")))
# Method 2: Using name parameter for main effect (averaged across time)
res_main <- results(dds, name="condition_treated_vs_control")
# Complete example: Testing interaction between condition and time
# Step 1: Create design with interaction
dds_interaction <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ batch + condition + time + condition:time)
# Step 2: Run DESeq
dds_interaction <- DESeq(dds_interaction)
# Step 3: Check coefficient names
resultsNames(dds_interaction)
# Output might be: [1] "Intercept" "batch_batch2_vs_batch1" "condition_treated_vs_control"
# [4] "time_time2_vs_time1" "conditiontreated.timetime2"
# Step 4: Test the interaction term
# This tests if the condition effect differs between time points
res_interaction <- results(dds_interaction, name="conditiontreated.timetime2")
# Step 5: Extract condition effect at each time point
# Condition effect at time1 (baseline)
res_condition_time1 <- results(dds_interaction,
contrast=c("condition", "treated", "control"))
# Condition effect at time2 (includes interaction)
# This is the main effect + interaction
res_condition_time2 <- results(dds_interaction,
contrast=list(c("condition_treated_vs_control",
"conditiontreated.timetime2"),
c(0, 0)))
#### Time-Course Analysis Specifics
Time-course experiments require special consideration in design and analysis:
**Design considerations:**
- **Time points**: Choose biologically relevant time points
- **Replicates**: Need replicates at each time point (not just overall replicates)
- **Baseline**: Include a time 0 or pre-treatment baseline
- **Balanced design**: Equal replicates across all time points when possible
**Analysis approaches:**
**1. Testing for temporal trends:**
```r
# Design with time as continuous variable
# This tests for linear trends over time
dds_time_continuous <- DESeqDataSetFromMatrix(
countData = counts,
colData = sample_info,
design = ~ batch + condition + time + condition:time
)
dds_time_continuous <- DESeq(dds_time_continuous)
# Test for interaction (does treatment effect change over time?)
res_time_interaction <- results(dds_time_continuous,
name="conditiontreated.timetime2")
# Test for overall time effect
res_time_effect <- results(dds_time_continuous, name="time")
2. Using splines for non-linear trends:
library(splines)
# Create spline basis for time (allows non-linear trends)
# ns() creates natural splines with specified degrees of freedom
sample_info$time_spline <- ns(sample_info$time, df=3)
# Design with splines
dds_spline <- DESeqDataSetFromMatrix(
countData = counts,
colData = sample_info,
design = ~ batch + condition + time_spline + condition:time_spline
)
dds_spline <- DESeq(dds_spline)
# Test for interaction at each spline term
resultsNames(dds_spline)
3. Pairwise comparisons at each time point:
# Compare treated vs control at each time point separately
# Time point 1
res_t1 <- results(dds,
contrast=list(c("condition_treated_vs_control"),
c(0)),
listValues=c(1, 0))
# Time point 2
res_t2 <- results(dds,
contrast=list(c("condition_treated_vs_control",
"conditiontreated.timetime2"),
c(1, 1)),
listValues=c(1, 0))
4. Tools for time-course analysis:
- splineTimeR (R package): Specifically designed for time-course RNA-seq
- maSigPro (R/Bioconductor): Microarray time series analysis adapted for RNA-seq
- ImpulseDE2 (R/Bioconductor): Detects impulse-like expression patterns over time
Example with splineTimeR:
library(splineTimeR)
# Prepare data
counts_matrix <- as.matrix(counts)
design <- model.matrix(~ condition + time + condition:time, data=sample_info)
# Run splineTimeR
spline_results <- splineTimeR(counts_matrix, design, time_col="time",
condition_col="condition", reference="control")
# Extract results
significant_genes <- spline_results[spline_results$padj < 0.05, ]
Example with maSigPro:
library(maSigPro)
# Prepare data in maSigPro format
# Create time series design
design <- make.design.matrix(sample_info, degree=2) # Quadratic model
# Fit model
fit <- p.vector(counts, design, Q=0.05, MT.adjust="BH")
# Get significant genes
fit$SELEC
# Get clusters of genes with similar temporal patterns
tstep <- T.fit(fit, step.method="backward", alfa=0.05)
sigs <- get.siggenes(tstep, rsq=0.6, vars="groups")
Visualization for time-course data:
library(ggplot2)
# Plot expression over time for specific genes
gene_of_interest <- "ENSG00000139618" # Example gene ID
# Get normalized counts
norm_counts <- counts(dds, normalized=TRUE)
gene_counts <- norm_counts[gene_of_interest, ]
# Create data frame for plotting
plot_data <- data.frame(
time = sample_info$time,
condition = sample_info$condition,
expression = log2(gene_counts + 1)
)
# Create line plot
ggplot(plot_data, aes(x=time, y=expression, color=condition)) +
geom_line(aes(group=interaction(condition, rep)), alpha=0.5) +
geom_smooth(method="loess", se=TRUE) +
labs(title=paste("Expression over time:", gene_of_interest),
x="Time", y="Log2 Normalized Counts") +
theme_minimal()
Best practices for time-course analysis:
- Plan time points carefully: Choose time points that capture expected biological changes
- Include sufficient replicates: Need replicates at each time point, not just overall
- Consider non-linear trends: Biological processes often have non-linear temporal patterns
- Test for interactions: Determine if treatment effects change over time
- Visualize temporal patterns: Line plots and heatmaps help identify patterns
- Account for multiple testing: Time-course experiments involve many comparisons
Example 5: With continuous covariate (e.g., age)
Design: ~ age + condition
Compare treated vs control (accounting for age)
res_age <- results(dds, contrast=c(“condition”, “treated”, “control”))
Example 6: With RUV factors or surrogate variables
Design: ~ condition + W_1 + W_2
Compare treated vs control (accounting for unwanted variation)
res_ruv <- results(dds, contrast=c(“condition”, “treated”, “control”))
Example 7: Extracting specific coefficients
Get results using coefficient name (useful for complex designs)
First check available coefficients
resultsNames(dds)
Then extract using name
res_coef <- results(dds, name=”condition_treated_vs_control”)
Example 8: Complex design with multiple covariates
Design: ~ batch + age + condition + time
Compare treated vs control (accounting for all covariates)
res_complex <- results(dds, contrast=c(“condition”, “treated”, “control”))
Multi-Condition Comparisons
When you have more than two conditions, you need to carefully plan your comparisons:
Example: Three conditions (Control, Treatment A, Treatment B)
# Create sample info with three conditions
sample_info <- data.frame(
condition = factor(c(rep("control", 3),
rep("treatment_A", 3),
rep("treatment_B", 3)),
levels = c("control", "treatment_A", "treatment_B")),
batch = factor(rep(c("batch1", "batch2", "batch3"), 3))
)
# Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sample_info,
design = ~ batch + condition)
# Run DESeq
dds <- DESeq(dds)
# Method 1: Pairwise comparisons
# Compare Treatment A vs Control
res_A_vs_control <- results(dds, contrast=c("condition", "treatment_A", "control"))
# Compare Treatment B vs Control
res_B_vs_control <- results(dds, contrast=c("condition", "treatment_B", "control"))
# Compare Treatment A vs Treatment B
res_A_vs_B <- results(dds, contrast=c("condition", "treatment_A", "treatment_B"))
# Method 2: Using resultsNames to see all available comparisons
resultsNames(dds)
# Output: [1] "Intercept" "batch_batch2_vs_batch1" "batch_batch3_vs_batch1"
# [4] "condition_treatment_A_vs_control" "condition_treatment_B_vs_control"
# Extract using coefficient names
res_A_vs_control <- results(dds, name="condition_treatment_A_vs_control")
res_B_vs_control <- results(dds, name="condition_treatment_B_vs_control")
# Method 3: ANOVA-like test (testing if any condition differs)
# This tests the overall effect of condition
dds_anova <- DESeq(dds, test="LRT", reduced=~batch)
res_anova <- results(dds_anova)
# Method 4: Custom contrasts for complex comparisons
# Example: Compare average of treatments vs control
res_treatments_vs_control <- results(dds,
contrast=list(
c("condition_treatment_A_vs_control",
"condition_treatment_B_vs_control"),
c(0.5, 0.5) # Average of the two treatments
))
Best practices for multi-condition comparisons:
- Define all comparisons a priori: Decide which comparisons are biologically meaningful before analysis
- Account for multiple testing: When performing multiple pairwise comparisons, be aware that you’re increasing the chance of false positives
- Use appropriate controls: Ensure your control group is appropriate for all treatment comparisons
- Consider interaction effects: If treatments might interact, include interaction terms in the design
- Visualize all conditions: Use PCA plots and heatmaps to visualize relationships between all conditions
Visualization for multi-condition experiments:
# Heatmap with all conditions
library(pheatmap)
library(viridis)
# Get normalized counts
norm_counts <- counts(dds, normalized=TRUE)
# Select top variable genes
top_genes <- head(order(rowVars(norm_counts), decreasing=TRUE), 50)
# Create annotation for conditions
annotation_col <- data.frame(Condition = sample_info$condition)
rownames(annotation_col) <- colnames(norm_counts)
# Create heatmap
pheatmap(log2(norm_counts[top_genes, ] + 1),
annotation_col = annotation_col,
color = viridis(100),
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
main = "Top 50 Variable Genes Across Conditions")
Sankey diagrams for multi-condition experiments:
library(networkD3)
# Create a data frame of significant genes in each comparison
sig_A <- rownames(res_A_vs_control)[!is.na(res_A_vs_control$padj) &
res_A_vs_control$padj < 0.05]
sig_B <- rownames(res_B_vs_control)[!is.na(res_B_vs_control$padj) &
res_B_vs_control$padj < 0.05]
# Find overlaps
both_sig <- intersect(sig_A, sig_B)
only_A <- setdiff(sig_A, sig_B)
only_B <- setdiff(sig_B, sig_A)
# Create nodes and links for Sankey diagram
nodes <- data.frame(name = c("Control", "Treatment A", "Treatment B",
"DE in A only", "DE in B only", "DE in both"))
links <- data.frame(
source = c(0, 1, 2, 0, 1),
target = c(3, 3, 4, 5, 5),
value = c(length(only_A), length(only_A), length(only_B),
length(both_sig), length(both_sig))
)
# Create Sankey diagram
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
fontSize = 12, nodeWidth = 30)
Shrink log2 fold changes (recommended for visualization)
For simple contrast
resLFC <- lfcShrink(dds, coef=”condition_treated_vs_control”, type=”apeglm”)
For custom contrast, use contrast parameter
resLFC_contrast <- lfcShrink(dds, contrast=c(“condition”, “treated”, “control”), type=”apeglm”)
Order by adjusted p-value
resOrdered <- res[order(res$padj),]
Summary
summary(res)
#### Results Interpretation
Key columns in results:
- **baseMean**: Average normalized count across all samples
- **log2FoldChange**: Log2 fold change between conditions
- **lfcSE**: Standard error of log2 fold change
- **stat**: Wald statistic
- **pvalue**: Raw p-value
- **padj**: Adjusted p-value (FDR, Benjamini-Hochberg)
**Filtering significant genes:**
```r
# Genes with padj < 0.05 and |log2FC| > 1
sig_genes <- res[which(!is.na(res$padj) & res$padj < 0.05 & abs(res$log2FoldChange) > 1), ]
Visualization of Differential Expression Results
Visualization is crucial for understanding and communicating your results. Here are common visualization approaches:
Publication-ready figure guidelines:
- Resolution: Use 300 DPI (dots per inch) or higher for publication
- Formats:
- Vector formats (PDF, SVG) for line plots, scatter plots - scale without quality loss
- Raster formats (PNG, TIFF) for heatmaps - use high resolution (300+ DPI)
- Color palettes: Use colorblind-friendly palettes (viridis, ColorBrewer)
- Font sizes: Ensure text is readable when figure is resized (typically 10-12pt minimum)
- Figure dimensions: Common sizes: single column (3.5 inches), double column (7 inches)
- File organization: Save both R code and final figures separately
MA Plots
MA plots show log2 fold change vs. mean expression:
# Basic MA plot
plotMA(res, ylim=c(-2,2))
# Enhanced MA plot with ggplot2
res_df <- as.data.frame(res)
res_df$significant <- ifelse(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1,
"Yes", "No")
ggplot(res_df, aes(x=baseMean, y=log2FoldChange, color=significant)) +
geom_point(alpha=0.6) +
scale_x_log10() +
scale_color_manual(values=c("gray", "red")) +
geom_hline(yintercept=c(-1, 1), linetype="dashed") +
labs(x="Mean Expression", y="Log2 Fold Change", title="MA Plot")
Volcano Plots
Volcano plots show statistical significance vs. fold change:
res_df$neg_log10_padj <- -log10(res_df$padj)
ggplot(res_df, aes(x=log2FoldChange, y=neg_log10_padj, color=significant)) +
geom_point(alpha=0.6) +
scale_color_manual(values=c("gray", "red")) +
geom_vline(xintercept=c(-1, 1), linetype="dashed") +
geom_hline(yintercept=-log10(0.05), linetype="dashed") +
labs(x="Log2 Fold Change", y="-Log10 Adjusted P-value", title="Volcano Plot") +
theme_minimal()
# Save publication-ready figure
ggsave("volcano_plot.pdf", width=7, height=6, dpi=300)
ggsave("volcano_plot.png", width=7, height=6, dpi=300) # Also save PNG version
Creating multi-panel figures:
library(gridExtra)
library(cowplot) # Alternative package for arranging plots
# Create multiple plots
p1 <- ggplot(ma_data, aes(x=baseMean, y=log2FoldChange, color=significant)) +
geom_point(alpha=0.6) +
scale_color_manual(values=c("gray", "red")) +
theme_minimal()
p2 <- ggplot(volcano_data, aes(x=log2FC, y=-log10(padj), color=significant)) +
geom_point(alpha=0.6) +
scale_color_manual(values=c("gray", "red")) +
theme_minimal()
# Arrange plots side by side
combined_plot <- plot_grid(p1, p2, labels=c("A", "B"), ncol=2)
# Save combined figure
ggsave("combined_plots.pdf", combined_plot, width=14, height=6, dpi=300)
Color palette recommendations:
# Colorblind-friendly palettes
library(viridis)
library(RColorBrewer)
# Viridis palettes (excellent for continuous data)
scale_color_viridis_c() # Continuous
scale_color_viridis_d() # Discrete
# ColorBrewer palettes (good for categorical data)
scale_color_brewer(palette="Set2") # Qualitative
scale_color_brewer(palette="RdYlBu") # Diverging
# Custom colorblind-friendly palette
cb_palette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
Export settings for different journals:
# Nature/Science style (single column, high resolution)
ggsave("figure.pdf", width=3.5, height=3.5, units="in", dpi=300)
# PLoS style (wider figures)
ggsave("figure.pdf", width=7, height=5, units="in", dpi=300)
# General publication (adjust as needed)
ggsave("figure.pdf", width=6, height=4.5, units="in", dpi=300)
Heatmaps (for Multi-Condition Experiments)
Heatmaps are excellent for visualizing expression patterns across multiple conditions:
library(ComplexHeatmap)
library(circlize)
# Select top variable genes
top_genes <- head(order(res$padj), 50)
# Get normalized counts
vsd <- vst(dds, blind=FALSE)
mat <- assay(vsd)[top_genes, ]
# Z-score normalization for visualization
mat_scaled <- t(scale(t(mat)))
# Create heatmap
Heatmap(mat_scaled,
name = "Z-score",
column_title = "Samples",
row_title = "Genes",
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_names = FALSE,
col = colorRamp2(c(-2, 0, 2), c("blue", "white", "red")),
top_annotation = HeatmapAnnotation(
Condition = colData(dds)$condition,
Batch = colData(dds)$batch,
col = list(Condition = c("control" = "blue", "treated" = "red"),
Batch = c("batch1" = "gray", "batch2" = "black"))
))
Alternative with pheatmap:
library(pheatmap)
# Prepare annotation
ann_col <- data.frame(
Condition = colData(dds)$condition,
Batch = colData(dds)$batch
)
rownames(ann_col) <- colnames(mat_scaled)
# Create heatmap
pheatmap(mat_scaled,
annotation_col = ann_col,
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
color = colorRampPalette(c("blue", "white", "red"))(100))
Sankey Diagrams (for Multi-Condition Experiments)
Sankey diagrams visualize gene flow and relationships across conditions:
library(networkD3)
library(dplyr)
# Example: Compare DE genes across multiple conditions
# Get DE genes for each comparison
de_cond1 <- rownames(res_cond1)[res_cond1$padj < 0.05]
de_cond2 <- rownames(res_cond2)[res_cond2$padj < 0.05]
de_cond3 <- rownames(res_cond3)[res_cond3$padj < 0.05]
# Create links data
links <- data.frame(
source = c(rep("Condition1", length(de_cond1)),
rep("Condition2", length(de_cond2)),
rep("Condition3", length(de_cond3))),
target = c(de_cond1, de_cond2, de_cond3),
value = 1
)
# Create nodes
nodes <- data.frame(
name = c("Condition1", "Condition2", "Condition3",
unique(c(de_cond1, de_cond2, de_cond3)))
)
# Create Sankey diagram
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "genes", fontSize = 12, nodeWidth = 30)
Alternative with ggplot2 and ggalluvial:
library(ggalluvial)
# Prepare data for alluvial plot
de_data <- data.frame(
Condition = rep(c("Control", "Treated1", "Treated2"), each=100),
Gene = c(sample(de_cond1, 100), sample(de_cond2, 100), sample(de_cond3, 100)),
Direction = sample(c("Up", "Down"), 300, replace=TRUE)
)
ggplot(de_data, aes(x = Condition, stratum = Gene, alluvium = Gene,
fill = Direction)) +
geom_flow() +
geom_stratum() +
theme_minimal()
Brief Mentions of edgeR and limma/voom
edgeR
edgeR is similar to DESeq2 but uses a different normalization method:
library(edgeR)
# Create DGEList object
dge <- DGEList(counts=counts, group=sample_info$condition)
# Filter and normalize
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
# Fit model and test
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
Best for: Similar use cases to DESeq2, sometimes faster
limma/voom
limma with voom transformation is powerful for complex designs:
library(limma)
# Voom transformation
v <- voom(counts, design, plot=TRUE)
# Fit linear model
fit <- lmFit(v, design)
fit <- eBayes(fit)
# Get results
topTable(fit, coef=2)
Best for: Large sample sizes, complex experimental designs, microarray-like analysis
Section 7: Overrepresentation Analysis (ORA)
Overrepresentation Analysis (ORA) identifies which biological pathways, processes, or functions are overrepresented in a list of differentially expressed genes. It’s a simple but powerful method for interpreting the biological meaning of your results.
Introduction to ORA
ORA tests whether genes in your gene list (e.g., significantly upregulated genes) are enriched for specific biological annotations compared to a background set of genes. It uses statistical tests (typically hypergeometric or Fisher’s exact test) to determine if certain gene sets appear more frequently than expected by chance.
Background Selection (Critical Importance)
The choice of background gene set is critical and directly affects the results and interpretation of ORA.
What is Background and Why It Matters
The background gene set represents the “universe” of genes from which your gene list was drawn. It defines what genes are considered “available” for enrichment, which directly impacts:
- Statistical significance: Different backgrounds yield different p-values
- Biological interpretation: Results may change dramatically
- False positive/negative rates: Wrong background can create or hide enrichments
Common Background Choices
All genes in the genome: (not recommended)
- Includes all annotated genes
- May include genes not expressed in your experiment
- Can dilute enrichment signals
- Use when: You want the broadest possible comparison
All genes detected/expressed in the experiment:
- Only genes with some expression in your samples
- More appropriate for RNA-seq experiments
- Reduces false positives from non-expressed genes
- Use when: You want experiment-specific context (RECOMMENDED for most RNA-seq)
All genes passing quality filters:
- Genes that passed your filtering criteria (e.g., minimum counts)
- Most conservative and appropriate
- Matches your actual analysis
- Use when: You want to match your DE analysis exactly (OFTEN BEST CHOICE)
All genes in a specific pathway database:
- Only genes annotated in the database you’re testing
- Very conservative
- May miss relevant pathways
- Use when: Testing specific curated pathways
Impact of Background Selection on Results
Statistical significance (p-values):
- Smaller background → More significant results (but may be inflated)
- Larger background → Less significant results
False positive/negative rates:
- Too permissive background → More false positives
- Too restrictive background → More false negatives
Biological interpretation:
- Wrong background can make pathways appear enriched when they’re not
- Can hide true enrichments if background is too broad
Best Practices for Background Selection
- Match background to experimental design:
- Use genes that could realistically be detected in your experiment
- Consider your library type (polyA-selected vs. total RNA)
- Use experiment-specific background (detected genes):
- Most appropriate for RNA-seq
- Reduces false positives from non-expressed genes
- Better reflects your actual experimental context
- Avoid overly restrictive or permissive backgrounds:
- Too restrictive: May miss relevant pathways
- Too permissive: May create false enrichments
- Document background choice:
- Always report which background you used
- Justify your choice
- Consider sensitivity analysis with different backgrounds
Common Pitfalls and Mistakes
- Using genome-wide background for tissue-specific experiments: Should use tissue-expressed genes
- Including non-expressed genes: Creates false enrichments
- Not matching background to filtering criteria: Background should match your DE analysis
- Using different backgrounds for up- and down-regulated genes: Should be consistent
- Not documenting background choice: Makes results irreproducible
Gene List Preparation
Prepare your gene list from differential expression results:
# Get significantly upregulated genes
up_genes <- rownames(res)[!is.na(res$padj) & res$padj < 0.05 & res$log2FoldChange > 1]
# Get significantly downregulated genes
down_genes <- rownames(res)[!is.na(res$padj) & res$padj < 0.05 & res$log2FoldChange < -1]
# Get all significant genes
all_sig_genes <- rownames(res)[!is.na(res$padj) & res$padj < 0.05]
# Convert to appropriate format (e.g., Entrez IDs, gene symbols)
# This depends on your annotation and the tool you're using
Main Sources of Gene Sets
Multiple databases provide gene sets for enrichment analysis:
GO (Gene Ontology) Terms
The Gene Ontology provides structured vocabularies for gene function:
Biological Process (BP): High-level biological objectives (e.g., “cell cycle”, “immune response”)
Molecular Function (MF): Molecular activities of gene products (e.g., “kinase activity”, “DNA binding”)
Cellular Component (CC): Locations where gene products act (e.g., “nucleus”, “mitochondrion”)
Hierarchical structure and relationships:
- GO terms are organized in a directed acyclic graph (DAG)
- Terms have parent-child relationships
- More specific terms are children of more general terms
GO enrichment databases and resources:
- GO database: http://geneontology.org/
- org.Hs.eg.db (Bioconductor): Human GO annotations
- org.Mm.eg.db: Mouse GO annotations
- Other species-specific packages available
KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathways
KEGG provides curated pathway maps:
Metabolic pathways: Biochemical reactions and pathways
Signaling pathways: Signal transduction cascades
Disease pathways: Disease-related pathways
Organism-specific pathways: Pathways annotated for specific organisms
Resources:
- KEGG database: https://www.genome.jp/kegg/
- KEGGREST (Bioconductor): R interface to KEGG
- KEGG.db (Bioconductor): Local KEGG pathway annotations
REACTOME
REACTOME is a curated pathway database:
Curated pathway database: Expert-curated, peer-reviewed pathways
Human pathways with orthologs: Primarily human, with orthologs in other species
Hierarchical pathway structure: Pathways organized hierarchically
Resources:
- REACTOME database: https://reactome.org/
- reactome.db (Bioconductor): R interface
WikiPathways
WikiPathways is a community-curated pathway resource:
Community-curated pathways: Pathways created and maintained by the research community
Multiple species coverage: Pathways for various organisms
Pathway visualization: Interactive pathway diagrams
Resources:
- WikiPathways: https://www.wikipathways.org/
- rWikiPathways (Bioconductor): R interface
MSigDB (Molecular Signatures Database)
MSigDB provides comprehensive gene set collections:
Hallmark gene sets: 50 well-defined gene sets representing specific biological states
Curated gene sets: Gene sets from various sources (pathways, literature, etc.)
Motif gene sets: Gene sets based on transcription factor binding motifs
Computational gene sets: Gene sets identified through computational analysis
GO gene sets: GO terms organized as gene sets
Oncogenic signatures: Cancer-related gene signatures
Immunologic signatures: Immune system-related gene sets
Resources:
- MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/
- msigdbr (Bioconductor): R package for MSigDB gene sets for 20 species
Enrichr
Enrichr is a web-based enrichment analysis platform:
Web-based platform: Easy-to-use web interface
Multiple gene set libraries: Integrates many gene set databases
Interactive visualization: Rich visualizations and interactive results
API access: Programmatic access available
Resources:
- Enrichr: https://maayanlab.cloud/Enrichr/
- enrichR (CRAN): R package for Enrichr API
Comparison of Gene Set Sources
| Source | Coverage | Curation | Update Frequency | Species | Best For |
|---|---|---|---|---|---|
| GO | Comprehensive | High | Regular | Many | General functional annotation |
| KEGG | Focused | High | Regular | Many | Metabolic and signaling pathways |
| REACTOME | Comprehensive | Very High | Regular | Primarily human | Detailed pathway analysis |
| WikiPathways | Growing | Community | Continuous | Many | Community knowledge, visualization |
| MSigDB | Very Comprehensive | Variable | Regular | Primarily human/mouse | Comprehensive analysis, cancer research |
| Enrichr | Very Comprehensive | Variable | Regular | Many | Quick analysis, multiple databases |
Use cases and recommendations:
- General functional analysis: GO terms
- Pathway analysis: KEGG, REACTOME, WikiPathways
- Comprehensive analysis: MSigDB
- Quick exploration: Enrichr
- Disease-specific: MSigDB oncogenic/immunologic signatures
- Metabolic focus: KEGG
Tools and Methods
Common tools for ORA:
# Using clusterProfiler (Bioconductor)
library(clusterProfiler)
library(org.Hs.eg.db)
# Convert gene IDs (if needed)
gene_ids <- bitr(up_genes, fromType="SYMBOL", toType="ENTREZID",
OrgDb=org.Hs.eg.db)
# GO enrichment
go_enrich <- enrichGO(gene = gene_ids$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
# KEGG enrichment
kegg_enrich <- enrichKEGG(gene = gene_ids$ENTREZID,
organism = "hsa",
pvalueCutoff = 0.05)
# REACTOME enrichment
library(reactome.db)
reactome_enrich <- enrichPathway(gene = gene_ids$ENTREZID,
organism = "human",
pvalueCutoff = 0.05)
Interpretation of Results
Key metrics to interpret:
- p-value / adjusted p-value: Statistical significance
- Gene ratio: Proportion of genes in your list that are in the gene set
- Background ratio: Proportion of background genes in the gene set
- Gene list: Which of your genes are in the enriched set
Examples with Different Background Selections
# Example 1: Genome-wide background
go_genome <- enrichGO(gene = gene_ids$ENTREZID,
universe = NULL, # All genes in database
OrgDb = org.Hs.eg.db,
ont = "BP")
# Example 2: Experiment-specific background (detected genes)
detected_genes <- rownames(counts)[rowSums(counts) > 0]
detected_ids <- bitr(detected_genes, fromType="SYMBOL", toType="ENTREZID",
OrgDb=org.Hs.eg.db)
go_detected <- enrichGO(gene = gene_ids$ENTREZID,
universe = detected_ids$ENTREZID, # Only detected genes
OrgDb = org.Hs.eg.db,
ont = "BP")
# Example 3: Filtered genes background (matches DE analysis)
filtered_genes <- rownames(dds) # Genes that passed filtering
filtered_ids <- bitr(filtered_genes, fromType="SYMBOL", toType="ENTREZID",
OrgDb=org.Hs.eg.db)
go_filtered <- enrichGO(gene = gene_ids$ENTREZID,
universe = filtered_ids$ENTREZID, # Matches DE analysis
OrgDb = org.Hs.eg.db,
ont = "BP")
# Compare results
head(go_genome@result)
head(go_detected@result)
head(go_filtered@result)
Section 8: Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA) is a powerful method that considers all genes in your dataset, not just those that pass a significance threshold. It’s particularly useful when you have many genes with small but coordinated changes.
Introduction to GSEA and fgsea
GSEA tests whether predefined gene sets show statistically significant, concordant differences between two biological states. Unlike ORA, GSEA:
- Uses all genes, ranked by their association with the phenotype
- Doesn’t require arbitrary significance thresholds
- Can detect subtle but coordinated changes
- Is more sensitive to pathway-level effects
fgsea (Fast Gene Set Enrichment Analysis) is an R implementation that’s faster than the original GSEA software and integrates well with R workflows.
Loading fgsea R Package
# Install if needed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("fgsea")
# Load library
library(fgsea)
library(data.table)
Preparing Ranked Gene Lists
GSEA requires genes ranked by their association with the phenotype (e.g., log2 fold change, t-statistic):
# Get gene statistics from DESeq2 results
res_stat <- res$stat # or use log2FoldChange
names(res_stat) <- rownames(res)
# Sort by statistic (decreasing)
ranks <- sort(res_stat, decreasing=TRUE)
# Alternatively, rank by -log10(p-value) * sign(log2FC)
res$ranking_metric <- -log10(res$padj) * sign(res$log2FoldChange)
ranks <- sort(setNames(res$ranking_metric, rownames(res)), decreasing=TRUE)
Gene Set Databases
MSigDB (Molecular Signatures Database)
MSigDB is the primary source for GSEA gene sets:
library(msigdbr)
# Get gene sets (example: Hallmark gene sets)
hallmark_sets <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(x = hallmark_sets$gene_symbol,
f = hallmark_sets$gs_name)
# Other categories available:
# - "C2": Curated gene sets
# - "C3": Motif gene sets
# - "C5": GO gene sets
# - "C6": Oncogenic signatures
# - "C7": Immunologic signatures
Enrichr
Enrichr gene sets can also be used, though MSigDB is more commonly used for GSEA:
# Enrichr gene sets are typically accessed through the web interface
# or via the enrichR package for specific libraries
library(enrichR)
# List available databases
dbs <- listEnrichrDbs()
Running fgsea Analysis
# Run GSEA
fgsea_res <- fgsea(pathways = hallmark_list,
stats = ranks,
minSize = 15, # Minimum gene set size
maxSize = 500, # Maximum gene set size
nperm = 10000) # Number of permutations
# Sort by normalized enrichment score (NES)
fgsea_res <- fgsea_res[order(NES, decreasing=TRUE),]
# Filter significant results
sig_pathways <- fgsea_res[padj < 0.05 & abs(NES) > 1, ]
Key parameters:
minSize: Minimum number of genes in a gene set (default: 15)maxSize: Maximum number of genes in a gene set (default: 500)nperm: Number of permutations for p-value calculation (more = more accurate but slower)
Interpreting Results
Key columns in fgsea results:
- pathway: Name of the gene set
- pval: Nominal p-value
- padj: Adjusted p-value (FDR)
- ES: Enrichment score
- NES: Normalized enrichment score (normalized across gene set sizes)
- size: Number of genes in the gene set
- leadingEdge: Genes that contribute most to the enrichment
Interpretation:
- Positive NES: Gene set is enriched at the top of the ranked list (upregulated)
- Negative NES: Gene set is enriched at the bottom (downregulated)
-
** NES > 1**: Generally considered significant - padj < 0.05: Statistically significant after multiple testing correction
Visualization
Enrichment Plots
# Plot top enriched pathway
plotEnrichment(hallmark_list[["HALLMARK_INTERFERON_ALPHA_RESPONSE"]],
ranks) +
labs(title="HALLMARK_INTERFERON_ALPHA_RESPONSE")
# Create a table plot of top results
library(ggplot2)
top_pathways <- head(fgsea_res[order(padj), ], 10)
ggplot(top_pathways, aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill = padj < 0.05)) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="GSEA Results") +
theme_minimal()
Pathway Networks
# Visualize leading edge genes
library(ComplexHeatmap)
# Get leading edge genes for top pathways
top_pathway <- fgsea_res[1, pathway]
leading_genes <- fgsea_res[pathway == top_pathway, leadingEdge][[1]]
# Create heatmap of leading edge genes
leading_ranks <- ranks[names(ranks) %in% leading_genes]
mat <- matrix(leading_ranks, nrow=length(leading_ranks), ncol=1)
rownames(mat) <- names(leading_ranks)
Heatmap(mat, name = "Rank", cluster_rows = FALSE)
R Code Examples
Complete GSEA workflow:
library(fgsea)
library(msigdbr)
library(ggplot2)
# 1. Prepare ranked gene list
ranks <- sort(res$stat, decreasing=TRUE)
names(ranks) <- rownames(res)
# 2. Get gene sets
hallmark_sets <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_list <- split(x = hallmark_sets$gene_symbol,
f = hallmark_sets$gs_name)
# 3. Run GSEA
fgsea_res <- fgsea(pathways = hallmark_list,
stats = ranks,
minSize = 15,
maxSize = 500,
nperm = 10000)
# 4. Filter and sort
sig_pathways <- fgsea_res[padj < 0.05 & abs(NES) > 1, ]
sig_pathways <- sig_pathways[order(NES, decreasing=TRUE), ]
# 5. Visualize
ggplot(sig_pathways[1:20, ], aes(reorder(pathway, NES), NES)) +
geom_col(aes(fill = NES > 0)) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score") +
scale_fill_manual(values=c("blue", "red"),
labels=c("Down", "Up")) +
theme_minimal()
Section 9: Reproducibility and Data Sharing
Ensuring reproducibility is essential for scientific research. Here are best practices for making your RNA-seq analysis reproducible:
Recording Software Versions
Always document the versions of all software and packages used:
# In R, record package versions
sessionInfo()
# Or create a more detailed report
writeLines(capture.output(sessionInfo()), "R_session_info.txt")
For command-line tools:
# Record versions
echo "STAR version: $(STAR --version)" >> software_versions.txt
echo "featureCounts version: $(featureCounts -v)" >> software_versions.txt
echo "FastQC version: $(fastqc --version)" >> software_versions.txt
Using Containers for Reproducibility
Containers (Docker, Singularity) ensure consistent software environments:
Docker example:
# Dockerfile for RNA-seq analysis
FROM continuumio/miniconda3
RUN conda install -c bioconda star featurecounts fastqc
RUN conda install -c conda-forge r-base r-deseq2
Singularity example:
# Build from Docker image
singularity build rnaseq_analysis.sif docker://biocontainers/star:latest
Version Control
Use Git to track your analysis code:
# Initialize repository
git init
git add *.R *.sh *.yaml
git commit -m "Initial RNA-seq analysis code"
Best practices:
- Track code, not data (use
.gitignorefor large files) - Write clear commit messages
- Use branches for experimental analyses
- Tag releases/versions
Creating Reproducible Reports
R Markdown for integrated analysis and reporting:
# Example R Markdown structure
---
title: "RNA-seq Analysis Report"
author: "Your Name"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(DESeq2)
library(ggplot2)
```{r load-data} counts <- read.table(“counts.txt”, header=TRUE, row.names=1)
# Results
## Differential Expression
```{r de-analysis}
dds <- DESeqDataSetFromMatrix(countData = counts, ...)
dds <- DESeq(dds)
res <- results(dds)
**Jupyter notebooks** (Python-based analysis):
- Good for interactive exploration
- Supports multiple languages (R, Python, bash)
- Easy to share and collaborate
### Data Sharing
**Where to share data:**
- **GEO (Gene Expression Omnibus)**: https://www.ncbi.nlm.nih.gov/geo/
- For processed data (count matrices, normalized data)
- Requires metadata and sample information
- **SRA (Sequence Read Archive)**: https://www.ncbi.nlm.nih.gov/sra
- For raw sequencing data (FASTQ files)
- Linked to GEO submissions
- **Zenodo**: https://zenodo.org/
- For code, intermediate files, and supplementary data
- Provides DOIs for citations
**What to share:**
- Raw sequencing data (FASTQ files) - via SRA
- Processed count matrices - via GEO
- Analysis code and scripts - via GitHub/Zenodo
- Metadata and sample information - via GEO
- Key intermediate files (if space allows)
**Metadata requirements:**
- Sample information (conditions, replicates, etc.)
- Experimental design
- Processing parameters
- Software versions
- Quality control metrics
### Creating Analysis Scripts
Organize your analysis into modular scripts:
```bash
# Example workflow script structure
# 01_qc.sh - Quality control
# 02_trim.sh - Adapter trimming
# 03_align.sh - Read alignment
# 04_quantify.sh - Expression quantification
# 05_de_analysis.R - Differential expression
# 06_enrichment.R - Functional enrichment
Best practices:
- Use relative paths (not absolute paths)
- Include comments explaining each step
- Make scripts executable and well-documented
- Test scripts on a subset of data first
Documentation
Document your analysis thoroughly:
- README files: Explain how to run the analysis
- Comments in code: Explain why, not just what
- Analysis logs: Record parameters and decisions
- Troubleshooting notes: Document issues and solutions
Section 10: Workflow Integration
Putting all the steps together into a cohesive analysis pipeline is crucial for reproducible and reliable results.
Best Practices
- Document everything: Keep detailed notes on parameters, versions, and decisions (see Section 9 for detailed reproducibility guidelines)
- Quality at each step: Don’t proceed if QC fails
- Appropriate controls: Include positive and negative controls
- Statistical rigor: Use appropriate multiple testing correction
- Biological validation: Verify key findings with independent methods
Common Pitfalls
- Skipping QC steps: Always check quality at each stage
- Ignoring batch effects: Can create false results
- Wrong background in ORA: Dramatically affects results
- Not checking for contamination: Can lead to misinterpretation
- Over-interpreting p-values: Consider effect sizes and biological relevance
- Not validating findings: Important results should be confirmed
Troubleshooting Tips
Low alignment rates:
- Check read quality
- Verify genome/annotation versions match
- Check for contamination
- Consider different aligner or parameters
High contamination:
- Review sample preparation
- Use appropriate cleanup tools
- Consider excluding severely contaminated samples
No significant DE genes:
- Check if effect sizes are biologically meaningful
- Verify experimental design and sample sizes
- Check for batch effects masking signal
- Consider less stringent thresholds (with caution)
Unexpected enrichment results:
- Verify background gene set
- Check gene ID conversion
- Review gene set definitions
- Consider alternative databases
Conclusion
This tutorial has covered the complete workflow for bulk RNA-seq data analysis, from raw reads to biological interpretation. The workflow overview is provided in the Introduction section. Key takeaways include the importance of quality control at each step, proper experimental design, appropriate statistical methods, and thorough documentation for reproducibility.
Next Steps
- Explore single-cell RNA-seq for cellular heterogeneity
- Learn spatial transcriptomics for tissue context
- Dive deeper into alternative splicing analysis
- Investigate time-course and multi-condition experimental designs
- Learn about network analysis and pathway modeling
Additional Resources
- Bioconductor: https://www.bioconductor.org/ - R packages for genomic analysis
- Galaxy: https://usegalaxy.org/ - Web-based analysis platform
- nf-core: https://nf-co.re/ - Nextflow pipelines for RNA-seq
- RNA-seq blog: https://www.rna-seqblog.com/ - Latest RNA-seq news and methods
- Key papers:
- Dobin et al. (2013) STAR: ultrafast universal RNA-seq aligner
- Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
- Subramanian et al. (2005) Gene set enrichment analysis
This tutorial provides a comprehensive foundation for bulk RNA-seq analysis. Remember that each experiment is unique, and you may need to adapt these methods to your specific research questions and data characteristics.
Comments