Nextflow is a powerful workflow management system that makes it easy to write data-intensive computational pipelines. One of the core concepts in Nextflow is the process definition, which is the building block of any workflow. In this post, we’ll break down the anatomy of a Nextflow process and understand each component.
What is a Nextflow Process?
A process in Nextflow is a self-contained computational unit that:
- Takes inputs from channels
- Executes a script (shell, Python, R, etc.)
- Produces outputs that can be consumed by other processes
- Can be configured with directives for resource management
Process Structure
A typical Nextflow process follows this structure:
process PROCESS_NAME {
// Directives (optional)
tag 'some_tag'
label 'resource_profile'
container 'image:tag'
// Input block (required)
input:
val x
path input_file
// Output block (required)
output:
path output_file
val statistics, emit: stats
// Script block (required)
script:
"""
your_command_here
"""
}
Key Components Explained
1. Common Directives
Directives configure how the process runs:
tag: Identifier for logging and monitoringlabel: Resource profile (CPU, memory requirements)container: Docker/Singularity container imageconda: Conda environment filecpus: Number of CPU coresmemory: Memory requirement (e.g., ‘8 GB’)time: Time limit (e.g., ‘1h’)publishDir: Where to publish output files
2. Input Block
The input block defines what data the process receives:
input:
val x // Value input (strings, numbers)
path input_file // File input (automatically staged)
env VARIABLE_NAME // Environment variable
stdin // Standard input
tuple val x, path y // Multiple inputs
3. Output Block
The output block defines what the process produces:
output:
val result // Value output
path "output.txt" // File output
stdout // Standard output
tuple path x, val y // Multiple outputs
val stats, emit: stats // Named output (for workflow access)
4. Script Block
The script block contains the actual commands to execute:
script:
// Optional Groovy section
def tool = 'bwa'
def version = '0.7.17'
"""
# Shell script section
echo "Running ${tool} version ${version}"
bwa mem -t ${task.cpus} ${ref} ${reads} > output.sam
"""
Complete Example
Here’s a complete example of a Nextflow process for sequence alignment:
process ALIGN {
tag "alignment-${sample_id}"
label 'cpu_intensive'
container 'biocontainers/bwa:v0.7.17'
cpus 4
memory '8 GB'
time '1h'
input:
val sample_id
path reads
path reference
output:
path "${sample_id}.sam", emit: alignment
val sample_id, emit: sample
script:
"""
bwa mem -t ${task.cpus} ${reference} ${reads} > ${sample_id}.sam
"""
}
Best Practices
- Always use
tag: Makes logging and debugging much easier - Use
emitfor named outputs: Simplifies workflow code - Specify resource requirements: Helps with scheduling and optimization
- Use containers: Ensures reproducibility across environments
- Escape variables properly: Use
\$variablefor shell,${variable}for Groovy
Process Execution Flow
When a process runs:
- Channel items arrive
whenclause is checked (if present)- Input files are staged to work directory
- Environment variables are set (if
envinputs) - Script executes in work directory
- Output files are collected
- Files published to
publishDir(if specified) - Outputs emitted to channels
Conclusion
Understanding process definitions is crucial for writing effective Nextflow workflows. Each component serves a specific purpose in making your pipeline reproducible, scalable, and maintainable.
In future posts, we’ll explore best practices of:
- Channel factories and operators
- Modules composition
- nf-test test composition
- Subworkflow composition
- Workflow composition
- Advanced Nextflow patterns
- Conda/mamba environment and containers
- Nextflow configuration
- Groovy reference for Nextflow workflow development
Happy workflow building! 🧬
Comments