Nextflow is a powerful workflow management system that makes it easy to write data-intensive computational pipelines. One of the core concepts in Nextflow is the process definition, which is the building block of any workflow. In this post, we’ll break down the anatomy of a Nextflow process and understand each component.

What is a Nextflow Process?

A process in Nextflow is a self-contained computational unit that:

  • Takes inputs from channels
  • Executes a script (shell, Python, R, etc.)
  • Produces outputs that can be consumed by other processes
  • Can be configured with directives for resource management

Process Structure

A typical Nextflow process follows this structure:

process PROCESS_NAME {
    // Directives (optional)
    tag 'some_tag'
    label 'resource_profile'
    container 'image:tag'
    
    // Input block (required)
    input:
    val x
    path input_file
    
    // Output block (required)
    output:
    path output_file
    val statistics, emit: stats
    
    // Script block (required)
    script:
    """
    your_command_here
    """
}

Key Components Explained

1. Common Directives

Directives configure how the process runs:

  • tag: Identifier for logging and monitoring
  • label: Resource profile (CPU, memory requirements)
  • container: Docker/Singularity container image
  • conda: Conda environment file
  • cpus: Number of CPU cores
  • memory: Memory requirement (e.g., ‘8 GB’)
  • time: Time limit (e.g., ‘1h’)
  • publishDir: Where to publish output files

2. Input Block

The input block defines what data the process receives:

input:
val x                    // Value input (strings, numbers)
path input_file          // File input (automatically staged)
env VARIABLE_NAME        // Environment variable
stdin                    // Standard input
tuple val x, path y      // Multiple inputs

3. Output Block

The output block defines what the process produces:

output:
val result               // Value output
path "output.txt"        // File output
stdout                   // Standard output
tuple path x, val y      // Multiple outputs
val stats, emit: stats  // Named output (for workflow access)

4. Script Block

The script block contains the actual commands to execute:

script:
// Optional Groovy section
def tool = 'bwa'
def version = '0.7.17'

"""
# Shell script section
echo "Running ${tool} version ${version}"
bwa mem -t ${task.cpus} ${ref} ${reads} > output.sam
"""

Complete Example

Here’s a complete example of a Nextflow process for sequence alignment:

process ALIGN {
    tag "alignment-${sample_id}"
    label 'cpu_intensive'
    container 'biocontainers/bwa:v0.7.17'
    cpus 4
    memory '8 GB'
    time '1h'
    
    input:
    val sample_id
    path reads
    path reference
    
    output:
    path "${sample_id}.sam", emit: alignment
    val sample_id, emit: sample
    
    script:
    """
    bwa mem -t ${task.cpus} ${reference} ${reads} > ${sample_id}.sam
    """
}

Best Practices

  1. Always use tag: Makes logging and debugging much easier
  2. Use emit for named outputs: Simplifies workflow code
  3. Specify resource requirements: Helps with scheduling and optimization
  4. Use containers: Ensures reproducibility across environments
  5. Escape variables properly: Use \$variable for shell, ${variable} for Groovy

Process Execution Flow

When a process runs:

  1. Channel items arrive
  2. when clause is checked (if present)
  3. Input files are staged to work directory
  4. Environment variables are set (if env inputs)
  5. Script executes in work directory
  6. Output files are collected
  7. Files published to publishDir (if specified)
  8. Outputs emitted to channels

Conclusion

Understanding process definitions is crucial for writing effective Nextflow workflows. Each component serves a specific purpose in making your pipeline reproducible, scalable, and maintainable.

In future posts, we’ll explore best practices of:

  • Channel factories and operators
  • Modules composition
  • nf-test test composition
  • Subworkflow composition
  • Workflow composition
  • Advanced Nextflow patterns
  • Conda/mamba environment and containers
  • Nextflow configuration
  • Groovy reference for Nextflow workflow development

Happy workflow building! 🧬