Nextflow Workflow Guide¶

This guide covers running py-gbcms as a Nextflow workflow for processing multiple samples in parallel, particularly on HPC clusters.

Overview¶

The Nextflow workflow provides: - Automatic parallelization across samples - SLURM/HPC integration with resource management - Containerization with Docker/Singularity - Resume capability for failed runs - Reproducible pipelines

Prerequisites¶

Nextflow >= 21.10.3
One of:
Docker (for local)
Singularity (for HPC)

Install Nextflow:

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/  # or any directory in your PATH

Quick Start¶

1. Prepare Samplesheet¶

Create a CSV file with your samples:

sample,bam,bai
sample1,/path/to/sample1.bam,/path/to/sample1.bam.bai
sample2,/path/to/sample2.bam,
sample3,/path/to/sample3.bam,/path/to/sample3.bam.bai

Or with per-sample suffix (for multiple BAM types):

sample,bam,bai,suffix
sample1,/path/to/sample1.duplex.bam,,-duplex
sample1,/path/to/sample1.simplex.bam,,-simplex
sample1,/path/to/sample1.unfiltered.bam,,-unfiltered
sample2,/path/to/sample2.bam,,

Notes: - bai column is optional - will auto-discover <bam>.bai if not provided - suffix column is optional - per-row suffix overrides global --suffix parameter - BAI files must exist or workflow will fail early with clear error

2. Run the Workflow¶

Local with Docker:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --outdir results \
    -profile docker

SLURM cluster with Singularity:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --outdir results \
    -profile slurm

Parameters¶

Required¶

Parameter	Description
`--input`	Path to samplesheet CSV
`--variants`	Path to VCF/MAF variants file
`--fasta`	Reference FASTA (with .fai index)

Output Options¶

Parameter	Default	Description
`--outdir`	`results`	Output directory
`--format`	`vcf`	Output format (`vcf` or `maf`)
`--suffix`	`''`	Suffix for output filenames

Filtering Options¶

Parameter	Default	Description
`--min_mapq`	`20`	Minimum mapping quality
`--min_baseq`	`0`	Minimum base quality
`--filter_duplicates`	`true`	Filter duplicate reads
`--filter_secondary`	`false`	Filter secondary alignments
`--filter_supplementary`	`false`	Filter supplementary alignments
`--filter_qc_failed`	`false`	Filter QC failed reads
`--filter_improper_pair`	`false`	Filter improperly paired reads
`--filter_indel`	`false`	Filter reads with indels

Resource Limits¶

Parameter	Default	Description
`--max_cpus`	`16`	Maximum CPUs per job
`--max_memory`	`128.GB`	Maximum memory per job
`--max_time`	`240.h`	Maximum runtime per job

Execution Profiles¶

Docker (Local)¶

-profile docker

- Uses Docker containers - Best for local development - Requires Docker installed

Singularity (HPC)¶

-profile singularity

- Uses Singularity images - Best for HPC without SLURM - Requires Singularity installed

SLURM (HPC Cluster)¶

-profile slurm

- Submits jobs to SLURM - Uses Singularity containers - Queue: cmobic_cpu (customizable)

Customizing for Your Cluster¶

Edit nextflow/nextflow.config to customize the SLURM profile:

slurm {
    process.executor       = 'slurm'
    process.queue          = 'your_queue_name'  // Change this
    process.clusterOptions = '--account=your_account'  // Add if needed
    singularity.enabled    = true
    singularity.autoMounts = true
}

Common customizations:

process {
    withName: GBCMS_RUN {
        cpus   = 8          // CPUs per sample
        memory = 16.GB      // Memory per sample
        time   = 6.h        // Time limit per sample
    }
}

Output Structure¶

Results are organized in ${outdir}/:

results/
├── gbcms/
│   ├── sample1.vcf        # Or .maf
│   ├── sample2.vcf
│   └── sample3.vcf
└── pipeline_info/
    ├── execution_report.html
    ├── execution_timeline.html
    └── execution_trace.txt

Advanced Usage¶

Resume Failed Runs¶

Nextflow caches completed tasks. Resume from where it failed:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    -profile slurm \
    -resume

Custom Suffix¶

Add suffix to output filenames:

--suffix .genotyped
# Output: sample1.genotyped.vcf

MAF Output¶

Generate MAF instead of VCF:

--format maf
# Output: sample1.maf

Strict Filtering¶

Enable all filters for high-quality genotyping:

nextflow run nextflow/main.nf \
    --input samplesheet.csv \
    --variants variants.vcf \
    --fasta reference.fa \
    --filter_duplicates true \
    --filter_secondary true \
    --filter_supplementary true \
    --filter_qc_failed true \
    -profile slurm

Monitoring¶

View Running Jobs¶

# SLURM
squeue -u $USER

# Nextflow
nextflow log

Check Progress¶

Nextflow prints real-time progress:

[c3/a1b2c3] GBCMS_RUN (sample1) [100%] 10 of 10 ✔

Execution Report¶

After completion, view the HTML report:

open results/pipeline_info/execution_report.html

Troubleshooting¶

Job Failed with Error¶

Check the work directory in error message:

cat work/c3/a1b2c3/.command.log

Out of Memory¶

Increase memory in config:

process {
    withName: GBCMS_RUN {
        memory = 32.GB
    }
}

Wrong Queue¶

Update queue name in nextflow/nextflow.config:

process.queue = 'your_queue_name'

Missing Container¶

Pull the container manually:

# Singularity
singularity pull docker://ghcr.io/msk-access/py-gbcms:2.0.0

# Docker
docker pull ghcr.io/msk-access/py-gbcms:2.0.0

Comparison with CLI¶

Feature	CLI	Nextflow
Multiple samples	Sequential	Parallel
Resource management	Manual	Automatic
Retry failed jobs	Manual	Automatic
HPC integration	Manual scripts	Built-in
Resume capability	No	Yes

When to use CLI instead: See Usage Patterns

Next Steps¶

See Usage Patterns for comparison with CLI usage
See nextflow/README.md for additional workflow documentation