gbcms dna¶

Count alleles at variant positions across one or more DNA/cfDNA BAM/CRAM files.

Migrated from gbcms run

The deprecated gbcms run command was removed in v4.1.0. Use gbcms dna instead — all arguments are identical.

Synopsis¶

gbcms dna [OPTIONS] --variants <FILE> --bam <NAME:PATH>... --fasta <FILE>

Required Arguments¶

Option	Description
`--variants`, `-v`	VCF or MAF file with variant positions (`.vcf`, `.vcf.gz`, `.vcf.bgz`, or `.maf`). Unsupported extensions are rejected immediately.
`--bam`, `-b`	BAM or CRAM file path (can repeat). Optionally prefix with `name:` for sample naming, e.g. `--bam tumor:tumor.bam`. If no name given, the filename stem is used.
`--bam-list`, `-L`	File containing BAM/CRAM paths (one per line, optionally `sample_name path`). Alternative to repeated `--bam`.
`--fasta`, `-f`	Reference FASTA file (with .fai index). Required for all runs; also used for CRAM decoding.
`--lenient-bam`	Skip missing `--bam` paths and continue with remaining samples (default: exit immediately on first missing BAM/CRAM). Note: a missing `--bam-list` file always fails regardless.

CRAM Support (v5.3.0+)

--bam accepts both BAM (.bam) and CRAM (.cram) files transparently. The --fasta reference is automatically used for CRAM decoding at both the Rust engine and Python header-validation layers. No additional flags are needed.

Index discovery is automatic: gbcms searches for .bam.bai / .bai (BAM) or .cram.crai / .crai (CRAM) next to the alignment file. You can also provide the index path explicitly via the Nextflow samplesheet bai column.

Output Options¶

Option	Default	Description
`--output-dir`, `-o`	required	Output directory
`--format`	`vcf`	Output format (`vcf` or `maf`)
`--suffix`	`''`	Suffix for output filenames
`--column-prefix`	`''`	Prefix for gbcms count columns in MAF output. Only letters, digits, and underscores (`[A-Za-z0-9_]`) are allowed; invalid characters exit immediately.
`--preserve-barcode`	`false`	Keep original Tumor_Sample_Barcode from input MAF. No-op (with warning) when input is not MAF.
`--show-normalization`	`false`	Append `norm_*` columns showing left-aligned coordinates
`--context-padding`	`5`	Minimum flanking bases for haplotype construction. Range 1–50, enforced at parse time. Auto-increased in repeat regions when `--adaptive-context` is enabled.
`--adaptive-context`	`true`	Dynamically increase context padding in tandem repeat regions
`--threads`	`1`	Total worker-thread budget for this run (minimum 1; `0` is rejected). All parallelism — read classification and any decode threads — stays within this budget, so gbcms never oversubscribes a small allocation. Multi-sample parallelism is the orchestrator's job (e.g. N concurrent Nextflow processes, each pinned to `task.cpus`), not this flag's.

mFSD Options¶

Mutant Fragment Size Distribution (mFSD) analysis compares insert-size distributions for REF- vs ALT-classified fragments at each variant position, enabling detection of short-fragment enrichment associated with tumor-derived cfDNA (see mFSD Metrics).

Option	Default	Description
`--mfsd`	`false`	Enable mFSD analysis. Adds 41 mFSD columns (KS test, LLR, mean sizes, pairwise comparisons, derived metrics) to MAF output and 13 `MFSD_*` INFO fields to VCF.
`--mfsd-parquet`	`false`	Write a companion `<sample>.fsd.parquet` with per-variant raw fragment size arrays (`ref_sizes`, `alt_sizes`). Enables downstream visualizations. Requires `--mfsd`.
`--mfsd-report`	`false`	Generate an interactive HTML report (`<sample>.mfsd_report.html`) with per-variant fragment size distributions, dual-axis histograms, and Fragment Origin Signal classification. Implies `--mfsd` and `--mfsd-parquet` (both are auto-enabled). See mFSD Report.
`--mfsd-report-min-alt`	`3`	Minimum ALT fragment count to include a variant in the HTML report.
`--mfsd-report-max-variants`	`20`	Maximum variants in the HTML report (selected by highest ALT count). Use `-1` for no limit.

Tip

To generate summary statistics, raw Parquet data, and an interactive HTML report:

gbcms dna --mfsd-report --format maf \
    --variants variants.maf --bam tumor:tumor.bam --fasta hg19.fa -o ./results

This produces <sample>.maf (with 41 mFSD columns), <sample>.fsd.parquet (raw fragment sizes), and <sample>.mfsd_report.html (interactive visualization). Using --mfsd-report auto-enables --mfsd and --mfsd-parquet.

Filtering Options¶

Option	Default	Description
`--min-mapq`	`20`	Minimum MAPQ
`--min-baseq`	`20`	Minimum BASEQ
`--filter-duplicates`	`true`	Filter duplicate reads
`--filter-secondary`	`true`	Filter secondary alignments
`--filter-supplementary`	`true`	Filter supplementary alignments
`--filter-qc-failed`	`true`	Filter QC failed reads
`--filter-improper-pair`	`false`	Filter improperly paired reads
`--filter-indel`	`false`	Filter reads with indels
`--fragment-qual-threshold`	`10`	Quality difference threshold for fragment consensus (see Fragment Counting)

BAQ Options¶

Base Alignment Quality (BAQ) heuristically downgrades base qualities near indels and splice junctions to prevent systematic errors from realignment artifacts.

Option	Default	Description
`--apply-baq/--no-baq`	`off`	Enable BAQ quality downgrade near indels and splice junctions

Internal constants (not configurable — derived from Li 2011):

Constant	Value	Description
`BAQ_RADIUS`	5 bp	Bases within 5 bp of an indel/splice boundary are penalized
`BAQ_PENALTY`	20	Subtracted from BQ (clamped to 0)

When to Enable BAQ for DNA

Enable --apply-baq when your upstream pipeline does not include BQSR, fgbio consensus, or any other base quality recalibration. In these "raw BQ" pipelines, bases near indels may retain inflated sequencer-assigned quality scores. BAQ compensates by penalizing BQ within 5 bp of indel boundaries.

Do Not Enable for Pre-Calibrated BAMs

If your DNA pipeline includes BQSR or consensus calling (e.g., fgbio, Marianas), leave BAQ off (the default). Applying BAQ on already-recalibrated BAMs may over-penalize, causing undercounting of legitimate variant evidence.

RNA Mode Default

In gbcms rna, BAQ is on by default because RNA pipelines typically lack BQSR. BAQ also penalizes bases near splice junctions (CIGAR N). See RNA Splice-Junction Handling.

UMI Options¶

Unique Molecular Identifier (UMI) support for molecule-level deduplication.

Option	Default	Description
`--umi-tag`	(none)	BAM tag for UMI barcode (e.g. `RX`). Enables UMI-aware fragment grouping.

UMI-Aware Fragment Counting

When --umi-tag is set, two reads are considered the same fragment only if they share both QNAME and UMI barcode. This prevents UMI-collapsed reads from different original molecules being incorrectly merged into a single fragment, which would deflate fragment-level allele counts.

gbcms dna --umi-tag RX \
    --variants variants.vcf --bam sample.bam --fasta ref.fa -o results/

MNP Rescue Options¶

The rescue pass recovers alt_count for MNP (multi-nucleotide polymorphism) variants where the full block match yields ad=0, by decomposing the MNP into individual SNP positions and re-counting each one independently. See Architecture → MNP Rescue Pass for the full design rationale.

Option	Default	Description
`--rescue-mnp`	`false`	Enable MNP rescue pass. When `ad=0` and the variant is `MNP_RESCUE_ELIGIBLE`, decomposes the MNP into individual SNP positions and re-counts via the Rust engine.
`--rescue-mnp-threshold`	`1.0`	Maximum discriminating/length ratio for MNP rescue eligibility (0.0–1.0). `1.0` = all MNPs are eligible (C++ gbcms compatible, default). `0.5` = conservative sparse-only mode (≤50% discriminating positions). `0.0` = disable rescue eligibility (MNP_DISC_RATIO diagnostics are still emitted). Only used when `--rescue-mnp` is enabled.

Diagnostic Flags

When rescue is enabled, two diagnostic flags are emitted for every MNP variant:

MNP_DISC_RATIO(n/m) — Always emitted. Shows the ratio of discriminating positions to total MNP length.
MNP_RESCUE_ELIGIBLE — Emitted only when disc/len ≤ --rescue-mnp-threshold. Marks the variant as a rescue candidate.

Choosing the threshold

# Permissive (default) — rescue all MNPs, matching C++ gbcms behavior
gbcms dna --rescue-mnp --variants input.maf --bam sample:sample.bam --fasta ref.fa -o out/

# Conservative — rescue only sparse MNPs (≤50% discriminating positions)
gbcms dna --rescue-mnp --rescue-mnp-threshold 0.5 --variants input.maf --bam sample:sample.bam --fasta ref.fa -o out/

Debugging Options¶

Option	Default	Description
`--verbose`, `-V`	`false`	Enable verbose debug logging
`--trace`, `-T`	`false`	Enable per-read Rust trace logging (slow). Implies `--verbose`. Shows detailed per-read classification diagnostics.

Alignment Backend¶

Phase 3 (haplotype-based) classification uses a two-stage pipeline:

WFA fast-path (Wavefront Alignment, wfa2lib-rs) — edit-distance triage against the pangenomic haplotype matrix. Resolves ~70-80% of reads instantly at O(s²) cost where s is the edit distance. If REF and ALT scores differ clearly, the read is classified immediately.
Marginalized PairHMM (escalated only when WFA is ambiguous) — integrates per-base quality probabilities into alignment scoring, producing a log-likelihood ratio (LLR) confidence score. More sensitive in noisy, low-quality, or repeat-dense regions.

Option	Default	Description
`--alignment-backend`	`pairhmm`	Phase 3 backend: `pairhmm` (WFA + PairHMM, default) or `sw` (Smith-Waterman only, no WFA triage). Invalid values are rejected at parse time.
`--llr-threshold`	`2.3`	PairHMM LLR threshold for confident calls (≈ ln(10))
`--gap-open-prob`	`1e-4`	PairHMM gap-open probability for non-repeat regions
`--gap-extend-prob`	`0.1`	PairHMM gap-extend probability for non-repeat regions
`--repeat-gap-open-prob`	`1e-2`	PairHMM gap-open probability for tandem repeat regions
`--repeat-gap-extend-prob`	`0.5`	PairHMM gap-extend probability for tandem repeat regions

pairhmm vs sw

pairhmm (default) uses WFA edit-distance triage first, then escalates to PairHMM only for ambiguous reads. This is faster than running PairHMM on every read and more accurate in low-quality or repeat-dense regions.

sw runs Smith-Waterman on every Phase 3 read (no WFA pre-filter). Use only if you need exact reproducibility with versions <3.0.0.

Examples¶

Single BAMMultiple BAMsWith FilteringWith NormalizationWith UMI Tags

gbcms dna \
    --variants mutations.vcf \
    --bam sample:sample.bam \ # (1)!
    --fasta reference.fa \
    --output-dir results/

The sample: prefix sets the output filename. Without it, the BAM filename stem is used.

gbcms dna \
    --variants mutations.maf \
    --bam tumor:tumor.bam \
    --bam normal:normal.bam \
    --fasta reference.fa \
    --format maf # (1)!

MAF output preserves all input MAF columns and appends gbcms count columns.

gbcms dna \
    --variants mutations.vcf \
    --bam sample:sample.bam \
    --fasta reference.fa \
    --filter-duplicates \
    --min-mapq 30

gbcms dna \
    --variants mutations.maf \
    --bam sample:sample.bam \
    --fasta reference.fa \
    --format maf \
    --show-normalization # (1)!

Appends norm_chrom, norm_pos, norm_ref, norm_alt columns showing left-aligned coordinates.

gbcms dna \
    --variants mutations.vcf \
    --bam sample:umi_deduped.bam \
    --fasta reference.fa \
    --umi-tag RX \ # (1)!
    --output-dir results/

The RX tag is the standard SAM tag for UMI barcodes (fgbio, gencore).

Output¶

See Output Formats for a complete column-level schema reference covering:

VCF output: ##INFO fields, ##FORMAT fields, and annotated examples
MAF output: VCF→MAF vs MAF→MAF column sets, Tumor_Sample_Barcode behaviour, and column prefix options
mFSD columns (with --mfsd): all 41 mFSD fields
Normalization columns (with --show-normalization)

Quick Start — Common patterns
gbcms rna — RNA-seq counting with transcriptome-aware filtering
gbcms normalize — Standalone normalization (no counting)
Nextflow Pipeline — For many samples
Input Formats — VCF/MAF specs
Output Formats — Complete column-level output reference
Variant Counting — How each variant type is counted