gbcms dna¶
Count alleles at variant positions across one or more DNA/cfDNA BAM files.
Migrating from gbcms run
gbcms run is deprecated and hidden. Replace with gbcms dna — all arguments are identical. gbcms run will be removed in v4.1.0.
Synopsis¶
Required Arguments¶
| Option | Description |
|---|---|
--variants, -v |
VCF or MAF file with variant positions (.vcf, .vcf.gz, .vcf.bgz, or .maf). Unsupported extensions are rejected immediately. |
--bam, -b |
BAM file path (can repeat). Optionally prefix with name: for sample naming, e.g. --bam tumor:tumor.bam. If no name given, the filename stem is used. |
--bam-list, -L |
File containing BAM paths (one per line, optionally sample_name path). Alternative to repeated --bam. |
--fasta, -f |
Reference FASTA file (with .fai index) |
--lenient-bam |
Skip missing --bam paths and continue with remaining samples (default: exit immediately on first missing BAM). Note: a missing --bam-list file always fails regardless. |
Output Options¶
| Option | Default | Description |
|---|---|---|
--output-dir, -o |
required | Output directory |
--format |
vcf |
Output format (vcf or maf) |
--suffix |
'' |
Suffix for output filenames |
--column-prefix |
'' |
Prefix for gbcms count columns in MAF output. Only letters, digits, and underscores ([A-Za-z0-9_]) are allowed; invalid characters exit immediately. |
--preserve-barcode |
false |
Keep original Tumor_Sample_Barcode from input MAF. No-op (with warning) when input is not MAF. |
--show-normalization |
false |
Append norm_* columns showing left-aligned coordinates |
--context-padding |
5 |
Minimum flanking bases for haplotype construction. Range 1–50, enforced at parse time. Auto-increased in repeat regions when --adaptive-context is enabled. |
--adaptive-context |
true |
Dynamically increase context padding in tandem repeat regions |
--threads |
1 |
Number of threads |
mFSD Options¶
Mutant Fragment Size Distribution (mFSD) analysis compares insert-size distributions for REF- vs ALT-classified fragments at each variant position, enabling detection of short-fragment enrichment associated with tumor-derived cfDNA (see mFSD Metrics).
| Option | Default | Description |
|---|---|---|
--mfsd |
false |
Enable mFSD analysis. Adds 34 mFSD columns (KS test, LLR, mean sizes, pairwise comparisons, derived metrics) to MAF output and 7 MFSD_* INFO fields to VCF. |
--mfsd-parquet |
false |
Write a companion <sample>.fsd.parquet with per-variant raw fragment size arrays (ref_sizes, alt_sizes). Enables downstream visualizations. Requires --mfsd. |
Tip
To generate both summary statistics and raw Parquet data in one run:
gbcms dna --mfsd --mfsd-parquet --format maf \
--variants variants.maf --bam tumor:tumor.bam --fasta hg19.fa -o ./results
<sample>.maf (with 34 mFSD columns) and <sample>.fsd.parquet
(raw fragment sizes for visualization).
Filtering Options¶
| Option | Default | Description |
|---|---|---|
--min-mapq |
20 |
Minimum MAPQ |
--min-baseq |
20 |
Minimum BASEQ |
--filter-duplicates |
true |
Filter duplicate reads |
--filter-secondary |
true |
Filter secondary alignments |
--filter-supplementary |
true |
Filter supplementary alignments |
--filter-qc-failed |
true |
Filter QC failed reads |
--filter-improper-pair |
false |
Filter improperly paired reads |
--filter-indel |
false |
Filter reads with indels |
--fragment-qual-threshold |
10 |
Quality difference threshold for fragment consensus (see Fragment Counting) |
BAQ Options¶
Base Alignment Quality (BAQ) heuristically downgrades base qualities near indels to prevent systematic errors from realignment artifacts.
| Option | Default | Description |
|---|---|---|
--apply-baq/--no-baq |
off |
Enable BAQ quality downgrade near indels |
When to Enable BAQ
Most modern pipelines (BQSR, fgbio consensus) already recalibrate base qualities. Enable BAQ only for legacy BAMs lacking quality recalibration, where bases near indels may have inflated quality scores that lead to false-positive allele calls.
UMI Options¶
Unique Molecular Identifier (UMI) support for molecule-level deduplication.
| Option | Default | Description |
|---|---|---|
--umi-tag |
(none) | BAM tag for UMI barcode (e.g. RX). Enables UMI-aware fragment grouping. |
UMI-Aware Fragment Counting
When --umi-tag is set, two reads are considered the same fragment only if they share both QNAME and UMI barcode. This prevents UMI-collapsed reads from different original molecules being incorrectly merged into a single fragment, which would deflate fragment-level allele counts.
Debugging Options¶
| Option | Default | Description |
|---|---|---|
--verbose, -V |
false |
Enable verbose debug logging |
--trace, -T |
false |
Enable per-read Rust trace logging (slow). Implies --verbose. Shows detailed per-read classification diagnostics. |
Alignment Backend¶
Phase 3 (haplotype-based) classification uses a two-stage pipeline:
-
WFA fast-path (Wavefront Alignment,
wfa2lib-rs) — edit-distance triage against the pangenomic haplotype matrix. Resolves ~70-80% of reads instantly at O(s²) cost where s is the edit distance. If REF and ALT scores differ clearly, the read is classified immediately. -
Marginalized PairHMM (escalated only when WFA is ambiguous) — integrates per-base quality probabilities into alignment scoring, producing a log-likelihood ratio (LLR) confidence score. More sensitive in noisy, low-quality, or repeat-dense regions.
| Option | Default | Description |
|---|---|---|
--alignment-backend |
pairhmm |
Phase 3 backend: pairhmm (WFA + PairHMM, default) or sw (Smith-Waterman only, no WFA triage). Invalid values are rejected at parse time. |
--llr-threshold |
2.3 |
PairHMM LLR threshold for confident calls (≈ ln(10)) |
--gap-open-prob |
1e-4 |
PairHMM gap-open probability for non-repeat regions |
--gap-extend-prob |
0.1 |
PairHMM gap-extend probability for non-repeat regions |
--repeat-gap-open-prob |
1e-2 |
PairHMM gap-open probability for tandem repeat regions |
--repeat-gap-extend-prob |
0.5 |
PairHMM gap-extend probability for tandem repeat regions |
pairhmm vs sw
pairhmm (default) uses WFA edit-distance triage first, then escalates to PairHMM only for ambiguous reads. This is faster than running PairHMM on every read and more accurate in low-quality or repeat-dense regions.
sw runs Smith-Waterman on every Phase 3 read (no WFA pre-filter). Use only if you need exact reproducibility with versions <3.0.0.
Examples¶
gbcms dna \
--variants mutations.vcf \
--bam sample:sample.bam \ # (1)!
--fasta reference.fa \
--output-dir results/
- The
sample:prefix sets the output filename. Without it, the BAM filename stem is used.
gbcms dna \
--variants mutations.maf \
--bam tumor:tumor.bam \
--bam normal:normal.bam \
--fasta reference.fa \
--format maf # (1)!
- MAF output preserves all input MAF columns and appends gbcms count columns.
Output¶
See Output Formats for a complete column-level schema reference covering:
- VCF output:
##INFOfields,##FORMATfields, and annotated examples - MAF output: VCF→MAF vs MAF→MAF column sets,
Tumor_Sample_Barcodebehaviour, and column prefix options - mFSD columns (with
--mfsd): all 34 mFSD fields - Normalization columns (with
--show-normalization)
Related¶
- Quick Start — Common patterns
- gbcms rna — RNA-seq counting with transcriptome-aware filtering
- gbcms normalize — Standalone normalization (no counting)
- Nextflow Pipeline — For many samples
- Input Formats — VCF/MAF specs
- Output Formats — Complete column-level output reference
- Variant Counting — How each variant type is counted
abbreviations