`gbcms merge`¶

Merge per-BAM-type genotyped MAFs (e.g., duplex, simplex) into a single output MAF with type-prefixed count columns.

Polars Batch Engine

This command uses Polars for lazy-evaluated, memory-efficient joins and aggregation. It operates on the complete file (batch mode), unlike the streaming per-variant counting in gbcms dna/rna.

Quick Start¶

# Merge duplex + simplex MAFs with combined columns
gbcms merge \
    --input duplex:sample1-duplex.maf \
    --input simplex:sample1-simplex.maf \
    --output sample1.merged.maf

# Merge 3 BAM types, no combined columns
gbcms merge \
    --input duplex:sample1-duplex.maf \
    --input simplex:sample1-simplex.maf \
    --input standard:sample1-unfiltered.maf \
    --output sample1.merged.maf \
    --no-combined

Arguments¶

Required¶

Flag	Type	Description
`--input`	`TYPE:PATH`	BAM type label and path to its genotyped MAF. Repeat for each input (minimum 2). Labels become column prefixes.
`--output`	`PATH`	Output path for the merged MAF.

Optional¶

Flag	Default	Description
`--no-combined`	`false`	Skip computing `simplex_duplex_*` combined columns.
`--legacy-naming`	`false`	Use `t_{metric}_{type}` column naming for `genotype_variants` compatibility.

How It Works¶

flowchart LR
    subgraph Inputs
        D["duplex.maf"]
        S["simplex.maf"]
    end

    subgraph Engine ["Merge Engine (Polars)"]
        Scan["Lazy Scan"] --> Prefix["Auto-prefix\ncount columns"]
        Prefix --> Join["Outer Join\non variant key"]
        Join --> Fill["Null → 0\n(missing counts)"]
        Fill --> Combine["Additive\ncombined columns"]
        Combine --> SB["Fisher Strand Bias\n(Rust)"]
    end

    subgraph Output
        M["merged.maf"]
    end

    D --> Scan
    S --> Scan
    SB --> M

Use mouse to pan and zoom

Step-by-Step¶

Scan — Each input MAF is lazily scanned with all columns as strings
Prefix Detection — Columns are checked for existing type prefixes; unprefixed gbcms columns are renamed (e.g., ref_count → duplex_ref_count)
Outer Join — Progressive full outer join on the 5-column variant key: Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2
Null Fill — Missing counts → "0", missing meta → ""
Combined Columns — If both simplex and duplex are present and --no-combined is not set:
- Phase 1: Additive sums (12 columns: read + fragment + strand counts)
- Phase 2a: Derived totals (total_count, total_count_fragment)
- Phase 2b: Derived VAFs (vaf, vaf_fragment)
- Phase 3: Fisher's exact test for strand bias (read + fragment level, via Rust)
Write — Materialized DataFrame written as tab-separated MAF

Provenance Comment Lines (v5.3.0+)

Starting in v5.3.0, gbcms MAF output includes #gbcms and #command comment lines before the TSV header. The merge engine handles these automatically — Polars scan_csv is configured with comment_prefix="#", which skips all #-prefixed lines. No manual pre-processing is needed.

Combined Columns¶

When both simplex and duplex inputs are present, 20 combined simplex_duplex_* columns are computed (assuming all strand-level counts are present in the input):

Phase	Column	Formula
Sum	`simplex_duplex_ref_count`	`simplex_ref_count + duplex_ref_count`
Sum	`simplex_duplex_alt_count`	`simplex_alt_count + duplex_alt_count`
Sum	`simplex_duplex_ref_count_forward`	Additive
Sum	`simplex_duplex_ref_count_reverse`	Additive
Sum	`simplex_duplex_alt_count_forward`	Additive
Sum	`simplex_duplex_alt_count_reverse`	Additive
Sum	`simplex_duplex_ref_count_fragment`	Additive
Sum	`simplex_duplex_alt_count_fragment`	Additive
Sum	`simplex_duplex_ref_count_fragment_forward`	Additive
Sum	`simplex_duplex_ref_count_fragment_reverse`	Additive
Sum	`simplex_duplex_alt_count_fragment_forward`	Additive
Sum	`simplex_duplex_alt_count_fragment_reverse`	Additive
Derived	`simplex_duplex_total_count`	`ref + alt`
Derived	`simplex_duplex_total_count_fragment`	`ref_fragment + alt_fragment`
Derived	`simplex_duplex_vaf`	`alt / total` (0 when total=0)
Derived	`simplex_duplex_vaf_fragment`	`alt_frag / total_frag`
Fisher	`simplex_duplex_strand_bias_p_value`	Fisher 2×2 (read strand)
Fisher	`simplex_duplex_strand_bias_odds_ratio`	Fisher 2×2 (read strand)
Fisher	`simplex_duplex_fragment_strand_bias_p_value`	Fisher 2×2 (fragment strand)
Fisher	`simplex_duplex_fragment_strand_bias_odds_ratio`	Fisher 2×2 (fragment strand)

Schema-Aware

If the input MAFs do not contain strand-level columns (e.g., from an older gbcms version), only the available columns are summed. Missing metrics are logged at INFO level and skipped — the pipeline does not fail.

Nextflow Integration¶

When running via the Nextflow pipeline, enable merge with:

nextflow run main.nf \
    --input samplesheet.csv \
    --variants mutations.maf \
    --fasta reference.fa \
    --format maf \
    --merge_counts \
    -profile slurm

The samplesheet should include a bam_type column:

sample,bam,bai,suffix,bam_type
sample1,sample1.duplex.bam,,-duplex,duplex
sample1,sample1.simplex.bam,,-simplex,simplex

When bam_type is set, the DNA module automatically derives --column-prefix from the type label, so counts are pre-prefixed in the per-BAM MAF output.

Output Formats — Column schema reference
Architecture — System design
gbcms dna — Per-BAM DNA counting

gbcms merge¶