ArchitectureΒΆ
gbcms uses a hybrid Python/Rust architecture for maximum performance.
System OverviewΒΆ
flowchart TB
subgraph Python [π Python Layer]
CLI[CLI - cli.py] -->|dna / rna| Pipeline[Orchestration - pipeline.py]
CLI -->|merge| Merge[Merge Engine - merge.py]
Pipeline --> Reader[Input Adapters]
Pipeline --> Writer[Output Writers]
Merge --> BatchIO[Batch I/O - io/batch.py]
end
subgraph Rust [π¦ Rust Layer]
Counter[count_bam_binned - counting/engine.rs] --> CIGAR[CIGAR Parser]
Counter --> RNA[RNA Filters - rna.rs]
Counter --> Pangenome[Haplotype Matrix - pangenome.rs]
Counter --> WFA[WFA Fast Path - wfa_router.rs]
Counter --> PairHMM[PairHMM Backend - pairhmm.rs]
Counter --> Stats[Strand Bias - stats.rs]
Counter --> Annot[Annotation Index - annotation/]:::annot
ParquetWriter[write_fsd_parquet - parquet_writer.rs]
end
Pipeline -->|PyO3| Counter
Counter -->|BaseCounts| Pipeline
Pipeline -->|mfsd-parquet| ParquetWriter
Merge -->|fisher_exact_2x2| Stats
classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
classDef annot fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
class Python pythonStyle;
class Rust rustStyle;
Data FlowΒΆ
flowchart LR
subgraph Input
VCF["VCF/MAF"]
BAM["BAM/CRAM Files"]
FASTA["Reference FASTA"]
end
subgraph Process
Load["Load Variants"]
Prepare["Prepare\n(validate + left-align + decomp detect)"]
Count["Count Reads"]
Diag["Diagnostics\n(gbcms_diagnostic flags)"]
Rescue["Rescue\n(--rescue-mnp)"]:::optional
end
subgraph Output ["Required Output"]
Result["VCF/MAF + Counts"]
end
subgraph OptOut ["Optional Output (--mfsd-parquet)"]
Parquet[".fsd.parquet"]
end
VCF --> Load --> Prepare
FASTA --> Prepare
Prepare --> Count
BAM --> Count
Count --> Diag
Diag -.->|"--rescue-mnp"| Rescue
Diag --> Result
Rescue --> Result
Count -.->|"--mfsd-parquet"| Parquet
classDef optional fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;
Genomic BinningΒΆ
To minimise BAM I/O, the Rust engine groups co-located variants into genomic bins before counting. A single bam.fetch() is issued per bin, and reads fetched for that region are classified against all variants in the bin in one pass. For clustered inputs (e.g. a MAF with hundreds of variants on the same gene) this reduces BAM seeks from O(N) per-variant to O(B) per-bin β typically 5-20Γ fewer I/O operations.
This design mirrors the original C++ GBCMS --max_block_size / --max_block_dist architecture, re-implemented as a pure Rust streaming algorithm.
flowchart LR
Variants(["π Variants (any order)"]):::start --> Sort["Sort indices\nby chrom + pos"]
Sort --> NewBin["Open new bin\nat variant pos"]
NewBin --> Extend{"Next variant\nwithin window\n& same chrom?"}
Extend -->|"Yes"| MaxCheck{"β₯ 200 variants\nin bin?"}
MaxCheck -->|"No β extend"| GrowBin["Add variant\nextend bin_end"]
GrowBin --> Extend
MaxCheck -->|"Yes β split"| NewBin
Extend -->|"No / new chrom"| Pad["Pad bin:\nmax(repeat_span+2, 5)bp"]
Pad --> Fetch["bam.fetch(bin_start, bin_end)\nβ 1 I/O per bin β‘"]:::io
Fetch --> Classify["Classify each read\nagainst all variants in bin\n(Rayon parallel across bins)"]:::parallel
Classify --> Out(["π BaseCounts per variant"]):::pass
classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
classDef io fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
classDef parallel fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Binning ParametersΒΆ
| Parameter | Value | Notes |
|---|---|---|
BIN_WINDOW |
10,000 bp | Maximum span of a single bin. Variants beyond this distance start a new bin. |
BIN_MAX_VARIANTS |
200 | Maximum variants per bin. Split enforced to prevent O(V Γ R) blowup in the shared-read classification loop. |
| Bin padding | max(repeat_span + 2, 5) bp |
Ensures reads overlapping bin edges are captured. Uses each variant's detected tandem-repeat span. |
| Parallelism | Rayon par_iter() over bins |
Each thread owns its own BamReader handle; no locking required across bins. |
| Output order | Preserved | Variants are sorted by index internally; results are written back in original input order. |
Performance Implication
For a MAF with 500 variants on TP53 (a 19kb gene), the engine produces ~2-3 bins instead of 500 individual bam.fetch() calls. On high-depth targeted panels this can reduce wall-clock counting time by 5-20Γ.
Not a CLI flag
BIN_WINDOW and BIN_MAX_VARIANTS are internal performance constants β they do not affect output values. Parity testing (D1 regression suite) validates that binned and per-variant paths produce identical BaseCounts.
Coordinate SystemΒΆ
All coordinates normalized to 0-based, half-open internally:
flowchart LR
VCF["VCF (1-based)"] -->|"-1 (e.g. VCF:100 β internal:99)"| Internal["Internal (0-based)"]
MAF["MAF (1-based)"] -->|"-1"| Internal
Internal -->|"β Rust engine"| Rust["gbcms._rs"]
Rust -->|"+1 (e.g. internal:99 β output:100)"| Output["Output (1-based)"]
| Format | System | Example |
|---|---|---|
| VCF input | 1-based | chr1:100 |
| Internal | 0-based | chr1:99 |
| Output | 1-based | chr1:100 |
FormulasΒΆ
Variant Allele Frequency (VAF)ΒΆ
Where: - AD = Alternate allele read count - RD = Reference allele read count
Strand Bias (Fisher's Exact Test)ΒΆ
| Forward Reverse |
-----+--------------------+
Ref | a b |
Alt | c d |
-----+--------------------+
p-value = Fisher's exact test on 2Γ2 contingency table
Low p-value (< 0.05) indicates potential strand bias artifact.
Structural InvariantsΒΆ
The Rust counting engine (BaseCounts) maintains these invariants at all times:
| Invariant | Formula | Purpose |
|---|---|---|
| ALT decomposition | any_alt = AD + partial_alt |
any_alt decomposes into full ALT matches and partial-only matches |
| ALT monotonicity | any_alt >= AD |
Reads with any ALT evidence β₯ reads with full ALT match |
| Depth decomposition | DP >= RD + AD + partial_alt + n_count |
DP includes all read categories; gap = reads classified as neither |
Diagnostic Output FieldsΒΆ
Three diagnostic fields are always present in VCF (FORMAT tags) and MAF output:
| Field | VCF Tag | Description |
|---|---|---|
any_alt |
AAD |
Reads with ALT evidence at β₯1 discriminating position |
partial_alt |
PAD |
Reads matching ALT at some but not all positions β now populated for all variant types including INDELs (via structural evidence propagation) |
n_count |
NAD |
Reads with N base at β₯1 discriminating position (duplex masking QC) |
N bases are strictly uninformative β they increment n_count for QC monitoring but are excluded from RD, AD, any_alt, and partial_alt. See Counting Metrics for full details.
Module StructureΒΆ
src/gbcms/
βββ cli.py # Typer CLI (dna, rna, merge, normalize commands)
βββ pipeline.py # Orchestration (~450 LOC)
βββ merge.py # Multi-BAM MAF merge engine (Polars lazy joins)
βββ normalize.py # Standalone normalization workflow
βββ core/
β βββ kernel.py # Coordinate normalization
βββ io/
β βββ input.py # VcfReader, MafReader (streaming)
β βββ output.py # VcfWriter, MafWriter (mode-aware: DNA/RNA columns)
β βββ batch.py # Polars batch I/O (read/scan/write MAF, read Parquet)
βββ models/
β βββ core.py # Pydantic configs (GbcmsDnaConfig, GbcmsRnaConfig, MergeConfig)
βββ utils/
βββ logging.py # Structured logging
rust/src/
βββ lib.rs # PyO3 module exports
βββ annotation/ # v5.0.0: GTF annotation index (COITree, splice masks)
β βββ mod.rs # AnnotationIndex struct, parse_gtf()
β βββ types.rs # ExonRecord, TranscriptInfo
βββ counting/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Main loop, genomic binning, BAQ, UMI
β βββ variant_checks.rs # check_snp/mnp/ins/del/complex + Phase 3 dispatch
β βββ alignment.rs # Smith-Waterman implementation
β βββ pairhmm.rs # PairHMM alignment backend (marginalized)
β βββ pangenome.rs # Haplotype matrix construction
β βββ wfa_router.rs # WFA2 fast-path alignment
β βββ rna.rs # RNA validation, splicing, editing site lookup
β βββ fragment.rs # Re-export of shared::fragment (backward compat)
β βββ mfsd.rs # Mutant Fragment Size Distribution analysis
β βββ parquet_writer.rs # write_fsd_parquet() β native Parquet via ZSTD
β βββ utils.rs # Helpers, reconstruction, soft-clip
βββ normalize/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Normalization pipeline
β βββ left_align.rs # bcftools-style left-alignment
β βββ decomp.rs # Homopolymer decomposition
β βββ fasta.rs # Reference sequence fetcher
β βββ repeat.rs # Tandem repeat detection
β βββ types.rs # NormResult enum
βββ shared/
β βββ mod.rs # Submodule re-exports
β βββ fragment.rs # FragmentEvidence, quality consensus, UMI grouping
β βββ stats.rs # Fisher's exact test (strand bias), BH correction
β βββ filters.rs # ReadFilter struct, FilterCounts (BAM flag checks)
β βββ baq.rs # BAQ heuristic (Li 2011)
β βββ bam_utils.rs # median_qual, find_read_pos
βββ types.rs # Variant, BaseCounts, PyO3 bindings
ConfigurationΒΆ
All settings via mode-specific Pydantic models:
flowchart TB
Base["GbcmsBaseConfig"]:::base --> DnaConfig["GbcmsDnaConfig"]
Base --> RnaConfig["GbcmsRnaConfig"]
DnaConfig --> DnaD["mode=dna\nMAPQ=20"]
RnaConfig --> RnaD["mode=rna, MAPQ=1\npairhmm relaxed RT gaps"]
RnaConfig --> RnaX["enforce_strandedness\nrna_editing_db\nlibrary_type\ngtf"]
Base --> Filters["ReadFilters"]
Base --> Quality["QualityThresholds"]
Base --> Output2["OutputConfig"]
Base --> Align["AlignmentConfig"]
Filters --> FD["secondary Β· duplicates\nsupplementary Β· qc_failed\nimproper_pair Β· indel"]
Quality --> QD["min_mapq Β· min_baseq\nfragment_qual_threshold"]
Output2 --> OD["output_dir Β· format Β· suffix\ncolumn_prefix Β· mfsd Β· mfsd_parquet"]
Align --> AD["backend: pairhmm|sw\nllr_threshold Β· gap_*_prob"]
classDef base fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
See models/core.py for definitions.
Full Pipeline: End-to-End ExampleΒΆ
Here's how a single variant is processed through the complete pipeline:
sequenceDiagram
participant CLI as CLI (Python)
participant Pipeline as Pipeline
participant Reader as VCF/MAF Reader
participant Kernel as Coordinate Kernel
participant Rust as Rust Engine
participant BAM as BAM File
CLI->>Pipeline: run(config)
Pipeline->>Reader: load variants
Reader->>Kernel: normalize coordinates
Kernel-->>Reader: 0-based Variant objects
Reader-->>Pipeline: List[Variant]
loop For each BAM sample
Pipeline->>Rust: count_bam_binned(bam, variants, config)
loop For each genomic bin (parallel via Rayon)
Rust->>BAM: fetch(chrom, posβ5, pos+ref_len+5)
BAM-->>Rust: Iterator of reads
loop For each read
Note over Rust: Apply filter cascade
Note over Rust: Dispatch to type checker
Note over Rust: Phase 1+2 haplotype reconstruction
Note over Rust: Phase 2.5 edit-distance fast-path
Note over Rust: Phase 3 WFA triage β PairHMM LLR
Note over Rust: Update read + fragment counts
end
Note over Rust: Compute Fisher strand bias
end
Rust-->>Pipeline: Vec[BaseCounts]
Pipeline->>Pipeline: _compute_diagnostics()
opt --rescue-mnp enabled
Pipeline->>Pipeline: _rescue_mnp_pass() β identify candidates
Pipeline->>Rust: count_bam_binned(synthetic SNPs)
Rust-->>Pipeline: Vec[BaseCounts] for SNPs
Pipeline->>Pipeline: Update ad, populate gbcms_rescue
end
end
Pipeline->>Pipeline: Write output (VCF/MAF)
MNP Rescue Pass (--rescue-mnp, v4.3.0)ΒΆ
The rescue pass recovers alt_count for MNP variants where ad == 0 by decomposing
the MNP into individual SNP positions and re-counting each one independently.
Why Python, Not Rust?ΒΆ
The original design spec proposed implementing rescue in Rust. The final implementation uses Python orchestration calling the existing Rust counting engine. This was a deliberate architectural decision:
flowchart LR
subgraph Python ["π Python β Orchestration"]
Filter["Filter candidates\n(PASS + ad==0 + MNP_RESCUE_ELIGIBLE)"]
Build["Build synthetic SNPs\nfrom disc positions"]
Map["Map counts back\npick best rescue"]
Audit["Populate gbcms_rescue\naudit trail"]
end
subgraph Rust ["π¦ Rust β Counting"]
Prepare["prepare_variants()\nREF validation"]
Count["count_bam_binned()\nBAM I/O + read classification"]
end
Filter --> Build --> Prepare
Prepare --> Count
Count --> Map --> Audit
classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
class Python pythonStyle;
class Rust rustStyle;
| Concern | If implemented in Rust | Actual (Python + Rust) |
|---|---|---|
| BAM I/O | Rust (fast) | Rust via count_bam_binned() β same speed |
| Read classification | Rust (fast) | Rust β unchanged |
| Candidate filtering | Would duplicate Python diagnostic logic | Python β direct access to gbcms_diagnostic |
| Synthetic variant creation | New Rust API surface needed | Python β uses existing rs.Variant() |
| Config access | Entire config struct across FFI boundary | Python β self.config.* already available |
| Audit trail formatting | Awkward string manipulation in Rust | Python β natural |
| Unit testing | Hard to mock BAM interactions | Python β SimpleNamespace mocks |
| Performance impact | Negligible improvement | < 1 second added to 30β120s pipeline |
Key Insight
The rescue engine is an orchestration layer, not a counting engine. The expensive
work (BAM I/O, read classification, allele counting) is still 100% Rust. Python
decides what to re-count and what to do with the results β typically 5β50
synthetic SNPs per sample, handled in a single batched count_bam_binned() call.
Rescue Candidate CriteriaΒΆ
A variant qualifies for rescue when all four conditions are met:
| # | Condition | Rationale |
|---|---|---|
| 1 | gbcms_status starts with PASS |
FAIL variants have unreliable coordinates |
| 2 | ad == 0 |
Variants with confirmed ALT reads don't need rescue |
| 3 | MNP_RESCUE_ELIGIBLE in gbcms_diagnostic |
Disc/len ratio β€ --rescue-mnp-threshold (default 1.0 = all MNPs). Set to 0.5 for conservative sparse-only mode. |
| 4 | ref_len == alt_len > 1 (MNP) |
SNPs and INDELs are never MNP rescue candidates |
Rescue Strategy: Decomposed SNP CountingΒΆ
MNP: GAGGG β AAGGA (5bp, positions 0-4)
Disc: pos 0 (GβA) β, pos 4 (GβA) β β 2/5 discriminating
Decompose into:
SNPβ: GβA at chr5:1295250 β count_bam_binned() β ad=108
SNPβ: GβA at chr5:1295254 β count_bam_binned() β ad=0
Best rescue: ad=108 (from SNPβ)
Audit: method=decomposed;original_alt=0;positions=chr5:1295251(G>A):108,chr5:1295255(G>A):0
Invariant ImpactΒΆ
After rescue, Invariant 1 (any_alt = ad + partial_alt) intentionally breaks:
| Field | Before Rescue | After Rescue | Source |
|---|---|---|---|
ad (alt_count) |
0 | 108 (updated) | Rescue engine |
partial_alt |
108 | 108 (unchanged) | Original MNP check β forensic evidence |
any_alt |
108 | 108 (unchanged) | Original MNP check β forensic evidence |
gbcms_rescue |
(empty) | method=decomposed;original_alt=0;... |
Rescue audit trail |
The gbcms_rescue audit trail preserves the original_alt=0 value and documents
the rescue provenance, enabling downstream users to reconcile the invariant breakage.
Comparison with Original GBCMSΒΆ
| Feature | Original GBCMS | gbcms |
|---|---|---|
| Counting algorithm | Region-based chunking, position matching | Per-variant CIGAR traversal |
| Indel detection | Exact position match only | Windowed scan (Β±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation |
| Complex variants | Optional via --generic_counting |
Always uses haplotype reconstruction |
| Complex quality handling | Exact match only (no quality awareness) | Masked comparison β bases below --min-baseq are masked out, ambiguity detection prevents false positives |
| Base quality filtering | No base quality threshold | Default --min-baseq 20 (Phred Q20) |
| MNP handling | Not explicit | Dedicated check_mnp with contiguity check |
| Fragment counting | Optional (--fragment_count), majority-rule |
Always computed, quality-weighted consensus with INDEL structural priority |
| Positive strand counts | Optional (--positive_count) |
Always computed |
| Strand bias | Not computed | Fisher's exact test (read + fragment level) |
| Fractional depth | --fragment_fractional_weight |
Not implemented |
| Parallelism | OpenMP block-based | Rayon per-variant |
RelatedΒΆ
- Allele Classification β How each variant type is counted
- RNA Annotation β GTF annotation, per-transcript counting, ASJD detection
- RNA Splice-Junction Handling β How gbcms handles splice-boundary artifacts vs GATK SplitNCigarReads
- Variant Normalization β How variants are prepared before counting
- Input Formats β VCF and MAF specifications
- Glossary β Term definitions
abbreviations