ArchitectureΒΆ
gbcms uses a hybrid Python/Rust architecture for maximum performance.
System OverviewΒΆ
flowchart TB
subgraph Python [π Python Layer]
CLI[CLI - cli.py] -->|dna / rna| Pipeline[Orchestration - pipeline.py]
Pipeline --> Reader[Input Adapters]
Pipeline --> Writer[Output Writers]
end
subgraph Rust [π¦ Rust Layer]
Counter[count_bam_binned - counting/engine.rs] --> CIGAR[CIGAR Parser]
Counter --> RNA[RNA Filters - rna.rs]
Counter --> Pangenome[Haplotype Matrix - pangenome.rs]
Counter --> WFA[WFA Fast Path - wfa_router.rs]
Counter --> PairHMM[PairHMM Backend - pairhmm.rs]
Counter --> Stats[Strand Bias - stats.rs]
ParquetWriter[write_fsd_parquet - parquet_writer.rs]
end
Pipeline -->|PyO3| Counter
Counter -->|BaseCounts| Pipeline
Pipeline -->|mfsd-parquet| ParquetWriter
classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
class Python pythonStyle;
class Rust rustStyle;
Data FlowΒΆ
flowchart LR
subgraph Input
VCF["VCF/MAF"]
BAM["BAM Files"]
FASTA["Reference FASTA"]
end
subgraph Process
Load["Load Variants"]
Prepare["Prepare\n(validate + left-align + decomp detect)"]
Count["Count Reads"]
end
subgraph Output ["Required Output"]
Result["VCF/MAF + Counts"]
end
subgraph OptOut ["Optional Output (--mfsd-parquet)"]
Parquet[".fsd.parquet"]
end
VCF --> Load --> Prepare
FASTA --> Prepare
Prepare --> Count
BAM --> Count
Count --> Result
Count -.->|"--mfsd-parquet"| Parquet
Genomic BinningΒΆ
To minimise BAM I/O, the Rust engine groups co-located variants into genomic bins before counting. A single bam.fetch() is issued per bin, and reads fetched for that region are classified against all variants in the bin in one pass. For clustered inputs (e.g. a MAF with hundreds of variants on the same gene) this reduces BAM seeks from O(N) per-variant to O(B) per-bin β typically 5-20Γ fewer I/O operations.
This design mirrors the original C++ GBCMS --max_block_size / --max_block_dist architecture, re-implemented as a pure Rust streaming algorithm.
flowchart LR
Variants(["π Variants (any order)"]):::start --> Sort["Sort indices\nby chrom + pos"]
Sort --> NewBin["Open new bin\nat variant pos"]
NewBin --> Extend{"Next variant\nwithin window\n& same chrom?"}
Extend -->|"Yes"| MaxCheck{"β₯ 200 variants\nin bin?"}
MaxCheck -->|"No β extend"| GrowBin["Add variant\nextend bin_end"]
GrowBin --> Extend
MaxCheck -->|"Yes β split"| NewBin
Extend -->|"No / new chrom"| Pad["Pad bin:\nmax(repeat_span+2, 5)bp"]
Pad --> Fetch["bam.fetch(bin_start, bin_end)\nβ 1 I/O per bin β‘"]:::io
Fetch --> Classify["Classify each read\nagainst all variants in bin\n(Rayon parallel across bins)"]:::parallel
Classify --> Out(["π BaseCounts per variant"]):::pass
classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
classDef io fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
classDef parallel fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Binning ParametersΒΆ
| Parameter | Value | Notes |
|---|---|---|
BIN_WINDOW |
10,000 bp | Maximum span of a single bin. Variants beyond this distance start a new bin. |
BIN_MAX_VARIANTS |
200 | Maximum variants per bin. Split enforced to prevent O(V Γ R) blowup in the shared-read classification loop. |
| Bin padding | max(repeat_span + 2, 5) bp |
Ensures reads overlapping bin edges are captured. Uses each variant's detected tandem-repeat span. |
| Parallelism | Rayon par_iter() over bins |
Each thread owns its own BamReader handle; no locking required across bins. |
| Output order | Preserved | Variants are sorted by index internally; results are written back in original input order. |
Performance Implication
For a MAF with 500 variants on TP53 (a 19kb gene), the engine produces ~2-3 bins instead of 500 individual bam.fetch() calls. On high-depth targeted panels this can reduce wall-clock counting time by 5-20Γ.
Not a CLI flag
BIN_WINDOW and BIN_MAX_VARIANTS are internal performance constants β they do not affect output values. Parity testing (D1 regression suite) validates that binned and per-variant paths produce identical BaseCounts.
Coordinate SystemΒΆ
All coordinates normalized to 0-based, half-open internally:
flowchart LR
VCF["VCF (1-based)"] -->|"-1 (e.g. VCF:100 β internal:99)"| Internal["Internal (0-based)"]
MAF["MAF (1-based)"] -->|"-1"| Internal
Internal -->|"β Rust engine"| Rust["gbcms._rs"]
Rust -->|"+1 (e.g. internal:99 β output:100)"| Output["Output (1-based)"]
| Format | System | Example |
|---|---|---|
| VCF input | 1-based | chr1:100 |
| Internal | 0-based | chr1:99 |
| Output | 1-based | chr1:100 |
FormulasΒΆ
Variant Allele Frequency (VAF)ΒΆ
Where: - AD = Alternate allele read count - RD = Reference allele read count
Strand Bias (Fisher's Exact Test)ΒΆ
| Forward Reverse |
-----+--------------------+
Ref | a b |
Alt | c d |
-----+--------------------+
p-value = Fisher's exact test on 2Γ2 contingency table
Low p-value (< 0.05) indicates potential strand bias artifact.
Module StructureΒΆ
src/gbcms/
βββ cli.py # Typer CLI (dna, rna, normalize, run commands)
βββ pipeline.py # Orchestration (~450 LOC)
βββ normalize.py # Standalone normalization workflow
βββ core/
β βββ kernel.py # Coordinate normalization
βββ io/
β βββ input.py # VcfReader, MafReader
β βββ output.py # VcfWriter, MafWriter (mode-aware: DNA/RNA columns)
βββ models/
β βββ core.py # Pydantic configs (GbcmsDnaConfig, GbcmsRnaConfig, AlignmentConfig)
βββ utils/
βββ logging.py # Structured logging
rust/src/
βββ lib.rs # PyO3 module exports
βββ counting/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Main loop, genomic binning, BAQ, UMI
β βββ variant_checks.rs # check_snp/mnp/ins/del/complex + Phase 3 dispatch
β βββ alignment.rs # Smith-Waterman implementation
β βββ pairhmm.rs # PairHMM alignment backend (marginalized)
β βββ pangenome.rs # Haplotype matrix construction
β βββ wfa_router.rs # WFA2 fast-path alignment
β βββ rna.rs # RNA validation, splicing, editing site lookup
β βββ fragment.rs # Re-export of shared::fragment (backward compat)
β βββ mfsd.rs # Mutant Fragment Size Distribution analysis
β βββ parquet_writer.rs # write_fsd_parquet() β native Parquet via ZSTD
β βββ utils.rs # Helpers, reconstruction, soft-clip
βββ normalize/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Normalization pipeline
β βββ left_align.rs # bcftools-style left-alignment
β βββ decomp.rs # Homopolymer decomposition
β βββ fasta.rs # Reference sequence fetcher
β βββ repeat.rs # Tandem repeat detection
β βββ types.rs # NormResult enum
βββ shared/
β βββ mod.rs # Submodule re-exports
β βββ fragment.rs # FragmentEvidence, quality consensus, UMI grouping
β βββ stats.rs # Fisher's exact test (strand bias)
β βββ filters.rs # ReadFilter struct, FilterCounts (BAM flag checks)
β βββ baq.rs # BAQ heuristic (Li 2011)
β βββ bam_utils.rs # median_qual, find_read_pos
βββ types.rs # Variant, BaseCounts, PyO3 bindings
ConfigurationΒΆ
All settings via mode-specific Pydantic models:
flowchart TB
Base["GbcmsBaseConfig"]:::base --> DnaConfig["GbcmsDnaConfig"]
Base --> RnaConfig["GbcmsRnaConfig"]
DnaConfig --> DnaD["mode=dna\nMAPQ=20"]
RnaConfig --> RnaD["mode=rna, MAPQ=1\npairhmm relaxed RT gaps"]
RnaConfig --> RnaX["enforce_strandedness\nrna_editing_db"]
Base --> Filters["ReadFilters"]
Base --> Quality["QualityThresholds"]
Base --> Output2["OutputConfig"]
Base --> Align["AlignmentConfig"]
Filters --> FD["secondary Β· duplicates\nsupplementary Β· qc_failed\nimproper_pair Β· indel"]
Quality --> QD["min_mapq Β· min_baseq\nfragment_qual_threshold"]
Output2 --> OD["output_dir Β· format Β· suffix\ncolumn_prefix Β· mfsd Β· mfsd_parquet"]
Align --> AD["backend: pairhmm|sw\nllr_threshold Β· gap_*_prob"]
classDef base fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
See models/core.py for definitions.
Full Pipeline: End-to-End ExampleΒΆ
Here's how a single variant is processed through the complete pipeline:
sequenceDiagram
participant CLI as CLI (Python)
participant Pipeline as Pipeline
participant Reader as VCF/MAF Reader
participant Kernel as Coordinate Kernel
participant Rust as Rust Engine
participant BAM as BAM File
CLI->>Pipeline: run(config)
Pipeline->>Reader: load variants
Reader->>Kernel: normalize coordinates
Kernel-->>Reader: 0-based Variant objects
Reader-->>Pipeline: List[Variant]
loop For each BAM sample
Pipeline->>Rust: count_bam_binned(bam, variants, config)
loop For each genomic bin (parallel via Rayon)
Rust->>BAM: fetch(chrom, posβ5, pos+ref_len+5)
BAM-->>Rust: Iterator of reads
loop For each read
Note over Rust: Apply filter cascade
Note over Rust: Dispatch to type checker
Note over Rust: Phase 1+2 haplotype reconstruction
Note over Rust: Phase 2.5 edit-distance fast-path
Note over Rust: Phase 3 WFA triage β PairHMM LLR
Note over Rust: Update read + fragment counts
end
Note over Rust: Compute Fisher strand bias
end
Rust-->>Pipeline: Vec[BaseCounts]
end
Pipeline->>Pipeline: Write output (VCF/MAF)
Comparison with Original GBCMSΒΆ
| Feature | Original GBCMS | gbcms |
|---|---|---|
| Counting algorithm | Region-based chunking, position matching | Per-variant CIGAR traversal |
| Indel detection | Exact position match only | Windowed scan (Β±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation |
| Complex variants | Optional via --generic_counting |
Always uses haplotype reconstruction |
| Complex quality handling | Exact match only (no quality awareness) | Masked comparison β bases below --min-baseq are masked out, ambiguity detection prevents false positives |
| Base quality filtering | No base quality threshold | Default --min-baseq 20 (Phred Q20) |
| MNP handling | Not explicit | Dedicated check_mnp with contiguity check |
| Fragment counting | Optional (--fragment_count), majority-rule |
Always computed, quality-weighted consensus with discard |
| Positive strand counts | Optional (--positive_count) |
Always computed |
| Strand bias | Not computed | Fisher's exact test (read + fragment level) |
| Fractional depth | --fragment_fractional_weight |
Not implemented |
| Parallelism | OpenMP block-based | Rayon per-variant |
RelatedΒΆ
- Allele Classification β How each variant type is counted
- Variant Normalization β How variants are prepared before counting
- Input Formats β VCF and MAF specifications
- Glossary β Term definitions
abbreviations