Architecture¶

gbcms uses a hybrid Python/Rust architecture for maximum performance.

System Overview¶

flowchart TB
    subgraph Python [🐍 Python Layer]
        CLI[CLI - cli.py] -->|dna / rna| Pipeline[Orchestration - pipeline.py]
        Pipeline --> Reader[Input Adapters]
        Pipeline --> Writer[Output Writers]
    end

    subgraph Rust [🦀 Rust Layer]
        Counter[count_bam_binned - counting/engine.rs] --> CIGAR[CIGAR Parser]
        Counter --> RNA[RNA Filters - rna.rs]
        Counter --> Pangenome[Haplotype Matrix - pangenome.rs]
        Counter --> WFA[WFA Fast Path - wfa_router.rs]
        Counter --> PairHMM[PairHMM Backend - pairhmm.rs]
        Counter --> Stats[Strand Bias - stats.rs]
        ParquetWriter[write_fsd_parquet - parquet_writer.rs]
    end

    Pipeline -->|PyO3| Counter
    Counter -->|BaseCounts| Pipeline
    Pipeline -->|mfsd-parquet| ParquetWriter

    classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
    classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
    class Python pythonStyle;
    class Rust rustStyle;

Use mouse to pan and zoom

Data Flow¶

flowchart LR
    subgraph Input
        VCF["VCF/MAF"]
        BAM["BAM Files"]
        FASTA["Reference FASTA"]
    end

    subgraph Process
        Load["Load Variants"]
        Prepare["Prepare\n(validate + left-align + decomp detect)"]
        Count["Count Reads"]
    end

    subgraph Output ["Required Output"]
        Result["VCF/MAF + Counts"]
    end

    subgraph OptOut ["Optional Output (--mfsd-parquet)"]
        Parquet[".fsd.parquet"]
    end

    VCF --> Load --> Prepare
    FASTA --> Prepare
    Prepare --> Count
    BAM --> Count
    Count --> Result
    Count -.->|"--mfsd-parquet"| Parquet

Use mouse to pan and zoom

Genomic Binning¶

To minimise BAM I/O, the Rust engine groups co-located variants into genomic bins before counting. A single bam.fetch() is issued per bin, and reads fetched for that region are classified against all variants in the bin in one pass. For clustered inputs (e.g. a MAF with hundreds of variants on the same gene) this reduces BAM seeks from O(N) per-variant to O(B) per-bin — typically 5-20× fewer I/O operations.

This design mirrors the original C++ GBCMS --max_block_size / --max_block_dist architecture, re-implemented as a pure Rust streaming algorithm.

flowchart LR
    Variants(["📋 Variants (any order)"]):::start --> Sort["Sort indices\nby chrom + pos"]
    Sort --> NewBin["Open new bin\nat variant pos"]
    NewBin --> Extend{"Next variant\nwithin window\n& same chrom?"}
    Extend -->|"Yes"| MaxCheck{"≥ 200 variants\nin bin?"}
    MaxCheck -->|"No — extend"| GrowBin["Add variant\nextend bin_end"]
    GrowBin --> Extend
    MaxCheck -->|"Yes — split"| NewBin
    Extend -->|"No / new chrom"| Pad["Pad bin:\nmax(repeat_span+2, 5)bp"]
    Pad --> Fetch["bam.fetch(bin_start, bin_end)\n— 1 I/O per bin ⚡"]:::io
    Fetch --> Classify["Classify each read\nagainst all variants in bin\n(Rayon parallel across bins)"]:::parallel
    Classify --> Out(["📊 BaseCounts per variant"]):::pass

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef io fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef parallel fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;

Use mouse to pan and zoom

Binning Parameters¶

Parameter	Value	Notes
`BIN_WINDOW`	10,000 bp	Maximum span of a single bin. Variants beyond this distance start a new bin.
`BIN_MAX_VARIANTS`	200	Maximum variants per bin. Split enforced to prevent O(V × R) blowup in the shared-read classification loop.
Bin padding	`max(repeat_span + 2, 5)` bp	Ensures reads overlapping bin edges are captured. Uses each variant's detected tandem-repeat span.
Parallelism	Rayon `par_iter()` over bins	Each thread owns its own `BamReader` handle; no locking required across bins.
Output order	Preserved	Variants are sorted by index internally; results are written back in original input order.

Performance Implication

For a MAF with 500 variants on TP53 (a 19kb gene), the engine produces ~2-3 bins instead of 500 individual bam.fetch() calls. On high-depth targeted panels this can reduce wall-clock counting time by 5-20×.

Not a CLI flag

BIN_WINDOW and BIN_MAX_VARIANTS are internal performance constants — they do not affect output values. Parity testing (D1 regression suite) validates that binned and per-variant paths produce identical BaseCounts.

Coordinate System¶

All coordinates normalized to 0-based, half-open internally:

flowchart LR
    VCF["VCF (1-based)"] -->|"-1 (e.g. VCF:100 → internal:99)"| Internal["Internal (0-based)"]
    MAF["MAF (1-based)"] -->|"-1"| Internal
    Internal -->|"→ Rust engine"| Rust["gbcms._rs"]
    Rust -->|"+1 (e.g. internal:99 → output:100)"| Output["Output (1-based)"]

Use mouse to pan and zoom

Format	System	Example
VCF input	1-based	chr1:100
Internal	0-based	chr1:99
Output	1-based	chr1:100

Formulas¶

Variant Allele Frequency (VAF)¶

VAF = AD / (RD + AD)

Where: - AD = Alternate allele read count - RD = Reference allele read count

Strand Bias (Fisher's Exact Test)¶

         |  Forward  Reverse  |
    -----+--------------------+
    Ref  |    a        b      |
    Alt  |    c        d      |
    -----+--------------------+

    p-value = Fisher's exact test on 2×2 contingency table

Low p-value (< 0.05) indicates potential strand bias artifact.

Module Structure¶

src/gbcms/
├── cli.py           # Typer CLI (dna, rna, normalize, run commands)
├── pipeline.py      # Orchestration (~450 LOC)
├── normalize.py     # Standalone normalization workflow
├── core/
│   └── kernel.py    # Coordinate normalization
├── io/
│   ├── input.py     # VcfReader, MafReader
│   └── output.py    # VcfWriter, MafWriter (mode-aware: DNA/RNA columns)
├── models/
│   └── core.py      # Pydantic configs (GbcmsDnaConfig, GbcmsRnaConfig, AlignmentConfig)
└── utils/
    └── logging.py   # Structured logging

rust/src/
├── lib.rs                    # PyO3 module exports
├── counting/
│   ├── mod.rs                # Submodule re-exports
│   ├── engine.rs             # Main loop, genomic binning, BAQ, UMI
│   ├── variant_checks.rs     # check_snp/mnp/ins/del/complex + Phase 3 dispatch
│   ├── alignment.rs          # Smith-Waterman implementation
│   ├── pairhmm.rs            # PairHMM alignment backend (marginalized)
│   ├── pangenome.rs          # Haplotype matrix construction
│   ├── wfa_router.rs         # WFA2 fast-path alignment
│   ├── rna.rs                # RNA validation, splicing, editing site lookup
│   ├── fragment.rs           # Re-export of shared::fragment (backward compat)
│   ├── mfsd.rs               # Mutant Fragment Size Distribution analysis
│   ├── parquet_writer.rs     # write_fsd_parquet() — native Parquet via ZSTD
│   └── utils.rs              # Helpers, reconstruction, soft-clip
├── normalize/
│   ├── mod.rs                # Submodule re-exports
│   ├── engine.rs             # Normalization pipeline
│   ├── left_align.rs         # bcftools-style left-alignment
│   ├── decomp.rs             # Homopolymer decomposition
│   ├── fasta.rs              # Reference sequence fetcher
│   ├── repeat.rs             # Tandem repeat detection
│   └── types.rs              # NormResult enum
├── shared/
│   ├── mod.rs                # Submodule re-exports
│   ├── fragment.rs           # FragmentEvidence, quality consensus, UMI grouping
│   ├── stats.rs              # Fisher's exact test (strand bias)
│   ├── filters.rs            # ReadFilter struct, FilterCounts (BAM flag checks)
│   ├── baq.rs                # BAQ heuristic (Li 2011)
│   └── bam_utils.rs          # median_qual, find_read_pos
└── types.rs                  # Variant, BaseCounts, PyO3 bindings

Configuration¶

All settings via mode-specific Pydantic models:

flowchart TB
    Base["GbcmsBaseConfig"]:::base --> DnaConfig["GbcmsDnaConfig"]
    Base --> RnaConfig["GbcmsRnaConfig"]

    DnaConfig --> DnaD["mode=dna\nMAPQ=20"]
    RnaConfig --> RnaD["mode=rna, MAPQ=1\npairhmm relaxed RT gaps"]
    RnaConfig --> RnaX["enforce_strandedness\nrna_editing_db"]

    Base --> Filters["ReadFilters"]
    Base --> Quality["QualityThresholds"]
    Base --> Output2["OutputConfig"]
    Base --> Align["AlignmentConfig"]

    Filters --> FD["secondary · duplicates\nsupplementary · qc_failed\nimproper_pair · indel"]
    Quality --> QD["min_mapq · min_baseq\nfragment_qual_threshold"]
    Output2 --> OD["output_dir · format · suffix\ncolumn_prefix · mfsd · mfsd_parquet"]
    Align --> AD["backend: pairhmm|sw\nllr_threshold · gap_*_prob"]

    classDef base fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;

Use mouse to pan and zoom

See models/core.py for definitions.

Full Pipeline: End-to-End Example¶

Here's how a single variant is processed through the complete pipeline:

sequenceDiagram
    participant CLI as CLI (Python)
    participant Pipeline as Pipeline
    participant Reader as VCF/MAF Reader
    participant Kernel as Coordinate Kernel
    participant Rust as Rust Engine
    participant BAM as BAM File

    CLI->>Pipeline: run(config)
    Pipeline->>Reader: load variants
    Reader->>Kernel: normalize coordinates
    Kernel-->>Reader: 0-based Variant objects
    Reader-->>Pipeline: List[Variant]

    loop For each BAM sample
        Pipeline->>Rust: count_bam_binned(bam, variants, config)
        loop For each genomic bin (parallel via Rayon)
            Rust->>BAM: fetch(chrom, pos−5, pos+ref_len+5)
            BAM-->>Rust: Iterator of reads
            loop For each read
                Note over Rust: Apply filter cascade
                Note over Rust: Dispatch to type checker
                Note over Rust: Phase 1+2 haplotype reconstruction
                Note over Rust: Phase 2.5 edit-distance fast-path
                Note over Rust: Phase 3 WFA triage → PairHMM LLR
                Note over Rust: Update read + fragment counts
            end
            Note over Rust: Compute Fisher strand bias
        end
        Rust-->>Pipeline: Vec[BaseCounts]
    end

    Pipeline->>Pipeline: Write output (VCF/MAF)

Use mouse to pan and zoom

Comparison with Original GBCMS¶

Feature	Original GBCMS	gbcms
Counting algorithm	Region-based chunking, position matching	Per-variant CIGAR traversal
Indel detection	Exact position match only	Windowed scan (±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation
Complex variants	Optional via `--generic_counting`	Always uses haplotype reconstruction
Complex quality handling	Exact match only (no quality awareness)	Masked comparison — bases below `--min-baseq` are masked out, ambiguity detection prevents false positives
Base quality filtering	No base quality threshold	Default `--min-baseq 20` (Phred Q20)
MNP handling	Not explicit	Dedicated `check_mnp` with contiguity check
Fragment counting	Optional (`--fragment_count`), majority-rule	Always computed, quality-weighted consensus with discard
Positive strand counts	Optional (`--positive_count`)	Always computed
Strand bias	Not computed	Fisher's exact test (read + fragment level)
Fractional depth	`--fragment_fractional_weight`	Not implemented
Parallelism	OpenMP block-based	Rayon per-variant

Allele Classification — How each variant type is counted
Variant Normalization — How variants are prepared before counting
Input Formats — VCF and MAF specifications
Glossary — Term definitions

Architecture¶

System Overview¶

Data Flow¶

Genomic Binning¶

Binning Parameters¶

Coordinate System¶

Formulas¶

Variant Allele Frequency (VAF)¶

Strand Bias (Fisher's Exact Test)¶

Module Structure¶

Configuration¶

Full Pipeline: End-to-End Example¶

Comparison with Original GBCMS¶

Related¶