Skip to content

ArchitectureΒΆ

gbcms uses a hybrid Python/Rust architecture for maximum performance.

System OverviewΒΆ

flowchart TB
    subgraph Python [🐍 Python Layer]
        CLI[CLI - cli.py] -->|dna / rna| Pipeline[Orchestration - pipeline.py]
        Pipeline --> Reader[Input Adapters]
        Pipeline --> Writer[Output Writers]
    end

    subgraph Rust [πŸ¦€ Rust Layer]
        Counter[count_bam_binned - counting/engine.rs] --> CIGAR[CIGAR Parser]
        Counter --> RNA[RNA Filters - rna.rs]
        Counter --> Pangenome[Haplotype Matrix - pangenome.rs]
        Counter --> WFA[WFA Fast Path - wfa_router.rs]
        Counter --> PairHMM[PairHMM Backend - pairhmm.rs]
        Counter --> Stats[Strand Bias - stats.rs]
        ParquetWriter[write_fsd_parquet - parquet_writer.rs]
    end

    Pipeline -->|PyO3| Counter
    Counter -->|BaseCounts| Pipeline
    Pipeline -->|mfsd-parquet| ParquetWriter

    classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
    classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
    class Python pythonStyle;
    class Rust rustStyle;
Use mouse to pan and zoom

Data FlowΒΆ

flowchart LR
    subgraph Input
        VCF["VCF/MAF"]
        BAM["BAM Files"]
        FASTA["Reference FASTA"]
    end

    subgraph Process
        Load["Load Variants"]
        Prepare["Prepare\n(validate + left-align + decomp detect)"]
        Count["Count Reads"]
    end

    subgraph Output ["Required Output"]
        Result["VCF/MAF + Counts"]
    end

    subgraph OptOut ["Optional Output (--mfsd-parquet)"]
        Parquet[".fsd.parquet"]
    end

    VCF --> Load --> Prepare
    FASTA --> Prepare
    Prepare --> Count
    BAM --> Count
    Count --> Result
    Count -.->|"--mfsd-parquet"| Parquet
Use mouse to pan and zoom

Genomic BinningΒΆ

To minimise BAM I/O, the Rust engine groups co-located variants into genomic bins before counting. A single bam.fetch() is issued per bin, and reads fetched for that region are classified against all variants in the bin in one pass. For clustered inputs (e.g. a MAF with hundreds of variants on the same gene) this reduces BAM seeks from O(N) per-variant to O(B) per-bin β€” typically 5-20Γ— fewer I/O operations.

This design mirrors the original C++ GBCMS --max_block_size / --max_block_dist architecture, re-implemented as a pure Rust streaming algorithm.

flowchart LR
    Variants(["πŸ“‹ Variants (any order)"]):::start --> Sort["Sort indices\nby chrom + pos"]
    Sort --> NewBin["Open new bin\nat variant pos"]
    NewBin --> Extend{"Next variant\nwithin window\n& same chrom?"}
    Extend -->|"Yes"| MaxCheck{"β‰₯ 200 variants\nin bin?"}
    MaxCheck -->|"No β€” extend"| GrowBin["Add variant\nextend bin_end"]
    GrowBin --> Extend
    MaxCheck -->|"Yes β€” split"| NewBin
    Extend -->|"No / new chrom"| Pad["Pad bin:\nmax(repeat_span+2, 5)bp"]
    Pad --> Fetch["bam.fetch(bin_start, bin_end)\nβ€” 1 I/O per bin ⚑"]:::io
    Fetch --> Classify["Classify each read\nagainst all variants in bin\n(Rayon parallel across bins)"]:::parallel
    Classify --> Out(["πŸ“Š BaseCounts per variant"]):::pass

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef io fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef parallel fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Use mouse to pan and zoom

Binning ParametersΒΆ

Parameter Value Notes
BIN_WINDOW 10,000 bp Maximum span of a single bin. Variants beyond this distance start a new bin.
BIN_MAX_VARIANTS 200 Maximum variants per bin. Split enforced to prevent O(V Γ— R) blowup in the shared-read classification loop.
Bin padding max(repeat_span + 2, 5) bp Ensures reads overlapping bin edges are captured. Uses each variant's detected tandem-repeat span.
Parallelism Rayon par_iter() over bins Each thread owns its own BamReader handle; no locking required across bins.
Output order Preserved Variants are sorted by index internally; results are written back in original input order.

Performance Implication

For a MAF with 500 variants on TP53 (a 19kb gene), the engine produces ~2-3 bins instead of 500 individual bam.fetch() calls. On high-depth targeted panels this can reduce wall-clock counting time by 5-20Γ—.

Not a CLI flag

BIN_WINDOW and BIN_MAX_VARIANTS are internal performance constants β€” they do not affect output values. Parity testing (D1 regression suite) validates that binned and per-variant paths produce identical BaseCounts.


Coordinate SystemΒΆ

All coordinates normalized to 0-based, half-open internally:

flowchart LR
    VCF["VCF (1-based)"] -->|"-1 (e.g. VCF:100 β†’ internal:99)"| Internal["Internal (0-based)"]
    MAF["MAF (1-based)"] -->|"-1"| Internal
    Internal -->|"β†’ Rust engine"| Rust["gbcms._rs"]
    Rust -->|"+1 (e.g. internal:99 β†’ output:100)"| Output["Output (1-based)"]
Use mouse to pan and zoom
Format System Example
VCF input 1-based chr1:100
Internal 0-based chr1:99
Output 1-based chr1:100

FormulasΒΆ

Variant Allele Frequency (VAF)ΒΆ

VAF = AD / (RD + AD)

Where: - AD = Alternate allele read count - RD = Reference allele read count

Strand Bias (Fisher's Exact Test)ΒΆ

         |  Forward  Reverse  |
    -----+--------------------+
    Ref  |    a        b      |
    Alt  |    c        d      |
    -----+--------------------+

    p-value = Fisher's exact test on 2Γ—2 contingency table

Low p-value (< 0.05) indicates potential strand bias artifact.


Module StructureΒΆ

src/gbcms/
β”œβ”€β”€ cli.py           # Typer CLI (dna, rna, normalize, run commands)
β”œβ”€β”€ pipeline.py      # Orchestration (~450 LOC)
β”œβ”€β”€ normalize.py     # Standalone normalization workflow
β”œβ”€β”€ core/
β”‚   └── kernel.py    # Coordinate normalization
β”œβ”€β”€ io/
β”‚   β”œβ”€β”€ input.py     # VcfReader, MafReader
β”‚   └── output.py    # VcfWriter, MafWriter (mode-aware: DNA/RNA columns)
β”œβ”€β”€ models/
β”‚   └── core.py      # Pydantic configs (GbcmsDnaConfig, GbcmsRnaConfig, AlignmentConfig)
└── utils/
    └── logging.py   # Structured logging

rust/src/
β”œβ”€β”€ lib.rs                    # PyO3 module exports
β”œβ”€β”€ counting/
β”‚   β”œβ”€β”€ mod.rs                # Submodule re-exports
β”‚   β”œβ”€β”€ engine.rs             # Main loop, genomic binning, BAQ, UMI
β”‚   β”œβ”€β”€ variant_checks.rs     # check_snp/mnp/ins/del/complex + Phase 3 dispatch
β”‚   β”œβ”€β”€ alignment.rs          # Smith-Waterman implementation
β”‚   β”œβ”€β”€ pairhmm.rs            # PairHMM alignment backend (marginalized)
β”‚   β”œβ”€β”€ pangenome.rs          # Haplotype matrix construction
β”‚   β”œβ”€β”€ wfa_router.rs         # WFA2 fast-path alignment
β”‚   β”œβ”€β”€ rna.rs                # RNA validation, splicing, editing site lookup
β”‚   β”œβ”€β”€ fragment.rs           # Re-export of shared::fragment (backward compat)
β”‚   β”œβ”€β”€ mfsd.rs               # Mutant Fragment Size Distribution analysis
β”‚   β”œβ”€β”€ parquet_writer.rs     # write_fsd_parquet() β€” native Parquet via ZSTD
β”‚   └── utils.rs              # Helpers, reconstruction, soft-clip
β”œβ”€β”€ normalize/
β”‚   β”œβ”€β”€ mod.rs                # Submodule re-exports
β”‚   β”œβ”€β”€ engine.rs             # Normalization pipeline
β”‚   β”œβ”€β”€ left_align.rs         # bcftools-style left-alignment
β”‚   β”œβ”€β”€ decomp.rs             # Homopolymer decomposition
β”‚   β”œβ”€β”€ fasta.rs              # Reference sequence fetcher
β”‚   β”œβ”€β”€ repeat.rs             # Tandem repeat detection
β”‚   └── types.rs              # NormResult enum
β”œβ”€β”€ shared/
β”‚   β”œβ”€β”€ mod.rs                # Submodule re-exports
β”‚   β”œβ”€β”€ fragment.rs           # FragmentEvidence, quality consensus, UMI grouping
β”‚   β”œβ”€β”€ stats.rs              # Fisher's exact test (strand bias)
β”‚   β”œβ”€β”€ filters.rs            # ReadFilter struct, FilterCounts (BAM flag checks)
β”‚   β”œβ”€β”€ baq.rs                # BAQ heuristic (Li 2011)
β”‚   └── bam_utils.rs          # median_qual, find_read_pos
└── types.rs                  # Variant, BaseCounts, PyO3 bindings

ConfigurationΒΆ

All settings via mode-specific Pydantic models:

flowchart TB
    Base["GbcmsBaseConfig"]:::base --> DnaConfig["GbcmsDnaConfig"]
    Base --> RnaConfig["GbcmsRnaConfig"]

    DnaConfig --> DnaD["mode=dna\nMAPQ=20"]
    RnaConfig --> RnaD["mode=rna, MAPQ=1\npairhmm relaxed RT gaps"]
    RnaConfig --> RnaX["enforce_strandedness\nrna_editing_db"]

    Base --> Filters["ReadFilters"]
    Base --> Quality["QualityThresholds"]
    Base --> Output2["OutputConfig"]
    Base --> Align["AlignmentConfig"]

    Filters --> FD["secondary Β· duplicates\nsupplementary Β· qc_failed\nimproper_pair Β· indel"]
    Quality --> QD["min_mapq Β· min_baseq\nfragment_qual_threshold"]
    Output2 --> OD["output_dir Β· format Β· suffix\ncolumn_prefix Β· mfsd Β· mfsd_parquet"]
    Align --> AD["backend: pairhmm|sw\nllr_threshold Β· gap_*_prob"]

    classDef base fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
Use mouse to pan and zoom

See models/core.py for definitions.


Full Pipeline: End-to-End ExampleΒΆ

Here's how a single variant is processed through the complete pipeline:

sequenceDiagram
    participant CLI as CLI (Python)
    participant Pipeline as Pipeline
    participant Reader as VCF/MAF Reader
    participant Kernel as Coordinate Kernel
    participant Rust as Rust Engine
    participant BAM as BAM File

    CLI->>Pipeline: run(config)
    Pipeline->>Reader: load variants
    Reader->>Kernel: normalize coordinates
    Kernel-->>Reader: 0-based Variant objects
    Reader-->>Pipeline: List[Variant]

    loop For each BAM sample
        Pipeline->>Rust: count_bam_binned(bam, variants, config)
        loop For each genomic bin (parallel via Rayon)
            Rust->>BAM: fetch(chrom, posβˆ’5, pos+ref_len+5)
            BAM-->>Rust: Iterator of reads
            loop For each read
                Note over Rust: Apply filter cascade
                Note over Rust: Dispatch to type checker
                Note over Rust: Phase 1+2 haplotype reconstruction
                Note over Rust: Phase 2.5 edit-distance fast-path
                Note over Rust: Phase 3 WFA triage β†’ PairHMM LLR
                Note over Rust: Update read + fragment counts
            end
            Note over Rust: Compute Fisher strand bias
        end
        Rust-->>Pipeline: Vec[BaseCounts]
    end

    Pipeline->>Pipeline: Write output (VCF/MAF)
Use mouse to pan and zoom

Comparison with Original GBCMSΒΆ

Feature Original GBCMS gbcms
Counting algorithm Region-based chunking, position matching Per-variant CIGAR traversal
Indel detection Exact position match only Windowed scan (Β±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation
Complex variants Optional via --generic_counting Always uses haplotype reconstruction
Complex quality handling Exact match only (no quality awareness) Masked comparison β€” bases below --min-baseq are masked out, ambiguity detection prevents false positives
Base quality filtering No base quality threshold Default --min-baseq 20 (Phred Q20)
MNP handling Not explicit Dedicated check_mnp with contiguity check
Fragment counting Optional (--fragment_count), majority-rule Always computed, quality-weighted consensus with discard
Positive strand counts Optional (--positive_count) Always computed
Strand bias Not computed Fisher's exact test (read + fragment level)
Fractional depth --fragment_fractional_weight Not implemented
Parallelism OpenMP block-based Rayon per-variant