Skip to content

ArchitectureΒΆ

gbcms uses a hybrid Python/Rust architecture for maximum performance.

System OverviewΒΆ

flowchart TB
    subgraph Python [🐍 Python Layer]
        CLI[CLI - cli.py] --> Pipeline[Orchestration - pipeline.py]
        Pipeline --> Reader[Input Adapters]
        Pipeline --> Writer[Output Writers]
    end

    subgraph Rust [πŸ¦€ Rust Layer]
        Counter[count_bam - counting/engine.rs] --> CIGAR[CIGAR Parser]
        Counter --> Stats[Strand Bias - stats.rs]
        ParquetWriter[write_fsd_parquet - parquet_writer.rs]
    end

    Pipeline -->|PyO3| Counter
    Counter -->|BaseCounts| Pipeline
    Pipeline -->|mfsd-parquet| ParquetWriter

    classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
    classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
    class Python pythonStyle;
    class Rust rustStyle;
Use mouse to pan and zoom

Data FlowΒΆ

flowchart LR
    subgraph Input
        VCF[VCF/MAF]
        BAM[BAM Files]
        FASTA[Reference]
    end

    subgraph Process
        Load[Load Variants]
        Prepare["Prepare (validate + left-align + decomp detect)"]
        Count[Count Reads]
    end

    subgraph Output
        Result[VCF/MAF + Counts]
        Parquet[".fsd.parquet (optional, --mfsd-parquet)"]
    end

    VCF --> Load --> Prepare
    FASTA --> Prepare
    Prepare --> Count
    BAM --> Count
    Count --> Result
    Count -.->|"--mfsd-parquet"| Parquet
Use mouse to pan and zoom

Coordinate SystemΒΆ

All coordinates normalized to 0-based, half-open internally:

flowchart LR
    VCF["VCF (1-based)"] -->|"-1"| Internal["Internal (0-based)"]
    MAF["MAF (1-based)"] -->|"-1"| Internal
    Internal -->|"to Rust"| Rust["gbcms._rs"]
    Rust -->|"+1"| Output["Output (1-based)"]
Use mouse to pan and zoom
Format System Example
VCF input 1-based chr1:100
Internal 0-based chr1:99
Output 1-based chr1:100

FormulasΒΆ

Variant Allele Frequency (VAF)ΒΆ

VAF = AD / (RD + AD)

Where: - AD = Alternate allele read count - RD = Reference allele read count

Strand Bias (Fisher's Exact Test)ΒΆ

         |  Forward  Reverse  |
    -----+--------------------+
    Ref  |    a        b      |
    Alt  |    c        d      |
    -----+--------------------+

    p-value = Fisher's exact test on 2Γ—2 contingency table

Low p-value (< 0.05) indicates potential strand bias artifact.


Module StructureΒΆ

src/gbcms/
β”œβ”€β”€ cli.py           # Typer CLI (~350 LOC)
β”œβ”€β”€ pipeline.py      # Orchestration (~450 LOC)
β”œβ”€β”€ normalize.py     # Standalone normalization workflow
β”œβ”€β”€ core/
β”‚   └── kernel.py    # Coordinate normalization
β”œβ”€β”€ io/
β”‚   β”œβ”€β”€ input.py     # VcfReader, MafReader
β”‚   └── output.py    # VcfWriter, MafWriter
β”œβ”€β”€ models/
β”‚   └── core.py      # Pydantic config (GbcmsConfig, AlignmentConfig)
└── utils/
    └── logging.py   # Structured logging

rust/src/
β”œβ”€β”€ lib.rs                    # PyO3 module exports
β”œβ”€β”€ parquet_writer.rs         # write_fsd_parquet() β€” native Parquet via ZSTD (--mfsd-parquet)
β”œβ”€β”€ counting/
β”‚   β”œβ”€β”€ mod.rs                # Submodule re-exports
β”‚   β”œβ”€β”€ engine.rs             # Main loop, DP gating, read iteration
β”‚   β”œβ”€β”€ variant_checks.rs     # check_snp/mnp/ins/del/complex
β”‚   β”œβ”€β”€ alignment.rs          # Smith-Waterman implementation
β”‚   β”œβ”€β”€ pairhmm.rs            # PairHMM alignment backend
β”‚   β”œβ”€β”€ fragment.rs           # FragmentEvidence, quality consensus
β”‚   β”œβ”€β”€ mfsd.rs               # Mutant Fragment Size Distribution analysis
β”‚   └── utils.rs              # Helpers, reconstruction, soft-clip
β”œβ”€β”€ normalize/
β”‚   β”œβ”€β”€ mod.rs                # Submodule re-exports
β”‚   β”œβ”€β”€ engine.rs             # Normalization pipeline
β”‚   β”œβ”€β”€ left_align.rs         # bcftools-style left-alignment
β”‚   β”œβ”€β”€ decomp.rs             # Homopolymer decomposition
β”‚   β”œβ”€β”€ fasta.rs              # Reference sequence fetcher
β”‚   β”œβ”€β”€ repeat.rs             # Tandem repeat detection
β”‚   └── types.rs              # NormResult enum
β”œβ”€β”€ stats.rs                  # Fisher's exact test
└── types.rs                  # Variant, BaseCounts, PyO3 bindings

ConfigurationΒΆ

All settings via GbcmsConfig (Pydantic model):

flowchart TB
    GbcmsConfig --> OutputConfig[Output Settings]
    GbcmsConfig --> ReadFilters[Read Filters]
    GbcmsConfig --> QualityThresholds[Quality Thresholds]
    GbcmsConfig --> AlignmentConfig[Alignment Backend]

    OutputConfig --> D1["output_dir, format, suffix, column_prefix, mfsd, mfsd_parquet"]
    ReadFilters --> D2[exclude_secondary, exclude_duplicates]
    QualityThresholds --> D3["min_mapq, min_baseq, fragment_qual_threshold"]
    AlignmentConfig --> D4["backend: sw|hmm, llr_threshold, gap_*_prob"]
Use mouse to pan and zoom

See models/core.py for definitions.


Full Pipeline: End-to-End ExampleΒΆ

Here's how a single variant is processed through the complete pipeline:

sequenceDiagram
    participant CLI as CLI (Python)
    participant Pipeline as Pipeline
    participant Reader as VCF/MAF Reader
    participant Kernel as Coordinate Kernel
    participant Rust as Rust Engine
    participant BAM as BAM File

    CLI->>Pipeline: run(config)
    Pipeline->>Reader: load variants
    Reader->>Kernel: normalize coordinates
    Kernel-->>Reader: 0-based Variant objects
    Reader-->>Pipeline: List[Variant]

    loop For each BAM sample
        Pipeline->>Rust: count_bam(bam, variants, filters)
        loop For each variant (parallel via Rayon)
            Rust->>BAM: fetch(chrom, posβˆ’5, pos+ref_len+5)
            BAM-->>Rust: Iterator of reads
            loop For each read
                Note over Rust: Apply filter cascade
                Note over Rust: Dispatch to type checker
                Note over Rust: Update read + fragment counts
            end
            Note over Rust: Compute Fisher strand bias
        end
        Rust-->>Pipeline: Vec[BaseCounts]
    end

    Pipeline->>Pipeline: Write output (VCF/MAF)
Use mouse to pan and zoom

Comparison with Original GBCMSΒΆ

Feature Original GBCMS gbcms
Counting algorithm Region-based chunking, position matching Per-variant CIGAR traversal
Indel detection Exact position match only Windowed scan (Β±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation
Complex variants Optional via --generic_counting Always uses haplotype reconstruction
Complex quality handling Exact match only (no quality awareness) Masked comparison β€” bases below --min-baseq are masked out, ambiguity detection prevents false positives
Base quality filtering No base quality threshold Default --min-baseq 20 (Phred Q20)
MNP handling Not explicit Dedicated check_mnp with contiguity check
Fragment counting Optional (--fragment_count), majority-rule Always computed, quality-weighted consensus with discard
Positive strand counts Optional (--positive_count) Always computed
Strand bias Not computed Fisher's exact test (read + fragment level)
Fractional depth --fragment_fractional_weight Not implemented
Parallelism OpenMP block-based Rayon per-variant