Skip to content

ArchitectureΒΆ

py-gbcms uses a hybrid Python/Rust architecture for maximum performance.

System OverviewΒΆ

flowchart TB
    subgraph Python["🐍 Python Layer"]
        CLI[CLI
cli.py] --> Pipeline[Orchestration
pipeline.py] Pipeline --> Reader[Input Adapters
VcfReader, MafReader] Pipeline --> Writer[Output Writers
VcfWriter, MafWriter] end subgraph Rust["πŸ¦€ Rust Layer (gbcms._rs)"] Counter[count_bam
counting.rs] --> CIGAR[CIGAR Parser] Counter --> Stats[Strand Bias
stats.rs] end Pipeline -->|"PyO3"| Counter Counter -->|"BaseCounts"| Pipeline style Python fill:#3776ab,color:#fff style Rust fill:#dea584,color:#000
Use mouse to pan and zoom

Data FlowΒΆ

flowchart LR
    subgraph Input
        VCF[VCF/MAF]
        BAM[BAM Files]
        FASTA[Reference]
    end

    subgraph Process
        Load[Load Variants]
        Validate[Validate vs Ref]
        Count[Count Reads]
    end

    subgraph Output
        Result[VCF/MAF + Counts]
    end

    VCF --> Load --> Validate
    FASTA --> Validate
    Validate --> Count
    BAM --> Count
    Count --> Result
Use mouse to pan and zoom

Coordinate SystemΒΆ

All coordinates normalized to 0-based, half-open internally:

flowchart LR
    VCF["VCF (1-based)"] -->|"-1"| Internal["Internal (0-based)"]
    MAF["MAF (1-based)"] -->|"-1"| Internal
    Internal -->|"to Rust"| Rust["gbcms._rs"]
    Rust -->|"+1"| Output["Output (1-based)"]
Use mouse to pan and zoom
Format System Example
VCF input 1-based chr1:100
Internal 0-based chr1:99
Output 1-based chr1:100

FormulasΒΆ

Variant Allele Frequency (VAF)ΒΆ

VAF = AD / (RD + AD)

Where: - AD = Alternate allele read count - RD = Reference allele read count

Strand Bias (Fisher's Exact Test)ΒΆ

         |  Forward  Reverse  |
    -----+--------------------+
    Ref  |    a        b      |
    Alt  |    c        d      |
    -----+--------------------+

    p-value = Fisher's exact test on 2Γ—2 contingency table

Low p-value (< 0.05) indicates potential strand bias artifact.


Module StructureΒΆ

src/gbcms/
β”œβ”€β”€ cli.py           # Typer CLI
β”œβ”€β”€ pipeline.py      # Orchestration
β”œβ”€β”€ core/
β”‚   └── kernel.py    # Coordinate normalization
β”œβ”€β”€ io/
β”‚   β”œβ”€β”€ input.py     # VcfReader, MafReader
β”‚   └── output.py    # VcfWriter, MafWriter
β”œβ”€β”€ models/
β”‚   └── core.py      # Pydantic config
└── utils/
    └── logging.py   # Structured logging

rust/src/
β”œβ”€β”€ lib.rs           # PyO3 module (_rs)
β”œβ”€β”€ counting.rs      # BAM processing
β”œβ”€β”€ stats.rs         # Fisher's exact test
└── types.rs         # Variant, BaseCounts

ConfigurationΒΆ

All settings via GbcmsConfig (Pydantic model):

flowchart TB
    GbcmsConfig --> OutputConfig[Output Settings]
    GbcmsConfig --> ReadFilters[Read Filters]
    GbcmsConfig --> QualityThresholds[Quality Thresholds]

    OutputConfig --> D1[output_dir, format, suffix]
    ReadFilters --> D2[exclude_secondary, exclude_duplicates]
    QualityThresholds --> D3[min_mapq, min_baseq]
Use mouse to pan and zoom

See models/core.py for definitions.