ArchitectureΒΆ
gbcms uses a hybrid Python/Rust architecture for maximum performance.
System OverviewΒΆ
flowchart TB
subgraph Python [π Python Layer]
CLI[CLI - cli.py] --> Pipeline[Orchestration - pipeline.py]
Pipeline --> Reader[Input Adapters]
Pipeline --> Writer[Output Writers]
end
subgraph Rust [π¦ Rust Layer]
Counter[count_bam - counting/engine.rs] --> CIGAR[CIGAR Parser]
Counter --> Stats[Strand Bias - stats.rs]
ParquetWriter[write_fsd_parquet - parquet_writer.rs]
end
Pipeline -->|PyO3| Counter
Counter -->|BaseCounts| Pipeline
Pipeline -->|mfsd-parquet| ParquetWriter
classDef pythonStyle fill:#3776ab,color:#fff,stroke:#2c5f8a,stroke-width:2px;
classDef rustStyle fill:#dea584,color:#000,stroke:#c48a6a,stroke-width:2px;
class Python pythonStyle;
class Rust rustStyle;
Use mouse to pan and zoom
Data FlowΒΆ
flowchart LR
subgraph Input
VCF[VCF/MAF]
BAM[BAM Files]
FASTA[Reference]
end
subgraph Process
Load[Load Variants]
Prepare["Prepare (validate + left-align + decomp detect)"]
Count[Count Reads]
end
subgraph Output
Result[VCF/MAF + Counts]
Parquet[".fsd.parquet (optional, --mfsd-parquet)"]
end
VCF --> Load --> Prepare
FASTA --> Prepare
Prepare --> Count
BAM --> Count
Count --> Result
Count -.->|"--mfsd-parquet"| Parquet
Use mouse to pan and zoom
Coordinate SystemΒΆ
All coordinates normalized to 0-based, half-open internally:
flowchart LR
VCF["VCF (1-based)"] -->|"-1"| Internal["Internal (0-based)"]
MAF["MAF (1-based)"] -->|"-1"| Internal
Internal -->|"to Rust"| Rust["gbcms._rs"]
Rust -->|"+1"| Output["Output (1-based)"]
Use mouse to pan and zoom
| Format | System | Example |
|---|---|---|
| VCF input | 1-based | chr1:100 |
| Internal | 0-based | chr1:99 |
| Output | 1-based | chr1:100 |
FormulasΒΆ
Variant Allele Frequency (VAF)ΒΆ
Where: - AD = Alternate allele read count - RD = Reference allele read count
Strand Bias (Fisher's Exact Test)ΒΆ
| Forward Reverse |
-----+--------------------+
Ref | a b |
Alt | c d |
-----+--------------------+
p-value = Fisher's exact test on 2Γ2 contingency table
Low p-value (< 0.05) indicates potential strand bias artifact.
Module StructureΒΆ
src/gbcms/
βββ cli.py # Typer CLI (~350 LOC)
βββ pipeline.py # Orchestration (~450 LOC)
βββ normalize.py # Standalone normalization workflow
βββ core/
β βββ kernel.py # Coordinate normalization
βββ io/
β βββ input.py # VcfReader, MafReader
β βββ output.py # VcfWriter, MafWriter
βββ models/
β βββ core.py # Pydantic config (GbcmsConfig, AlignmentConfig)
βββ utils/
βββ logging.py # Structured logging
rust/src/
βββ lib.rs # PyO3 module exports
βββ parquet_writer.rs # write_fsd_parquet() β native Parquet via ZSTD (--mfsd-parquet)
βββ counting/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Main loop, DP gating, read iteration
β βββ variant_checks.rs # check_snp/mnp/ins/del/complex
β βββ alignment.rs # Smith-Waterman implementation
β βββ pairhmm.rs # PairHMM alignment backend
β βββ fragment.rs # FragmentEvidence, quality consensus
β βββ mfsd.rs # Mutant Fragment Size Distribution analysis
β βββ utils.rs # Helpers, reconstruction, soft-clip
βββ normalize/
β βββ mod.rs # Submodule re-exports
β βββ engine.rs # Normalization pipeline
β βββ left_align.rs # bcftools-style left-alignment
β βββ decomp.rs # Homopolymer decomposition
β βββ fasta.rs # Reference sequence fetcher
β βββ repeat.rs # Tandem repeat detection
β βββ types.rs # NormResult enum
βββ stats.rs # Fisher's exact test
βββ types.rs # Variant, BaseCounts, PyO3 bindings
ConfigurationΒΆ
All settings via GbcmsConfig (Pydantic model):
flowchart TB
GbcmsConfig --> OutputConfig[Output Settings]
GbcmsConfig --> ReadFilters[Read Filters]
GbcmsConfig --> QualityThresholds[Quality Thresholds]
GbcmsConfig --> AlignmentConfig[Alignment Backend]
OutputConfig --> D1["output_dir, format, suffix, column_prefix, mfsd, mfsd_parquet"]
ReadFilters --> D2[exclude_secondary, exclude_duplicates]
QualityThresholds --> D3["min_mapq, min_baseq, fragment_qual_threshold"]
AlignmentConfig --> D4["backend: sw|hmm, llr_threshold, gap_*_prob"]
Use mouse to pan and zoom
See models/core.py for definitions.
Full Pipeline: End-to-End ExampleΒΆ
Here's how a single variant is processed through the complete pipeline:
sequenceDiagram
participant CLI as CLI (Python)
participant Pipeline as Pipeline
participant Reader as VCF/MAF Reader
participant Kernel as Coordinate Kernel
participant Rust as Rust Engine
participant BAM as BAM File
CLI->>Pipeline: run(config)
Pipeline->>Reader: load variants
Reader->>Kernel: normalize coordinates
Kernel-->>Reader: 0-based Variant objects
Reader-->>Pipeline: List[Variant]
loop For each BAM sample
Pipeline->>Rust: count_bam(bam, variants, filters)
loop For each variant (parallel via Rayon)
Rust->>BAM: fetch(chrom, posβ5, pos+ref_len+5)
BAM-->>Rust: Iterator of reads
loop For each read
Note over Rust: Apply filter cascade
Note over Rust: Dispatch to type checker
Note over Rust: Update read + fragment counts
end
Note over Rust: Compute Fisher strand bias
end
Rust-->>Pipeline: Vec[BaseCounts]
end
Pipeline->>Pipeline: Write output (VCF/MAF)
Use mouse to pan and zoom
Comparison with Original GBCMSΒΆ
| Feature | Original GBCMS | gbcms |
|---|---|---|
| Counting algorithm | Region-based chunking, position matching | Per-variant CIGAR traversal |
| Indel detection | Exact position match only | Windowed scan (Β±5bp) with 3-layer safeguards: sequence identity, closest match, reference context validation |
| Complex variants | Optional via --generic_counting |
Always uses haplotype reconstruction |
| Complex quality handling | Exact match only (no quality awareness) | Masked comparison β bases below --min-baseq are masked out, ambiguity detection prevents false positives |
| Base quality filtering | No base quality threshold | Default --min-baseq 20 (Phred Q20) |
| MNP handling | Not explicit | Dedicated check_mnp with contiguity check |
| Fragment counting | Optional (--fragment_count), majority-rule |
Always computed, quality-weighted consensus with discard |
| Positive strand counts | Optional (--positive_count) |
Always computed |
| Strand bias | Not computed | Fisher's exact test (read + fragment level) |
| Fractional depth | --fragment_fractional_weight |
Not implemented |
| Parallelism | OpenMP block-based | Rayon per-variant |
RelatedΒΆ
- Allele Classification β How each variant type is counted
- Variant Normalization β How variants are prepared before counting
- Input Formats β VCF and MAF specifications
- Glossary β Term definitions
abbreviations