Developer Guide¶
Guide for contributing to gbcms.
Setup¶
# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms
# Virtual environment
python -m venv .venv
source .venv/bin/activate
# Install Rust (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install (builds Rust extension)
maturin develop --release
# Verify
gbcms --version
# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms
# Create conda environment with build dependencies
# Note: clangdev (not clang) provides headers needed by bindgen
conda create -n gbcms-dev python=3.11 clangdev rust -c conda-forge
conda activate gbcms-dev
# Set libclang path for the Rust build
export LIBCLANG_PATH=$CONDA_PREFIX/lib
# Install maturin and build
pip install maturin
maturin develop --release
# Verify
gbcms --version
# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms
# Install Rust (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Virtual environment
python -m venv .venv
source .venv/bin/activate
# Install (builds Rust extension)
maturin develop --release
# Verify
gbcms --version
Project Structure¶
flowchart LR
subgraph Python["src/gbcms/"]
CLI["cli.py"] --> Pipeline["pipeline.py"]
CLI --> Normalize["normalize.py"]
Pipeline --> IO["io/"]
Pipeline --> Models["models/"]
end
subgraph Rust["rust/src/"]
Lib["lib.rs"] --> Count["counting/"]
Lib --> Norm["normalize/"]
Lib --> Shared["shared/"]
Count --> Eng["engine.rs"]
Count --> VC["variant_checks.rs"]
Count --> PH["pairhmm.rs"]
Count --> WFA["wfa_router.rs"]
Count --> RNA["rna.rs"]
Count --> Parquet["parquet_writer.rs"]
Norm --> LA["left_align.rs"]
Norm --> DC["decomp.rs"]
Shared --> Frag["fragment.rs"]
Shared --> Stats["stats.rs"]
Shared --> Filters["filters.rs"]
Shared --> BAQ["baq.rs"]
Shared --> BamU["bam_utils.rs"]
end
Pipeline --> Rust
Normalize --> Rust
Performance: Genomic Binning
The counting engine (counting/engine.rs) groups variants into ~10kb genomic bins before BAM traversal. Each bin issues one bam.fetch() call instead of one per variant, reducing I/O dramatically for clustered inputs (e.g., a MAF with hundreds of variants on the same gene). See Architecture → Genomic Binning.
To observe binning at runtime:
Build Commands¶
# Development (fast)
maturin develop
# Release (optimized)
maturin develop --release
# Build wheel
maturin build --release --out dist
Regression Testing¶
Python Integration Tests (pytest)¶
The primary test suite covers CIGAR parsing, windowed indel detection, SNP/MNP/insertion/deletion/complex variant classification, shifted indel handling, and CLI validation:
pytest -v # Full suite
pytest tests/test_shifted_indels.py -v # Windowed + shifted indel-specific
pytest tests/test_variant_checks.py -v # Core allele-classification
Rust Unit Tests (cargo test)¶
Normalization engine tests — left-alignment, repeat detection, adaptive padding, homopolymer decomposition:
BAM Slice Regression Suite¶
For changes to counting logic (counting/engine.rs, counting/variant_checks.rs),
run against the BAM slice test set (54 BAM slices × 2 MAFs — see ~/Downloads/bam_slice_v2_8_0/):
# Rebuild first
pip install -e . --no-build-isolation
# Run on indels MAF
gbcms dna \
--variants filtered_indels_io.maf \
--bam-list bam_list.txt \
--fasta ref.fa \
--format maf \
--output-dir /tmp/regression_out/
# Diff against baseline
python scripts/concordance.py /tmp/regression_baseline/ /tmp/regression_out/
Key invariants to check:
| Check | Expected |
|---|---|
ref_count + alt_count ≤ total_count |
Must hold for every row (INVARIANT) |
| Δalt for own-sample covered variants | 0 vs sign-out t_alt (except known MAF annotation issues) |
| Non-own-sample ref_count changes | ≤1 count (non-determinism from parallel windowed scan) |
Variant-Type-Specific Investigation¶
When debugging specific variant types, use targeted BAM slices:
| Variant Type | Key Examples | What to Check |
|---|---|---|
| Del+SNV (complex) | SOX9 GC→T, ABL1 AG→T |
Routes to check_complex, not check_deletion; alt > 0 |
| Large deletion, REF=0 | NF2 ~100bp DEL | M-block REF fallback in check_complex; ref > 0 |
| Shifted large deletion | TP53 GACCGTGCAAGT→- |
has_nearby_length_match Phase 3; alt matches sign-out |
| MNP/DNP | TERT (5bp), BRCA2 (2bp) | ALT recovery vs sign-out |
| Shifted insertion | JAK1 65306997 |
Multi-allelic isolation, windowed INS scan |
Code Standards¶
Python¶
| Standard | Requirement |
|---|---|
| Type hints | All public functions |
| Docstrings | Google style |
| Exports | __all__ in every module |
| Logging | Use logging, not print() |
| Config | Pydantic models |
CLI Validation Standards¶
New CLI options must follow this four-layer validation order:
- Parse-time (Typer): Use
Enumfor constrained choices,min=/max=for numeric ranges. Typer rejects invalid values before any Python code runs. - Pre-model (cli.py command body): File extension checks, cross-option semantics (e.g.
--preserve-barcode+ non-MAF input), charset validation. Log atERRORand raisetyper.Exit(code=1). - Model-time (Pydantic): Business-logic constraints in
models/core.pyviaField(ge=..., le=...)and@field_validator. - No silent skips: Missing inputs must fail-fast or require an explicit opt-out (e.g.
--lenient-bam).WARNING-level silent skips are unacceptable for missing required inputs.
The module-level docstring in cli.py documents this order and must be kept in sync when adding new options.
Rust¶
| Standard | Requirement |
|---|---|
| Docs | /// on public items |
| Errors | anyhow::Result |
| Logging | log crate |
Git Workflow (git-flow)¶
gitGraph
commit id: "main"
branch develop
commit id: "develop"
branch feature/new-thing
commit id: "work"
checkout develop
merge feature/new-thing
branch release/X.Y.Z
commit id: "bump"
checkout main
merge release/X.Y.Z tag: "X.Y.Z"
checkout develop
merge release/X.Y.Z
checkout main
branch hotfix/urgent-fix
commit id: "critical fix"
checkout main
merge hotfix/urgent-fix tag: "X.Y.Z+1"
checkout develop
merge hotfix/urgent-fix
| Branch | Purpose |
|---|---|
main |
Production releases |
develop |
Integration |
feature/* |
New features |
release/* |
Release candidates |
hotfix/* |
Production fixes |
Pre-Commit Checklist¶
Before committing, verify all checks pass:
-
ruff check src/ tests/— Python linting -
black --check src/ tests/— Python formatting -
mypy src/— Type checking (no errors) -
cd rust && cargo clippy --all-targets -- -D warnings— Rust linting (strict, warnings-as-errors) -
cd rust && cargo test— Rust unit tests -
pytest -v— Python/integration tests - Type hints complete on all new public functions
- Docstrings added (Google style for Python;
///on public Rust items) - No dead code (removed, not commented out)
- No silent failures: missing inputs must fail-fast or require explicit opt-out
To run all Python checks in one pass:
To run all Rust checks:
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
GBCMS_LOG_LEVEL |
INFO | Logging level |
RUST_LOG |
— | Rust logging |
Generating a PDF¶
The docs include a combined print page (via mkdocs-print-site-plugin) that can be printed to PDF directly from Chrome — no extra tools required.
Steps:
-
Start the docs server:
-
Open
http://127.0.0.1:8000/gbcms/print_page/in Google Chrome and wait ~10s for Mermaid diagrams to render. -
Open the DevTools console (
F12→ Console tab) and paste: -
Press
Cmd+P→ Save as PDF (A4, default margins, ✅ background graphics).
Automated PDF
For a headless/automated version, see ~/Downloads/gbcms-pdf-generator/ (local archive, not in repo) — generates site/documentation.pdf via node generate_pdf.mjs.
Related¶
- Testing Guide — Running tests and adding new test cases
- Release Guide — Release process and versioning
- Contributing — Contribution workflow and code standards
- Architecture — System design and module structure
abbreviations