Developer Guide¶

Guide for contributing to gbcms.

Setup¶

Modern Linux (Ubuntu 22.04+, RHEL 9+)Legacy Linux (RHEL 8 / HPC)macOS

# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms

# Virtual environment
python -m venv .venv
source .venv/bin/activate

# Install Rust (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install (builds Rust extension)
maturin develop --release

# Verify
gbcms --version

# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms

# Create conda environment with build dependencies
# Note: clangdev (not clang) provides headers needed by bindgen
conda create -n gbcms-dev python=3.11 clangdev rust -c conda-forge
conda activate gbcms-dev

# Set libclang path for the Rust build
export LIBCLANG_PATH=$CONDA_PREFIX/lib

# Install maturin and build
pip install maturin
maturin develop --release

# Verify
gbcms --version

# Clone
git clone https://github.com/msk-access/gbcms.git
cd gbcms

# Install Rust (if not installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Virtual environment
python -m venv .venv
source .venv/bin/activate

# Install (builds Rust extension)
maturin develop --release

# Verify
gbcms --version

Project Structure¶

flowchart LR
    subgraph Python["src/gbcms/"]
        CLI["cli.py"] --> Pipeline["pipeline.py"]
        CLI --> Normalize["normalize.py"]
        Pipeline --> IO["io/"]
        Pipeline --> Models["models/"]
    end

    subgraph Rust["rust/src/"]
        Lib["lib.rs"] --> Count["counting/"]
        Lib --> Norm["normalize/"]
        Lib --> Shared["shared/"]
        Count --> Eng["engine.rs"]
        Count --> VC["variant_checks.rs"]
        Count --> PH["pairhmm.rs"]
        Count --> WFA["wfa_router.rs"]
        Count --> RNA["rna.rs"]
        Count --> Parquet["parquet_writer.rs"]
        Norm --> LA["left_align.rs"]
        Norm --> DC["decomp.rs"]
        Shared --> Frag["fragment.rs"]
        Shared --> Stats["stats.rs"]
        Shared --> Filters["filters.rs"]
        Shared --> BAQ["baq.rs"]
        Shared --> BamU["bam_utils.rs"]
    end

    Pipeline --> Rust
    Normalize --> Rust

Use mouse to pan and zoom

Performance: Genomic Binning

The counting engine (counting/engine.rs) groups variants into ~10kb genomic bins before BAM traversal. Each bin issues one bam.fetch() call instead of one per variant, reducing I/O dramatically for clustered inputs (e.g., a MAF with hundreds of variants on the same gene). See Architecture → Genomic Binning.

To observe binning at runtime:

RUST_LOG=info gbcms dna ... 2>&1 | grep "Built.*bins"
# Example: "Built 42 genomic bins from 1247 variants (window=10000bp)"

Build Commands¶

# Development (fast)
maturin develop

# Release (optimized)
maturin develop --release

# Build wheel
maturin build --release --out dist

Regression Testing¶

Python Integration Tests (pytest)¶

The primary test suite covers CIGAR parsing, windowed indel detection, SNP/MNP/insertion/deletion/complex variant classification, shifted indel handling, and CLI validation:

pytest -v                                    # Full suite
pytest tests/test_shifted_indels.py -v      # Windowed + shifted indel-specific
pytest tests/test_variant_checks.py -v      # Core allele-classification

Rust Unit Tests (cargo test)¶

Normalization engine tests — left-alignment, repeat detection, adaptive padding, homopolymer decomposition:

cd rust && cargo test

BAM Slice Regression Suite¶

For changes to counting logic (counting/engine.rs, counting/variant_checks.rs), run against the BAM slice test set (54 BAM slices × 2 MAFs — see ~/Downloads/bam_slice_v2_8_0/):

# Rebuild first
pip install -e . --no-build-isolation

# Run on indels MAF
gbcms dna \
    --variants filtered_indels_io.maf \
    --bam-list bam_list.txt \
    --fasta ref.fa \
    --format maf \
    --output-dir /tmp/regression_out/

# Diff against baseline
python scripts/concordance.py /tmp/regression_baseline/ /tmp/regression_out/

Key invariants to check:

Check	Expected
`ref_count + alt_count ≤ total_count`	Must hold for every row (INVARIANT)
Δalt for own-sample covered variants	0 vs sign-out `t_alt` (except known MAF annotation issues)
Non-own-sample ref_count changes	≤1 count (non-determinism from parallel windowed scan)

Variant-Type-Specific Investigation¶

When debugging specific variant types, use targeted BAM slices:

Variant Type	Key Examples	What to Check
Del+SNV (complex)	SOX9 `GC→T`, ABL1 `AG→T`	Routes to `check_complex`, not `check_deletion`; alt > 0
Large deletion, REF=0	NF2 ~100bp DEL	M-block REF fallback in `check_complex`; ref > 0
Shifted large deletion	TP53 `GACCGTGCAAGT→-`	`has_nearby_length_match` Phase 3; alt matches sign-out
MNP/DNP	TERT (5bp), BRCA2 (2bp)	ALT recovery vs sign-out
Shifted insertion	JAK1 `65306997`	Multi-allelic isolation, windowed INS scan

Code Standards¶

Python¶

Standard	Requirement
Type hints	All public functions
Docstrings	Google style
Exports	`__all__` in every module
Logging	Use `logging`, not `print()`
Config	Pydantic models

CLI Validation Standards¶

New CLI options must follow this four-layer validation order:

Parse-time (Typer): Use Enum for constrained choices, min=/max= for numeric ranges. Typer rejects invalid values before any Python code runs.
Pre-model (cli.py command body): File extension checks, cross-option semantics (e.g. --preserve-barcode + non-MAF input), charset validation. Log at ERROR and raise typer.Exit(code=1).
Model-time (Pydantic): Business-logic constraints in models/core.py via Field(ge=..., le=...) and @field_validator.
No silent skips: Missing inputs must fail-fast or require an explicit opt-out (e.g. --lenient-bam). WARNING-level silent skips are unacceptable for missing required inputs.

The module-level docstring in cli.py documents this order and must be kept in sync when adding new options.

Rust¶

Standard	Requirement
Docs	`///` on public items
Errors	`anyhow::Result`
Logging	`log` crate

Git Workflow (git-flow)¶

gitGraph
    commit id: "main"
    branch develop
    commit id: "develop"
    branch feature/new-thing
    commit id: "work"
    checkout develop
    merge feature/new-thing
    branch release/X.Y.Z
    commit id: "bump"
    checkout main
    merge release/X.Y.Z tag: "X.Y.Z"
    checkout develop
    merge release/X.Y.Z
    checkout main
    branch hotfix/urgent-fix
    commit id: "critical fix"
    checkout main
    merge hotfix/urgent-fix tag: "X.Y.Z+1"
    checkout develop
    merge hotfix/urgent-fix

Use mouse to pan and zoom

Branch	Purpose
`main`	Production releases
`develop`	Integration
`feature/*`	New features
`release/*`	Release candidates
`hotfix/*`	Production fixes

Pre-Commit Checklist¶

Before committing, verify all checks pass:

To run all Python checks in one pass:

ruff check src/ tests/ && \
black --check src/ tests/ && \
mypy src/ && \
pytest -v

To run all Rust checks:

cd rust && cargo clippy --all-targets -- -D warnings && cargo test

Environment Variables¶

Variable	Default	Description
`GBCMS_LOG_LEVEL`	INFO	Logging level
`RUST_LOG`	—	Rust logging

GBCMS_LOG_LEVEL=DEBUG RUST_LOG=debug gbcms dna ...

Generating a PDF¶

The docs include a combined print page (via mkdocs-print-site-plugin) that can be printed to PDF directly from Chrome — no extra tools required.

Steps:

Start the docs server:
```
mkdocs serve
```
Open http://127.0.0.1:8000/gbcms/print_page/ in Google Chrome and wait ~10s for Mermaid diagrams to render.

Open the DevTools console (F12 → Console tab) and paste:

document.querySelectorAll(
  '.md-sidebar,.md-header,.md-footer,.md-tabs,.md-source-file'
).forEach(e => e.remove());
document.querySelectorAll('.md-content').forEach(e => {
  e.style.maxWidth = '100%'; e.style.padding = '0 1cm';
});

Press Cmd+P → Save as PDF (A4, default margins, ✅ background graphics).

Automated PDF

For a headless/automated version, see ~/Downloads/gbcms-pdf-generator/ (local archive, not in repo) — generates site/documentation.pdf via node generate_pdf.mjs.

Testing Guide — Running tests and adding new test cases
Release Guide — Release process and versioning
Contributing — Contribution workflow and code standards
Architecture — System design and module structure

Developer Guide¶

Setup¶

Project Structure¶

Build Commands¶

Regression Testing¶

Python Integration Tests (pytest)¶

Rust Unit Tests (cargo test)¶

BAM Slice Regression Suite¶

Variant-Type-Specific Investigation¶

Code Standards¶

Python¶

CLI Validation Standards¶

Rust¶

Git Workflow (git-flow)¶

Pre-Commit Checklist¶

Environment Variables¶

Generating a PDF¶

Related¶