Changelog¶

All notable changes to gbcms are documented here.

Full History

See GitHub Releases for complete release notes.

Changelog¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[3.0.0] - 2026-03-05¶

⚠️ Breaking Changes¶

Package renamed: py-gbcms → gbcms. Update your dependencies:
```
pip uninstall py-gbcms
pip install gbcms
```
A final py-gbcms==3.0.0 deprecation stub on PyPI re-exports gbcms and issues a DeprecationWarning for smooth migration.

✨ Added¶

mFSD native integration: --mfsd and --mfsd-parquet flags output a Parquet file with 31 MAF columns + 7 VCF INFO fields via the Rust native Parquet writer. Compression: ZSTD level 1.
CLI validation hardening: 12 validation gaps resolved — fail-fast BAM accessibility check, --lenient-bam flag for permissive mode, .vcf.bgz accepted as a valid variant input extension.
py-gbcms deprecation stub: compat/py-gbcms/ shim published to PyPI as py-gbcms==3.0.0 for backwards compatibility during migration.

🔧 Fixed¶

Parquet output compression switched from SNAPPY to ZSTD(1).
Resolved mypy AlignedSegment.cigar attribute errors in test suite.
Lint and stale-comment cleanup post Phase 6 audit.

📚 Documentation¶

Full documentation sync with codebase after Phase 6 (mFSD) merge.
PDF generation guide added to developer documentation.
.agent/rules/ directory added; stale .antigravity docs removed.

🏗️ CI¶

Added mkdocs-print-site-plugin to CI docs install step and pyproject.toml.

[2.8.0] - 2026-02-23¶

✨ Added¶

PairHMM alignment backend: Alternative Phase 3 alignment via --alignment-backend hmm with probabilistic scoring using base quality probabilities. Configurable LLR threshold (default 2.3 ≈ ln(10)) and gap probabilities for repeat/non-repeat regions. 6 new CLI options. Exposed as first-class Nextflow params in nextflow.config
MNP min-BQ-across-block quality strategy: MNP quality now assessed using min(BQ) across the entire block, matching C++ GBCMS baseCountDNP. Low-quality MNP reads now fall through to check_complex for masked comparison instead of being silently skipped
Per-phase ClassifyResult counters: Diagnostic counters track how many reads are resolved in each classification phase (Phase 1/2/2.5/3)
--trace flag: Two-tier Rust logging — --verbose for debug, --trace for per-read classification diagnostics via pyo3-log

🔧 Fixed¶

Phase 2/2.5 overcounting: Complex variants with REF >> ALT now skip Phase 2 Case B and Phase 2.5 when ref_len > 2 × alt_len — short ALT trivially matches, edit distance is biased toward shorter allele
Phase 3 bypass for pure DEL/INS: Removed is_worth_realignment prefilter — CIGAR structure is ground truth for pure deletions/insertions, prefilter was overcounting
DP anchor overlap: Now uses single-position check (read_start ≤ variant.pos), matching Mutect2, VarDictJava, and samtools mpileup standard
MNP LowQuality routing: LowQuality reads now fall through to check_complex for masked comparison instead of being skipped entirely
S3 underflow guard: Guards against negative ctx_offset in deletion S3 validation

🏗️ Refactored¶

Rust module structure: Split counting.rs (1904 LOC) and normalize.rs (1221 LOC) into idiomatic module directories: counting/ (7 modules) and normalize/ (7 modules)
AlignmentBackend threading: AlignmentBackend enum threaded through all Phase 3 call sites

📚 Documentation¶

Comprehensive audit: 28 fixes across 23 files — all GitBook URLs → MkDocs, version templating (X.Y.Z), undocumented CLI options, 5 mermaid diagrams updated for post-2.7.0 logic, architecture module tree refreshed, PairHMM documented end-to-end
Deleted stale nextflow/CHANGES.md

🧹 Chores¶

Fix all cargo clippy warnings
Fix ruff lint issues, black formatting
Update test expectations for new behavior
CI: skip test workflow for docs-only changes

🧪 Tests¶

8 Python alignment backend integration tests
5 SW-vs-PairHMM concordance tests
10 MNP unit tests
Multi-allelic isolation, DP/neither, fragment consensus, normalization tests

[2.7.0] - 2026-02-19¶

✨ Added¶

Phase 2.5 edit distance fallback: When read reconstruction length matches neither REF nor ALT (e.g., incomplete MAF definition), Levenshtein distance discriminates the closest allele with >1 edit margin safety guard
Phase 3 local SW fallback: Complex variants where semiglobal alignment produces confident-but-wrong calls (e.g., EPHA7 TCC→CT) are now rescued via local Smith-Waterman that soft-clips mismatched flanks. Dual-trigger requires both score reversal and ≥2-point margin

🔧 Fixed¶

Allele-based dispatch (check_allele_with_qual): Routes by ref_len × alt_len instead of unreliable variant_type string labels. SNP (1×1), insertion (1×N), deletion (N×1), MNP (N×N equal), complex (N×M unequal) — eliminates misrouting when callers emit inconsistent type annotations
SW semiglobal argument order: Fixed ref_hap/alt_hap argument swap in semiglobal alignment that was scoring reads against the wrong haplotype
Haplotype trimming removed: Eliminated shared symmetric trim that caused slice index starts at 7 but ends at 6 panics on asymmetric indels; replaced with validated per-haplotype bounds
MNP fallback: MNP reads now correctly fall through to SW alignment instead of silently returning "neither" on partial mismatches
Dual-count guard: Prevents a single read from being counted as both REF and ALT when SW scores are exactly equal
Soft-clip restriction: Soft-clipped bases no longer incorrectly contribute to variant region reconstruction
Strand bias orientation: Strand bias (Fisher's exact) now couples to the winning allele, not the raw alignment orientation
Interior REF quality proxy: Reads falling entirely within a large deletion (>50bp) now use median base quality instead of 0
Interior REF guard removed: Eliminated the has_large_cigar_del guard that massively overcounted REF for large deletions by misclassifying ALT-supporting reads

🧹 Chores¶

Clippy: Removed unused has_large_cigar_del variable
Tests: Updated test_fuzzy_complex::TestLengthMismatch expectation to reflect Phase 3 local SW fallback behavior

[2.6.1] - 2026-02-19¶

🔧 Fixed¶

Per-haplotype trimming: Fixed slice index starts at 7 but ends at 6 panic in counting.rs on asymmetric indels. Replaced shared symmetric trim with independent per-haplotype trim_haplotype() function that calculates bounds safely for each allele

✨ Added¶

Tolerant REF validation: Variants with ≥90% REF match against the FASTA are now counted (status PASS_WARN_REF_CORRECTED) instead of being silently rejected. The FASTA REF is used for haplotype construction. Variants with <90% match are still rejected as REF_MISMATCH

📚 Documentation¶

Visual posters: Added overview, normalization, and read-filter/counting-metrics posters (JPG) to reference documentation pages with lightbox support
Embedded PDFs: Added inline PDF viewer for allele classification guide and detailed overview presentation via mkdocs-pdf plugin
Variant normalization: Updated REF validation docs with 3-tier flowchart, PASS_WARN_REF_CORRECTED status, and EGFR exon 19 real-world example

🔧 CI¶

deploy-docs.yml: Added mkdocs-pdf to docs CI pip install dependencies

[2.6.0] - 2026-02-18¶

✨ Added¶

Adaptive context padding: Dynamically increases ref_context flanking in tandem repeat regions (homopolymer through hexanucleotide). Formula: max(default, repeat_span/2 + 3), capped at 50bp. Enabled by default (--adaptive-context/--no-adaptive-context)
gbcms normalize command: Standalone variant normalization (left-align + REF validate) without counting, outputs TSV with original and normalized coordinates
Nextflow parameters: fragment_qual_threshold, context_padding, show_normalization, adaptive_context now configurable in nextflow.config
Docs restructure: Split monolithic variant-counting.md into 4 focused pages: Variant Normalization, Allele Classification, Counting Metrics, Read Filters
HPC install docs: Micromamba-based source install with Python 3.13

🔧 Fixed¶

Interior REF guard for large deletions (>50bp): Reads falling entirely within a deleted region are now correctly classified as REF instead of ALT by Smith-Waterman
Windowed reciprocal overlap: Improved shifted indel detection using bidirectional overlap scoring
Complex variant counting (EPHA7 TCC→CT): Fixed base quality extraction for all variant-type handlers (check_insertion, check_deletion, check_mnp, check_complex)
MAF VCF-style conversion: Corrected complex variant handling in MAF→internal coordinate conversion
Lint: Fixed ruff I001/E402/B905 and black formatting in pipeline.py

🔄 Changed¶

Dead code removed: GenomicInterval class, Variant.interval property, fragment_counting config field
fetch_single_base() refactored: Delegates to fetch_region(), removing 33 lines of duplicated chr-prefix retry logic
Release guide: Updated version locations table with exact line numbers and verification command
Nextflow pipeline diagram added to docs index

[2.5.0] - 2026-02-12¶

✨ Added¶

--preserve-barcode flag: Keeps original Tumor_Sample_Barcode from input MAF instead of overriding with BAM sample name (MAF→MAF workflows)
--column-prefix parameter: Controls prefix for gbcms count columns in MAF output (default: none; use --column-prefix t_ for legacy compatibility) ⚠️
CoordinateKernel: Centralized MAF↔internal 0-based coordinate conversion with variant-type-aware logic for SNP, insertion, deletion, and complex variants
Nextflow FILTER_MAF module: Per-sample MAF variant filtering by Tumor_Sample_Barcode supporting exact match, regex, and multi-select (comma-separated) modes
Nextflow PIPELINE_SUMMARY module: Aggregated per-sample filtering statistics with formatted console output
Nextflow --filter_by_sample parameter and samplesheet tsb column for multi-sample MAF workflows
Nextflow documentation: Samplesheet tsb column guide, --filter_by_sample parameter reference, multi-sample MAF filtering examples

🔧 Fixed¶

Fragment quality extraction (critical): All variant-type handlers (check_insertion, check_deletion, check_mnp, check_complex) now return actual base_qual from CIGAR walk instead of 0 — fixes systematic ALT undercount for indels in fragment-level consensus
FILTER_MAF heredoc conflict: Restructured script from Python shebang to python3 << 'PYEOF' pattern, resolving SyntaxError from bash syntax in Python context
FILTER_MAF string quoting: Changed to single-quoted Python strings for Nextflow variable interpolation to prevent CSV-parsed double-quote conflicts
splitCsv quote handling: Added quote:'"' parameter for correct RFC 4180 parsing of comma-separated TSB values within quoted CSV fields
mypy no-redef error: Removed redundant type annotation in output.py else branch

🔄 Changed¶

MAF output column prefix default: Changed from t_ to empty string (no prefix). Use --column-prefix t_ for legacy t_ref_count / t_alt_count style columns ⚠️
MafWriter refactored: MAF→MAF path preserves all original columns verbatim; VCF→MAF path builds row from GDC fieldnames with CoordinateKernel coordinate conversion
Nextflow GBCMS_RUN input: Variants bundled into sample tuple (meta, bam, bai, variants) instead of separate channel
Nextflow GBCMS workflow: Simplified to 2-channel interface (ch_samples, ch_fasta) from 3 channels
Nextflow config: Added column_prefix, preserve_barcode, filter_by_sample parameters

[2.4.0] - 2026-02-10¶

✨ Added¶

Fragment Consensus Engine: FragmentEvidence struct with u64 QNAME hashing and quality-weighted R1/R2 consensus; ambiguous fragments are discarded (not assigned to REF)
--fragment-qual-threshold: New CLI option (default 10) controlling consensus quality difference for fragment conflict resolution
Windowed Indel Detection: ±5bp positional scan with 3-layer safeguards (sequence identity, closest match, reference context validation)
Quality-Aware Complex Matching: Masked comparison that ignores bases below --min-baseq; 3-case comparison (equal-length, ALT-only, REF-only) with ambiguity detection
Variant Counting Guide: New docs/reference/variant-counting.md with algorithm diagrams for all variant types (~700 lines)
MAF Normalization Docs: Added indel normalization and coordinate handling to docs/reference/input-formats.md
47 Tests: Up from 16 — added test_shifted_indels.py (15), test_fuzzy_complex.py (14), test_fragment_consensus.py (4)

🔄 Changed¶

--min-baseq default: 0 → 20 (Phred Q20) — activates quality masking by default for improved accuracy on low-coverage samples ⚠️
--version flag: Added to CLI (gbcms --version)
Deploy-Docs Workflow: Replaced mkdocs gh-deploy with mike for multi-version documentation; deploys stable (tagged version) from main and dev from develop branch; added extra.version.provider: mike to mkdocs.yml with version switcher widget

🔧 Fixed¶

Fragment double-counting bug: R1+R2 pairs previously counted as two independent observations; now collapsed via quality-weighted consensus
MAF Input Hardening: Graceful handling of missing/malformed fields with warnings instead of crashes
CI Release Pipeline: Stabilized manylinux builds — migrated from manylinux_2_28 to manylinux_2_34, resolved OpenSSL/CURL vendor conflicts via docker-options pattern
Type stubs: _rs.pyi and gbcms_rs.pyi synced with Rust bindings (added ref_context, ref_context_start, fragment_qual_threshold)
Linting: All files pass black, ruff, and mypy

📚 Documentation¶

Architecture comparison table updated with windowed indels and masked comparison
Nextflow config and docs updated with new min_baseq default
Testing guide expanded with Phase 2a/2b test files
HPC/RHEL 8 installation instructions updated with clangdev header management
Release guide updated with docs version locations
.antigravity project files updated with current Rust LOC (~1270) and test counts (47)

[2.3.0] - 2026-02-06¶

✨ Added¶

Nextflow BAI Auto-Discovery: Checks .bam.bai and .bai extensions automatically
Documentation Modernization: Hierarchical navigation, glightbox, panzoom, abbreviations
Performance Benchmarks: cfDNA duplex sample metrics in documentation
RHEL 8 Installation Guide: Conda-based source installation for legacy Linux

🔄 Changed¶

Dockerfile: Added procps, bash, OCI labels, maturin[patchelf], selective COPY
Nextflow Config: --platform linux/amd64, shell config, local profile, observability (trace/report/timeline/dag)
MkDocs: Switched to navigation.sections, 20+ abbreviations with hover tooltips
GitHub Actions: Consolidated deploy-docs workflows, added caching and PR validation
CI Wheels: Migrated from manylinux_2_28 to manylinux_2_34 (AlmaLinux 9 with OpenSSL 3.0+)

🔧 Fixed¶

Nextflow: Empty --suffix argument no longer causes failures
Admonitions: Converted GitHub-style alerts to MkDocs syntax
CI Build: Resolved curl-sys OpenSSL version conflict by switching to manylinux_2_34

[2.2.0] - 2026-02-04¶

✨ Added¶

Multi-platform Wheel Publishing: Maturin-based CI builds for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows
Structured Logging: New utils/logging.py module with Rich console output, timing utilities, and log file support
Mermaid Diagrams: Architecture documentation with interactive flowcharts
Release Guide: Comprehensive docs/RELEASE.md with git-flow workflow

🔄 Changed¶

Folder Restructure: Moved Rust code to rust/ (bundled as gbcms._rs)
Config Hierarchy: Nested Pydantic models (ReadFilters, QualityThresholds, OutputConfig) for better organization
Code Quality: Added __all__ exports, docstrings, and type hints across all modules
StrEnum: Modern enum pattern with Python 3.10 backport

📚 Documentation¶

New docs/ARCHITECTURE.md with system diagrams
New docs/DEVELOPMENT.md (developer guide)
New docs/TESTING.md (testing guide)
Updated MkDocs with mermaid2 plugin and snippet includes

[2.1.2] - 2025-11-25¶

🔧 Fixed¶

PyPI Distribution: Fixed source distribution size issue by correctly excluding large files (tests, docs, etc.) via pyproject.toml configuration.

[2.1.1] - 2025-11-25 [YANKED]¶

Yanked Release

This release was yanked from PyPI due to a source distribution size limit error. Use 2.1.2 instead.

🔧 Fixed¶

PyPI Distribution: Added MANIFEST.in (failed to work with Hatchling) to reduce source distribution size
Documentation: Added comprehensive Installation guide
Documentation: Unified Contributing guide (merged code + docs contributions)
Documentation: Added Changelog to documentation navigation

[2.1.0] - 2025-11-25¶

✨ Added¶

Nextflow Workflow¶

Production-ready Nextflow workflow for processing multiple samples in parallel
SLURM cluster support with customizable queue configuration
Per-sample suffix support via optional suffix column in samplesheet
Docker and Singularity profiles for containerized execution
Automatic BAI index discovery with validation
Resume capability for failed workflow runs
Resource management with automatic retry and scaling
Comprehensive documentation in docs/NEXTFLOW.md and nextflow/README.md

Documentation¶

Usage pattern comparison guide (docs/WORKFLOWS.md) for choosing between CLI and Nextflow
MkDocs integration for beautiful GitHub Pages documentation
Local documentation preview with live reload (mkdocs serve)
Staging deployment from develop branch for testing docs
Production deployment from main branch
Reorganized documentation structure with clear CLI vs Nextflow separation
CLI Quick Start guide (docs/quick-start.md)

🔧 Changed¶

Documentation workflow: docs now live on main branch with automated deployment
GitBook integration: configured to read from main branch
Nextflow module: improved parameter passing with meta.suffix support

📝 Documentation¶

Complete Nextflow workflow guide with SLURM examples
Per-sample suffix usage examples
Git-flow documentation workflow guide
Local preview instructions
Updated README with clear usage pattern separation

[2.0.0] - 2025-11-21¶

🚀 Major Rewrite¶

Version 2.0.0 represents a complete rewrite of py-gbcms with a focus on performance, correctness, and modern architecture.

✨ Added¶

Core Features¶

Rust-based Counting Engine: Hybrid Python/Rust architecture for 20x+ performance improvement
Strand Bias Statistics: Fisher's exact test p-values and odds ratios for both reads (SB_PVAL, SB_OR) and fragments (FSB_PVAL, FSB_OR)
Fragment-Level Counting: Majority-rule fragment counting with strand-specific counts (RDF, ADF)
Variant Allele Fractions: Read-level (VAF) and fragment-level (FAF) allele fraction calculations
Thread Control: Explicit control over parallelism via --threads argument (default: 1)

Input/Output¶

VCF Output Format: Standard VCF with comprehensive INFO and FORMAT fields
MAF Output Format: Extended MAF with custom columns for strand counts and statistics
Column Preservation: Input MAF columns are preserved in output
Multiple BAM Support: Process multiple samples via --bam-list or repeated --bam arguments
Sample ID Override: Explicit sample naming via --bam sample_id:path syntax

Filters¶

--filter-duplicates: Filter duplicate reads (default: enabled)
--filter-secondary: Filter secondary alignments
--filter-supplementary: Filter supplementary alignments
--filter-qc-failed: Filter reads that failed QC
--filter-improper-pair: Filter improperly paired reads
--filter-indel: Filter reads with indels in CIGAR

CLI & Usability¶

Modern CLI: Built with Typer and Rich for beautiful terminal output
Progress Tracking: Real-time progress bars and status indicators
Direct Invocation: Use gbcms run instead of python -m gbcms.cli
Output Customization: --suffix flag for output filename customization
Flexible Input: Support for both VCF and MAF input formats

Infrastructure¶

Docker Support: Production-ready multi-stage Dockerfile with optimized layers
Type Safety: Full type annotations with mypy support
Type Stubs: Provided .pyi stub file for Rust extension
Comprehensive Tests: Extended test suite with accuracy and filter validation
CI/CD: GitHub Actions workflows for testing, linting, and releases

🔄 Changed¶

Architecture¶

Migrated from pure Python to hybrid Python/Rust architecture
Core counting logic implemented in Rust using rust-htslib
Data parallelism over variants with per-thread BAM readers

Output Formats¶

VCF FORMAT fields: Strand-specific counts now use comma-separated values (e.g., RD=5,3 for forward,reverse)
MAF columns: Standardized column names (t_ref_count_forward, t_alt_count_reverse, etc.)
Coordinate System: Internal 0-based indexing with correct conversion for VCF (1-based) and MAF output

Performance¶

Speed: 20x+ faster than v1.x on typical datasets
Memory: Efficient per-thread BAM readers with minimal overhead
Scalability: Configurable thread pool for optimal resource usage

Dependencies¶

Python: Updated to require Python ≥3.10
Rust: pyo3 0.27.1, rust-htslib 0.51.0, statrs 0.18.0
Python Packages: pysam ≥0.21.0, typer ≥0.9.0, rich ≥13.0.0, pydantic ≥2.0.0

🗑️ Removed¶

Legacy Python Counting: Pure Python implementation removed in favor of Rust
Old CLI: Deprecated python -m gbcms.cli entry point
Unused Dependencies: Removed cyvcf2 and numba (no longer needed)
Pre-commit Hooks: Removed in favor of explicit linting in CI

🐛 Fixed¶

Correct handling of complex variants (MNPs, DelIns)
Proper strand assignment for fragment counting
Reference validation against FASTA for all variant types
Thread-safe BAM access with per-thread readers

📚 Documentation¶

Complete rewrite of all documentation
New guides: INSTALLATION.md, CLI_FEATURES.md, INPUT_OUTPUT.md
Comprehensive API documentation
Docker usage examples
Contributing guidelines updated

🔧 Technical Details¶

Rust Components¶

gbcms._rs: PyO3-based Rust extension (bundled in wheel)
Fisher's exact test via statrs crate
Rayon-based parallelism with configurable thread pools
Safe memory management with Rust's ownership model

Testing¶

16 comprehensive test cases
Accuracy validation with synthetic BAM files
Filter validation for all read flag combinations
Integration tests with real-world data

⚠️ Breaking Changes¶

Version 2.0.0 is not backward compatible with 1.x. Key breaking changes:

CLI syntax: Use gbcms run instead of python -m gbcms.cli
Output format: VCF/MAF column structures have changed
Default behavior: Only duplicate filtering enabled by default (was: all filters)
Dependencies: Requires Rust toolchain for installation from source
Python version: Minimum Python 3.10 (was: 3.8)

📦 Installation¶

# From PyPI (includes pre-built wheels)
pip install gbcms

# From source (requires Rust)
pip install git+https://github.com/msk-access/gbcms.git

# Docker
docker pull ghcr.io/msk-access/gbcms:2.0.0

🙏 Acknowledgments¶

This rewrite was designed and implemented with a focus on correctness, performance, and modern best practices in bioinformatics software development.

[1.x] - Legacy¶

Previous versions (1.x) used a pure Python implementation. See git history for details.