Skip to content

Changelog

All notable changes to gbcms are documented here.

Full History

See GitHub Releases for complete release notes.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[3.0.0] - 2026-03-05

⚠️ Breaking Changes

  • Package renamed: py-gbcmsgbcms. Update your dependencies:
    pip uninstall py-gbcms
    pip install gbcms
    
    A final py-gbcms==3.0.0 deprecation stub on PyPI re-exports gbcms and issues a DeprecationWarning for smooth migration.

✨ Added

  • mFSD native integration: --mfsd and --mfsd-parquet flags output a Parquet file with 31 MAF columns + 7 VCF INFO fields via the Rust native Parquet writer. Compression: ZSTD level 1.
  • CLI validation hardening: 12 validation gaps resolved — fail-fast BAM accessibility check, --lenient-bam flag for permissive mode, .vcf.bgz accepted as a valid variant input extension.
  • py-gbcms deprecation stub: compat/py-gbcms/ shim published to PyPI as py-gbcms==3.0.0 for backwards compatibility during migration.

🔧 Fixed

  • Parquet output compression switched from SNAPPY to ZSTD(1).
  • Resolved mypy AlignedSegment.cigar attribute errors in test suite.
  • Lint and stale-comment cleanup post Phase 6 audit.

📚 Documentation

  • Full documentation sync with codebase after Phase 6 (mFSD) merge.
  • PDF generation guide added to developer documentation.
  • .agent/rules/ directory added; stale .antigravity docs removed.

🏗️ CI

  • Added mkdocs-print-site-plugin to CI docs install step and pyproject.toml.

[2.8.0] - 2026-02-23

✨ Added

  • PairHMM alignment backend: Alternative Phase 3 alignment via --alignment-backend hmm with probabilistic scoring using base quality probabilities. Configurable LLR threshold (default 2.3 ≈ ln(10)) and gap probabilities for repeat/non-repeat regions. 6 new CLI options. Exposed as first-class Nextflow params in nextflow.config
  • MNP min-BQ-across-block quality strategy: MNP quality now assessed using min(BQ) across the entire block, matching C++ GBCMS baseCountDNP. Low-quality MNP reads now fall through to check_complex for masked comparison instead of being silently skipped
  • Per-phase ClassifyResult counters: Diagnostic counters track how many reads are resolved in each classification phase (Phase 1/2/2.5/3)
  • --trace flag: Two-tier Rust logging — --verbose for debug, --trace for per-read classification diagnostics via pyo3-log

🔧 Fixed

  • Phase 2/2.5 overcounting: Complex variants with REF >> ALT now skip Phase 2 Case B and Phase 2.5 when ref_len > 2 × alt_len — short ALT trivially matches, edit distance is biased toward shorter allele
  • Phase 3 bypass for pure DEL/INS: Removed is_worth_realignment prefilter — CIGAR structure is ground truth for pure deletions/insertions, prefilter was overcounting
  • DP anchor overlap: Now uses single-position check (read_start ≤ variant.pos), matching Mutect2, VarDictJava, and samtools mpileup standard
  • MNP LowQuality routing: LowQuality reads now fall through to check_complex for masked comparison instead of being skipped entirely
  • S3 underflow guard: Guards against negative ctx_offset in deletion S3 validation

🏗️ Refactored

  • Rust module structure: Split counting.rs (1904 LOC) and normalize.rs (1221 LOC) into idiomatic module directories: counting/ (7 modules) and normalize/ (7 modules)
  • AlignmentBackend threading: AlignmentBackend enum threaded through all Phase 3 call sites

📚 Documentation

  • Comprehensive audit: 28 fixes across 23 files — all GitBook URLs → MkDocs, version templating (X.Y.Z), undocumented CLI options, 5 mermaid diagrams updated for post-2.7.0 logic, architecture module tree refreshed, PairHMM documented end-to-end
  • Deleted stale nextflow/CHANGES.md

🧹 Chores

  • Fix all cargo clippy warnings
  • Fix ruff lint issues, black formatting
  • Update test expectations for new behavior
  • CI: skip test workflow for docs-only changes

🧪 Tests

  • 8 Python alignment backend integration tests
  • 5 SW-vs-PairHMM concordance tests
  • 10 MNP unit tests
  • Multi-allelic isolation, DP/neither, fragment consensus, normalization tests

[2.7.0] - 2026-02-19

✨ Added

  • Phase 2.5 edit distance fallback: When read reconstruction length matches neither REF nor ALT (e.g., incomplete MAF definition), Levenshtein distance discriminates the closest allele with >1 edit margin safety guard
  • Phase 3 local SW fallback: Complex variants where semiglobal alignment produces confident-but-wrong calls (e.g., EPHA7 TCC→CT) are now rescued via local Smith-Waterman that soft-clips mismatched flanks. Dual-trigger requires both score reversal and ≥2-point margin

🔧 Fixed

  • Allele-based dispatch (check_allele_with_qual): Routes by ref_len × alt_len instead of unreliable variant_type string labels. SNP (1×1), insertion (1×N), deletion (N×1), MNP (N×N equal), complex (N×M unequal) — eliminates misrouting when callers emit inconsistent type annotations
  • SW semiglobal argument order: Fixed ref_hap/alt_hap argument swap in semiglobal alignment that was scoring reads against the wrong haplotype
  • Haplotype trimming removed: Eliminated shared symmetric trim that caused slice index starts at 7 but ends at 6 panics on asymmetric indels; replaced with validated per-haplotype bounds
  • MNP fallback: MNP reads now correctly fall through to SW alignment instead of silently returning "neither" on partial mismatches
  • Dual-count guard: Prevents a single read from being counted as both REF and ALT when SW scores are exactly equal
  • Soft-clip restriction: Soft-clipped bases no longer incorrectly contribute to variant region reconstruction
  • Strand bias orientation: Strand bias (Fisher's exact) now couples to the winning allele, not the raw alignment orientation
  • Interior REF quality proxy: Reads falling entirely within a large deletion (>50bp) now use median base quality instead of 0
  • Interior REF guard removed: Eliminated the has_large_cigar_del guard that massively overcounted REF for large deletions by misclassifying ALT-supporting reads

🧹 Chores

  • Clippy: Removed unused has_large_cigar_del variable
  • Tests: Updated test_fuzzy_complex::TestLengthMismatch expectation to reflect Phase 3 local SW fallback behavior

[2.6.1] - 2026-02-19

🔧 Fixed

  • Per-haplotype trimming: Fixed slice index starts at 7 but ends at 6 panic in counting.rs on asymmetric indels. Replaced shared symmetric trim with independent per-haplotype trim_haplotype() function that calculates bounds safely for each allele

✨ Added

  • Tolerant REF validation: Variants with ≥90% REF match against the FASTA are now counted (status PASS_WARN_REF_CORRECTED) instead of being silently rejected. The FASTA REF is used for haplotype construction. Variants with <90% match are still rejected as REF_MISMATCH

📚 Documentation

  • Visual posters: Added overview, normalization, and read-filter/counting-metrics posters (JPG) to reference documentation pages with lightbox support
  • Embedded PDFs: Added inline PDF viewer for allele classification guide and detailed overview presentation via mkdocs-pdf plugin
  • Variant normalization: Updated REF validation docs with 3-tier flowchart, PASS_WARN_REF_CORRECTED status, and EGFR exon 19 real-world example

🔧 CI

  • deploy-docs.yml: Added mkdocs-pdf to docs CI pip install dependencies

[2.6.0] - 2026-02-18

✨ Added

  • Adaptive context padding: Dynamically increases ref_context flanking in tandem repeat regions (homopolymer through hexanucleotide). Formula: max(default, repeat_span/2 + 3), capped at 50bp. Enabled by default (--adaptive-context/--no-adaptive-context)
  • gbcms normalize command: Standalone variant normalization (left-align + REF validate) without counting, outputs TSV with original and normalized coordinates
  • Nextflow parameters: fragment_qual_threshold, context_padding, show_normalization, adaptive_context now configurable in nextflow.config
  • Docs restructure: Split monolithic variant-counting.md into 4 focused pages: Variant Normalization, Allele Classification, Counting Metrics, Read Filters
  • HPC install docs: Micromamba-based source install with Python 3.13

🔧 Fixed

  • Interior REF guard for large deletions (>50bp): Reads falling entirely within a deleted region are now correctly classified as REF instead of ALT by Smith-Waterman
  • Windowed reciprocal overlap: Improved shifted indel detection using bidirectional overlap scoring
  • Complex variant counting (EPHA7 TCC→CT): Fixed base quality extraction for all variant-type handlers (check_insertion, check_deletion, check_mnp, check_complex)
  • MAF VCF-style conversion: Corrected complex variant handling in MAF→internal coordinate conversion
  • Lint: Fixed ruff I001/E402/B905 and black formatting in pipeline.py

🔄 Changed

  • Dead code removed: GenomicInterval class, Variant.interval property, fragment_counting config field
  • fetch_single_base() refactored: Delegates to fetch_region(), removing 33 lines of duplicated chr-prefix retry logic
  • Release guide: Updated version locations table with exact line numbers and verification command
  • Nextflow pipeline diagram added to docs index

[2.5.0] - 2026-02-12

✨ Added

  • --preserve-barcode flag: Keeps original Tumor_Sample_Barcode from input MAF instead of overriding with BAM sample name (MAFMAF workflows)
  • --column-prefix parameter: Controls prefix for gbcms count columns in MAF output (default: none; use --column-prefix t_ for legacy compatibility) ⚠️
  • CoordinateKernel: Centralized MAF↔internal 0-based coordinate conversion with variant-type-aware logic for SNP, insertion, deletion, and complex variants
  • Nextflow FILTER_MAF module: Per-sample MAF variant filtering by Tumor_Sample_Barcode supporting exact match, regex, and multi-select (comma-separated) modes
  • Nextflow PIPELINE_SUMMARY module: Aggregated per-sample filtering statistics with formatted console output
  • Nextflow --filter_by_sample parameter and samplesheet tsb column for multi-sample MAF workflows
  • Nextflow documentation: Samplesheet tsb column guide, --filter_by_sample parameter reference, multi-sample MAF filtering examples

🔧 Fixed

  • Fragment quality extraction (critical): All variant-type handlers (check_insertion, check_deletion, check_mnp, check_complex) now return actual base_qual from CIGAR walk instead of 0 — fixes systematic ALT undercount for indels in fragment-level consensus
  • FILTER_MAF heredoc conflict: Restructured script from Python shebang to python3 << 'PYEOF' pattern, resolving SyntaxError from bash syntax in Python context
  • FILTER_MAF string quoting: Changed to single-quoted Python strings for Nextflow variable interpolation to prevent CSV-parsed double-quote conflicts
  • splitCsv quote handling: Added quote:'"' parameter for correct RFC 4180 parsing of comma-separated TSB values within quoted CSV fields
  • mypy no-redef error: Removed redundant type annotation in output.py else branch

🔄 Changed

  • MAF output column prefix default: Changed from t_ to empty string (no prefix). Use --column-prefix t_ for legacy t_ref_count / t_alt_count style columns ⚠️
  • MafWriter refactored: MAFMAF path preserves all original columns verbatim; VCFMAF path builds row from GDC fieldnames with CoordinateKernel coordinate conversion
  • Nextflow GBCMS_RUN input: Variants bundled into sample tuple (meta, bam, bai, variants) instead of separate channel
  • Nextflow GBCMS workflow: Simplified to 2-channel interface (ch_samples, ch_fasta) from 3 channels
  • Nextflow config: Added column_prefix, preserve_barcode, filter_by_sample parameters

[2.4.0] - 2026-02-10

✨ Added

  • Fragment Consensus Engine: FragmentEvidence struct with u64 QNAME hashing and quality-weighted R1/R2 consensus; ambiguous fragments are discarded (not assigned to REF)
  • --fragment-qual-threshold: New CLI option (default 10) controlling consensus quality difference for fragment conflict resolution
  • Windowed Indel Detection: ±5bp positional scan with 3-layer safeguards (sequence identity, closest match, reference context validation)
  • Quality-Aware Complex Matching: Masked comparison that ignores bases below --min-baseq; 3-case comparison (equal-length, ALT-only, REF-only) with ambiguity detection
  • Variant Counting Guide: New docs/reference/variant-counting.md with algorithm diagrams for all variant types (~700 lines)
  • MAF Normalization Docs: Added indel normalization and coordinate handling to docs/reference/input-formats.md
  • 47 Tests: Up from 16 — added test_shifted_indels.py (15), test_fuzzy_complex.py (14), test_fragment_consensus.py (4)

🔄 Changed

  • --min-baseq default: 020 (Phred Q20) — activates quality masking by default for improved accuracy on low-coverage samples ⚠️
  • --version flag: Added to CLI (gbcms --version)
  • Deploy-Docs Workflow: Replaced mkdocs gh-deploy with mike for multi-version documentation; deploys stable (tagged version) from main and dev from develop branch; added extra.version.provider: mike to mkdocs.yml with version switcher widget

🔧 Fixed

  • Fragment double-counting bug: R1+R2 pairs previously counted as two independent observations; now collapsed via quality-weighted consensus
  • MAF Input Hardening: Graceful handling of missing/malformed fields with warnings instead of crashes
  • CI Release Pipeline: Stabilized manylinux builds — migrated from manylinux_2_28 to manylinux_2_34, resolved OpenSSL/CURL vendor conflicts via docker-options pattern
  • Type stubs: _rs.pyi and gbcms_rs.pyi synced with Rust bindings (added ref_context, ref_context_start, fragment_qual_threshold)
  • Linting: All files pass black, ruff, and mypy

📚 Documentation

  • Architecture comparison table updated with windowed indels and masked comparison
  • Nextflow config and docs updated with new min_baseq default
  • Testing guide expanded with Phase 2a/2b test files
  • HPC/RHEL 8 installation instructions updated with clangdev header management
  • Release guide updated with docs version locations
  • .antigravity project files updated with current Rust LOC (~1270) and test counts (47)

[2.3.0] - 2026-02-06

✨ Added

  • Nextflow BAI Auto-Discovery: Checks .bam.bai and .bai extensions automatically
  • Documentation Modernization: Hierarchical navigation, glightbox, panzoom, abbreviations
  • Performance Benchmarks: cfDNA duplex sample metrics in documentation
  • RHEL 8 Installation Guide: Conda-based source installation for legacy Linux

🔄 Changed

  • Dockerfile: Added procps, bash, OCI labels, maturin[patchelf], selective COPY
  • Nextflow Config: --platform linux/amd64, shell config, local profile, observability (trace/report/timeline/dag)
  • MkDocs: Switched to navigation.sections, 20+ abbreviations with hover tooltips
  • GitHub Actions: Consolidated deploy-docs workflows, added caching and PR validation
  • CI Wheels: Migrated from manylinux_2_28 to manylinux_2_34 (AlmaLinux 9 with OpenSSL 3.0+)

🔧 Fixed

  • Nextflow: Empty --suffix argument no longer causes failures
  • Admonitions: Converted GitHub-style alerts to MkDocs syntax
  • CI Build: Resolved curl-sys OpenSSL version conflict by switching to manylinux_2_34

[2.2.0] - 2026-02-04

✨ Added

  • Multi-platform Wheel Publishing: Maturin-based CI builds for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows
  • Structured Logging: New utils/logging.py module with Rich console output, timing utilities, and log file support
  • Mermaid Diagrams: Architecture documentation with interactive flowcharts
  • Release Guide: Comprehensive docs/RELEASE.md with git-flow workflow

🔄 Changed

  • Folder Restructure: Moved Rust code to rust/ (bundled as gbcms._rs)
  • Config Hierarchy: Nested Pydantic models (ReadFilters, QualityThresholds, OutputConfig) for better organization
  • Code Quality: Added __all__ exports, docstrings, and type hints across all modules
  • StrEnum: Modern enum pattern with Python 3.10 backport

📚 Documentation

  • New docs/ARCHITECTURE.md with system diagrams
  • New docs/DEVELOPMENT.md (developer guide)
  • New docs/TESTING.md (testing guide)
  • Updated MkDocs with mermaid2 plugin and snippet includes

[2.1.2] - 2025-11-25

🔧 Fixed

  • PyPI Distribution: Fixed source distribution size issue by correctly excluding large files (tests, docs, etc.) via pyproject.toml configuration.

[2.1.1] - 2025-11-25 [YANKED]

Yanked Release

This release was yanked from PyPI due to a source distribution size limit error. Use 2.1.2 instead.

🔧 Fixed

  • PyPI Distribution: Added MANIFEST.in (failed to work with Hatchling) to reduce source distribution size
  • Documentation: Added comprehensive Installation guide
  • Documentation: Unified Contributing guide (merged code + docs contributions)
  • Documentation: Added Changelog to documentation navigation

[2.1.0] - 2025-11-25

✨ Added

Nextflow Workflow

  • Production-ready Nextflow workflow for processing multiple samples in parallel
  • SLURM cluster support with customizable queue configuration
  • Per-sample suffix support via optional suffix column in samplesheet
  • Docker and Singularity profiles for containerized execution
  • Automatic BAI index discovery with validation
  • Resume capability for failed workflow runs
  • Resource management with automatic retry and scaling
  • Comprehensive documentation in docs/NEXTFLOW.md and nextflow/README.md

Documentation

  • Usage pattern comparison guide (docs/WORKFLOWS.md) for choosing between CLI and Nextflow
  • MkDocs integration for beautiful GitHub Pages documentation
  • Local documentation preview with live reload (mkdocs serve)
  • Staging deployment from develop branch for testing docs
  • Production deployment from main branch
  • Reorganized documentation structure with clear CLI vs Nextflow separation
  • CLI Quick Start guide (docs/quick-start.md)

🔧 Changed

  • Documentation workflow: docs now live on main branch with automated deployment
  • GitBook integration: configured to read from main branch
  • Nextflow module: improved parameter passing with meta.suffix support

📝 Documentation

  • Complete Nextflow workflow guide with SLURM examples
  • Per-sample suffix usage examples
  • Git-flow documentation workflow guide
  • Local preview instructions
  • Updated README with clear usage pattern separation

[2.0.0] - 2025-11-21

🚀 Major Rewrite

Version 2.0.0 represents a complete rewrite of py-gbcms with a focus on performance, correctness, and modern architecture.

✨ Added

Core Features

  • Rust-based Counting Engine: Hybrid Python/Rust architecture for 20x+ performance improvement
  • Strand Bias Statistics: Fisher's exact test p-values and odds ratios for both reads (SB_PVAL, SB_OR) and fragments (FSB_PVAL, FSB_OR)
  • Fragment-Level Counting: Majority-rule fragment counting with strand-specific counts (RDF, ADF)
  • Variant Allele Fractions: Read-level (VAF) and fragment-level (FAF) allele fraction calculations
  • Thread Control: Explicit control over parallelism via --threads argument (default: 1)

Input/Output

  • VCF Output Format: Standard VCF with comprehensive INFO and FORMAT fields
  • MAF Output Format: Extended MAF with custom columns for strand counts and statistics
  • Column Preservation: Input MAF columns are preserved in output
  • Multiple BAM Support: Process multiple samples via --bam-list or repeated --bam arguments
  • Sample ID Override: Explicit sample naming via --bam sample_id:path syntax

Filters

  • --filter-duplicates: Filter duplicate reads (default: enabled)
  • --filter-secondary: Filter secondary alignments
  • --filter-supplementary: Filter supplementary alignments
  • --filter-qc-failed: Filter reads that failed QC
  • --filter-improper-pair: Filter improperly paired reads
  • --filter-indel: Filter reads with indels in CIGAR

CLI & Usability

  • Modern CLI: Built with Typer and Rich for beautiful terminal output
  • Progress Tracking: Real-time progress bars and status indicators
  • Direct Invocation: Use gbcms run instead of python -m gbcms.cli
  • Output Customization: --suffix flag for output filename customization
  • Flexible Input: Support for both VCF and MAF input formats

Infrastructure

  • Docker Support: Production-ready multi-stage Dockerfile with optimized layers
  • Type Safety: Full type annotations with mypy support
  • Type Stubs: Provided .pyi stub file for Rust extension
  • Comprehensive Tests: Extended test suite with accuracy and filter validation
  • CI/CD: GitHub Actions workflows for testing, linting, and releases

🔄 Changed

Architecture

  • Migrated from pure Python to hybrid Python/Rust architecture
  • Core counting logic implemented in Rust using rust-htslib
  • Data parallelism over variants with per-thread BAM readers

Output Formats

  • VCF FORMAT fields: Strand-specific counts now use comma-separated values (e.g., RD=5,3 for forward,reverse)
  • MAF columns: Standardized column names (t_ref_count_forward, t_alt_count_reverse, etc.)
  • Coordinate System: Internal 0-based indexing with correct conversion for VCF (1-based) and MAF output

Performance

  • Speed: 20x+ faster than v1.x on typical datasets
  • Memory: Efficient per-thread BAM readers with minimal overhead
  • Scalability: Configurable thread pool for optimal resource usage

Dependencies

  • Python: Updated to require Python ≥3.10
  • Rust: pyo3 0.27.1, rust-htslib 0.51.0, statrs 0.18.0
  • Python Packages: pysam ≥0.21.0, typer ≥0.9.0, rich ≥13.0.0, pydantic ≥2.0.0

🗑️ Removed

  • Legacy Python Counting: Pure Python implementation removed in favor of Rust
  • Old CLI: Deprecated python -m gbcms.cli entry point
  • Unused Dependencies: Removed cyvcf2 and numba (no longer needed)
  • Pre-commit Hooks: Removed in favor of explicit linting in CI

🐛 Fixed

  • Correct handling of complex variants (MNPs, DelIns)
  • Proper strand assignment for fragment counting
  • Reference validation against FASTA for all variant types
  • Thread-safe BAM access with per-thread readers

📚 Documentation

  • Complete rewrite of all documentation
  • New guides: INSTALLATION.md, CLI_FEATURES.md, INPUT_OUTPUT.md
  • Comprehensive API documentation
  • Docker usage examples
  • Contributing guidelines updated

🔧 Technical Details

Rust Components

  • gbcms._rs: PyO3-based Rust extension (bundled in wheel)
  • Fisher's exact test via statrs crate
  • Rayon-based parallelism with configurable thread pools
  • Safe memory management with Rust's ownership model

Testing

  • 16 comprehensive test cases
  • Accuracy validation with synthetic BAM files
  • Filter validation for all read flag combinations
  • Integration tests with real-world data

⚠️ Breaking Changes

Version 2.0.0 is not backward compatible with 1.x. Key breaking changes:

  1. CLI syntax: Use gbcms run instead of python -m gbcms.cli
  2. Output format: VCF/MAF column structures have changed
  3. Default behavior: Only duplicate filtering enabled by default (was: all filters)
  4. Dependencies: Requires Rust toolchain for installation from source
  5. Python version: Minimum Python 3.10 (was: 3.8)

📦 Installation

# From PyPI (includes pre-built wheels)
pip install gbcms

# From source (requires Rust)
pip install git+https://github.com/msk-access/gbcms.git

# Docker
docker pull ghcr.io/msk-access/gbcms:2.0.0

🙏 Acknowledgments

This rewrite was designed and implemented with a focus on correctness, performance, and modern best practices in bioinformatics software development.


[1.x] - Legacy

Previous versions (1.x) used a pure Python implementation. See git history for details.