Changelog¶
All notable changes to gbcms are documented here.
Full History
See GitHub Releases for complete release notes.
Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[3.0.0] - 2026-03-05¶
⚠️ Breaking Changes¶
- Package renamed:
py-gbcms→gbcms. Update your dependencies: A finalpy-gbcms==3.0.0deprecation stub on PyPI re-exportsgbcmsand issues aDeprecationWarningfor smooth migration.
✨ Added¶
- mFSD native integration:
--mfsdand--mfsd-parquetflags output a Parquet file with 31 MAF columns + 7 VCF INFO fields via the Rust native Parquet writer. Compression: ZSTD level 1. - CLI validation hardening: 12 validation gaps resolved — fail-fast BAM
accessibility check,
--lenient-bamflag for permissive mode,.vcf.bgzaccepted as a valid variant input extension. py-gbcmsdeprecation stub:compat/py-gbcms/shim published to PyPI aspy-gbcms==3.0.0for backwards compatibility during migration.
🔧 Fixed¶
- Parquet output compression switched from SNAPPY to ZSTD(1).
- Resolved
mypyAlignedSegment.cigarattribute errors in test suite. - Lint and stale-comment cleanup post Phase 6 audit.
📚 Documentation¶
- Full documentation sync with codebase after Phase 6 (mFSD) merge.
- PDF generation guide added to developer documentation.
.agent/rules/directory added; stale.antigravitydocs removed.
🏗️ CI¶
- Added
mkdocs-print-site-pluginto CI docs install step andpyproject.toml.
[2.8.0] - 2026-02-23¶
✨ Added¶
- PairHMM alignment backend: Alternative Phase 3 alignment via
--alignment-backend hmmwith probabilistic scoring using base quality probabilities. Configurable LLR threshold (default 2.3 ≈ ln(10)) and gap probabilities for repeat/non-repeat regions. 6 new CLI options. Exposed as first-class Nextflow params innextflow.config - MNP min-BQ-across-block quality strategy: MNP quality now assessed using min(BQ) across the entire block, matching C++ GBCMS
baseCountDNP. Low-quality MNP reads now fall through tocheck_complexfor masked comparison instead of being silently skipped - Per-phase ClassifyResult counters: Diagnostic counters track how many reads are resolved in each classification phase (Phase 1/2/2.5/3)
--traceflag: Two-tier Rust logging —--verbosefor debug,--tracefor per-read classification diagnostics via pyo3-log
🔧 Fixed¶
- Phase 2/2.5 overcounting: Complex variants with
REF >> ALTnow skip Phase 2 Case B and Phase 2.5 whenref_len > 2 × alt_len— short ALT trivially matches, edit distance is biased toward shorter allele - Phase 3 bypass for pure DEL/INS: Removed
is_worth_realignmentprefilter — CIGAR structure is ground truth for pure deletions/insertions, prefilter was overcounting - DP anchor overlap: Now uses single-position check (
read_start ≤ variant.pos), matching Mutect2, VarDictJava, and samtools mpileup standard - MNP LowQuality routing: LowQuality reads now fall through to
check_complexfor masked comparison instead of being skipped entirely - S3 underflow guard: Guards against negative
ctx_offsetin deletion S3 validation
🏗️ Refactored¶
- Rust module structure: Split
counting.rs(1904 LOC) andnormalize.rs(1221 LOC) into idiomatic module directories:counting/(7 modules) andnormalize/(7 modules) - AlignmentBackend threading:
AlignmentBackendenum threaded through all Phase 3 call sites
📚 Documentation¶
- Comprehensive audit: 28 fixes across 23 files — all GitBook URLs → MkDocs, version templating (X.Y.Z), undocumented CLI options, 5 mermaid diagrams updated for post-2.7.0 logic, architecture module tree refreshed, PairHMM documented end-to-end
- Deleted stale
nextflow/CHANGES.md
🧹 Chores¶
- Fix all cargo clippy warnings
- Fix ruff lint issues, black formatting
- Update test expectations for new behavior
- CI: skip test workflow for docs-only changes
🧪 Tests¶
- 8 Python alignment backend integration tests
- 5 SW-vs-PairHMM concordance tests
- 10 MNP unit tests
- Multi-allelic isolation, DP/neither, fragment consensus, normalization tests
[2.7.0] - 2026-02-19¶
✨ Added¶
- Phase 2.5 edit distance fallback: When read reconstruction length matches neither REF nor ALT (e.g., incomplete MAF definition), Levenshtein distance discriminates the closest allele with >1 edit margin safety guard
- Phase 3 local SW fallback: Complex variants where semiglobal alignment produces confident-but-wrong calls (e.g., EPHA7
TCC→CT) are now rescued via local Smith-Waterman that soft-clips mismatched flanks. Dual-trigger requires both score reversal and ≥2-point margin
🔧 Fixed¶
- Allele-based dispatch (
check_allele_with_qual): Routes byref_len × alt_leninstead of unreliablevariant_typestring labels. SNP (1×1), insertion (1×N), deletion (N×1), MNP (N×N equal), complex (N×M unequal) — eliminates misrouting when callers emit inconsistent type annotations - SW semiglobal argument order: Fixed
ref_hap/alt_hapargument swap in semiglobal alignment that was scoring reads against the wrong haplotype - Haplotype trimming removed: Eliminated shared symmetric trim that caused
slice index starts at 7 but ends at 6panics on asymmetric indels; replaced with validated per-haplotype bounds - MNP fallback: MNP reads now correctly fall through to SW alignment instead of silently returning "neither" on partial mismatches
- Dual-count guard: Prevents a single read from being counted as both REF and ALT when SW scores are exactly equal
- Soft-clip restriction: Soft-clipped bases no longer incorrectly contribute to variant region reconstruction
- Strand bias orientation: Strand bias (Fisher's exact) now couples to the winning allele, not the raw alignment orientation
- Interior REF quality proxy: Reads falling entirely within a large deletion (>50bp) now use median base quality instead of 0
- Interior REF guard removed: Eliminated the
has_large_cigar_delguard that massively overcounted REF for large deletions by misclassifying ALT-supporting reads
🧹 Chores¶
- Clippy: Removed unused
has_large_cigar_delvariable - Tests: Updated
test_fuzzy_complex::TestLengthMismatchexpectation to reflect Phase 3 local SW fallback behavior
[2.6.1] - 2026-02-19¶
🔧 Fixed¶
- Per-haplotype trimming: Fixed
slice index starts at 7 but ends at 6panic incounting.rson asymmetric indels. Replaced shared symmetric trim with independent per-haplotypetrim_haplotype()function that calculates bounds safely for each allele
✨ Added¶
- Tolerant REF validation: Variants with ≥90% REF match against the FASTA are now counted (status
PASS_WARN_REF_CORRECTED) instead of being silently rejected. The FASTA REF is used for haplotype construction. Variants with <90% match are still rejected asREF_MISMATCH
📚 Documentation¶
- Visual posters: Added overview, normalization, and read-filter/counting-metrics posters (JPG) to reference documentation pages with lightbox support
- Embedded PDFs: Added inline PDF viewer for allele classification guide and detailed overview presentation via
mkdocs-pdfplugin - Variant normalization: Updated REF validation docs with 3-tier flowchart,
PASS_WARN_REF_CORRECTEDstatus, and EGFR exon 19 real-world example
🔧 CI¶
deploy-docs.yml: Addedmkdocs-pdfto docs CI pip install dependencies
[2.6.0] - 2026-02-18¶
✨ Added¶
- Adaptive context padding: Dynamically increases
ref_contextflanking in tandem repeat regions (homopolymer through hexanucleotide). Formula:max(default, repeat_span/2 + 3), capped at 50bp. Enabled by default (--adaptive-context/--no-adaptive-context) gbcms normalizecommand: Standalone variant normalization (left-align + REF validate) without counting, outputs TSV with original and normalized coordinates- Nextflow parameters:
fragment_qual_threshold,context_padding,show_normalization,adaptive_contextnow configurable innextflow.config - Docs restructure: Split monolithic
variant-counting.mdinto 4 focused pages: Variant Normalization, Allele Classification, Counting Metrics, Read Filters - HPC install docs: Micromamba-based source install with Python 3.13
🔧 Fixed¶
- Interior REF guard for large deletions (>50bp): Reads falling entirely within a deleted region are now correctly classified as REF instead of ALT by Smith-Waterman
- Windowed reciprocal overlap: Improved shifted indel detection using bidirectional overlap scoring
- Complex variant counting (EPHA7
TCC→CT): Fixed base quality extraction for all variant-type handlers (check_insertion,check_deletion,check_mnp,check_complex) - MAF VCF-style conversion: Corrected complex variant handling in MAF→internal coordinate conversion
- Lint: Fixed ruff I001/E402/B905 and black formatting in
pipeline.py
🔄 Changed¶
- Dead code removed:
GenomicIntervalclass,Variant.intervalproperty,fragment_countingconfig field fetch_single_base()refactored: Delegates tofetch_region(), removing 33 lines of duplicated chr-prefix retry logic- Release guide: Updated version locations table with exact line numbers and verification command
- Nextflow pipeline diagram added to docs index
[2.5.0] - 2026-02-12¶
✨ Added¶
--preserve-barcodeflag: Keeps originalTumor_Sample_Barcodefrom input MAF instead of overriding with BAM sample name (MAF→MAF workflows)--column-prefixparameter: Controls prefix for gbcms count columns in MAF output (default: none; use--column-prefix t_for legacy compatibility) ⚠️CoordinateKernel: Centralized MAF↔internal 0-based coordinate conversion with variant-type-aware logic for SNP, insertion, deletion, and complex variants- Nextflow
FILTER_MAFmodule: Per-sample MAF variant filtering byTumor_Sample_Barcodesupporting exact match, regex, and multi-select (comma-separated) modes - Nextflow
PIPELINE_SUMMARYmodule: Aggregated per-sample filtering statistics with formatted console output - Nextflow
--filter_by_sampleparameter and samplesheettsbcolumn for multi-sample MAF workflows - Nextflow documentation: Samplesheet
tsbcolumn guide,--filter_by_sampleparameter reference, multi-sample MAF filtering examples
🔧 Fixed¶
- Fragment quality extraction (critical): All variant-type handlers (
check_insertion,check_deletion,check_mnp,check_complex) now return actualbase_qualfrom CIGAR walk instead of 0 — fixes systematic ALT undercount for indels in fragment-level consensus - FILTER_MAF heredoc conflict: Restructured script from Python shebang to
python3 << 'PYEOF'pattern, resolvingSyntaxErrorfrom bash syntax in Python context - FILTER_MAF string quoting: Changed to single-quoted Python strings for Nextflow variable interpolation to prevent CSV-parsed double-quote conflicts
splitCsvquote handling: Addedquote:'"'parameter for correct RFC 4180 parsing of comma-separated TSB values within quoted CSV fields- mypy
no-redeferror: Removed redundant type annotation inoutput.pyelse branch
🔄 Changed¶
- MAF output column prefix default: Changed from
t_to empty string (no prefix). Use--column-prefix t_for legacyt_ref_count/t_alt_countstyle columns ⚠️ MafWriterrefactored: MAF→MAF path preserves all original columns verbatim; VCF→MAF path builds row from GDC fieldnames withCoordinateKernelcoordinate conversion- Nextflow
GBCMS_RUNinput: Variants bundled into sample tuple(meta, bam, bai, variants)instead of separate channel - Nextflow
GBCMSworkflow: Simplified to 2-channel interface (ch_samples,ch_fasta) from 3 channels - Nextflow config: Added
column_prefix,preserve_barcode,filter_by_sampleparameters
[2.4.0] - 2026-02-10¶
✨ Added¶
- Fragment Consensus Engine:
FragmentEvidencestruct with u64 QNAME hashing and quality-weighted R1/R2 consensus; ambiguous fragments are discarded (not assigned to REF) --fragment-qual-threshold: New CLI option (default 10) controlling consensus quality difference for fragment conflict resolution- Windowed Indel Detection: ±5bp positional scan with 3-layer safeguards (sequence identity, closest match, reference context validation)
- Quality-Aware Complex Matching: Masked comparison that ignores bases below
--min-baseq; 3-case comparison (equal-length, ALT-only, REF-only) with ambiguity detection - Variant Counting Guide: New
docs/reference/variant-counting.mdwith algorithm diagrams for all variant types (~700 lines) - MAF Normalization Docs: Added indel normalization and coordinate handling to
docs/reference/input-formats.md - 47 Tests: Up from 16 — added
test_shifted_indels.py(15),test_fuzzy_complex.py(14),test_fragment_consensus.py(4)
🔄 Changed¶
--min-baseqdefault:0→20(Phred Q20) — activates quality masking by default for improved accuracy on low-coverage samples ⚠️--versionflag: Added to CLI (gbcms --version)- Deploy-Docs Workflow: Replaced
mkdocs gh-deploywithmikefor multi-version documentation; deploysstable(tagged version) from main anddevfrom develop branch; addedextra.version.provider: miketomkdocs.ymlwith version switcher widget
🔧 Fixed¶
- Fragment double-counting bug: R1+R2 pairs previously counted as two independent observations; now collapsed via quality-weighted consensus
- MAF Input Hardening: Graceful handling of missing/malformed fields with warnings instead of crashes
- CI Release Pipeline: Stabilized manylinux builds — migrated from manylinux_2_28 to manylinux_2_34, resolved OpenSSL/CURL vendor conflicts via
docker-optionspattern - Type stubs:
_rs.pyiandgbcms_rs.pyisynced with Rust bindings (addedref_context,ref_context_start,fragment_qual_threshold) - Linting: All files pass black, ruff, and mypy
📚 Documentation¶
- Architecture comparison table updated with windowed indels and masked comparison
- Nextflow config and docs updated with new
min_baseqdefault - Testing guide expanded with Phase 2a/2b test files
- HPC/RHEL 8 installation instructions updated with
clangdevheader management - Release guide updated with docs version locations
.antigravityproject files updated with current Rust LOC (~1270) and test counts (47)
[2.3.0] - 2026-02-06¶
✨ Added¶
- Nextflow BAI Auto-Discovery: Checks
.bam.baiand.baiextensions automatically - Documentation Modernization: Hierarchical navigation, glightbox, panzoom, abbreviations
- Performance Benchmarks: cfDNA duplex sample metrics in documentation
- RHEL 8 Installation Guide: Conda-based source installation for legacy Linux
🔄 Changed¶
- Dockerfile: Added
procps,bash, OCI labels,maturin[patchelf], selective COPY - Nextflow Config:
--platform linux/amd64, shell config, local profile, observability (trace/report/timeline/dag) - MkDocs: Switched to
navigation.sections, 20+ abbreviations with hover tooltips - GitHub Actions: Consolidated deploy-docs workflows, added caching and PR validation
- CI Wheels: Migrated from
manylinux_2_28tomanylinux_2_34(AlmaLinux 9 with OpenSSL 3.0+)
🔧 Fixed¶
- Nextflow: Empty
--suffixargument no longer causes failures - Admonitions: Converted GitHub-style alerts to MkDocs syntax
- CI Build: Resolved
curl-sysOpenSSL version conflict by switching to manylinux_2_34
[2.2.0] - 2026-02-04¶
✨ Added¶
- Multi-platform Wheel Publishing: Maturin-based CI builds for Linux (x86_64, aarch64), macOS (Intel, Apple Silicon), and Windows
- Structured Logging: New
utils/logging.pymodule with Rich console output, timing utilities, and log file support - Mermaid Diagrams: Architecture documentation with interactive flowcharts
- Release Guide: Comprehensive
docs/RELEASE.mdwith git-flow workflow
🔄 Changed¶
- Folder Restructure: Moved Rust code to
rust/(bundled asgbcms._rs) - Config Hierarchy: Nested Pydantic models (
ReadFilters,QualityThresholds,OutputConfig) for better organization - Code Quality: Added
__all__exports, docstrings, and type hints across all modules - StrEnum: Modern enum pattern with Python 3.10 backport
📚 Documentation¶
- New
docs/ARCHITECTURE.mdwith system diagrams - New
docs/DEVELOPMENT.md(developer guide) - New
docs/TESTING.md(testing guide) - Updated MkDocs with mermaid2 plugin and snippet includes
[2.1.2] - 2025-11-25¶
🔧 Fixed¶
- PyPI Distribution: Fixed source distribution size issue by correctly excluding large files (tests, docs, etc.) via
pyproject.tomlconfiguration.
[2.1.1] - 2025-11-25 [YANKED]¶
Yanked Release
This release was yanked from PyPI due to a source distribution size limit error. Use 2.1.2 instead.
🔧 Fixed¶
- PyPI Distribution: Added MANIFEST.in (failed to work with Hatchling) to reduce source distribution size
- Documentation: Added comprehensive Installation guide
- Documentation: Unified Contributing guide (merged code + docs contributions)
- Documentation: Added Changelog to documentation navigation
[2.1.0] - 2025-11-25¶
✨ Added¶
Nextflow Workflow¶
- Production-ready Nextflow workflow for processing multiple samples in parallel
- SLURM cluster support with customizable queue configuration
- Per-sample suffix support via optional
suffixcolumn in samplesheet - Docker and Singularity profiles for containerized execution
- Automatic BAI index discovery with validation
- Resume capability for failed workflow runs
- Resource management with automatic retry and scaling
- Comprehensive documentation in
docs/NEXTFLOW.mdandnextflow/README.md
Documentation¶
- Usage pattern comparison guide (
docs/WORKFLOWS.md) for choosing between CLI and Nextflow - MkDocs integration for beautiful GitHub Pages documentation
- Local documentation preview with live reload (
mkdocs serve) - Staging deployment from
developbranch for testing docs - Production deployment from
mainbranch - Reorganized documentation structure with clear CLI vs Nextflow separation
- CLI Quick Start guide (
docs/quick-start.md)
🔧 Changed¶
- Documentation workflow: docs now live on
mainbranch with automated deployment - GitBook integration: configured to read from
mainbranch - Nextflow module: improved parameter passing with meta.suffix support
📝 Documentation¶
- Complete Nextflow workflow guide with SLURM examples
- Per-sample suffix usage examples
- Git-flow documentation workflow guide
- Local preview instructions
- Updated README with clear usage pattern separation
[2.0.0] - 2025-11-21¶
🚀 Major Rewrite¶
Version 2.0.0 represents a complete rewrite of py-gbcms with a focus on performance, correctness, and modern architecture.
✨ Added¶
Core Features¶
- Rust-based Counting Engine: Hybrid Python/Rust architecture for 20x+ performance improvement
- Strand Bias Statistics: Fisher's exact test p-values and odds ratios for both reads (
SB_PVAL,SB_OR) and fragments (FSB_PVAL,FSB_OR) - Fragment-Level Counting: Majority-rule fragment counting with strand-specific counts (
RDF,ADF) - Variant Allele Fractions: Read-level (
VAF) and fragment-level (FAF) allele fraction calculations - Thread Control: Explicit control over parallelism via
--threadsargument (default: 1)
Input/Output¶
- VCF Output Format: Standard VCF with comprehensive INFO and FORMAT fields
- MAF Output Format: Extended MAF with custom columns for strand counts and statistics
- Column Preservation: Input MAF columns are preserved in output
- Multiple BAM Support: Process multiple samples via
--bam-listor repeated--bamarguments - Sample ID Override: Explicit sample naming via
--bam sample_id:pathsyntax
Filters¶
--filter-duplicates: Filter duplicate reads (default: enabled)--filter-secondary: Filter secondary alignments--filter-supplementary: Filter supplementary alignments--filter-qc-failed: Filter reads that failed QC--filter-improper-pair: Filter improperly paired reads--filter-indel: Filter reads with indels in CIGAR
CLI & Usability¶
- Modern CLI: Built with Typer and Rich for beautiful terminal output
- Progress Tracking: Real-time progress bars and status indicators
- Direct Invocation: Use
gbcms runinstead ofpython -m gbcms.cli - Output Customization:
--suffixflag for output filename customization - Flexible Input: Support for both VCF and MAF input formats
Infrastructure¶
- Docker Support: Production-ready multi-stage Dockerfile with optimized layers
- Type Safety: Full type annotations with mypy support
- Type Stubs: Provided
.pyistub file for Rust extension - Comprehensive Tests: Extended test suite with accuracy and filter validation
- CI/CD: GitHub Actions workflows for testing, linting, and releases
🔄 Changed¶
Architecture¶
- Migrated from pure Python to hybrid Python/Rust architecture
- Core counting logic implemented in Rust using
rust-htslib - Data parallelism over variants with per-thread BAM readers
Output Formats¶
- VCF FORMAT fields: Strand-specific counts now use comma-separated values (e.g.,
RD=5,3for forward,reverse) - MAF columns: Standardized column names (
t_ref_count_forward,t_alt_count_reverse, etc.) - Coordinate System: Internal 0-based indexing with correct conversion for VCF (1-based) and MAF output
Performance¶
- Speed: 20x+ faster than v1.x on typical datasets
- Memory: Efficient per-thread BAM readers with minimal overhead
- Scalability: Configurable thread pool for optimal resource usage
Dependencies¶
- Python: Updated to require Python ≥3.10
- Rust: pyo3 0.27.1, rust-htslib 0.51.0, statrs 0.18.0
- Python Packages: pysam ≥0.21.0, typer ≥0.9.0, rich ≥13.0.0, pydantic ≥2.0.0
🗑️ Removed¶
- Legacy Python Counting: Pure Python implementation removed in favor of Rust
- Old CLI: Deprecated
python -m gbcms.clientry point - Unused Dependencies: Removed
cyvcf2andnumba(no longer needed) - Pre-commit Hooks: Removed in favor of explicit linting in CI
🐛 Fixed¶
- Correct handling of complex variants (MNPs, DelIns)
- Proper strand assignment for fragment counting
- Reference validation against FASTA for all variant types
- Thread-safe BAM access with per-thread readers
📚 Documentation¶
- Complete rewrite of all documentation
- New guides:
INSTALLATION.md,CLI_FEATURES.md,INPUT_OUTPUT.md - Comprehensive API documentation
- Docker usage examples
- Contributing guidelines updated
🔧 Technical Details¶
Rust Components¶
gbcms._rs: PyO3-based Rust extension (bundled in wheel)- Fisher's exact test via
statrscrate - Rayon-based parallelism with configurable thread pools
- Safe memory management with Rust's ownership model
Testing¶
- 16 comprehensive test cases
- Accuracy validation with synthetic BAM files
- Filter validation for all read flag combinations
- Integration tests with real-world data
⚠️ Breaking Changes¶
Version 2.0.0 is not backward compatible with 1.x. Key breaking changes:
- CLI syntax: Use
gbcms runinstead ofpython -m gbcms.cli - Output format: VCF/MAF column structures have changed
- Default behavior: Only duplicate filtering enabled by default (was: all filters)
- Dependencies: Requires Rust toolchain for installation from source
- Python version: Minimum Python 3.10 (was: 3.8)
📦 Installation¶
# From PyPI (includes pre-built wheels)
pip install gbcms
# From source (requires Rust)
pip install git+https://github.com/msk-access/gbcms.git
# Docker
docker pull ghcr.io/msk-access/gbcms:2.0.0
🙏 Acknowledgments¶
This rewrite was designed and implemented with a focus on correctness, performance, and modern best practices in bioinformatics software development.
[1.x] - Legacy¶
Previous versions (1.x) used a pure Python implementation. See git history for details.
abbreviations