Developer Guide

This guide covers the Krewlyzer codebase architecture for contributors.

Repository Structure

krewlyzer/
├── src/krewlyzer/          # Python package
│   ├── cli.py              # Typer CLI entry point
│   ├── wrapper.py          # run-all orchestration
│   ├── assets.py           # AssetManager for bundled data
│   ├── extract.py          # BAM → BED extraction
│   ├── fsc.py              # Fragment size coverage
│   ├── fsd.py              # Fragment size distribution
│   ├── fsr.py              # Fragment size ratio
│   ├── wps.py              # Windowed protection score
│   ├── ocf.py              # Orientation-aware fragmentation
│   ├── motif.py            # End motif analysis
│   ├── mfsd.py             # Mutant fragment size distribution
│   ├── region_entropy.py   # TFBS/ATAC region entropy
│   ├── region_mds.py       # Per-gene MDS
│   ├── uxm.py              # Fragment-level methylation
│   ├── build_gc_reference.py # GC reference generation
│   ├── core/               # Shared processors
│   │   ├── asset_resolution.py  # Target/PON resolution logic
│   │   ├── asset_validation.py  # Asset validation checks
│   │   ├── bam_utils.py         # BAM utilities
│   │   ├── feature_serializer.py # JSON output
│   │   ├── fsc_processor.py     # FSC post-processing
│   │   ├── fsd_processor.py     # FSD post-processing
│   │   ├── fsr_processor.py     # FSR post-processing
│   │   ├── gc_assets.py         # GC resolution helper
│   │   ├── gene_bed.py          # Gene BED parsing
│   │   ├── logging.py           # Startup banner and logging
│   │   ├── motif_processor.py   # Motif post-processing
│   │   ├── ocf_processor.py     # OCF post-processing
│   │   ├── pon_integration.py   # PON post-processing
│   │   ├── region_entropy_processor.py # TFBS/ATAC processor
│   │   ├── sample_processor.py  # Per-sample orchestration
│   │   ├── unified_processor.py # Unified pipeline Python layer
│   │   ├── wps_processor.py     # WPS post-processing
│   │   └── utils.py, resource_utils.py
│   ├── pon/                # PON model code
│   │   ├── model.py        # PonModel dataclass
│   │   └── build.py        # PON building logic
│   └── data/               # Bundled assets (Git LFS)
├── rust/                   # Rust backend (19 modules)
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs          # PyO3 module exports
│       ├── pipeline.rs     # Unified pipeline entry
│       ├── engine.rs       # Core engine utilities
│       ├── bed.rs          # BGZF/gzip BED reader
│       ├── filters.rs      # Fragment filtering logic
│       └── (feature modules: see table below)
├── tests/                  # Test suite (244 tests, 4 skipped)
├── docs/                   # MkDocs documentation
└── nextflow/               # Nextflow pipeline
    ├── main.nf
    ├── nextflow.config
    ├── modules/local/       # Per-tool NF modules
    └── subworkflows/local/  # Subworkflows

Rust Backend Architecture

Module Structure

lib.rs
├── extract_motif     # BAM extraction module
├── gc                # GC correction module  
├── run_unified_pipeline  # Main pipeline function
└── configure_threads # Thread pool setup

Key Rust Modules

| Module | Purpose | |--------|---------|| | lib.rs | PyO3 module exports and thread config | | pipeline.rs | Unified pipeline coordination | | engine.rs | Core engine utilities | | bed.rs | BGZF/gzip BED reader | | extract_motif.rs | BAM parsing, fragment + motif extraction | | motif_utils.rs | Shared 4-mer encoding, MDS, GC utils | | fsc.rs | Fragment size coverage + gene aggregation | | fsd.rs | Per-arm size distribution + PON log-ratio | | wps.rs | Dual-stream WPS, FFT, smoothing | | ocf.rs | Orientation-aware fragmentation + PON z-score | | mfsd.rs | Mutant fragment size distribution | | region_entropy.rs | TFBS/ATAC entropy + PON z-score | | region_mds.rs | Per-gene MDS at exon boundaries | | uxm.rs | Fragment-level methylation (UXM) | | gc_correction.rs | LOESS GC bias correction | | gc_reference.rs | Pre-computed GC reference generation | | pon_model.rs | PON model loading and hybrid correction | | pon_builder.rs | PON model construction | | filters.rs | Fragment filtering logic |

Unified Pipeline

All feature computation goes through run_unified_pipeline:

pub fn run_unified_pipeline(
    _py: Python,
    bed_path: PathBuf,
    // GC Correction
    gc_ref_path: Option<PathBuf>,
    valid_regions_path: Option<PathBuf>,
    correction_out_path: Option<PathBuf>,
    correction_input_path: Option<PathBuf>,
    // FSC
    fsc_bins: Option<PathBuf>, fsc_output: Option<PathBuf>,
    // WPS Foreground (TSS/CTCF anchors)
    wps_regions: Option<PathBuf>, wps_output: Option<PathBuf>,
    // WPS Background (Alu stacking)
    wps_background_regions: Option<PathBuf>, wps_background_output: Option<PathBuf>,
    wps_empty: bool,
    // FSD
    fsd_arms: Option<PathBuf>, fsd_output: Option<PathBuf>,
    // OCF
    ocf_regions: Option<PathBuf>, ocf_output: Option<PathBuf>,
    // Target regions for on/off-target split (panel mode)
    target_regions_path: Option<PathBuf>,
    bait_padding: u64,
    // Output format: "tsv", "parquet", or "both"
    output_format: &str,
    // Gzip-compress TSV outputs
    compress: bool,
    silent: bool,
) -> PyResult<()>

Adding a New Feature

Create Rust module (rust/src/new_feature.rs)
Add to lib.rs module exports
Create Python wrapper (src/krewlyzer/new_feature.py)
Add to CLI in cli.py
Integrate with run_unified_pipeline if applicable
Add to FeatureSerializer for JSON output
Write tests in tests/unit/ and tests/integration/

Python Architecture

Tool Pattern

All standalone tools follow this pattern:

def tool_name(
    input: Path = typer.Option(...),
    output: Path = typer.Option(...),
    # ... options
):
    # 1. Initialize AssetManager
    assets = AssetManager(genome)

    # 2. Resolve GC assets (use helper)
    from .core.gc_assets import resolve_gc_assets
    gc = resolve_gc_assets(assets, output, sample_name, input, gc_correct, genome)

    # 3. Call Rust backend
    _core.run_unified_pipeline(...)

    # 4. Post-process (Python)
    from .core.tool_processor import process_tool
    process_tool(output_file, ...)

AssetManager

Centralizes access to bundled data:

assets = AssetManager("hg19")

# Properties
assets.gc_reference      # GC ref parquet
assets.valid_regions     # Valid regions BED
assets.bins_100kb        # Default bins
assets.wps_anchors       # WPS anchors BED

# Assay-aware methods
assets.get_gene_bed("xs2")     # Panel gene BED
assets.get_wps_anchors("xs2")  # Panel WPS anchors
assets.get_pon("xs2")          # Bundled PON

FeatureSerializer

Collects all features into unified JSON:

serializer = FeatureSerializer(sample_id, version="X.Y.Z")
serializer.add_fsc(fsc_df)
serializer.add_fsc_e1(fsc_e1_df)
serializer.add_wps(wps_df)
serializer.add_motif(edm_df, bpm_df, mds, mds_z)
serializer.add_ocf(ocf_df)
serializer.add_mfsd(mfsd_df)
serializer.add_uxm(uxm_df)
serializer.add_qc("total_fragments", 1234567)
serializer.save(output_dir)

# Or load from existing outputs
serializer = FeatureSerializer.from_outputs(
    sample_id=sample_id,
    output_dir=output_dir,
    version="X.Y.Z"
)

Testing

Krewlyzer has 244 tests (4 skipped) across unit, integration, and e2e categories.

→ Testing Guide for complete documentation including: - Feature → test file mapping - Fixture reference - Test writing examples

Quick Commands

# All tests
pytest tests/

# Fast unit tests only
pytest tests/unit/

# Specific feature
pytest tests/unit/test_fsc.py

# With coverage
pytest tests/ --cov=krewlyzer --cov-report=html

Tip

Modifying a feature? Check the Feature → Test Map to find which test file to update.

Rust Development

Building

Always build from project root, not rust/

Running maturin develop from rust/ builds a krewlyzer_core wheel that does NOT update the .so at src/krewlyzer/_core.cpython-*.so. Running from the project root builds the krewlyzer wheel which correctly installs the extension module.

# Debug build (fast compile, slow run)
maturin develop

# Release build (slow compile, fast run)
maturin develop --release

# Verify the installed build timestamp
python -c "import krewlyzer._core as c; import os, datetime; print(datetime.datetime.fromtimestamp(os.path.getmtime(c.__file__)))"

Testing Rust

# Run Rust tests (currently minimal)
cargo test

# Check before committing
cargo clippy -- -D warnings
cargo fmt --check

Debugging

# Enable verbose logging
RUST_LOG=debug krewlyzer run-all -i sample.bam ...

# Or in Python
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from krewlyzer import _core
# ... your test code
"

Code Style

Python

Use typer for CLIs
Use rich for logging/progress
Type hints for all public functions
Docstrings (Google style)

Rust

Run cargo fmt before committing
Address cargo clippy warnings
Use anyhow::Result for error handling
Log with log::info!, log::debug!

Known Gotchas

`Path.with_suffix()` and Compound Extensions

NEVER use Path.with_suffix() on paths with compound dot-separated names

Krewlyzer output files use compound names like sample.MDS.exon, sample.EndMotif.ontarget, sample.FSC.regions.e1only. Python's Path.with_suffix() replaces only the last dot-segment, which silently corrupts these paths:

Expression	Expected	Actual
`Path("P-XXX.MDS.exon").with_suffix(".tsv")`	`P-XXX.MDS.exon.tsv`	`P-XXX.MDS.tsv` ❌
`Path("P-XXX.MDS.ontarget").with_suffix(".tsv")`	`P-XXX.MDS.ontarget.tsv`	`P-XXX.MDS.tsv` ❌
`Path("P-XXX.EndMotif").with_suffix(".tsv")`	`P-XXX.EndMotif.tsv`	`P-XXX.tsv` ❌

Safe pattern — always use string concatenation:

# ✅ CORRECT: preserves all dot-segments
output_path = base.parent / (base.name + ".tsv")

# ❌ WRONG: replaces last dot-segment
output_path = base.with_suffix(".tsv")

Compound names in krewlyzer (all vulnerable to with_suffix()): MDS.exon, MDS.gene, MDS.ontarget, EndMotif, EndMotif.ontarget, BreakPointMotif, EndMotif1mer, FSC.gene, FSC.regions, FSC.regions.e1only, OCF.sync, TFBS.sync, ATAC.sync.

This gotcha caused a silent data loss bug in v0.8.0 where MDS.exon.tsv and MDS.gene.tsv were never generated. Test coverage: tests/unit/test_compound_extension.py.

`_core.pyi` — Rust Extension Stub Maintenance

src/krewlyzer/_core.pyi is a type stub for the compiled Rust/PyO3 extension (krewlyzer._core). It tells mypy what functions and submodules exist without inspecting the binary .so — following the same pattern as py-gbcms _rs.pyi.

Update this file whenever you:

Add a new #[pyfunction] to any rust/src/*.rs file
Add a new sub-PyModule registered in rust/src/lib.rs
Change the signature (parameters or return type) of an existing exported function

# After updating rust/src/*.rs, update the stub and verify:
python -m mypy src/krewlyzer/ --ignore-missing-imports --no-error-summary
# Must exit 0 with no output. Commit .rs and .pyi changes together.

Stub drift causes CI failures

If _core.pyi is out of sync, mypy will report attr-defined errors in Python files that call the Rust functions, even if the code runs correctly at runtime.

Contributing Checklist

Code follows existing patterns
Added/updated tests
Updated documentation
Ran pytest tests/ — 244 pass, 4 skipped
If Rust functions changed: updated src/krewlyzer/_core.pyi stub

Ran Python lint (matches CI lint job):

ruff check src/krewlyzer/
black --check src/krewlyzer/
python -m mypy src/krewlyzer/
python scripts/check_output_format.py

Ran Rust lint:

cargo fmt && cargo clippy -- -D warnings

Updated CHANGELOG.md

See CONTRIBUTING.md for full guidelines.