Developer Guide
This guide covers the Krewlyzer codebase architecture for contributors.
Repository Structure
krewlyzer/
├── src/krewlyzer/ # Python package
│ ├── cli.py # Typer CLI entry point
│ ├── wrapper.py # run-all orchestration
│ ├── assets.py # AssetManager for bundled data
│ ├── extract.py # BAM → BED extraction
│ ├── fsc.py # Fragment size coverage
│ ├── fsd.py # Fragment size distribution
│ ├── fsr.py # Fragment size ratio
│ ├── wps.py # Windowed protection score
│ ├── ocf.py # Orientation-aware fragmentation
│ ├── motif.py # End motif analysis
│ ├── mfsd.py # Mutant fragment size distribution
│ ├── region_entropy.py # TFBS/ATAC region entropy
│ ├── region_mds.py # Per-gene MDS
│ ├── uxm.py # Fragment-level methylation
│ ├── build_gc_reference.py # GC reference generation
│ ├── core/ # Shared processors
│ │ ├── asset_resolution.py # Target/PON resolution logic
│ │ ├── asset_validation.py # Asset validation checks
│ │ ├── bam_utils.py # BAM utilities
│ │ ├── feature_serializer.py # JSON output
│ │ ├── fsc_processor.py # FSC post-processing
│ │ ├── fsd_processor.py # FSD post-processing
│ │ ├── fsr_processor.py # FSR post-processing
│ │ ├── gc_assets.py # GC resolution helper
│ │ ├── gene_bed.py # Gene BED parsing
│ │ ├── logging.py # Startup banner and logging
│ │ ├── motif_processor.py # Motif post-processing
│ │ ├── ocf_processor.py # OCF post-processing
│ │ ├── pon_integration.py # PON post-processing
│ │ ├── region_entropy_processor.py # TFBS/ATAC processor
│ │ ├── sample_processor.py # Per-sample orchestration
│ │ ├── unified_processor.py # Unified pipeline Python layer
│ │ ├── wps_processor.py # WPS post-processing
│ │ └── utils.py, resource_utils.py
│ ├── pon/ # PON model code
│ │ ├── model.py # PonModel dataclass
│ │ └── build.py # PON building logic
│ └── data/ # Bundled assets (Git LFS)
├── rust/ # Rust backend (19 modules)
│ ├── Cargo.toml
│ └── src/
│ ├── lib.rs # PyO3 module exports
│ ├── pipeline.rs # Unified pipeline entry
│ ├── engine.rs # Core engine utilities
│ ├── bed.rs # BGZF/gzip BED reader
│ ├── filters.rs # Fragment filtering logic
│ └── (feature modules: see table below)
├── tests/ # Test suite (244 tests, 4 skipped)
├── docs/ # MkDocs documentation
└── nextflow/ # Nextflow pipeline
├── main.nf
├── nextflow.config
├── modules/local/ # Per-tool NF modules
└── subworkflows/local/ # Subworkflows
Rust Backend Architecture
Module Structure
lib.rs
├── extract_motif # BAM extraction module
├── gc # GC correction module
├── run_unified_pipeline # Main pipeline function
└── configure_threads # Thread pool setup
Key Rust Modules
| Module | Purpose |
|--------|---------||
| lib.rs | PyO3 module exports and thread config |
| pipeline.rs | Unified pipeline coordination |
| engine.rs | Core engine utilities |
| bed.rs | BGZF/gzip BED reader |
| extract_motif.rs | BAM parsing, fragment + motif extraction |
| motif_utils.rs | Shared 4-mer encoding, MDS, GC utils |
| fsc.rs | Fragment size coverage + gene aggregation |
| fsd.rs | Per-arm size distribution + PON log-ratio |
| wps.rs | Dual-stream WPS, FFT, smoothing |
| ocf.rs | Orientation-aware fragmentation + PON z-score |
| mfsd.rs | Mutant fragment size distribution |
| region_entropy.rs | TFBS/ATAC entropy + PON z-score |
| region_mds.rs | Per-gene MDS at exon boundaries |
| uxm.rs | Fragment-level methylation (UXM) |
| gc_correction.rs | LOESS GC bias correction |
| gc_reference.rs | Pre-computed GC reference generation |
| pon_model.rs | PON model loading and hybrid correction |
| pon_builder.rs | PON model construction |
| filters.rs | Fragment filtering logic |
Unified Pipeline
All feature computation goes through run_unified_pipeline:
pub fn run_unified_pipeline(
_py: Python,
bed_path: PathBuf,
// GC Correction
gc_ref_path: Option<PathBuf>,
valid_regions_path: Option<PathBuf>,
correction_out_path: Option<PathBuf>,
correction_input_path: Option<PathBuf>,
// FSC
fsc_bins: Option<PathBuf>, fsc_output: Option<PathBuf>,
// WPS Foreground (TSS/CTCF anchors)
wps_regions: Option<PathBuf>, wps_output: Option<PathBuf>,
// WPS Background (Alu stacking)
wps_background_regions: Option<PathBuf>, wps_background_output: Option<PathBuf>,
wps_empty: bool,
// FSD
fsd_arms: Option<PathBuf>, fsd_output: Option<PathBuf>,
// OCF
ocf_regions: Option<PathBuf>, ocf_output: Option<PathBuf>,
// Target regions for on/off-target split (panel mode)
target_regions_path: Option<PathBuf>,
bait_padding: u64,
// Output format: "tsv", "parquet", or "both"
output_format: &str,
// Gzip-compress TSV outputs
compress: bool,
silent: bool,
) -> PyResult<()>
Adding a New Feature
- Create Rust module (
rust/src/new_feature.rs) - Add to lib.rs module exports
- Create Python wrapper (
src/krewlyzer/new_feature.py) - Add to CLI in
cli.py - Integrate with run_unified_pipeline if applicable
- Add to FeatureSerializer for JSON output
- Write tests in
tests/unit/andtests/integration/
Python Architecture
Tool Pattern
All standalone tools follow this pattern:
def tool_name(
input: Path = typer.Option(...),
output: Path = typer.Option(...),
# ... options
):
# 1. Initialize AssetManager
assets = AssetManager(genome)
# 2. Resolve GC assets (use helper)
from .core.gc_assets import resolve_gc_assets
gc = resolve_gc_assets(assets, output, sample_name, input, gc_correct, genome)
# 3. Call Rust backend
_core.run_unified_pipeline(...)
# 4. Post-process (Python)
from .core.tool_processor import process_tool
process_tool(output_file, ...)
AssetManager
Centralizes access to bundled data:
assets = AssetManager("hg19")
# Properties
assets.gc_reference # GC ref parquet
assets.valid_regions # Valid regions BED
assets.bins_100kb # Default bins
assets.wps_anchors # WPS anchors BED
# Assay-aware methods
assets.get_gene_bed("xs2") # Panel gene BED
assets.get_wps_anchors("xs2") # Panel WPS anchors
assets.get_pon("xs2") # Bundled PON
FeatureSerializer
Collects all features into unified JSON:
serializer = FeatureSerializer(sample_id, version="X.Y.Z")
serializer.add_fsc(fsc_df)
serializer.add_fsc_e1(fsc_e1_df)
serializer.add_wps(wps_df)
serializer.add_motif(edm_df, bpm_df, mds, mds_z)
serializer.add_ocf(ocf_df)
serializer.add_mfsd(mfsd_df)
serializer.add_uxm(uxm_df)
serializer.add_qc("total_fragments", 1234567)
serializer.save(output_dir)
# Or load from existing outputs
serializer = FeatureSerializer.from_outputs(
sample_id=sample_id,
output_dir=output_dir,
version="X.Y.Z"
)
Testing
Krewlyzer has 244 tests (4 skipped) across unit, integration, and e2e categories.
→ Testing Guide for complete documentation including: - Feature → test file mapping - Fixture reference - Test writing examples
Quick Commands
# All tests
pytest tests/
# Fast unit tests only
pytest tests/unit/
# Specific feature
pytest tests/unit/test_fsc.py
# With coverage
pytest tests/ --cov=krewlyzer --cov-report=html
Tip
Modifying a feature? Check the Feature → Test Map to find which test file to update.
Rust Development
Building
Always build from project root, not rust/
Running maturin develop from rust/ builds a krewlyzer_core wheel that
does NOT update the .so at src/krewlyzer/_core.cpython-*.so.
Running from the project root builds the krewlyzer wheel which correctly
installs the extension module.
# Debug build (fast compile, slow run)
maturin develop
# Release build (slow compile, fast run)
maturin develop --release
# Verify the installed build timestamp
python -c "import krewlyzer._core as c; import os, datetime; print(datetime.datetime.fromtimestamp(os.path.getmtime(c.__file__)))"
Testing Rust
# Run Rust tests (currently minimal)
cargo test
# Check before committing
cargo clippy -- -D warnings
cargo fmt --check
Debugging
# Enable verbose logging
RUST_LOG=debug krewlyzer run-all -i sample.bam ...
# Or in Python
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from krewlyzer import _core
# ... your test code
"
Code Style
Python
- Use
typerfor CLIs - Use
richfor logging/progress - Type hints for all public functions
- Docstrings (Google style)
Rust
- Run
cargo fmtbefore committing - Address
cargo clippywarnings - Use
anyhow::Resultfor error handling - Log with
log::info!,log::debug!
Known Gotchas
Path.with_suffix() and Compound Extensions
NEVER use Path.with_suffix() on paths with compound dot-separated names
Krewlyzer output files use compound names like sample.MDS.exon, sample.EndMotif.ontarget,
sample.FSC.regions.e1only. Python's Path.with_suffix() replaces only the last dot-segment,
which silently corrupts these paths:
| Expression | Expected | Actual |
|---|---|---|
Path("P-XXX.MDS.exon").with_suffix(".tsv") |
P-XXX.MDS.exon.tsv |
P-XXX.MDS.tsv ❌ |
Path("P-XXX.MDS.ontarget").with_suffix(".tsv") |
P-XXX.MDS.ontarget.tsv |
P-XXX.MDS.tsv ❌ |
Path("P-XXX.EndMotif").with_suffix(".tsv") |
P-XXX.EndMotif.tsv |
P-XXX.tsv ❌ |
Safe pattern — always use string concatenation:
# ✅ CORRECT: preserves all dot-segments
output_path = base.parent / (base.name + ".tsv")
# ❌ WRONG: replaces last dot-segment
output_path = base.with_suffix(".tsv")
Compound names in krewlyzer (all vulnerable to with_suffix()):
MDS.exon, MDS.gene, MDS.ontarget, EndMotif, EndMotif.ontarget,
BreakPointMotif, EndMotif1mer, FSC.gene, FSC.regions,
FSC.regions.e1only, OCF.sync, TFBS.sync, ATAC.sync.
This gotcha caused a silent data loss bug in v0.8.0 where MDS.exon.tsv and
MDS.gene.tsv were never generated. Test coverage: tests/unit/test_compound_extension.py.
_core.pyi — Rust Extension Stub Maintenance
src/krewlyzer/_core.pyi is a type stub for the compiled Rust/PyO3 extension
(krewlyzer._core). It tells mypy what functions and submodules exist without
inspecting the binary .so — following the same pattern as
py-gbcms _rs.pyi.
Update this file whenever you:
- Add a new
#[pyfunction]to anyrust/src/*.rsfile - Add a new sub-PyModule registered in
rust/src/lib.rs - Change the signature (parameters or return type) of an existing exported function
# After updating rust/src/*.rs, update the stub and verify:
python -m mypy src/krewlyzer/ --ignore-missing-imports --no-error-summary
# Must exit 0 with no output. Commit .rs and .pyi changes together.
Stub drift causes CI failures
If _core.pyi is out of sync, mypy will report attr-defined errors in
Python files that call the Rust functions, even if the code runs correctly at runtime.
Contributing Checklist
- Code follows existing patterns
- Added/updated tests
- Updated documentation
- Ran
pytest tests/— 244 pass, 4 skipped - If Rust functions changed: updated
src/krewlyzer/_core.pyistub - Ran Python lint (matches CI lint job):
- Ran Rust lint:
- Updated CHANGELOG.md
See CONTRIBUTING.md for full guidelines.