Developer Guide
This guide covers the Krewlyzer codebase architecture for contributors.
Repository Structure
krewlyzer/
├── src/krewlyzer/ # Python package
│ ├── cli.py # Typer CLI entry point
│ ├── wrapper.py # run-all orchestration (750 lines)
│ ├── assets.py # AssetManager for bundled data
│ ├── extract.py # BAM → BED extraction
│ ├── fsc.py, fsd.py, ... # Standalone tools
│ ├── core/ # Shared processors
│ │ ├── asset_resolution.py # Target/PON resolution logic
│ │ ├── logging.py # Startup banner and logging
│ │ ├── gc_assets.py # GC resolution helper
│ │ ├── fsc_processor.py
│ │ ├── wps_processor.py
│ │ └── feature_serializer.py # JSON output
│ ├── pon/ # PON model code
│ │ ├── model.py # PonModel dataclass
│ │ └── build.py # PON building logic
│ └── data/ # Bundled assets
├── rust/ # Rust backend
│ ├── Cargo.toml
│ └── src/
│ ├── lib.rs # PyO3 module exports
│ ├── pipeline.rs # Unified pipeline entry
│ ├── fsc.rs, wps.rs # Feature modules
│ └── gc_correction.rs
├── tests/ # Test suite
├── docs/ # MkDocs documentation
└── nextflow/ # Nextflow pipeline
Rust Backend Architecture
Module Structure
lib.rs
├── extract_motif # BAM extraction module
├── gc # GC correction module
├── run_unified_pipeline # Main pipeline function
└── configure_threads # Thread pool setup
Key Rust Modules
| Module | Lines | Purpose |
|--------|------:|---------|\n| wps.rs | 2000+ | Dual-stream WPS, FFT, smoothing |
| gc_correction.rs | 590 | LOESS GC bias correction |
| gc_reference.rs | 658 | Asset generation (once per genome) |
| fsc.rs | 500+ | 5-bin fragment counting |
| fsd.rs | 600+ | Per-arm size distribution |
| pipeline.rs | 400+ | Unified pipeline coordination |
| extract_motif.rs | 1200+ | BAM reading, motif extraction |
| motif_utils.rs | 176 | Shared 4-mer encoding, MDS, GC utils |
| region_mds.rs | 800+ | Per-gene MDS at exon boundaries |
| region_entropy.rs | 600+ | TFBS/ATAC size entropy |
Unified Pipeline
All feature computation goes through run_unified_pipeline:
pub fn run_unified_pipeline(
bedgz_path: &str,
gc_ref_path: Option<&str>,
valid_regions_path: Option<&str>,
gc_factors_out: Option<&str>,
gc_factors_in: Option<&str>,
fsc_bins_path: Option<&str>,
fsc_out_path: Option<&str>,
wps_anchors_path: Option<&str>,
wps_out_path: Option<&str>,
wps_bg_path: Option<&str>,
wps_bg_out_path: Option<&str>,
include_empty: bool,
fsd_arms_path: Option<&str>,
fsd_out_path: Option<&str>,
ocf_regions_path: Option<&str>,
ocf_out_path: Option<&str>,
target_regions_path: Option<&str>,
bait_padding: u32,
silent: bool,
) -> PyResult<()>
Adding a New Feature
- Create Rust module (
rust/src/new_feature.rs) - Add to lib.rs module exports
- Create Python wrapper (
src/krewlyzer/new_feature.py) - Add to CLI in
cli.py - Integrate with run_unified_pipeline if applicable
- Add to FeatureSerializer for JSON output
- Write tests in
tests/unit/andtests/integration/
Python Architecture
Tool Pattern
All standalone tools follow this pattern:
def tool_name(
input: Path = typer.Option(...),
output: Path = typer.Option(...),
# ... options
):
# 1. Initialize AssetManager
assets = AssetManager(genome)
# 2. Resolve GC assets (use helper)
from .core.gc_assets import resolve_gc_assets
gc = resolve_gc_assets(assets, output, sample_name, input, gc_correct, genome)
# 3. Call Rust backend
_core.run_unified_pipeline(...)
# 4. Post-process (Python)
from .core.tool_processor import process_tool
process_tool(output_file, ...)
AssetManager
Centralizes access to bundled data:
assets = AssetManager("hg19")
# Properties
assets.gc_reference # GC ref parquet
assets.valid_regions # Valid regions BED
assets.bins_100kb # Default bins
assets.wps_anchors # WPS anchors BED
# Assay-aware methods
assets.get_gene_bed("xs2") # Panel gene BED
assets.get_wps_anchors("xs2") # Panel WPS anchors
assets.get_pon("xs2") # Bundled PON
FeatureSerializer
Collects all features into unified JSON:
serializer = FeatureSerializer(sample_id)
serializer.add_fsc(fsc_df)
serializer.add_wps(wps_df)
serializer.add_motif(edm_dict, bpm_dict, mds)
# ...
serializer.save(output_file)
# Or load from existing outputs
serializer = FeatureSerializer.from_outputs(output_dir, sample_id)
Testing
Krewlyzer has 239 tests across unit, integration, and e2e categories.
→ Testing Guide for complete documentation including: - Feature → test file mapping - Fixture reference - Test writing examples
Quick Commands
# All tests
pytest tests/
# Fast unit tests only
pytest tests/unit/
# Specific feature
pytest tests/unit/test_fsc.py
# With coverage
pytest tests/ --cov=krewlyzer --cov-report=html
Tip
Modifying a feature? Check the Feature → Test Map to find which test file to update.
Rust Development
Building
cd rust
# Debug build (fast compile, slow run)
maturin develop
# Release build (slow compile, fast run)
maturin develop --release
Testing Rust
# Run Rust tests (currently minimal)
cargo test
# Check before committing
cargo clippy
cargo fmt --check
Debugging
# Enable verbose logging
RUST_LOG=debug krewlyzer run-all -i sample.bam ...
# Or in Python
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from krewlyzer import _core
# ... your test code
"
Code Style
Python
- Use
typerfor CLIs - Use
richfor logging/progress - Type hints for all public functions
- Docstrings (Google style)
Rust
- Run
cargo fmtbefore committing - Address
cargo clippywarnings - Use
anyhow::Resultfor error handling - Log with
log::info!,log::debug!
Contributing Checklist
- Code follows existing patterns
- Added/updated tests
- Updated documentation
- Ran
pytest tests/ - Ran
cargo fmt && cargo clippy - Updated CHANGELOG.md
See CONTRIBUTING.md for full guidelines.