Developer Guide

This guide covers the Krewlyzer codebase architecture for contributors.

Repository Structure

krewlyzer/
├── src/krewlyzer/          # Python package
│   ├── cli.py              # Typer CLI entry point
│   ├── wrapper.py          # run-all orchestration (750 lines)
│   ├── assets.py           # AssetManager for bundled data
│   ├── extract.py          # BAM → BED extraction
│   ├── fsc.py, fsd.py, ... # Standalone tools
│   ├── core/               # Shared processors
│   │   ├── asset_resolution.py  # Target/PON resolution logic
│   │   ├── logging.py      # Startup banner and logging
│   │   ├── gc_assets.py    # GC resolution helper
│   │   ├── fsc_processor.py
│   │   ├── wps_processor.py
│   │   └── feature_serializer.py  # JSON output
│   ├── pon/                # PON model code
│   │   ├── model.py        # PonModel dataclass
│   │   └── build.py        # PON building logic
│   └── data/               # Bundled assets
├── rust/                   # Rust backend
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs          # PyO3 module exports
│       ├── pipeline.rs     # Unified pipeline entry
│       ├── fsc.rs, wps.rs  # Feature modules
│       └── gc_correction.rs
├── tests/                  # Test suite
├── docs/                   # MkDocs documentation
└── nextflow/               # Nextflow pipeline

Rust Backend Architecture

Module Structure

lib.rs
├── extract_motif     # BAM extraction module
├── gc                # GC correction module  
├── run_unified_pipeline  # Main pipeline function
└── configure_threads # Thread pool setup

Key Rust Modules

| Module | Lines | Purpose | |--------|------:|---------|\n| wps.rs | 2000+ | Dual-stream WPS, FFT, smoothing | | gc_correction.rs | 590 | LOESS GC bias correction | | gc_reference.rs | 658 | Asset generation (once per genome) | | fsc.rs | 500+ | 5-bin fragment counting | | fsd.rs | 600+ | Per-arm size distribution | | pipeline.rs | 400+ | Unified pipeline coordination | | extract_motif.rs | 1200+ | BAM reading, motif extraction | | motif_utils.rs | 176 | Shared 4-mer encoding, MDS, GC utils | | region_mds.rs | 800+ | Per-gene MDS at exon boundaries | | region_entropy.rs | 600+ | TFBS/ATAC size entropy |

Unified Pipeline

All feature computation goes through run_unified_pipeline:

pub fn run_unified_pipeline(
    bedgz_path: &str,
    gc_ref_path: Option<&str>,
    valid_regions_path: Option<&str>,
    gc_factors_out: Option<&str>,
    gc_factors_in: Option<&str>,
    fsc_bins_path: Option<&str>,
    fsc_out_path: Option<&str>,
    wps_anchors_path: Option<&str>,
    wps_out_path: Option<&str>,
    wps_bg_path: Option<&str>,
    wps_bg_out_path: Option<&str>,
    include_empty: bool,
    fsd_arms_path: Option<&str>,
    fsd_out_path: Option<&str>,
    ocf_regions_path: Option<&str>,
    ocf_out_path: Option<&str>,
    target_regions_path: Option<&str>,
    bait_padding: u32,
    silent: bool,
) -> PyResult<()>

Adding a New Feature

Create Rust module (rust/src/new_feature.rs)
Add to lib.rs module exports
Create Python wrapper (src/krewlyzer/new_feature.py)
Add to CLI in cli.py
Integrate with run_unified_pipeline if applicable
Add to FeatureSerializer for JSON output
Write tests in tests/unit/ and tests/integration/

Python Architecture

Tool Pattern

All standalone tools follow this pattern:

def tool_name(
    input: Path = typer.Option(...),
    output: Path = typer.Option(...),
    # ... options
):
    # 1. Initialize AssetManager
    assets = AssetManager(genome)

    # 2. Resolve GC assets (use helper)
    from .core.gc_assets import resolve_gc_assets
    gc = resolve_gc_assets(assets, output, sample_name, input, gc_correct, genome)

    # 3. Call Rust backend
    _core.run_unified_pipeline(...)

    # 4. Post-process (Python)
    from .core.tool_processor import process_tool
    process_tool(output_file, ...)

AssetManager

Centralizes access to bundled data:

assets = AssetManager("hg19")

# Properties
assets.gc_reference      # GC ref parquet
assets.valid_regions     # Valid regions BED
assets.bins_100kb        # Default bins
assets.wps_anchors       # WPS anchors BED

# Assay-aware methods
assets.get_gene_bed("xs2")     # Panel gene BED
assets.get_wps_anchors("xs2")  # Panel WPS anchors
assets.get_pon("xs2")          # Bundled PON

FeatureSerializer

Collects all features into unified JSON:

serializer = FeatureSerializer(sample_id)
serializer.add_fsc(fsc_df)
serializer.add_wps(wps_df)
serializer.add_motif(edm_dict, bpm_dict, mds)
# ...
serializer.save(output_file)

# Or load from existing outputs
serializer = FeatureSerializer.from_outputs(output_dir, sample_id)

Testing

Krewlyzer has 239 tests across unit, integration, and e2e categories.

→ Testing Guide for complete documentation including: - Feature → test file mapping - Fixture reference - Test writing examples

Quick Commands

# All tests
pytest tests/

# Fast unit tests only
pytest tests/unit/

# Specific feature
pytest tests/unit/test_fsc.py

# With coverage
pytest tests/ --cov=krewlyzer --cov-report=html

Tip

Modifying a feature? Check the Feature → Test Map to find which test file to update.

Rust Development

Building

cd rust

# Debug build (fast compile, slow run)
maturin develop

# Release build (slow compile, fast run)
maturin develop --release

Testing Rust

# Run Rust tests (currently minimal)
cargo test

# Check before committing
cargo clippy
cargo fmt --check

Debugging

# Enable verbose logging
RUST_LOG=debug krewlyzer run-all -i sample.bam ...

# Or in Python
python -c "
import logging
logging.basicConfig(level=logging.DEBUG)
from krewlyzer import _core
# ... your test code
"

Code Style

Python

Use typer for CLIs
Use rich for logging/progress
Type hints for all public functions
Docstrings (Google style)

Rust

Run cargo fmt before committing
Address cargo clippy warnings
Use anyhow::Result for error handling
Log with log::info!, log::debug!

Contributing Checklist

See CONTRIBUTING.md for full guidelines.