Architecture

Krewlyzer uses a hybrid Python/Rust architecture for optimal performance and usability.

Overview

flowchart TB
    subgraph CLI["Python CLI (Typer)"]
        cli[krewlyzer CLI]
        wrapper[wrapper.py]
    end

    subgraph Python["Python Layer"]
        extract[extract.py]
        motif[motif.py]
        fsc[fsc.py]
        fsr[fsr.py]
        wps[wps.py]
        ocf[ocf.py]
        mfsd[mfsd.py]
        uxm[uxm.py]
        region_mds[region_mds.py]
    end

    subgraph Rust["Rust Core (_core)"]
        lib[lib.rs]
        extract_motif[extract_motif.rs]
        motif_utils[motif_utils.rs]
        region_mds_rs[region_mds.rs]
        pipeline[pipeline.rs]
        gc[gc_correction.rs]
        pon[pon_model.rs]
        engine[engine.rs]
    end

    cli --> wrapper
    wrapper --> Python
    Python --> Rust

Use mouse to pan and zoom

Rust Core (`krewlyzer._core`)

The performance-critical functions are implemented in Rust and exposed to Python via PyO3.

Module Structure

Module	Size	Purpose
`lib.rs`	4KB	PyO3 module definition, thread config
`bed.rs`	4KB	BGZF-first file reader - auto-detects BGZF vs gzip
`extract_motif.rs`	17KB	BAM parsing, fragment extraction, motif counting
`motif_utils.rs`	4KB	Shared 4-mer encoding and MDS calculation
`region_mds.rs`	18KB	Per-region MDS analysis (Helzer et al.)
`region_entropy.rs`	15KB	TFBS/ATAC region entropy with panel support
`pipeline.rs`	10KB	Unified FSC/FSD/WPS/OCF pipeline
`fsc.rs`	11KB	Fragment size coverage by bins
`fsd.rs`	7KB	Size distribution per arm
`wps.rs`	18KB	Windowed protection score
`ocf.rs`	10KB	Orientation-aware fragmentation
`mfsd.rs`	35KB	Mutant fragment size analysis
`gc_correction.rs`	20KB	LOESS-based GC bias correction
`pon_model.rs`	7KB	PON model loading and hybrid correction
`gc_reference.rs`	20KB	Pre-computed GC reference generation

BGZF-First File Reader (`bed.rs`)

The get_reader() function provides smart compressed file handling:

flowchart LR
    A[".bed.gz file"] --> B{Is BGZF?}
    B -->|Yes| C["noodles::bgzf::Reader"]
    B -->|No| D["flate2::MultiGzDecoder"]
    C --> E["BufReader"]
    D --> E

Use mouse to pan and zoom

BGZF detection: Examines file header for BGZF magic bytes (0x1f 0x8b + BC subfield)
Primary: Uses noodles::bgzf::io::Reader for BGZF files (tabix-indexed)
Fallback: Uses flate2::MultiGzDecoder for standard gzip files

This is critical because fragment BED files with .tbi indexes use BGZF compression, which contains multiple gzip blocks. Standard GzDecoder only reads the first block.

Shared Utilities: `motif_utils.rs`

The motif_utils.rs module provides shared DNA sequence manipulation functions used across multiple feature modules:

Function	Signature	Description
`reverse_complement`	`fn(seq: &[u8]) -> Vec<u8>`	Reverses and complements DNA sequence (A↔T, G↔C)
`kmer4_to_index`	`fn(kmer: &[u8]) -> Option<usize>`	Converts 4-mer to index 0-255 using 2-bit encoding
`calculate_mds`	`fn(counts: &[u64; 256]) -> f64`	Shannon entropy of 4-mer histogram, normalized to [0,1]
`calculate_gc`	`fn(seq: &[u8]) -> f64`	GC content as fraction of valid ACGT bases

Used by: - extract_motif.rs - Global MDS calculation from BAM - region_mds.rs - Per-gene MDS at exon boundaries

Example usage (Rust):

use crate::motif_utils::{kmer4_to_index, calculate_mds};

// Count 4-mers
let mut counts = [0u64; 256];
if let Some(idx) = kmer4_to_index(b"ACGT") {
    counts[idx] += 1;
}

// Calculate MDS
let mds = calculate_mds(&counts);  // Range: 0.0 (uniform) to 1.0 (max diversity)

Key Functions Exposed

from krewlyzer import _core

# Thread configuration
_core.configure_threads(num_threads=8)

# Fragment extraction
_core.extract_motif.process_bam_parallel(
    bam_path, reference_path, filters, ...
)

# Unified pipeline
_core.run_unified_pipeline(
    bed_path, gc_ref, valid_regions, gc_factors_out,
    gc_factors_in, bin_file, fsc_out, wps_anchors, wps_out,
    wps_bg_anchors, wps_bg_out, wps_bg_flip,
    arms_file, fsd_out, ocr_file, ocf_out_dir,
    target_regions, bait_padding, silent
)

# GC correction
_core.gc.compute_and_write_gc_factors(...)

run_unified_pipeline Parameters

Parameter	Type	Description
`bed_path`	str	Input .bed.gz file path
`gc_ref`	str/None	GC reference parquet for computing factors
`valid_regions`	str/None	Valid regions BED for GC
`gc_factors_out`	str/None	Output path for computed GC factors
`gc_factors_in`	str/None	Pre-computed GC factors CSV
`bin_file`	str/None	FSC/FSR bins BED
`fsc_out`	str/None	FSC output TSV path
`wps_anchors`	str/None	WPS foreground anchors BED
`wps_out`	str/None	WPS foreground output parquet
`wps_bg_anchors`	str/None	WPS background (Alu) BED
`wps_bg_out`	str/None	WPS background output parquet
`wps_bg_flip`	bool	Flip WPS vectors for strand awareness
`arms_file`	str/None	Chromosome arms BED for FSD
`fsd_out`	str/None	FSD output TSV path
`ocr_file`	str/None	Open chromatin regions for OCF
`ocf_out_dir`	str/None	OCF output directory
`target_regions`	str/None	Panel target BED (on/off split)
`bait_padding`	int	Bait edge padding in bp (default: 50)
`silent`	bool	Suppress progress output

Python Layer

The Python layer provides:

CLI Interface (cli.py) - Typer-based commands
Orchestration (wrapper.py) - run-all coordination
Asset Management (assets.py) - Bundled data files
Feature Modules - Per-tool logic (fsc.py, fsr.py, etc.)
PON Integration (pon/) - Model loading and building
Core Utilities (core/) - Shared helpers

Core Module (`core/`)

File	Purpose
`sample_processor.py`	Unified BAM extraction - `extract_sample()`, `write_motif_outputs()`, `write_extraction_outputs()`
`unified_processor.py`	Central feature runner - single-pass FSC/FSR/FSD/WPS/OCF via `run_features()`
`asset_resolution.py`	Centralized asset resolution - `resolve_target_regions()`, `resolve_pon_model()` for auto-loading
`logging.py`	Logging utilities - `log_startup_banner()`, `ResolvedAsset` dataclass
`gc_assets.py`	Centralized GC asset resolution
`fsc_processor.py`	FSC z-score computation
`fsr_processor.py`	FSR ratio calculation
`wps_processor.py`	WPS post-processing
`wps_anchor_filter.py`	Panel-specific anchor filtering
`motif_processor.py`	Motif file writing (EDM, BPM, MDS)
`gene_bed.py`	Gene BED parsing for FSC gene aggregation
`feature_serializer.py`	Unified JSON output generation

Architecture Note: - sample_processor.py provides extract_sample() for unified BAM extraction used by extract.py, motif.py, wrapper.py, and build-pon - unified_processor.py provides run_features() for unified feature extraction used by all CLI tools and wrapper.py

Recent Improvements (2024-2025)

Change	Files	Benefit
Unified Extraction	`core/sample_processor.py`	Single `extract_sample()` for all BAM extraction, ~150 lines reduced
Unified Processor	`core/unified_processor.py`	Single entry point for all features, ~350 lines reduced
GC Asset Helper	`core/gc_assets.py`	Eliminated 100+ lines duplication
Dual WPS Output	`wps.py`, `wrapper.py`	Panel + genome-wide WPS
JSON Serializer	`core/feature_serializer.py`	Unified ML output
Assay Support	`assets.py`, `wrapper.py`	MSK-ACCESS v1/v2 auto-resolution

Asset Management

The AssetManager class in assets.py provides centralized resolution of bundled data files based on genome build.

Initialization

from krewlyzer.assets import AssetManager

assets = AssetManager("hg19")  # or "hg38", "GRCh37", "GRCh38"
print(assets.genome_dir)       # "GRCh37"
print(assets.file_prefix)      # "hg19"

Asset Properties

Property	Path Pattern	Used By
`bins_100kb`	`ChromosomeBins/{genome}/{prefix}_window_100kb.bed.gz`	FSC, FSR
`arms`	`ChromosomeArms/{genome}/{prefix}.arms.bed.gz`	FSD
`wps_anchors`	`WpsAnchors/{genome}/{prefix}.wps_anchors.bed.gz`	WPS
`wps_background`	`WpsBackground/{genome}/{prefix}.alu_consensus.bed.gz`	WPS
`ocf_regions`	`OpenChromatinRegion/{genome}/7specificTissue.all.OC.bed.gz`	OCF
`gc_ref`	`gc/{genome}/ref_genome_GC_{prefix}.parquet`	GC correction
`valid_regions`	`gc/{genome}/valid_regions_{prefix}.bed.gz`	GC correction
`exclude_regions`	`exclude-regions/{genome}/{prefix}-blacklist.v2.bed.gz`	Extract

Assay-Specific Assets

For MSK-ACCESS panels, AssetManager provides assay-aware resolution:

# Get panel-specific WPS anchors
xs2_anchors = assets.get_wps_anchors("xs2")  
# → WpsAnchors/GRCh37/xs2.wps_anchors.bed.gz

# Get panel targets
xs2_targets = assets.get_targets("xs2")
# → targets/GRCh37/xs2.targets.bed

# Get panel PON
xs2_pon = assets.get_pon("xs2")
# → pon/GRCh37/xs2.all_unique.pon.parquet

Auto-Loading with `--assay`

All CLI tools support auto-loading bundled assets via --assay:

# Auto-loads PON and target regions for xs2
krewlyzer run-all -i sample.bam -o out/ --assay xs2

Asset resolution follows this priority: 1. Explicit path (--target-regions, --pon-model) - highest 2. Skip flag (--skip-target-regions, --skip-pon) 3. Bundled asset (auto-loaded from --assay) 4. None - WGS mode

Assay	WPS Anchors	Target Genes	PON
`xs1` (MSK-ACCESS v1)	1,611	128	✓
`xs2` (MSK-ACCESS v2)	1,820	146	✓

Data Folder Structure

data/
├── ChromosomeArms/{GRCh37,GRCh38}/
├── ChromosomeBins/{GRCh37,GRCh38}/
├── WpsAnchors/{GRCh37,GRCh38}/
│   ├── {hg19,hg38}.wps_anchors.bed.gz    # Genome-wide
│   ├── xs1.wps_anchors.bed.gz             # MSK-ACCESS v1
│   └── xs2.wps_anchors.bed.gz             # MSK-ACCESS v2
├── WpsBackground/{GRCh37,GRCh38}/
├── OpenChromatinRegion/{GRCh37}/          # hg19 only
├── gc/{GRCh37,GRCh38}/
├── genes/{GRCh37}/                         # Gene BEDs per assay
├── targets/{GRCh37}/                       # Target BEDs per assay
└── pon/{GRCh37}/                           # Pre-built PON models

Python vs Rust Responsibilities

By Feature

Feature	Rust (Fast Path)	Python (Orchestration)
Extract	BAM parsing, fragment filtering, BED writing	File I/O orchestration
Motif	k-mer counting, GC observation	File writing, MDS calculation
FSC	Fragment counting per bin, GC correction	PON z-score overlay
FSR	Fragment ratio calculation	PON z-score overlay
FSD	Size histogram per arm, on/off-target split	PON log-ratio normalization
WPS	Protection score, Savitzky-Golay, FFT, NRL	PON z-score subtraction
OCF	Strand asymmetry calculation	PON z-score overlay
mFSD	Variant-level size profiles	Output formatting
UXM	Methylation state extraction	Output formatting

By Operation Type

Operation	Language	File
BAM reading	Rust	`extract_motif.rs`
Fragment extraction	Rust	`extract_motif.rs`
Target region intersection	Rust	`fsd.rs`, `extract_motif.rs`
GC observation collection	Rust	`extract_motif.rs`
GC LOESS fitting	Rust	`gc_correction.rs`
Correction factor application	Rust	Consumers in `pipeline.rs`
FSC/FSD/WPS/OCF counting	Rust	`pipeline.rs`
Savitzky-Golay smoothing	Rust	`wps.rs`
FFT periodicity (NRL)	Rust	`wps.rs`
PON z-score (FSD, WPS, OCF, TFBS/ATAC)	Rust	`fsd.rs`, `wps.rs`, `ocf.rs`, `region_entropy.rs`
Gene FSC aggregation	Rust	`fsc.rs`
PON baseline loading	Rust	via Parquet in z-score functions
PON building	Python	`pon/build.py`
File I/O coordination	Python	`wrapper.py`, feature modules
CLI parsing	Python	`cli.py`, Typer

Completed Rust Migrations ✅

Component	Function	Speedup
FSD log-ratio normalization	`fsd.apply_pon_logratio`	10-50x
WPS PON z-score subtraction	`wps.apply_pon_zscore`	5-20x
GC bias aggregation	`pon_builder.compute_gc_bias_model`	3-10x
FSD baseline aggregation	`pon_builder.compute_fsd_baseline`	3-10x
WPS baseline aggregation	`pon_builder.compute_wps_baseline`	3-10x
OCF PON z-score	`ocf.apply_pon_zscore`	5-20x
TFBS/ATAC PON z-score	`region_entropy.apply_pon_zscore`	5-20x
Gene FSC aggregation	`fsc.aggregate_by_gene`	3-10x

Rust-First Fallback Strategy

All performance-critical functions use Rust-first with Python fallback:

# Pattern used in fsd_processor.py, wps_processor.py, build.py
try:
    from krewlyzer import _core
    result = _core.module.function(args)  # Rust (10-50x faster)
    if result:
        return result
except Exception as e:
    logger.debug(f"Rust failed: {e}")

# Python fallback (slower but always available)
return python_implementation(args)

Benefits: - ✅ Performance: Rust path is 3-50x faster - ✅ Reliability: Python fallback if Rust extension fails - ✅ Debugging: Python code is easier to step through - ✅ Accuracy: Both paths produce identical results

Execution Flow

flowchart TB
    subgraph "1. CLI Layer"
        CLI["krewlyzer run-all"]
        CLI --> WRAPPER["wrapper.py"]
    end

    subgraph "2. Asset Resolution"
        WRAPPER --> ASSETS["AssetManager(genome)"]
        ASSETS --> |"bins_100kb, arms, wps_anchors..."| CONFIG["Tool Configuration"]
    end

    subgraph "3. Extraction (Rust)"
        CONFIG --> EXTRACT["_core.extract_motif.process_bam_parallel()"]
        EXTRACT --> BED["sample.bed.gz"]
        EXTRACT --> MOTIF["EndMotif.tsv, MDS.tsv"]
        EXTRACT --> GC_OBS["GC observations"]
    end

    subgraph "4. GC Correction (Rust)"
        GC_OBS --> GC_LOESS["_core.gc.compute_gc_factors()"]
        GC_LOESS --> GC_FACTORS["correction_factors.tsv"]
    end

    subgraph "5. Unified Pipeline (Rust)"
        BED --> UNIFIED["_core.run_unified_pipeline()"]
        GC_FACTORS --> UNIFIED
        UNIFIED --> FSC_RAW["fsc_counts.tsv"]
        UNIFIED --> FSD_RAW["FSD.tsv"]
        UNIFIED --> WPS_RAW["WPS.parquet"]
        UNIFIED --> OCF_RAW["OCF.tsv"]
    end

    subgraph "6. Post-Processing (Python)"
        FSC_RAW --> FSC_PROC["fsc_processor.py"]
        FSC_RAW --> FSR_PROC["fsr_processor.py"]
        FSD_RAW --> FSD_PROC["fsd_processor.py"]
        WPS_RAW --> WPS_PROC["wps_processor.py"]

        PON["PON Model"] --> FSC_PROC & FSR_PROC & FSD_PROC & WPS_PROC

        FSC_PROC --> FSC_OUT["FSC.tsv (z-scores)"]
        FSR_PROC --> FSR_OUT["FSR.tsv (ratios)"]
        FSD_PROC --> FSD_OUT["FSD.tsv (logR)"]
        WPS_PROC --> WPS_OUT["WPS.parquet (smoothed)"]
    end

Use mouse to pan and zoom

Key Points:

Single BAM Read: Extraction reads BAM once, outputs BED.gz + motifs
GC Correction First: LOESS-based correction computed before features
Single BED Pass: run_unified_pipeline() processes BED.gz once for all features
Python Post-processing: PON normalization and z-scores added in Python
Parallel Features: FSC, FSD, WPS, OCF computed simultaneously in Rust

Performance Characteristics

Operation	Engine	Parallelism
BAM reading	Rust (htslib)	Multi-threaded
Fragment extraction	Rust	Rayon parallel
Motif counting	Rust	Rayon parallel
FSC/FSD/WPS/OCF	Rust pipeline	Single-pass I/O
GC correction	Rust LOESS	Per-fragment-type
mFSD	Rust	Per-variant parallel
UXM	Rust	Per-region parallel

Speedup vs Pure Python

3-4x faster for large BAM files (>100M reads)
Single-pass I/O for unified pipeline (vs 4 separate passes)
Rayon parallelism scales with CPU cores

Building from Source

Requirements

Python 3.10+
Rust toolchain (via rustup)
C compiler (clang recommended)
htslib development headers

Build Steps

# Clone and enter
git clone https://github.com/msk-access/krewlyzer.git
cd krewlyzer

# Create environment
uv venv .venv && source .venv/bin/activate

# Build Rust extension
cd rust && maturin develop --release
cd ..

# Install Python package
uv pip install -e ".[dev,test]"

# Verify
python -c "from krewlyzer import _core; print(_core.version())"

Architecture

Overview

Rust Core (`krewlyzer._core`)

Module Structure

BGZF-First File Reader (`bed.rs`)

Shared Utilities: `motif_utils.rs`

Key Functions Exposed

run_unified_pipeline Parameters

Python Layer

Core Module (`core/`)

Recent Improvements (2024-2025)

Asset Management

Initialization

Asset Properties

Assay-Specific Assets

Auto-Loading with `--assay`

Data Folder Structure

Python vs Rust Responsibilities

By Feature

By Operation Type

Completed Rust Migrations ✅

Rust-First Fallback Strategy

Execution Flow

Performance Characteristics

Speedup vs Pure Python

Building from Source

Requirements

Build Steps

Extending Krewlyzer

Adding a New Rust Function

Adding a New Feature Tool

Architecture

Overview

Rust Core (krewlyzer._core)

Module Structure

BGZF-First File Reader (bed.rs)

Shared Utilities: motif_utils.rs

Key Functions Exposed

run_unified_pipeline Parameters

Python Layer

Core Module (core/)

Recent Improvements (2024-2025)

Asset Management

Initialization

Asset Properties

Assay-Specific Assets

Auto-Loading with --assay

Data Folder Structure

Python vs Rust Responsibilities

By Feature

By Operation Type

Completed Rust Migrations ✅

Rust-First Fallback Strategy

Execution Flow

Performance Characteristics

Speedup vs Pure Python

Building from Source

Requirements

Build Steps

Extending Krewlyzer

Adding a New Rust Function

Adding a New Feature Tool

Rust Core (`krewlyzer._core`)

BGZF-First File Reader (`bed.rs`)

Shared Utilities: `motif_utils.rs`

Core Module (`core/`)

Auto-Loading with `--assay`