Panel of Normals (PON)

The Panel of Normals (PON) is a unified model built from healthy plasma samples that enables:

GC bias correction - Per-fragment correction for GC content bias
Z-score normalization - Detect deviations from healthy baseline for all features
Panel mode support - Dual on/off-target baselines for capture panels

Quick Start

# Build PON from healthy samples
krewlyzer build-pon samples.txt --assay msk-access-v2 -r hg19.fa -o pon.parquet

# Use PON for sample processing
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -P pon.parquet

Auto-PON Loading

When you specify an assay with -A, krewlyzer automatically loads the bundled PON:

# Auto-loads bundled PON for xs2 assay
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2 -G hg19

This is equivalent to explicitly passing -P with the bundled PON path.

Skipping Z-Score Normalization (`--skip-pon`)

For ML training workflows where PON samples are used as true negatives, use --skip-pon to output raw features without z-score normalization:

# Process PON samples as ML negatives (no z-scores)
krewlyzer run-all -i pon_sample.bam -r hg19.fa -o out/ -A xs2 --skip-pon

Warning

-P and --skip-pon are mutually exclusive. If you specify an explicit PON model, you want z-scores applied. Use --skip-pon only with -A (assay) for the ML negatives workflow.

The --skip-pon flag: - Works with -A/--assay (auto-loads bundled PON but skips z-scores) - Available on all tools: run-all, fsc, fsd, fsr, wps, ocf, region-entropy, motif - Logs which tools are skipping normalization

PON Variant Selection (`--pon-variant`)

For duplex sequencing workflows (fgbio/Marianas), use --pon-variant duplex to select PONs built from duplex consensus reads:

# Default: all_unique PON (maximum coverage for standard cfDNA)
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2

# Duplex PON (highest accuracy for duplex sequencing data)
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2 --pon-variant duplex

Variant	Description	Best For
`all_unique`	Built from all unique reads	Standard cfDNA (default)
`duplex`	Built from duplex consensus reads	Duplex sequencing workflows

Tip

The --pon-variant flag is independent of the --duplex flag for mFSD. Use --duplex for mFSD weighting (enables cD tag usage), and --pon-variant for PON selection across all tools.

The --pon-variant flag: - Defaults to all_unique (maximum coverage PON) - Available on all PON-using tools: run-all, fsc, fsd, fsr, wps, ocf, motif, region-entropy, region-mds - File structure: pon/{genome}/{variant}/{assay}.{variant}.pon.parquet

PON Components

Component	Description	Used By
GC Bias Model	Expected coverage by GC content per fragment type	FSC, FSR, WPS
FSD Baseline	Size distribution per chromosome arm	FSD
WPS Baseline	WPS mean/std per transcript region	WPS
OCF Baseline	Open chromatin scores per region	OCF
MDS Baseline	k-mer frequencies and motif diversity	Motif
TFBS Baseline	Per-TF entropy mean/std	Region Entropy
ATAC Baseline	Per-cancer-type entropy mean/std	Region Entropy
Region MDS Baseline	Per-gene MDS mean/std for E1	Region MDS
FSC Gene Baseline	Per-gene normalized depth mean/std	FSC Gene
FSC Region Baseline	Per-exon normalized depth mean/std	FSC Region

Panel Mode

For capture panels (like MSK-ACCESS), use --target-regions when building the PON:

krewlyzer build-pon samples.txt -a msk-access-v2 -r hg19.fa -T targets.bed -o pon.parquet

This enables:

GC model trained on off-target only - Avoids capture bias
Separate on/off-target baselines - For features that differ in captured regions
Panel mode detection - Sample processing auto-detects matching PON mode

Building a PON

See build-pon CLI for detailed options.

Requirements: - Minimum 10 healthy samples recommended - Same assay/panel as samples to be processed - Same reference genome

Using PON in Processing

When --pon-model is provided to run-all:

PON is loaded once and passed to all processors
Each feature computes z-scores against healthy baseline
Output includes both raw values and PON-normalized columns

Output Columns

With PON, additional columns are added to outputs:

Feature	PON Column(s)	Description
FSC	`*_log2`	Log2 ratio vs PON expected
FSC Gene	`depth_zscore`	Gene-level depth z-score
FSC Region	`depth_zscore`	Exon-level depth z-score
FSD	`ratio_log2`	Size distribution log ratio
WPS	`wps_zscore`	Z-score vs region baseline
OCF	`ocf_zscore`	Z-score vs OCF baseline
Motif	`mds_z`	Z-score for MDS
TFBS	`entropy_z`	Z-score per TF
ATAC	`entropy_z`	Z-score per cancer type
Region MDS	`mds_z`, `mds_e1_z`	Gene-level and E1 z-scores

API Reference

from krewlyzer.pon.model import PonModel

# Load existing PON
pon = PonModel.load("path/to/pon.parquet")

# Access components
gc_expected = pon.get_mean("short")  # Expected at median GC
variance = pon.get_variance("short")  # For reliability scoring

# Check panel mode
if pon.panel_mode:
    print(f"Built with: {pon.target_regions_file}")

PON Baselines in Detail

GC Bias Model (`gc_bias`)

Stores expected fragment coverage per GC content (0-100%) for each fragment type:

Fragment Type	Size Range	Purpose
`short`	65-149bp	Short fragment correction
`intermediate`	150-259bp	Mono-nucleosomal
`long`	260-400bp	Di-nucleosomal
`wps_long`	120-180bp	WPS nucleosomal
`wps_short`	35-80bp	WPS TF footprint

FSD Baseline (`fsd_baseline`)

Size distribution per chromosome arm (46 arms): - expected: Mean proportion at each size bin - std: Standard deviation across PON samples

WPS Baseline (`wps_baseline`)

Per-region nucleosome positioning metrics.

Schema v1.0 (Scalar): - wps_long_mean/std: Single nucleosomal WPS value per region - wps_short_mean/std: Single TF footprint value per region

Schema v2.0 (Vector): - wps_nuc_mean/std: 200-element vector (nucleosomal footprint) - wps_tf_mean/std: 200-element vector (TF footprint)

Tip

v2.0 enables position-specific z-scores and Shape Correlation Score for cancer detection.

Shape Score Interpretation

Score	Interpretation
0.9-1.0	Healthy nucleosome positioning
0.5-0.9	Mild chromatin disorganization
<0.5	Significant disruption (cancer signal)

See WPS Features for output column details.

OCF Baseline (`ocf_baseline`)

Per-region open chromatin footprint: - ocf_mean/std: OCF score baseline - sync_mean/std: Synchronization score baseline

MDS Baseline (`mds_baseline`)

Motif diversity expectations: - kmer_expected: 256 4-mer frequencies from healthy samples - kmer_std: Variability per k-mer - mds_mean/std: Expected Motif Diversity Score

TFBS Baseline (`tfbs_baseline`)

Per-TF size entropy: - label_stats: Mean/std entropy per TF (808 transcription factors) - Enables z-score per TF for detailed regulatory analysis

ATAC Baseline (`atac_baseline`)

Per-cancer-type size entropy: - label_stats: Mean/std entropy per cancer type (23 types) - Enables tissue-of-origin scoring

Region MDS Baseline (`region_mds`)

Per-gene MDS expectations: - gene_baseline: Dict of gene → {mds_mean, mds_std, mds_e1_mean, mds_e1_std} - Enables gene-level anomaly detection - E1 (first exon) tracked separately for promoter-proximal sensitivity

FSC Gene Baseline (`fsc_gene_baseline`)

Per-gene normalized depth baseline (panel mode only): - data: Dict of gene → (mean_depth, std_depth, n_samples) - Requires minimum 3 samples for reliable statistics - Clinical use: z-score >> 0 = amplification, z-score << 0 = deletion

FSC Region Baseline (`fsc_region_baseline`)

Per-exon/probe normalized depth baseline (panel mode only): - data: Dict of region_id → (mean_depth, std_depth, n_samples) - Region IDs formatted as "chrom:start-end" - Covers all exons (no filtering by variance) - Enables detection of focal copy number changes affecting single exons

Interpreting Z-Scores

Z-scores measure how many standard deviations a sample differs from the healthy PON baseline:

\[ z = \frac{x_{\text{sample}} - \mu_{\text{PON}}}{\sigma_{\text{PON}}} \]

Clinical Interpretation

Z-Score Range	Interpretation	Action
-2 to +2	Normal range	Within healthy variation
**	z	= 2-3**
**	z	> 3**
**	z	> 5**

Per-Feature Z-Score Meaning

Feature	Z-Score Column	Positive Z Means	Negative Z Means
FSC	`z_core_short`	More short fragments	Fewer short fragments
FSD	-	Shifted size distribution	-
WPS	`wps_nuc_z`	Stronger nucleosome signal	Disrupted nucleosomes
OCF	`ocf_z`	More open chromatin	Less accessible
MDS	`mds_z`	More diverse motifs	Less diverse
TFBS	`entropy_z`	Higher entropy (diverse sizes)	Lower entropy (restricted)
ATAC	`entropy_z`	Higher entropy	Lower entropy
Region MDS	`mds_z`, `mds_e1_z`	More diverse at gene	Restricted motifs (aberrant)

ML Feature Usage

# Extract z-score features for classification
features = {
    "fsc_short_z": sample_fsc["z_core_short"].mean(),
    "wps_nuc_z": sample_wps["wps_nuc_z"].mean(),
    "mds_z": sample_motif["mds_z"],
}

# Higher |z| = more likely to be tumor
combined_signal = sum(abs(z) for z in features.values())

Tip

Combine z-scores across features - Single extreme values may be noise, but consistent deviations across FSC, WPS, and MDS are highly indicative of ctDNA.