Panel of Normals (PON)
The Panel of Normals (PON) is a unified model built from healthy plasma samples that enables:
- GC bias correction - Per-fragment correction for GC content bias
- Z-score normalization - Detect deviations from healthy baseline for all features
- Panel mode support - Dual on/off-target baselines for capture panels
Quick Start
# Build PON from healthy samples
krewlyzer build-pon samples.txt --assay msk-access-v2 -r hg19.fa -o pon.parquet
# Use PON for sample processing
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -P pon.parquet
Auto-PON Loading
When you specify an assay with -A, krewlyzer automatically loads the bundled PON:
# Auto-loads bundled PON for xs2 assay
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2 -G hg19
This is equivalent to explicitly passing -P with the bundled PON path.
Skipping Z-Score Normalization (--skip-pon)
For ML training workflows where PON samples are used as true negatives, use --skip-pon to output raw features without z-score normalization:
# Process PON samples as ML negatives (no z-scores)
krewlyzer run-all -i pon_sample.bam -r hg19.fa -o out/ -A xs2 --skip-pon
Warning
-P and --skip-pon are mutually exclusive. If you specify an explicit PON model, you want z-scores applied. Use --skip-pon only with -A (assay) for the ML negatives workflow.
The --skip-pon flag:
- Works with -A/--assay (auto-loads bundled PON but skips z-scores)
- Available on all tools: run-all, fsc, fsd, fsr, wps, ocf, region-entropy, motif
- Logs which tools are skipping normalization
PON Variant Selection (--pon-variant)
For duplex sequencing workflows (fgbio/Marianas), use --pon-variant duplex to select PONs built from duplex consensus reads:
# Default: all_unique PON (maximum coverage for standard cfDNA)
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2
# Duplex PON (highest accuracy for duplex sequencing data)
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ -A xs2 --pon-variant duplex
| Variant | Description | Best For |
|---|---|---|
all_unique |
Built from all unique reads | Standard cfDNA (default) |
duplex |
Built from duplex consensus reads | Duplex sequencing workflows |
Tip
The --pon-variant flag is independent of the --duplex flag for mFSD. Use --duplex for mFSD weighting (enables cD tag usage), and --pon-variant for PON selection across all tools.
The --pon-variant flag:
- Defaults to all_unique (maximum coverage PON)
- Available on all PON-using tools: run-all, fsc, fsd, fsr, wps, ocf, motif, region-entropy, region-mds
- File structure: pon/{genome}/{variant}/{assay}.{variant}.pon.parquet
PON Components
| Component | Description | Used By |
|---|---|---|
| GC Bias Model | Expected coverage by GC content per fragment type | FSC, FSR, WPS |
| FSD Baseline | Size distribution per chromosome arm | FSD |
| WPS Baseline | WPS mean/std per transcript region | WPS |
| OCF Baseline | Open chromatin scores per region | OCF |
| MDS Baseline | k-mer frequencies and motif diversity | Motif |
| TFBS Baseline | Per-TF entropy mean/std | Region Entropy |
| ATAC Baseline | Per-cancer-type entropy mean/std | Region Entropy |
| Region MDS Baseline | Per-gene MDS mean/std for E1 | Region MDS |
| FSC Gene Baseline | Per-gene normalized depth mean/std | FSC Gene |
| FSC Region Baseline | Per-exon normalized depth mean/std | FSC Region |
Panel Mode
For capture panels (like MSK-ACCESS), use --target-regions when building the PON:
This enables:
- GC model trained on off-target only - Avoids capture bias
- Separate on/off-target baselines - For features that differ in captured regions
- Panel mode detection - Sample processing auto-detects matching PON mode
Building a PON
See build-pon CLI for detailed options.
Requirements: - Minimum 10 healthy samples recommended - Same assay/panel as samples to be processed - Same reference genome
Using PON in Processing
When --pon-model is provided to run-all:
- PON is loaded once and passed to all processors
- Each feature computes z-scores against healthy baseline
- Output includes both raw values and PON-normalized columns
Output Columns
With PON, additional columns are added to outputs:
| Feature | PON Column(s) | Description |
|---|---|---|
| FSC | *_log2 |
Log2 ratio vs PON expected |
| FSC Gene | depth_zscore |
Gene-level depth z-score |
| FSC Region | depth_zscore |
Exon-level depth z-score |
| FSD | ratio_log2 |
Size distribution log ratio |
| WPS | wps_zscore |
Z-score vs region baseline |
| OCF | ocf_zscore |
Z-score vs OCF baseline |
| Motif | mds_z |
Z-score for MDS |
| TFBS | entropy_z |
Z-score per TF |
| ATAC | entropy_z |
Z-score per cancer type |
| Region MDS | mds_z, mds_e1_z |
Gene-level and E1 z-scores |
API Reference
from krewlyzer.pon.model import PonModel
# Load existing PON
pon = PonModel.load("path/to/pon.parquet")
# Access components
gc_expected = pon.get_mean("short") # Expected at median GC
variance = pon.get_variance("short") # For reliability scoring
# Check panel mode
if pon.panel_mode:
print(f"Built with: {pon.target_regions_file}")
PON Baselines in Detail
GC Bias Model (gc_bias)
Stores expected fragment coverage per GC content (0-100%) for each fragment type:
| Fragment Type | Size Range | Purpose |
|---|---|---|
short |
65-149bp | Short fragment correction |
intermediate |
150-259bp | Mono-nucleosomal |
long |
260-400bp | Di-nucleosomal |
wps_long |
120-180bp | WPS nucleosomal |
wps_short |
35-80bp | WPS TF footprint |
FSD Baseline (fsd_baseline)
Size distribution per chromosome arm (46 arms):
- expected: Mean proportion at each size bin
- std: Standard deviation across PON samples
WPS Baseline (wps_baseline)
Per-region nucleosome positioning metrics.
Schema v1.0 (Scalar):
- wps_long_mean/std: Single nucleosomal WPS value per region
- wps_short_mean/std: Single TF footprint value per region
Schema v2.0 (Vector):
- wps_nuc_mean/std: 200-element vector (nucleosomal footprint)
- wps_tf_mean/std: 200-element vector (TF footprint)
Tip
v2.0 enables position-specific z-scores and Shape Correlation Score for cancer detection.
Shape Score Interpretation
| Score | Interpretation |
|---|---|
| 0.9-1.0 | Healthy nucleosome positioning |
| 0.5-0.9 | Mild chromatin disorganization |
| <0.5 | Significant disruption (cancer signal) |
See WPS Features for output column details.
OCF Baseline (ocf_baseline)
Per-region open chromatin footprint:
- ocf_mean/std: OCF score baseline
- sync_mean/std: Synchronization score baseline
MDS Baseline (mds_baseline)
Motif diversity expectations:
- kmer_expected: 256 4-mer frequencies from healthy samples
- kmer_std: Variability per k-mer
- mds_mean/std: Expected Motif Diversity Score
TFBS Baseline (tfbs_baseline)
Per-TF size entropy:
- label_stats: Mean/std entropy per TF (808 transcription factors)
- Enables z-score per TF for detailed regulatory analysis
ATAC Baseline (atac_baseline)
Per-cancer-type size entropy:
- label_stats: Mean/std entropy per cancer type (23 types)
- Enables tissue-of-origin scoring
Region MDS Baseline (region_mds)
Per-gene MDS expectations:
- gene_baseline: Dict of gene → {mds_mean, mds_std, mds_e1_mean, mds_e1_std}
- Enables gene-level anomaly detection
- E1 (first exon) tracked separately for promoter-proximal sensitivity
FSC Gene Baseline (fsc_gene_baseline)
Per-gene normalized depth baseline (panel mode only):
- data: Dict of gene → (mean_depth, std_depth, n_samples)
- Requires minimum 3 samples for reliable statistics
- Clinical use: z-score >> 0 = amplification, z-score << 0 = deletion
FSC Region Baseline (fsc_region_baseline)
Per-exon/probe normalized depth baseline (panel mode only):
- data: Dict of region_id → (mean_depth, std_depth, n_samples)
- Region IDs formatted as "chrom:start-end"
- Covers all exons (no filtering by variance)
- Enables detection of focal copy number changes affecting single exons
Interpreting Z-Scores
Z-scores measure how many standard deviations a sample differs from the healthy PON baseline:
Clinical Interpretation
| Z-Score Range | Interpretation | Action |
|---|---|---|
| -2 to +2 | Normal range | Within healthy variation |
| ** | z | = 2-3** |
| ** | z | > 3** |
| ** | z | > 5** |
Per-Feature Z-Score Meaning
| Feature | Z-Score Column | Positive Z Means | Negative Z Means |
|---|---|---|---|
| FSC | z_core_short |
More short fragments | Fewer short fragments |
| FSD | - | Shifted size distribution | - |
| WPS | wps_nuc_z |
Stronger nucleosome signal | Disrupted nucleosomes |
| OCF | ocf_z |
More open chromatin | Less accessible |
| MDS | mds_z |
More diverse motifs | Less diverse |
| TFBS | entropy_z |
Higher entropy (diverse sizes) | Lower entropy (restricted) |
| ATAC | entropy_z |
Higher entropy | Lower entropy |
| Region MDS | mds_z, mds_e1_z |
More diverse at gene | Restricted motifs (aberrant) |
ML Feature Usage
# Extract z-score features for classification
features = {
"fsc_short_z": sample_fsc["z_core_short"].mean(),
"wps_nuc_z": sample_wps["wps_nuc_z"].mean(),
"mds_z": sample_motif["mds_z"],
}
# Higher |z| = more likely to be tumor
combined_signal = sum(abs(z) for z in features.values())
Tip
Combine z-scores across features - Single extreme values may be noise, but consistent deviations across FSC, WPS, and MDS are highly indicative of ctDNA.