build-pon

Build a unified Panel of Normals (PON) model from healthy plasma samples.

Synopsis

krewlyzer build-pon SAMPLE_LIST --assay NAME -r REFERENCE -o OUTPUT [OPTIONS]

Description

Creates a PON model containing all baselines needed for cfDNA analysis from a cohort of healthy samples:

GC Bias Model - For coverage correction
FSD Baseline - Fragment size distributions per arm
WPS Baseline - Nucleosome protection per region
OCF Baseline - Open chromatin footprinting per region
MDS Baseline - Motif diversity and k-mer frequencies
TFBS Baseline - Transcription factor binding site entropy (808 TFs)
ATAC Baseline - ATAC-seq peak entropy (23 cancer types)

This model is used for bias correction and z-score normalization during sample processing.

Arguments

Argument	Description
`SAMPLE_LIST`	Text file with paths to BAM or BED.gz files (one per line)

Required Options

Option	Description
`-a, --assay`	Assay name (e.g., `msk-access-v2`)
`-r, --reference`	Reference FASTA file (indexed)
`-o, --output`	Output PON model file (`.pon.parquet`)

Optional Parameters

Option	Default	Description
`-G, --genome`	hg19	Genome build (hg19/GRCh37/hg38/GRCh38)
`-T, --target-regions`	None	BED file with target regions (panel mode)
`--skip-target-regions`	False	Force WGS mode (ignore bundled targets from --assay)
`-W, --wps-anchors`	Built-in	WPS anchors BED.gz (merged TSS+CTCF)
`-b, --bin-file`	Built-in	Bin file for FSC/FSR
`--temp-dir`	System temp	Directory for temporary files
`-p, --threads`	4	Total threads (divided among parallel samples)
`-P, --parallel-samples`	1	Number of samples to process in parallel
`--memory-per-sample`	12	Expected memory per sample in GB (panel: 12-20, WGS: 4-8)
`--sample-timeout`	3600	Max seconds per sample (0=no timeout)
`--allow-failures`	False	Continue if a sample fails
`--require-proper-pair`	False	Only properly paired reads
`-v, --verbose`	False	Verbose output

Pipeline Flow

flowchart TB
    subgraph Input
        SL["sample_list.txt"]
        BAM["BAM/CRAM files"]
        BED["BED.gz files"]
    end

    subgraph "Per-Sample Processing (Parallel with -P)"
        EXT["Extract fragments"]
        GC["GC observations"]
        FSD["FSD per arm"]
        WPS["WPS per anchor"]
        OCF["OCF per region"]
        MDS["MDS k-mers"]
        FSC_G["FSC gene/region"]
        TFBS["TFBS entropy"]
        ATAC["ATAC entropy"]
        RMDS["Region MDS"]
    end

    subgraph "Baseline Aggregation"
        AGG_GC["GC bias curves\nmean/std per bin"]
        AGG_FSD["FSD baseline\nmean/std per arm"]
        AGG_WPS["WPS baseline\nmean/std per region"]
        AGG_OCF["OCF baseline\nmean/std per region"]
        AGG_MDS["MDS baseline\nk-mer mean/std"]
        AGG_FSC["FSC gene/region\nmean/std per gene"]
        AGG_TFBS["TFBS baseline\nmean/std per TF"]
        AGG_ATAC["ATAC baseline\nmean/std per type"]
        AGG_RMDS["Region MDS\nmean/std per gene"]
    end

    subgraph Output
        PON["PON.parquet"]
    end

    SL --> BAM & BED
    BAM --> EXT --> GC & FSD & WPS & OCF & MDS & FSC_G & TFBS & ATAC & RMDS
    BED --> GC & FSD & WPS & OCF

    GC --> AGG_GC --> PON
    FSD --> AGG_FSD --> PON
    WPS --> AGG_WPS --> PON
    OCF --> AGG_OCF --> PON
    MDS --> AGG_MDS --> PON
    FSC_G --> AGG_FSC --> PON
    TFBS --> AGG_TFBS --> PON
    ATAC --> AGG_ATAC --> PON
    RMDS --> AGG_RMDS --> PON

Use mouse to pan and zoom

Examples

WGS Mode:

krewlyzer build-pon healthy_samples.txt \
    --assay wgs-hg19 \
    --reference hg19.fa \
    --output wgs.pon.parquet

Panel Mode (MSK-ACCESS):

krewlyzer build-pon healthy_samples.txt \
    --assay msk-access-v2 \
    --reference hg19.fa \
    --target-regions msk_access_targets.bed \
    --output msk-access.pon.parquet \
    --threads 16

HPC with custom temp directory:

krewlyzer build-pon healthy_samples.txt \
    --assay xs1 \
    --reference hg19.fa \
    --target-regions msk_access_v1_targets.bed \
    --temp-dir /scratch/$USER/pon_tmp \
    --output xs1.pon.parquet \
    --threads 16 \
    --verbose

Input Formats

Sample list file (samples.txt):

/path/to/sample1.bam
/path/to/sample2.bam
/path/to/sample3.bed.gz

Both BAM/CRAM and BED.gz inputs are supported:

Input Type	Extraction	MDS Baseline	Speed
BAM/CRAM	Full	✓ Included	Slower
BED.gz	Skip	✗ Not available	Faster

Note

MDS baseline requires BAM/CRAM input because it needs fragment end sequences for k-mer extraction. BED.gz files only contain coordinates.

Output

The output is a Parquet file containing:

Component	Description	Used By
Metadata	assay, build_date, n_samples, reference, panel_mode	All
GC Bias	Expected coverage by GC bin for short/intermediate/long fragments	FSC, FSR
FSD Baseline	Mean/std size proportions per chromosome arm	FSD
WPS Baseline	Mean/std WPS per transcript region	WPS
OCF Baseline	Mean/std OCF per open chromatin region	OCF
MDS Baseline	K-mer frequencies and MDS mean/std	Motif
TFBS Baseline	Mean/std entropy per TF (808 factors)	Region Entropy
ATAC Baseline	Mean/std entropy per cancer type (23 types)	Region Entropy
Region MDS Baseline	Per-gene MDS mean/std for E1	Region MDS
FSC Gene Baseline	Per-gene normalized depth mean/std	FSC Gene
FSC Region Baseline	Per-exon normalized depth mean/std	FSC Region

In panel mode, additional on-target baselines are included:

Component	Description
GC Bias (ontarget)	On-target GC correction model
FSD Baseline (ontarget)	On-target FSD stats
TFBS Baseline (ontarget)	Panel-specific TF entropy
ATAC Baseline (ontarget)	Panel-specific ATAC entropy

Panel Mode

When --target-regions is provided:

GC model trained on off-target fragments only (unbiased by capture)
FSD/WPS include separate on-target statistics
Output model includes panel_mode=true in metadata

Recommendations

Minimum samples: 10+ for stable baselines
FSC gene/region: Requires minimum 3 samples per gene for statistics
Same assay: All samples must be from the same assay
Same reference: Must match reference used for processing
Healthy samples: Use confirmed non-cancer samples only

MDS Baseline

The MDS (Motif Diversity Score) baseline is computed from k-mer frequencies at fragment ends:

Metric	Description
`kmer_expected`	Mean frequency per 4-mer across samples
`kmer_std`	Standard deviation per 4-mer
`mds_mean`	Mean MDS (Shannon entropy)
`mds_std`	MDS standard deviation

Important

MDS baseline requires BAM/CRAM input because it needs fragment end sequences. BED.gz files cannot provide this data.

Z-score interpretation:

mds_z	Interpretation
-2 to +2	Normal range
< -2	Abnormally low diversity
> +2	Rare, check data quality

HPC Usage (SLURM)

For running build-pon on HPC clusters with SLURM, use one of these approaches:

Resource Planning

Parallel Samples	Memory	CPUs	Est. Time (50 samples)
2	50 GB	16	~48 hours
4	100 GB	32	~24 hours
6	150 GB	48	~12 hours

Tip

High-coverage panel data (e.g., MSK-ACCESS) typically uses 15-20 GB per sample during peak extraction. WGS data may use less. Start conservative and adjust based on memory logs.

sbatch Script (Recommended)

Create run_pon.sh:

#!/bin/bash
#SBATCH --partition=your_partition
#SBATCH --job-name=build_pon
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=150G
#SBATCH --time=12:00:00
#SBATCH --output=pon_build_%j.out
#SBATCH --error=pon_build_%j.err

# Create temp directory
mkdir -p ./pon_temp

krewlyzer build-pon samples.txt \
  --assay your-assay \
  -r /path/to/reference.fasta \
  -o output.pon.parquet \
  --temp-dir ./pon_temp \
  --threads 48 \
  -P 6 \
  --memory-per-sample 20 \
  --sample-timeout 7200 \
  --allow-failures \
  -v

Submit with:

sbatch run_pon.sh

srun One-liner

mkdir -p ./pon_temp && srun \
  --partition=your_partition \
  --job-name=build_pon \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=48 \
  --mem=150G \
  --time=12:00:00 \
  krewlyzer build-pon samples.txt \
    --assay your-assay \
    -r /path/to/reference.fasta \
    -o output.pon.parquet \
    --temp-dir ./pon_temp \
    --threads 48 -P 6 --memory-per-sample 20 \
    --sample-timeout 7200 --allow-failures -v \
  2>&1 | tee pon_build.log

Key SLURM Parameters

Parameter	Value	Explanation
`--nodes=1`	1	Single node (Python multiprocessing doesn't span nodes)
`--ntasks=1`	1	One main process (parallelism handled internally)
`--cpus-per-task`	48	All CPUs available to the process
`--mem`	150G	6 samples × 20 GB + buffer

Key krewlyzer Parameters

Parameter	Example	Description
`-P` / `--parallel-samples`	6	Concurrent samples (internal workers)
`--threads`	48	Total threads (divided among workers)
`--memory-per-sample`	20	Memory hint for auto-mode (panel: 15-20 GB)
`--sample-timeout`	7200	Per-sample timeout in seconds (2 hours)
`--temp-dir`	./pon_temp	Local scratch directory
`--allow-failures`	-	Continue if a sample fails

Warning

If jobs are OOM-killed, reduce -P or increase --mem. The new memory monitoring will log usage at each processing stage.