build-pon
Build a unified Panel of Normals (PON) model from healthy plasma samples.
Synopsis
Description
Creates a PON model containing all baselines needed for cfDNA analysis from a cohort of healthy samples:
- GC Bias Model - For coverage correction
- FSD Baseline - Fragment size distributions per arm
- WPS Baseline - Nucleosome protection per region
- OCF Baseline - Open chromatin footprinting per region
- MDS Baseline - Motif diversity and k-mer frequencies
- TFBS Baseline - Transcription factor binding site entropy (808 TFs)
- ATAC Baseline - ATAC-seq peak entropy (23 cancer types)
This model is used for bias correction and z-score normalization during sample processing.
Arguments
| Argument | Description |
|---|---|
SAMPLE_LIST |
Text file with paths to BAM or BED.gz files (one per line) |
Required Options
| Option | Description |
|---|---|
-a, --assay |
Assay name (e.g., msk-access-v2) |
-r, --reference |
Reference FASTA file (indexed) |
-o, --output |
Output PON model file (.pon.parquet) |
Optional Parameters
| Option | Default | Description |
|---|---|---|
-G, --genome |
hg19 | Genome build (hg19/GRCh37/hg38/GRCh38) |
-T, --target-regions |
None | BED file with target regions (panel mode) |
--skip-target-regions |
False | Force WGS mode (ignore bundled targets from --assay) |
-W, --wps-anchors |
Built-in | WPS anchors BED.gz (merged TSS+CTCF) |
-b, --bin-file |
Built-in | Bin file for FSC/FSR |
--temp-dir |
System temp | Directory for temporary files |
-p, --threads |
4 | Total threads (divided among parallel samples) |
-P, --parallel-samples |
1 | Number of samples to process in parallel |
--memory-per-sample |
12 | Expected memory per sample in GB (panel: 12-20, WGS: 4-8) |
--sample-timeout |
3600 | Max seconds per sample (0=no timeout) |
--allow-failures |
False | Continue if a sample fails |
--require-proper-pair |
False | Only properly paired reads |
-v, --verbose |
False | Verbose output |
Pipeline Flow
flowchart TB
subgraph Input
SL["sample_list.txt"]
BAM["BAM/CRAM files"]
BED["BED.gz files"]
end
subgraph "Per-Sample Processing (Parallel with -P)"
EXT["Extract fragments"]
GC["GC observations"]
FSD["FSD per arm"]
WPS["WPS per anchor"]
OCF["OCF per region"]
MDS["MDS k-mers"]
FSC_G["FSC gene/region"]
TFBS["TFBS entropy"]
ATAC["ATAC entropy"]
RMDS["Region MDS"]
end
subgraph "Baseline Aggregation"
AGG_GC["GC bias curves\nmean/std per bin"]
AGG_FSD["FSD baseline\nmean/std per arm"]
AGG_WPS["WPS baseline\nmean/std per region"]
AGG_OCF["OCF baseline\nmean/std per region"]
AGG_MDS["MDS baseline\nk-mer mean/std"]
AGG_FSC["FSC gene/region\nmean/std per gene"]
AGG_TFBS["TFBS baseline\nmean/std per TF"]
AGG_ATAC["ATAC baseline\nmean/std per type"]
AGG_RMDS["Region MDS\nmean/std per gene"]
end
subgraph Output
PON["PON.parquet"]
end
SL --> BAM & BED
BAM --> EXT --> GC & FSD & WPS & OCF & MDS & FSC_G & TFBS & ATAC & RMDS
BED --> GC & FSD & WPS & OCF
GC --> AGG_GC --> PON
FSD --> AGG_FSD --> PON
WPS --> AGG_WPS --> PON
OCF --> AGG_OCF --> PON
MDS --> AGG_MDS --> PON
FSC_G --> AGG_FSC --> PON
TFBS --> AGG_TFBS --> PON
ATAC --> AGG_ATAC --> PON
RMDS --> AGG_RMDS --> PON
Examples
WGS Mode:
krewlyzer build-pon healthy_samples.txt \
--assay wgs-hg19 \
--reference hg19.fa \
--output wgs.pon.parquet
Panel Mode (MSK-ACCESS):
krewlyzer build-pon healthy_samples.txt \
--assay msk-access-v2 \
--reference hg19.fa \
--target-regions msk_access_targets.bed \
--output msk-access.pon.parquet \
--threads 16
HPC with custom temp directory:
krewlyzer build-pon healthy_samples.txt \
--assay xs1 \
--reference hg19.fa \
--target-regions msk_access_v1_targets.bed \
--temp-dir /scratch/$USER/pon_tmp \
--output xs1.pon.parquet \
--threads 16 \
--verbose
Input Formats
Sample list file (samples.txt):
Both BAM/CRAM and BED.gz inputs are supported:
| Input Type | Extraction | MDS Baseline | Speed |
|---|---|---|---|
| BAM/CRAM | Full | ✓ Included | Slower |
| BED.gz | Skip | ✗ Not available | Faster |
Note
MDS baseline requires BAM/CRAM input because it needs fragment end sequences for k-mer extraction. BED.gz files only contain coordinates.
Output
The output is a Parquet file containing:
| Component | Description | Used By |
|---|---|---|
| Metadata | assay, build_date, n_samples, reference, panel_mode | All |
| GC Bias | Expected coverage by GC bin for short/intermediate/long fragments | FSC, FSR |
| FSD Baseline | Mean/std size proportions per chromosome arm | FSD |
| WPS Baseline | Mean/std WPS per transcript region | WPS |
| OCF Baseline | Mean/std OCF per open chromatin region | OCF |
| MDS Baseline | K-mer frequencies and MDS mean/std | Motif |
| TFBS Baseline | Mean/std entropy per TF (808 factors) | Region Entropy |
| ATAC Baseline | Mean/std entropy per cancer type (23 types) | Region Entropy |
| Region MDS Baseline | Per-gene MDS mean/std for E1 | Region MDS |
| FSC Gene Baseline | Per-gene normalized depth mean/std | FSC Gene |
| FSC Region Baseline | Per-exon normalized depth mean/std | FSC Region |
In panel mode, additional on-target baselines are included:
| Component | Description |
|---|---|
| GC Bias (ontarget) | On-target GC correction model |
| FSD Baseline (ontarget) | On-target FSD stats |
| TFBS Baseline (ontarget) | Panel-specific TF entropy |
| ATAC Baseline (ontarget) | Panel-specific ATAC entropy |
Panel Mode
When --target-regions is provided:
- GC model trained on off-target fragments only (unbiased by capture)
- FSD/WPS include separate on-target statistics
- Output model includes
panel_mode=truein metadata
Recommendations
- Minimum samples: 10+ for stable baselines
- FSC gene/region: Requires minimum 3 samples per gene for statistics
- Same assay: All samples must be from the same assay
- Same reference: Must match reference used for processing
- Healthy samples: Use confirmed non-cancer samples only
MDS Baseline
The MDS (Motif Diversity Score) baseline is computed from k-mer frequencies at fragment ends:
| Metric | Description |
|---|---|
kmer_expected |
Mean frequency per 4-mer across samples |
kmer_std |
Standard deviation per 4-mer |
mds_mean |
Mean MDS (Shannon entropy) |
mds_std |
MDS standard deviation |
Important
MDS baseline requires BAM/CRAM input because it needs fragment end sequences. BED.gz files cannot provide this data.
Z-score interpretation:
| mds_z | Interpretation |
|---|---|
| -2 to +2 | Normal range |
| < -2 | Abnormally low diversity |
| > +2 | Rare, check data quality |
HPC Usage (SLURM)
For running build-pon on HPC clusters with SLURM, use one of these approaches:
Resource Planning
| Parallel Samples | Memory | CPUs | Est. Time (50 samples) |
|---|---|---|---|
| 2 | 50 GB | 16 | ~48 hours |
| 4 | 100 GB | 32 | ~24 hours |
| 6 | 150 GB | 48 | ~12 hours |
Tip
High-coverage panel data (e.g., MSK-ACCESS) typically uses 15-20 GB per sample during peak extraction. WGS data may use less. Start conservative and adjust based on memory logs.
sbatch Script (Recommended)
Create run_pon.sh:
#!/bin/bash
#SBATCH --partition=your_partition
#SBATCH --job-name=build_pon
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=150G
#SBATCH --time=12:00:00
#SBATCH --output=pon_build_%j.out
#SBATCH --error=pon_build_%j.err
# Create temp directory
mkdir -p ./pon_temp
krewlyzer build-pon samples.txt \
--assay your-assay \
-r /path/to/reference.fasta \
-o output.pon.parquet \
--temp-dir ./pon_temp \
--threads 48 \
-P 6 \
--memory-per-sample 20 \
--sample-timeout 7200 \
--allow-failures \
-v
Submit with:
srun One-liner
mkdir -p ./pon_temp && srun \
--partition=your_partition \
--job-name=build_pon \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=48 \
--mem=150G \
--time=12:00:00 \
krewlyzer build-pon samples.txt \
--assay your-assay \
-r /path/to/reference.fasta \
-o output.pon.parquet \
--temp-dir ./pon_temp \
--threads 48 -P 6 --memory-per-sample 20 \
--sample-timeout 7200 --allow-failures -v \
2>&1 | tee pon_build.log
Key SLURM Parameters
| Parameter | Value | Explanation |
|---|---|---|
--nodes=1 |
1 | Single node (Python multiprocessing doesn't span nodes) |
--ntasks=1 |
1 | One main process (parallelism handled internally) |
--cpus-per-task |
48 | All CPUs available to the process |
--mem |
150G | 6 samples × 20 GB + buffer |
Key krewlyzer Parameters
| Parameter | Example | Description |
|---|---|---|
-P / --parallel-samples |
6 | Concurrent samples (internal workers) |
--threads |
48 | Total threads (divided among workers) |
--memory-per-sample |
20 | Memory hint for auto-mode (panel: 15-20 GB) |
--sample-timeout |
7200 | Per-sample timeout in seconds (2 hours) |
--temp-dir |
./pon_temp | Local scratch directory |
--allow-failures |
- | Continue if a sample fails |
Warning
If jobs are OOM-killed, reduce -P or increase --mem. The new memory monitoring will log usage at each processing stage.