Output Files Reference

Complete reference for every file Krewlyzer produces — what it contains, what each column means, when to use it, and how to apply it in ML models.

Quick Reference

File	Feature	Resolution	ML Signal
`{s}.FSD.tsv`	FSD	Per chr arm	Arm-level fragmentation shift
`{s}.FSR.tsv`	FSR	5 Mb / 100kb	PON-normalized short/long ratio
`{s}.FSC.tsv`	FSC	5 Mb window	Multi-channel size coverage
`{s}.FSC.gene.tsv`	FSC	Per gene	Gene fragmentation composition
`{s}.FSC.regions.tsv`	FSC	Per exon/target	Exon fragmentation composition
`{s}.FSC.regions.e1only.tsv`	FSC-E1	Per gene (E1)	Promoter-proximal signal
`{s}.fsc_counts.tsv`	FSC-raw	Per GC bin	GC correction diagnostics
`{s}.WPS.parquet`	WPS	Per anchor	Nucleosome positioning
`{s}.WPS.panel.parquet`	WPS	Panel anchors	Gene-level nucleosome
`{s}.WPS_background.parquet`	WPS-bg	Alu stacks	Global chromatin state
`{s}.EndMotif.tsv`	Motif	Global (1 row)	256 4-mer end frequencies
`{s}.EndMotif1mer.tsv`	Motif	Global (4 rows)	Base composition at ends
`{s}.BreakPointMotif.tsv`	Motif	Global (1 row)	256 4-mer break frequencies
`{s}.MDS.tsv`	MDS	Global (1 row)	Motif diversity scalar
`{s}.MDS.exon.tsv`	Region-MDS	Per exon	Per-exon motif diversity
`{s}.MDS.gene.tsv`	Region-MDS	Per gene	Per-gene MDS + E1
`{s}.OCF.tsv`	OCF	Per tissue	Tissue-of-origin score
`{s}.OCF.sync.tsv`	OCF	Positional	Strand-phased profiles
`{s}.TFBS.tsv`	TFBS	Per TF	TF footprint entropy
`{s}.TFBS.sync.tsv`	TFBS	Per TF × size	Size-resolved TF profiles
`{s}.ATAC.tsv`	ATAC	Per tissue	ATAC region entropy
`{s}.ATAC.sync.tsv`	ATAC	Per tissue × size	Size-resolved ATAC profiles
`{s}.mFSD.tsv`	mFSD	Per variant	Variant fragment size metrics
`{s}.mFSD.distributions.tsv`	mFSD	Per variant × size	Raw size histograms
`{s}.UXM.tsv`	UXM	Per region	U/X/M methylation fractions
`{s}.correction_factors.tsv`	GC	Per (len, GC) bin	GC bias weights
`{s}.metadata.tsv`	Meta	Global	Run parameters + QC
`{s}.features.json`	All	All	Unified ML feature export

{s} = sample name.

Output Format Options

All tabular outputs support three formats controlled by --output-format:

Flag	Files produced	How to read
`--output-format tsv` (default)	`{s}.FSD.tsv`, `{s}.FSC.tsv`, etc.	`pd.read_csv(path, sep="\t")`
`--output-format parquet`	`{s}.FSD.parquet`, `{s}.FSC.parquet`, etc.	`pd.read_parquet(path)`
`--output-format both`	Both `.tsv` and `.parquet` for every tabular file	Either reader

The file content (columns, rows, values) is identical across formats — only the encoding changes.

TSV Compression (`--compress`)

When --compress is set alongside tsv or both output format, every TSV file is gzip-compressed and given a .tsv.gz extension:

{s}.FSD.tsv.gz    →  pd.read_csv(path, sep="\t", compression="gzip")
{s}.FSC.tsv.gz    →  pd.read_csv(path, sep="\t", compression="gzip")

WPS — Always Parquet

*.WPS.parquet, *.WPS_background.parquet, and *.WPS.panel.parquet are always Parquet regardless of --output-format. WPS stores thousands of 200-point per-anchor profiles — TSV at that scale would be hundreds of MB and functionally unusable. Use pd.read_parquet() for all WPS files.

Unified JSON (`--generate-json`)

The --generate-json flag produces {s}.features.json in addition to the standard TSV/Parquet outputs. It aggregates every feature above into a single file for ML pipelines. JSON generation is independent of --output-format — you can use both together.

# Example: Parquet outputs + unified JSON
krewlyzer run-all sample.bam -r hg19.fa -o out/ \
    --output-format parquet \
    --generate-json
# Produces: *.FSD.parquet, *.FSC.parquet ... AND sample.features.json

See JSON Export Reference for the complete JSON schema.

On-Target vs Off-Target Files

Most fragmentomics features generate two parallel outputs in panel mode: a standard file (off-target reads) and an .ontarget.tsv variant (on-target reads). Understanding the difference is critical for choosing the right input to any ML model.

What "On-Target" and "Off-Target" Mean

In panel sequencing (e.g. MSK-ACCESS), reads fall into two categories:

Category	Reads	Depth	GC bias
Off-target	Do NOT overlap capture bait regions	Low (~1–5×) but genome-wide	Unbiased — not affected by probe GC
On-target	Overlap the capture bait regions	High (~500–2000×) but only at panel genes	Biased — capture efficiency varies by probe GC content

Why the Split Matters

          All cfDNA fragments in BAM
                    │
         ┌──────────┴──────────┐
    Off-target             On-target
  (genome-wide)         (panel genes only)
         │                    │
   FSC.tsv, FSR.tsv      FSC.ontarget.tsv
   MDS.tsv, OCF.tsv      FSR.ontarget.tsv
   FSD.tsv, ...          FSD.ontarget.tsv

The off-target pool is sampled uniformly across the genome, unaffected by probe GC, making it the gold standard for fragmentation features (FSR, FSC, FSD, MDS, OCF). The GC correction model is also trained exclusively on off-target fragments.

The on-target pool has high depth at panel loci but is confounded by capture efficiency — probes with higher GC content capture more efficiently, making fragment size distributions at those loci appear artificially different. On-target reads are primarily useful for gene-level copy number and region-specific analysis, not genome-wide fragmentation.

Which to Use

Task	Use	Reason
Pan-cancer ML features (FSR, FSC, FSD, MDS, OCF)	Off-target (`.tsv`)	Unbiased, genome-wide, GC-corrected with unbiased model
Gene-level copy number (FSC.gene.tsv)	On-target internally	Gene FSC already uses on-target correction factors
Gene fragmentation composition (FSC.regions, E1)	On-target reads, but captured in gene/region TSVs	Not the raw `.ontarget.tsv` — use FSC.gene / FSC.regions
Motif features for tissue-of-origin (MDS, EDM, BPM)	Off-target (`.tsv`)	On-target motifs are biased by probe sequence
MDS on-target (when off-target too sparse)	`.ontarget.tsv`	Lower depth but gene-anchored signal
OCF tissue-of-origin	`OCF.tsv` (WGS) or `OCF.offtarget.tsv` (panel)	`OCF.tsv` = all reads (WGS has no target split); in panel mode `OCF.offtarget.tsv` is the unbiased off-target score — use that instead of `OCF.tsv` to avoid contamination from capture-biased on-target reads
Building a PON	Off-target only	Must match what the sample uses

Do not mix off-target and on-target in the same model

Features from FSR.tsv (off-target) and FSR.ontarget.tsv (on-target) are not on the same scale. The on-target pool has different GC bias, different fragment size distributions, and different effective depth. Always use the same variant consistently across all samples in a cohort.

Complete On-Target / Off-Target File Inventory

Base file	On-target variant	Off-target variant
`{s}.FSD.tsv`	`{s}.FSD.ontarget.tsv`	— (base is off-target)
`{s}.FSR.tsv`	`{s}.FSR.ontarget.tsv`	—
`{s}.FSC.tsv`	`{s}.FSC.ontarget.tsv`	—
`{s}.EndMotif.tsv`	`{s}.EndMotif.ontarget.tsv`	—
`{s}.BreakPointMotif.tsv`	`{s}.BreakPointMotif.ontarget.tsv`	—
`{s}.MDS.tsv`	`{s}.MDS.ontarget.tsv`	—
`{s}.OCF.tsv`	`{s}.OCF.ontarget.tsv`	`{s}.OCF.offtarget.tsv` ⚠️
`{s}.OCF.sync.tsv`	`{s}.OCF.ontarget.sync.tsv`	`{s}.OCF.offtarget.sync.tsv`
`{s}.TFBS.tsv`	`{s}.TFBS.ontarget.tsv`	—
`{s}.TFBS.sync.tsv`	`{s}.TFBS.ontarget.sync.tsv`	—
`{s}.ATAC.tsv`	`{s}.ATAC.ontarget.tsv`	—
`{s}.ATAC.sync.tsv`	`{s}.ATAC.ontarget.sync.tsv`	—
`{s}.correction_factors.tsv`	`{s}.correction_factors.ontarget.tsv`	—

⚠️ OCF is a special case — it always computes three output variants:

File Contains When generated

{s}.OCF.tsv All reads (on + off combined) Always

{s}.OCF.ontarget.tsv On-target reads only Panel mode

{s}.OCF.offtarget.tsv Off-target reads only Panel mode

In WGS mode, OCF.tsv = all reads ≈ off-target (no target split exists). In panel mode, OCF.tsv mixes on-target (capture-biased) and off-target reads — for unbiased tissue-of-origin signal, use OCF.offtarget.tsv instead.

GC Correction and On-Target

GC correction is trained on off-target reads only, then applied to both pools:

Off-target reads → GC model training → correction_factors.tsv
On-target reads  → correction_factors.ontarget.tsv (separate model)

The on-target GC model accounts for probe-specific capture bias. It is used internally when generating FSC.gene.tsv and FSC.regions.tsv — you do not need to apply it manually.

Core Fragmentomics

FSD (Fragment Size Distribution)

File: {sample}.FSD.tsv / {sample}.FSD.ontarget.tsv

FSD measures how many fragments of each size (65–400 bp, 5 bp bins) come from each chromosomal arm. Each row is one arm.

Columns

Column	Type	Description
`region`	str	Arm coordinate range, e.g. `chr1:10001-121535433`, `chr17:25263006-81195210`
`65-69`, `70-74`, … `395-399`	float	GC-corrected fragment count in that 5 bp size bin
`total`	float	Total GC-corrected fragment count for this arm

Purpose & Use Cases

ARM-LEVEL fragmentation fingerprint: Each arm's histogram reflects the chromatin state of that chromosomal region
Aneuploidy / CNV detection: Arms with deletions or amplifications show altered absolute counts in total
Tumor-specific size shift: Cancer plasma shows a systematic shift toward shorter fragments genome-wide

ML Use Case

import pandas as pd
import numpy as np

df = pd.read_csv("sample.FSD.tsv", sep="\t", index_col="region")
size_cols = [c for c in df.columns if c != "total"]

# Feature vector: proportion of each size class per arm
props = df[size_cols].div(df["total"] + 1e-9, axis=0)

# Reduce to 3-class: short (65-149), mono (150-259), long (260-400)
props["short_frac"] = props[[c for c in size_cols if int(c.split("-")[0]) < 150]].sum(axis=1)
props["mono_frac"]  = props[[c for c in size_cols if 150 <= int(c.split("-")[0]) < 260]].sum(axis=1)
props["long_frac"]  = props[[c for c in size_cols if int(c.split("-")[0]) >= 260]].sum(axis=1)

Best for: Arm-level short_frac as 46-feature input (2 arms × 23 chromosomes) per sample for pan-cancer models.

On-target variant

FSD.ontarget.tsv uses only reads overlapping target capture regions. Useful for panel-specific copy number analysis, but capture-biased — prefer off-target for fragmentation ML features.

FSR (Fragment Size Ratio)

File: {sample}.FSR.tsv / {sample}.FSR.ontarget.tsv

FSR computes the PON-normalized short-to-long ratio per window (5 Mb in WGS mode, 100kb in panel mode). This is the primary genome-wide tumor fraction biomarker.

Columns

Column	Type	Description
`region`	str	Window region, e.g. `chr1:0-5000000` (WGS) or `chr1:0-100000` (panel)
`short_count`	int	Count of short frags: ultra_short + core_short (65–149 bp)
`long_count`	int	Count of long frags: di_nucl + long (221–400+ bp)
`total_count`	int	Total fragment count in window
`short_norm`	float	`short_count / PON_short_mean` — PON-normalized short
`long_norm`	float	`long_count / PON_long_mean` — PON-normalized long
`short_long_ratio`	float	`short_norm / long_norm` — primary biomarker
`short_long_log2`	float	`log2(short_long_ratio)` — ML-ready signed metric
`short_frac`	float	`short_count / total_count` — raw proportion
`long_frac`	float	`long_count / total_count` — raw proportion

Purpose & Use Cases

Tumor fraction estimation: short_long_ratio increases proportionally with ctDNA fraction
Genome-wide cancer screen: Profile of 500+ windows per sample captures focal and arm-level alterations
PON comparison: PON normalization (short_norm, long_norm) removes batch effects before ratio — critical for cross-batch comparison

Why not just use short_frac from FSC?

short_frac is a raw proportion — it conflates true biology with library prep and GC bias. short_long_ratio divides PON-normalized values, canceling technical noise. Use FSR for any cross-sample comparison.

ML Use Case

df = pd.read_csv("sample.FSR.tsv", sep="\t")

# Primary feature vector: ~500 windows × 1 scalar
feature_vec = df["short_long_log2"].values  # signed, mean ~0 in healthy

# Genome-wide statistics as compact features
features = {
    "fsr_mean":   df["short_long_log2"].mean(),
    "fsr_std":    df["short_long_log2"].std(),
    "fsr_q90":    df["short_long_log2"].quantile(0.9),
    "fsr_n_elevated": (df["short_long_log2"] > 0.3).sum(),
}

Typical range: healthy ~0.0 ± 0.15; high ctDNA > +0.4 (more short frags than PON)

FSC (Fragment Size Coverage)

File: {sample}.FSC.tsv / {sample}.FSC.ontarget.tsv

FSC counts fragments in 6 non-overlapping size channels across 5 Mb windows — the foundational multi-channel coverage feature.

Columns

Column	Type	Description
`chrom`	str	Chromosome
`start`	int	Window start (0-based)
`end`	int	Window end
`ultra_short`	float	GC-corrected count, 65–100 bp
`core_short`	float	GC-corrected count, 101–149 bp
`mono_nucl`	float	GC-corrected count, 150–220 bp
`di_nucl`	float	GC-corrected count, 221–260 bp
`long`	float	GC-corrected count, 261–400 bp
`ultra_long`	float	GC-corrected count, 401–1000 bp
`total`	float	GC-corrected total, 65–1000 bp
`mean_gc`	float	Mean GC fraction of fragments in window
`*_log2`	float	log₂(channel / PON_mean), with `--pon-model`
`*_reliability`	float	1/(PON_variance + k) — weight for PON columns

Purpose & Use Cases

Multi-channel fragmentation profile: Each channel represents fragments sharing a biological origin (nucleosomal, sub-nucleosomal, apoptotic)
CNV proxy: total counts per arm reflect coverage depth — useful for detecting large-scale copy number events
PON log2 ratios: core_short_log2, mono_nucl_log2 etc. are analogous to CNV log-ratio tracks

ML Use Case

df = pd.read_csv("sample.FSC.tsv", sep="\t")

# 6-channel feature matrix: windows × channels
channels = ["ultra_short", "core_short", "mono_nucl", "di_nucl", "long", "ultra_long"]
X = df[channels].values  # shape: (n_windows, 6)

# Normalize to proportions (remove depth variation)
X_prop = X / (df["total"].values[:, None] + 1e-9)

# With PON: use log2 ratios directly (already depth-normalized)
log2_cols = [c + "_log2" for c in channels if c + "_log2" in df.columns]
X_pon = df[log2_cols].values  # shape: (n_windows, 6) — centered near 0 in healthy

Best for: Input to CNV callers, genome-wide fragmentation classifiers, and tumor fraction regression.

FSC Gene-Level

File: {sample}.FSC.gene.tsv
Requires: --assay xs2 (or other assay code) / run-all

Aggregates FSC across all exons for each panel gene. Rows = genes.

Columns

Column	Type	Description
`gene`	str	HGNC symbol (e.g. `ATM`, `TP53`)
`n_regions`	int	Number of exons/targets captured
`total_bp`	int	Total base pairs covered
`ultra_short` … `long`	float	GC-corrected count per size class
`total`	float	Total GC-corrected count
`ultra_short_ratio` … `long_ratio`	float	`channel / total` — size composition
`normalized_depth`	float	RPKM-like: `(total × 10⁹) / (total_bp × total_frags)`

Purpose & Use Cases

Gene-level copy number: normalized_depth enables comparing coverage across genes in the same sample
Gene fragmentation composition: *_ratio columns show whether a gene's reads are enriched for short (tumor) or long (normal) fragments
Panel-level feature matrix: 146 genes × 6 channels = 876 features per sample

ML Use Case

df = pd.read_csv("sample.FSC.gene.tsv", sep="\t").set_index("gene")

# Gene-level short enrichment
df["tumor_signal"] = df["ultra_short_ratio"] + df["core_short_ratio"]

# Full channel composition feature matrix: 146 genes × 5 ratios
ratio_cols = ["ultra_short_ratio", "core_short_ratio", "mono_nucl_ratio", "di_nucl_ratio", "long_ratio"]
X = df[ratio_cols].values  # shape: (146, 5) per sample

# Normalized depth for CNV
cnv_proxy = df["normalized_depth"]  # pivot across samples → CNV log-ratio

Best for: Gene-specific models, tissue-of-origin (which genes show altered fragmentation), copy number inference.

FSC Region-Level

File: {sample}.FSC.regions.tsv
Requires: --assay or run-all

Per-exon/target fragment size coverage. Most granular FSC output.

Columns

Column	Type	Description
`chrom`	str	Chromosome
`start` / `end`	int	Exon/target coordinates
`gene`	str	Gene symbol
`region_name`	str	Unique exon/target identifier
`region_bp`	int	Region size in bp
`ultra_short` … `long`	float	GC-corrected counts
`total`	float	Total count
`ultra_short_ratio` … `long_ratio`	float	`channel / total`
`normalized_depth`	float	RPKM-like depth

Purpose & Use Cases

Exon-level resolution: Useful when only specific exons (e.g., hotspot exons in TP53) show altered fragmentation
Fine-grained CNV: Detect focal amplifications or deletions at single-exon scale
Input for PON building: Use to compute exon-level expected depth profiles

ML Use Case

df = pd.read_csv("sample.FSC.regions.tsv", sep="\t")

# Pivot: regions × channels, one row per region
pivot = df.pivot_table(index="region_name", values=["core_short_ratio", "mono_nucl_ratio"])

# Filter to well-covered regions
well_covered = df[df["total"] > 50]["region_name"]
df_filtered = df[df["region_name"].isin(well_covered)]

FSC E1-Only

File: {sample}.FSC.regions.e1only.tsv
Requires: --assay or run-all (disable with --disable-e1-aggregation)

First exon (E1) per gene only. Same columns as FSC.regions.tsv.

Purpose & Use Cases

Promoter-proximal fragmentation: E1 = first exon = nucleosome-depleted region (NDR) near TSS. NDRs have the most cancer-specific fragmentation patterns (Helzer et al. 2025)
Highest cancer signal: E1 consistently outperforms whole-gene FSC in early cancer detection tasks
Compact feature set: 146 genes × 1 exon = compact, interpretable vector

ML Use Case

df = pd.read_csv("sample.FSC.regions.e1only.tsv", sep="\t").set_index("gene")

# Primary cancer signal: promoter short enrichment
df["promoter_short"] = df["ultra_short_ratio"] + df["core_short_ratio"]

# 146-gene feature vector — best single FSC feature for early detection
X = df["promoter_short"].values

Use E1 over gene-level for early detection models

e1only routinely achieves lower AUC for early-stage cancer vs FSC.gene.tsv because E1 captures NDR-specific fragmentation that is washed out by whole-gene averaging.

FSC Counts (Pre-Correction)

File: {sample}.fsc_counts.tsv / {sample}.correction_factors.ontarget.tsv

Raw bin-level fragment counts before GC correction — used internally for GC model training.

Purpose & Use Cases

GC bias diagnostics: Compare observed vs expected counts per (length, GC) bin
Batch QC: Libraries with systematic GC bias show large correction factors in specific bins
Not for ML features: GC-corrected values in FSC.tsv are the right input; fsc_counts is pre-correction

Nucleosome Positioning

WPS (Windowed Protection Score)

File: {sample}.WPS.parquet

WPS is always Parquet

WPS outputs (*.WPS.parquet, *.WPS_background.parquet, *.WPS.panel.parquet) are always written as Parquet, regardless of the --output-format flag. WPS vectors are thousands of 200-point profiles — TSV at that scale would be hundreds of MB and functionally unusable. Use pd.read_parquet() to load WPS files.

Per-anchor nucleosome protection profiles. Each row is one genomic anchor (gene TSS or CTCF site). WPS captures how protected (nucleosome-covered) a region is to fragments of two sizes.

Columns (Parquet)

Column	Type	Description
`region_id`	str	Anchor identifier (e.g. `ENSG00000142611_TSS`)
`chrom`	str	Chromosome
`center`	int	Anchor midpoint
`strand`	str	`+` / `-`
`region_type`	str	`TSS`, `CTCF`, etc.
`wps_nuc`	float	Raw nucleosomal WPS (120–180 bp fragments)
`wps_tf`	float	Raw TF footprint WPS (35–80 bp fragments)
`wps_nuc_smooth`	float	Savitzky-Golay smoothed nucleosomal WPS
`wps_tf_smooth`	float	Savitzky-Golay smoothed TF WPS
`wps_nuc_mean`	float	Mean WPS across anchor window
`wps_tf_mean`	float	Mean TF WPS across anchor window
`prot_frac_nuc`	float	Fraction of window with WPS > 0 (nucleosome-covered)
`prot_frac_tf`	float	Fraction of window with TF WPS > 0
`wps_nuc_z`	float	Z-score vs PON baseline (with `--pon-model`)
`wps_tf_z`	float	TF WPS Z-score vs PON

Purpose & Use Cases

Nucleosome positioning: wps_nuc_smooth detects nucleosome phasing at TSS/CTCF anchors
TF accessibility: wps_tf / prot_frac_tf reflects accessible chromatin at TF binding sites
Cancer signal: Tumor DNA shows flattened/disrupted WPS profiles at TSS of active genes
NRL (Nucleosome Repeat Length): Computed in WPS_background.parquet from Alu stacking

ML Use Case

import pandas as pd

df = pd.read_parquet("sample.WPS.parquet")

# Feature vector per sample: mean WPS across all anchors
features = {
    "wps_nuc_global_mean": df["wps_nuc_mean"].mean(),
    "wps_nuc_global_std":  df["wps_nuc_mean"].std(),
    "prot_frac_nuc_mean":  df["prot_frac_nuc"].mean(),
    "prot_frac_tf_mean":   df["prot_frac_tf"].mean(),
}

# Per-anchor feature matrix for gene-level models
X = df[["wps_nuc_mean", "wps_tf_mean", "prot_frac_nuc", "prot_frac_tf"]].values
# shape: (~15,000 anchors, 4) — one row per TSS/CTCF

Best for: Deep learning input (WPS profiles as 1D signals), nucleosome periodicity score, TSS accessibility classifier.

WPS Panel

File: {sample}.WPS.panel.parquet
Requires: --assay

Same columns as WPS.parquet, filtered to panel gene anchors (~1,820 for xs2). Rows are panel-gene TSS and CTCF sites.

ML Use Case

df = pd.read_parquet("sample.WPS.panel.parquet")

# Compact panel feature: 1820 anchors × 4 = 7280 features
X = df[["wps_nuc_mean", "wps_tf_mean", "prot_frac_nuc", "prot_frac_tf"]].values

# Gene-indexed lookup
df_gene = df.set_index("region_id")
tp53_wps = df_gene.loc["TP53_TSS", "wps_nuc_mean"]

WPS Background

File: {sample}.WPS_background.parquet

Hierarchical Alu element stacking analysis capturing global chromatin state and nucleosome repeat length (NRL).

Columns

Column	Type	Description
`group_id`	str	`Global_All`, `Family_AluY/S/J/Other`, `Chr{N}_All`
`stacked_wps_nuc`	float[]	30-position binned stacked WPS (nucleosomal)
`stacked_wps_tf`	float[]	30-position binned stacked WPS (TF)
`alu_count`	int	Number of Alu elements in this group
`mean_wps_nuc`	float	Mean WPS amplitude
`nrl_bp`	float	Estimated Nucleosome Repeat Length in bp (~190 in healthy)
`nrl_deviation_bp`	float	Deviation from expected 190 bp NRL
`periodicity_score`	float	Signal-to-noise ratio of periodicity (0–1)
`adjusted_score`	float	Periodicity score penalized by NRL deviation
`fragment_ratio`	float	Ratio of short/long fragments at Alu sites

Purpose & Use Cases

Global chromatin compaction: nrl_bp shortens in cancer (chromatin opens globally)
NRL as tumor biomarker: Healthy plasma NRL ~190 bp; cancer < 185 bp
Background correction: Used to compute "global_pon" baseline for WPS z-scores

ML Use Case

df = pd.read_parquet("sample.WPS_background.parquet")
global_row = df[df["group_id"] == "Global_All"].iloc[0]

features = {
    "nrl_bp": global_row["nrl_bp"],
    "periodicity_score": global_row["periodicity_score"],
    "adjusted_score": global_row["adjusted_score"],
    "fragment_ratio_bg": global_row["fragment_ratio"],
}

Best for: Global chromatin state features, cancer screening, NRL as continuous tumor fraction predictor.

Motif & Tissue-of-Origin

EndMotif

File: {sample}.EndMotif.tsv / {sample}.EndMotif.ontarget.tsv

4-mer frequencies at fragment 5′ ends. One row per sample, 256 columns (one per AAAA→TTTT 4-mer).

Columns

Column	Description
`AAAA` … `TTTT`	Frequency of that 4-mer at fragment ends (sums to 1.0)

Purpose & Use Cases

Tissue-of-origin: Different tissues have distinct end-motif preferences based on DNASE1L3 activity
Cancer detection: DNASE1L3 is suppressed in cancer, producing a flattened, less-specific motif profile
MDS input: Raw material for Motif Diversity Score calculation

ML Use Case

df = pd.read_csv("sample.EndMotif.tsv", sep="\t")
# Single row: 256 4-mer frequencies
X = df.iloc[0].values  # 256-dimensional feature vector

# Reduce by GC content group (64 → 5 groups)
gc_groups = {"AT-rich": [k for k in df.columns if k.count("A") + k.count("T") >= 3], ...}

EndMotif 1-mer

File: {sample}.EndMotif1mer.tsv

Single-base (A/C/G/T) composition at fragment ends.

Columns: `base`, `fraction`

Purpose & Use Cases

GC bias QC: Should be roughly balanced; extreme GC skew indicates library quality issues
DNASE1L3 proxy: Healthy cfDNA has distinct strand-asymmetric base preferences

BreakPointMotif

File: {sample}.BreakPointMotif.tsv / {sample}.BreakPointMotif.ontarget.tsv

4-mer frequencies at internal fragment breakpoints (rather than ends). Same 256-column format as EndMotif.

Purpose vs EndMotif

	EndMotif	BreakPointMotif
Measures	DNASE1L3 cutting preference	Mechanical fragmentation patterns
Cancer signal	DNASE1L3 suppression	Chromatin compaction / MNase-like cleavage
Correlation	r ~ 0.6 with BPM	Complementary signal

Best for: Combining both in multimodal ML models — they capture distinct biological processes.

MDS (Motif Diversity Score)

File: {sample}.MDS.tsv / {sample}.MDS.ontarget.tsv

Single-number summary of end-motif randomness (Shannon entropy of 256 4-mers).

Columns

Column	Type	Description
`Sample`	str	Sample name
`MDS`	float	Motif Diversity Score (Shannon entropy of 256 4-mers)
`mds_z`	float	Z-score vs PON on-target baseline (`.ontarget` variant only, with `--pon-model`)

Range: Healthy plasma ~0.80–0.85; cancer (DNASE1L3 suppressed) < 0.75

ML Use Case

mds = float(pd.read_csv("sample.MDS.tsv", sep="\t")["MDS"].iloc[0])
# Single scalar — highly interpretable cancer feature

MDS Exon-Level

File: {sample}.MDS.exon.tsv
Requires: region-mds command or run-all

MDS calculated per exon/target from BAM reads overlapping that region.

Columns

Column	Description
`gene`	Gene symbol
`name`	Exon identifier (`gene:exonN` for WGS; target name for panel)
`chrom`	Chromosome
`start` / `end`	Exon coordinates
`strand`	Strand
`n_fragments`	Fragments overlapping this exon
`mds`	Motif Diversity Score for this exon

ML Use Case

df = pd.read_csv("sample.MDS.exon.tsv", sep="\t")

# Filter low-coverage exons
df_high = df[df["n_fragments"] >= 20]

# Per-exon MDS as feature matrix (rows = exons)
X = df_high[["mds"]].values  # or pivot into gene × exon matrix

MDS Gene-Level

File: {sample}.MDS.gene.tsv
Requires: region-mds command or run-all

Gene-level aggregation of per-exon MDS, plus E1 (first exon) MDS as the promoter-proximal signal.

Columns

Column	Description
`gene`	Gene symbol
`n_exons`	Number of exons with data
`n_fragments`	Total fragments across all exons
`mds_mean`	Mean MDS across all exons
`mds_e1`	MDS of E1 (first exon) only
`mds_std`	Standard deviation of per-exon MDS
`mds_z`	Z-score vs PON (with `--pon-model`)
`mds_e1_z`	E1 MDS z-score vs PON

ML Use Case

df = pd.read_csv("sample.MDS.gene.tsv", sep="\t").set_index("gene")

# E1 MDS is highest-signal feature (promoter-proximal NDR)
X_e1 = df["mds_e1"].values           # 146 features for panel
X_z  = df["mds_e1_z"].fillna(0).values  # PON-normalized — zero-centered in healthy

Best for: Gene-level cancer classifiers, promoter aberration detection, combining with FSC-E1.

OCF (Orientation-aware cfDNA Fragmentation)

File: {sample}.OCF.tsv / {sample}.OCF.ontarget.tsv / {sample}.OCF.offtarget.tsv

Three OCF variants

OCF always produces three output files:

OCF.tsv — All reads (on + off combined). In WGS mode this is the only file and equals the off-target signal. In panel mode it mixes capture-biased on-target reads with off-target reads — use OCF.offtarget.tsv for unbiased tissue-of-origin in panel mode.
OCF.ontarget.tsv — On-target reads only (panel mode). Useful for gene-anchored OCF but biased by capture efficiency at tissue-specific loci.
OCF.offtarget.tsv — Off-target reads only (panel mode). Preferred for ML features as it is unbiased by capture probe GC content.

Tissue-of-origin scores based on strand asymmetry of fragment ends at tissue-specific open chromatin regions.

Columns

Column	Description
`tissue`	Tissue type (Liver, Lung, Colon, Placenta, etc.)
`OCF`	Raw score = `(U - D) / (U + D)` — upstream/downstream asymmetry
`ocf_z`	OCF z-score vs PON (with `--pon-model`)

Purpose & Use Cases

Tissue-of-origin: Which tissue contributed most cfDNA — useful for cancer site-of-origin
Multi-tissue mixture deconvolution: Multiple elevated ocf_z values suggest multi-tissue contribution
ctDNA fraction proxy: Overall OCF magnitude correlates with cfDNA purity

ML Use Case

df = pd.read_csv("sample.OCF.tsv", sep="\t").set_index("tissue")

# OCF score vector across tissues (e.g. 10 tissues = 10 features)
X_ocf = df["OCF"].values
X_z   = df["ocf_z"].fillna(0).values  # zero-centered in healthy

# Tissue with max signal
top_tissue = df["ocf_z"].idxmax()

Best for: Primary tumor site-of-origin classification, multi-class tissue deconvolution.

OCF Sync

File: {sample}.OCF.sync.tsv

Positional strand-specific protection profiles — detailed positional data underlying the OCF summary score.

Columns: `label`, `count`, `mean_size`, `entropy`

Use: Raw data for visualizing strand phasing; input to advanced nucleosome positioning models.

TFBS (Transcription Factor Binding Site Entropy)

File: {sample}.TFBS.tsv / {sample}.TFBS.ontarget.tsv

Fragment size entropy at TFBS regions for 808 transcription factors. Reflects chromatin accessibility and TF binding.

Columns

Column	Description
`label`	TF name (e.g. `CTCF`, `SP1`, `E2F1`)
`count`	Fragment count
`mean_size`	Mean fragment size at this TF's sites
`entropy`	Shannon entropy of fragment size distribution

Purpose & Use Cases

TF accessibility: Low entropy = dominated by one size class (nucleosomal); high entropy = mixed, accessible
TF-specific cancer signal: Some TFs (E2F family, SP1) show altered fragmentation in cancer
808-feature vector: One entropy value per TF = large, rich feature set

ML Use Case

df = pd.read_csv("sample.TFBS.tsv", sep="\t").set_index("label")

# 808 TF entropy values — high-dimensional feature vector
X_tfbs = df["entropy"].values

# Mean size as complementary feature
X_size = df["mean_size"].values

# Combine
X = np.stack([X_tfbs, X_size], axis=1)  # shape: (808, 2)

TFBS Sync

File: {sample}.TFBS.sync.tsv

Per-TF × per-size distribution — the raw size histogram for each TF.

Columns: `label`, `size`, `count`, `proportion`

Use: Size-resolved TF footprinting; input for detecting nucleosome-footprint transitions.

ATAC (Chromatin Accessibility Entropy)

File: {sample}.ATAC.tsv / {sample}.ATAC.ontarget.tsv

Fragment size entropy at ATAC-seq accessible regions for 23 cancer-relevant tissue types.

Columns

Same as TFBS: label (tissue type), count, mean_size, entropy

Purpose & Use Cases

Tissue-specific accessibility: Which tissue's ATAC peaks show altered fragmentation
Cancer type inference: Different cancer types show tissue-specific ATAC entropy patterns
Complementary to OCF: OCF uses strand asymmetry; ATAC uses size entropy — different signal axis

ML Use Case

df = pd.read_csv("sample.ATAC.tsv", sep="\t").set_index("label")

# 23 tissue entropy values — compact tissue-of-origin feature
X_atac = df["entropy"].values

# Combine OCF + ATAC tissue vectors for multimodal tissue classifier
X_combined = np.concatenate([ocf_scores, atac_entropy])

ATAC Sync

File: {sample}.ATAC.sync.tsv

Per-tissue × per-size fragment distributions.

Columns: `label`, `size`, `count`, `proportion`

Variant-Level

mFSD (Mutant Fragment Size Distribution)

File: {sample}.mFSD.tsv
Requires: --maf / run-all with MAF input

Per-variant fragment size analysis. Compares ALT-bearing fragments vs REF, NonREF, and N (uncertain) allele classes.

Column Groups (46 total)

Group	Columns	Description
Variant (5)	`Chrom`, `Pos`, `Ref`, `Alt`, `VarType`	Variant coordinates and type
Raw Counts (5)	`REF_Count`, `ALT_Count`, `NonREF_Count`, `N_Count`, `Total_Count`	Fragments per allele class
GC-Weighted (5)	`REF_Weighted`, `ALT_Weighted`, `NonREF_Weighted`, `N_Weighted`, `VAF_GC_Corrected`	GC-corrected counts and VAF
Log-Likelihood (2)	`ALT_LLR`, `REF_LLR`	For low-N variants (ALT_Count < 5)
Mean Sizes (4)	`REF_MeanSize`, `ALT_MeanSize`, `NonREF_MeanSize`, `N_MeanSize`	Mean fragment size per class
KS Tests (18)	`Delta_/KS_/KS_Pval_*` × 6 pairings	KS distance + p-value for each allele pair
Derived (5)	`VAF_Proxy`, `Error_Rate`, `N_Rate`, `Size_Ratio`, `Quality_Score`	Summary biomarkers
Flags (2)	`ALT_Confidence`, `KS_Valid`	Quality indicators

Purpose & Use Cases

MRD detection: VAF_GC_Corrected + KS_Pval_ALT_REF identify true somatic mutations vs noise
Fragment size as orthogonal evidence: Delta_ALT_REF negative = ALT fragments shorter than REF = tumor-derived ctDNA
Duplex support: ALT_LLR provides statistically valid evidence even at 1–2 fragment counts
Multi-variant aggregation: Combine evidence across many variants to estimate ctDNA fraction

ML Use Case

df = pd.read_csv("sample.mFSD.tsv", sep="\t")

# Filter to high-quality variants
df_hq = df[(df["ALT_Confidence"] == "HIGH") & (df["KS_Valid"] == True)]

# Per-variant feature vector
features = df_hq[["VAF_GC_Corrected", "Delta_ALT_REF", "KS_ALT_REF",
                   "Size_Ratio", "Quality_Score", "ALT_LLR"]].values

# Sample-level aggregation: weighted average across variants
sample_vaf = (df_hq["VAF_GC_Corrected"] * df_hq["Quality_Score"]).sum() / df_hq["Quality_Score"].sum()

Best for: MRD detection models, ctDNA fraction estimation from targeted panels, variant-level cancer classifiers.

mFSD Distributions

File: {sample}.mFSD.distributions.tsv
Requires: --output-distributions flag

Per-variant raw size histograms for manual inspection.

Columns: `Chrom`, `Pos`, `Ref`, `Alt`, `Category`, `Size`, `Count`

Use: Visualizing the ALT vs REF size shift for individual variants; QC and debugging.

Methylation

UXM (Methylation)

File: {sample}.UXM.tsv
Requires: Methylation-enabled BAM input

CpG methylation classification per genomic region. Each fragment is classified as Unmethylated (U), Partially-methylated (X), or fully Methylated (M).

Columns

Column	Description
`region`	Genomic region identifier
`U`	Fraction of unmethylated fragments
`X`	Fraction of partially methylated fragments
`M`	Fraction of fully methylated fragments

Purpose & Use Cases

Tumor DNA methylation: Cancer shows global hypomethylation (↑U) and focal hypermethylation (↑M at TSGs)
Tissue-of-origin: Tissue-specific methylation patterns enable ctDNA source attribution
Multi-modal fusion: Combine with FSR and MDS for maximum sensitivity in early detection

ML Use Case

df = pd.read_csv("sample.UXM.tsv", sep="\t").set_index("region")

# 3-class methylation state per region
X = df[["U", "X", "M"]].values  # shape: (n_regions, 3); rows sum to 1.0

# Sample-level statistics
features = {
    "global_U": df["U"].mean(),   # high = hypomethylation (cancer signal)
    "global_M": df["M"].mean(),   # varies by tissue
    "U_std": df["U"].std(),       # heterogeneity across regions
}

Diagnostic / QC Outputs

GC Correction Factors

File: {sample}.correction_factors.tsv / {sample}.correction_factors.ontarget.tsv

GC-bias correction weights per (fragment length bin, GC content) pair.

Columns

Column	Description
`length_bin_min`	Fragment length bin lower bound (bp)
`length_bin_max`	Fragment length bin upper bound (bp)
`gc_percent`	GC content percentage (0–100)
`observed`	Observed fragment count
`expected`	Expected count from GC model
`correction_factor`	`expected / observed` — multiply raw counts by this

Purpose & Use Cases

Library QC: Values far from 1.0 indicate GC bias in sequencing/capture
Batch effect debugging: Compare correction factors across runs to detect systematic issues
Input to GC model PON: Used to build PON correction factor baselines

Not for ML features

GC correction is already applied to FSC, FSR, FSD, and WPS outputs. Use those corrected values. correction_factors.tsv is diagnostic data.

Metadata TSV

File: {sample}.metadata.tsv

Run parameters, QC metrics, and processing provenance — written as a single-row tabular file for easy ingestion into PON pipelines and pandas workflows.

import pandas as pd
meta = pd.read_csv("sample.metadata.tsv", sep="\t").iloc[0].to_dict()
print(meta["total_fragments"], meta["on_target_rate"])

Column	Type	Description
`sample_id`	str	Sample identifier
`krewlyzer_version`	str	Version string
`genome`	str	Genome build used (`hg19`, `hg38`)
`assay`	str	Assay name (or empty for WGS)
`total_fragments`	int	Total fragments extracted
`on_target_rate`	float	Fraction of fragments overlapping targets
`mean_fragment_size`	float	Mean fragment length (bp)
`duplication_rate`	float	Estimated duplicate fraction
`processing_time_s`	float	Wall-clock processing time in seconds

Use: Filter samples by QC thresholds before ML training (e.g. total_fragments > 5M, on_target_rate > 0.3).

Features JSON

File: {sample}.features.json
Requires: --generate-json

Unified export of all features above in a single JSON file for ML pipelines. See JSON Output Reference for the complete schema.

import json
with open("sample.features.json") as f:
    features = json.load(f)

# Access any output without reading individual TSVs
fsr_windows = features["fsr"]["off_target"]
mds = features["motif"]["mds"]
region_mds = features["region_mds"]["gene"]

Feature Selection Guide for ML Models

Model type	Recommended features	File(s)
Pan-cancer screen (WGS)	FSR `short_long_log2`, MDS, WPS `prot_frac_nuc_mean`, FSD `short_frac`	FSR, MDS, WPS_background, FSD
Pan-cancer screen (panel)	FSC-E1 `promoter_short`, MDS-E1, OCF `ocf_z`, WPS panel	FSC.e1only, MDS.gene, OCF, WPS.panel
Tumor fraction regression	FSR `short_long_log2` genome-wide, WPS `nrl_bp`	FSR, WPS_background
Cancer type / site-of-origin	OCF tissue vector, ATAC tissue vector, TFBS vector	OCF, ATAC, TFBS
Gene-level amplification	FSC gene `normalized_depth`, FSD arm `total`	FSC.gene, FSD
MRD (residual disease)	mFSD `VAF_GC_Corrected`, `Quality_Score`, `Delta_ALT_REF`	mFSD
Promoter accessibility	WPS E1 `wps_nuc_mean`, MDS E1, FSC-E1 ratios	WPS.panel, MDS.gene, FSC.e1only
Methylation-augmented	UXM `U`/`M` + FSR + MDS	UXM, FSR, MDS

File	Contains	When generated
`{s}.OCF.tsv`	All reads (on + off combined)	Always
`{s}.OCF.ontarget.tsv`	On-target reads only	Panel mode
`{s}.OCF.offtarget.tsv`	Off-target reads only	Panel mode

Output Files Reference

Quick Reference

Output Format Options

TSV Compression (--compress)

WPS — Always Parquet

Unified JSON (--generate-json)

On-Target vs Off-Target Files

What "On-Target" and "Off-Target" Mean

Why the Split Matters

Which to Use

Complete On-Target / Off-Target File Inventory

GC Correction and On-Target

Core Fragmentomics

FSD (Fragment Size Distribution)

Columns

Purpose & Use Cases

ML Use Case

FSR (Fragment Size Ratio)

Columns

Purpose & Use Cases

ML Use Case

FSC (Fragment Size Coverage)

Columns

Purpose & Use Cases

ML Use Case

FSC Gene-Level

Columns

Purpose & Use Cases

ML Use Case

FSC Region-Level

Columns

Purpose & Use Cases

ML Use Case

FSC E1-Only

Purpose & Use Cases

ML Use Case

FSC Counts (Pre-Correction)

Purpose & Use Cases

Nucleosome Positioning

WPS (Windowed Protection Score)

Columns (Parquet)

Purpose & Use Cases

ML Use Case

WPS Panel

ML Use Case

WPS Background

Columns

Purpose & Use Cases

ML Use Case

Motif & Tissue-of-Origin

EndMotif

Columns

Purpose & Use Cases

ML Use Case

EndMotif 1-mer

Columns: base, fraction

Purpose & Use Cases

BreakPointMotif

Purpose vs EndMotif

MDS (Motif Diversity Score)

Columns

ML Use Case

MDS Exon-Level

Columns

ML Use Case

MDS Gene-Level

Columns

ML Use Case

OCF (Orientation-aware cfDNA Fragmentation)

Columns

Purpose & Use Cases

ML Use Case

OCF Sync

Columns: label, count, mean_size, entropy

TFBS (Transcription Factor Binding Site Entropy)

Columns

Purpose & Use Cases

ML Use Case

TFBS Sync

Columns: label, size, count, proportion

TSV Compression (`--compress`)

Unified JSON (`--generate-json`)

Columns: `base`, `fraction`

Columns: `label`, `count`, `mean_size`, `entropy`

Columns: `label`, `size`, `count`, `proportion`

Columns: `label`, `size`, `count`, `proportion`

Columns: `Chrom`, `Pos`, `Ref`, `Alt`, `Category`, `Size`, `Count`