Skip to content

JSON Feature Output

The --generate-json flag produces a single JSON file containing all features for ML integration.

krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    --assay xs2 \
    --generate-json

This generates {sample}.features.json alongside the standard TSV/Parquet outputs.


Top-Level Structure

{
  "schema_version": "1.0",
  "sample_id": "sample_001",
  "krewlyzer_version": "0.6.0",
  "timestamp": "2026-02-28T19:00:00",
  "metadata": {
    "genome": "hg19",
    "assay": "xs2",
    "panel_mode": true,
    "on_target_rate": 0.45
  },
  "features": {
    "fsc": { ... },
    "fsc_gene": [ ... ],
    "fsc_region": [ ... ],
    "fsc_region_e1": [ ... ],
    "fsc_counts": [ ... ],
    "fsr": { ... },
    "fsd": { ... },
    "wps": { ... },
    "wps_panel": { ... },
    "wps_background": { ... },
    "motif": { ... },
    "ocf": { ... },
    "tfbs": { ... },
    "atac": { ... },
    "gc_factors": { ... },
    "mfsd": { ... },
    "region_mds": { ... },
    "uxm": { ... }
  },
  "qc": { ... }
}

Feature Schemas

FSC (Fragment Size Coverage)

Window-based fragment size counts. Split into off_target (genome-wide) and on_target (panel capture regions).

"fsc": {
  "off_target": [
    {
      "region": "chr1:0-500000",
      "ultra_short": 123,
      "core_short": 4567,
      "mono_nucl": 8901,
      "di_nucl": 2345,
      "long": 678,
      "total": 16614,
      "log_ratio_core_short": -0.15,
      "zscore_core_short": -1.23
    }
  ],
  "on_target": [ ... ]
}
Field Type Description
region string Genomic window coordinates
ultra_short int Fragments 65–99 bp
core_short int Fragments 100–149 bp
mono_nucl int Fragments 150–259 bp
di_nucl int Fragments 260–399 bp
long int Fragments 400+ bp
log_ratio_* float Log₂(observed/expected) vs PON
zscore_* float Z-score vs PON (if PON provided)

FSC Gene (Panel Mode Only)

Gene-level fragment size aggregation. Only present with --assay.

"fsc_gene": [
  {
    "gene": "ATM",
    "n_regions": 62,
    "total_bp": 8432,
    "ultra_short": 1234,
    "core_short": 5678,
    "mono_nucl": 9012,
    "core_short_ratio": 0.282,
    "normalized_depth": 1245.67,
    "z_core_short": -0.45
  }
]

FSC Region (Panel Mode Only)

Per-exon/target fragment size data. More granular than gene-level.

"fsc_region": [
  {
    "chrom": "1",
    "start": 11168235,
    "end": 11168345,
    "gene": "MTOR",
    "region_name": "MTOR_target_02",
    "region_bp": 110,
    "ultra_short": 8.0,
    "core_short": 229.0,
    "mono_nucl": 804.0,
    "di_nucl": 88.0,
    "total": 1129.0,
    "normalized_depth": 1272.71
  }
]

FSC Region E1 (Panel Mode Only)

First exon (E1) per gene. E1 = promoter-proximal region with stronger cancer signal (Helzer et al. 2025).

"fsc_region_e1": [
  {
    "chrom": "14",
    "start": 105238685,
    "end": 105238805,
    "gene": "AKT1",
    "region_name": "exon_AKT1_15a_1",
    "region_bp": 120,
    "ultra_short": 19.0,
    "mono_nucl": 635.0,
    "total": 1082.0,
    "normalized_depth": 3039.77
  }
]

Tip

Use fsc_region_e1 for early cancer detection models where promoter fragmentation changes are a primary signal.


FSC Counts (Raw GC-Binned Counts)

Raw per-GC-bin, per-size-class fragment counts. Source: {sample}.fsc_counts.tsv.

"fsc_counts": [
  {
    "gc_bin": 0.40,
    "len_bin": 150,
    "count": 1234,
    "expected": 1012.5,
    "correction_factor": 1.218
  }
]

Note

fsc_counts contains pre-correction data. The GC bias correction is already applied to fsc, fsr, and fsd values.


FSR (Fragment Size Ratios)

PON-normalized short/long ratios for tumor detection. Split into off_target and on_target.

"fsr": {
  "off_target": [
    {
      "region": "chr1:0-5000000",
      "short_count": 12450,
      "long_count": 8200,
      "total_count": 28614,
      "short_norm": 1.23,
      "long_norm": 0.98,
      "short_long_ratio": 1.26,
      "short_long_log2": 0.33,
      "short_frac": 0.435,
      "long_frac": 0.287
    }
  ],
  "on_target": [ ... ]
}
Field Type Description
region string Genomic window (chr:start-end)
short_count int Short fragments: ultra_short + core_short (65–149 bp)
long_count int Long fragments: di_nucl + long (221–400+ bp)
total_count int Total fragment count
short_norm float short_count / PON_short_mean
long_norm float long_count / PON_long_mean
short_long_ratio float short_norm / long_norm — primary cancer biomarker
short_long_log2 float log2(short_long_ratio) — ML-ready signed metric
short_frac float short_count / total_count
long_frac float long_count / total_count

FSD (Fragment Size Distribution)

Per-arm size distribution profiles. Split into off_target and on_target.

"fsd": {
  "off_target": {
    "arms": ["1p", "1q", "2p", "..."],
    "size_bins": ["65-69", "70-74", "...", "395-399"],
    "counts": [[123, 456, ...], [...]],
    "total": [12345, 13456, ...]
  },
  "on_target": { ... }
}
Field Type Description
arms string[] Chromosomal arm labels
size_bins string[] Fragment size range labels (bp)
counts int[][] Count matrix: arms × size bins
total int[] Total fragment count per arm

WPS (Windowed Protection Score)

Nucleosome positioning profiles. Stored as columnar arrays (one value per region), not row records.

"wps": {
  "regions": ["ENSG00000142611_TSS", "ENSG00000157764_TSS", "..."],
  "chrom": ["1", "7", "..."],
  "center": [11166302, 140453136, "..."],
  "wps_nuc": [24.5, 18.2, "..."],
  "wps_tf": [-3.2, -1.8, "..."],
  "wps_nuc_smooth": [23.1, 17.9, "..."],
  "wps_tf_smooth": [-3.0, -1.7, "..."],
  "wps_nuc_mean": [22.8, 17.5, "..."],
  "wps_tf_mean": [-2.9, -1.6, "..."],
  "wps_nuc_z": [1.2, 0.8, "..."],
  "wps_tf_z": [-0.4, -0.2, "..."],
  "prot_frac_nuc": [0.62, 0.55, "..."],
  "prot_frac_tf": [0.41, 0.38, "..."]
}
Field Type Description
regions string[] Region IDs (one entry per anchor)
wps_nuc float[] Nucleosomal WPS (120–180 bp fragments)
wps_tf float[] TF footprint WPS (35–80 bp fragments)
wps_*_smooth float[] Savitzky-Golay smoothed profiles
wps_*_z float[] Z-scores vs PON baseline
prot_frac_* float[] Protected fraction (values > 0)

Reconstructing a DataFrame

import pandas as pd
wps = features["wps"]
df = pd.DataFrame({"region_id": wps["regions"], "wps_nuc": wps["wps_nuc"], ...})

WPS Panel (Panel Mode Only)

Same schema as standard TSV/Parquet WPS output, but stored as row records and filtered to panel gene anchors.

"wps_panel": {
  "n_anchors": 1820,
  "data": [
    {
      "region_id": "ATM_TSS",
      "chrom": "11",
      "center": 108093559,
      "strand": "+",
      "wps_nuc": 22.1,
      "wps_tf": -2.8,
      "prot_frac_nuc": 0.59,
      "prot_frac_tf": 0.38,
      "wps_nuc_z": 0.9,
      "wps_tf_z": -0.3
    }
  ]
}

WPS Background

Alu element hierarchical stacking profiles (global, family, per-chromosome groups).

"wps_background": {
  "n_elements": 27,
  "data": [
    {
      "group_id": "Global_All",
      "stacked_wps_nuc": [0.12, 0.08, "..."],
      "stacked_wps_tf": [-0.03, -0.01, "..."],
      "alu_count": 142567,
      "mean_wps_nuc": 0.11,
      "mean_wps_tf": -0.02,
      "nrl_bp": 192.4,
      "nrl_deviation_bp": 2.4,
      "periodicity_score": 0.78,
      "adjusted_score": 0.69,
      "fragment_ratio": 0.31
    }
  ]
}
Field Type Description
group_id string Global_All, Family_AluY/S/J/Other, Chr{N}_All
stacked_wps_nuc float[] 30-bin binned WPS nucleosomal profile
nrl_bp float Nucleosome Repeat Length in bp (expected ~190 bp)
periodicity_score float SNR-based quality 0–1
adjusted_score float Periodicity score penalized by NRL deviation

Motif

End motif (EDM) and breakpoint motif (BPM) 4-mer frequencies, plus MDS scores for off-target and on-target.

"motif": {
  "edm": { "AAAA": 0.0042, "AAAC": 0.0039, "...": "..." },
  "bpm": { "AAAA": 0.0038, "AAAC": 0.0041, "...": "..." },
  "edm_1mer": { "A": 0.197, "C": 0.345, "G": 0.321, "T": 0.138 },
  "mds": 0.8234,
  "mds_z": -1.23,
  "edm_on_target": { "AAAA": 0.0051, "...": "..." },
  "bpm_on_target": { "AAAA": 0.0043, "...": "..." },
  "mds_on_target": 0.7918,
  "mds_z_on_target": -0.91
}
Field Type Description
edm dict 256 4-mer frequencies at fragment ends (off-target)
bpm dict 256 4-mer frequencies at breakpoints (off-target)
edm_1mer dict Single-base composition at fragment ends (A/C/G/T)
mds float Motif Diversity Score (off-target)
mds_z float MDS z-score vs PON (only with --pon-model)
edm_on_target dict On-target EDM frequencies (panel mode only)
bpm_on_target dict On-target BPM frequencies (panel mode only)
mds_on_target float On-target MDS (panel mode only)
mds_z_on_target float On-target MDS z-score vs PON (panel + --pon-model)

OCF (Orientation-aware cfDNA Fragmentation)

Open chromatin footprint scores by tissue type. Includes off-target, on-target, and positional sync profiles.

"ocf": {
  "off_target": [ { "tissue": "Liver", "score": 0.42, "n_fragments": 12345 } ],
  "on_target":  [ { "tissue": "Liver", "score": 0.51, "n_fragments": 3421 } ],
  "offtarget":  [ { "tissue": "Liver", "score": 0.39, "..." } ],
  "sync":           [ { "pos": -150, "strand_ratio": 0.61, "..." } ],
  "sync_offtarget": [ { ... } ],
  "sync_ontarget":  [ { ... } ]
}
Sub-key When present Description
off_target Always Genome-wide OCF scores
on_target Panel mode On-target capture region OCF
offtarget Panel mode Panel-specific off-target scores
sync Always Positional strand-specific profiles
sync_offtarget Panel mode Sync profiles for off-target
sync_ontarget Panel mode Sync profiles for on-target

TFBS (Transcription Factor Binding Site Entropy)

Fragment size entropy at TFBS regions. Includes sync (per-TF × per-size) profiles.

"tfbs": {
  "off_target": [
    { "region": "CTCF_chr1_12345", "entropy": 3.45, "n_fragments": 234, "mean_size": 167.5 }
  ],
  "on_target": [ ... ],
  "sync": [ { "tf": "CTCF", "size_bin": 150, "count": 234, "fraction": 0.052 } ],
  "sync_ontarget": [ ... ]
}

ATAC (Chromatin Accessibility Regions)

Fragment size entropy at ATAC-seq accessible regions. Includes sync (per-tissue × per-size) profiles.

"atac": {
  "off_target": [
    { "region": "peak_chr1_23456", "entropy": 3.21, "n_fragments": 189, "mean_size": 145.2 }
  ],
  "on_target": [ ... ],
  "sync": [ { "tissue": "BRCA", "size_bin": 150, "count": 189, "fraction": 0.041 } ],
  "sync_ontarget": [ ... ]
}

GC Factors (Diagnostic)

GC bias correction factors used internally during processing.

Note

Not recommended for ML features. The GC correction is already applied to FSC/FSR/FSD. Use those corrected values instead. GC factors are useful for QC, batch effect detection, and panel development.

"gc_factors": {
  "off_target": [
    { "len_bin": 100, "gc_pct": 45, "correction_factor": 1.12 }
  ],
  "on_target": [ ... ]
}

mFSD (Mutant Fragment Size Distribution)

Per-variant fragment size distribution metrics. Only present when a MAF file is provided.

"mfsd": {
  "enabled": true,
  "n_variants": 47,
  "variants": [
    {
      "Chrom": "17", "Pos": 7577548, "Ref": "C", "Alt": "T", "VarType": "SNP",
      "REF_Count": 234, "ALT_Count": 12, "NonREF_Count": 8, "N_Count": 45, "Total_Count": 299,
      "REF_Weighted": 221.4, "ALT_Weighted": 11.3, "NonREF_Weighted": 7.6, "N_Weighted": 42.8, "VAF_GC_Corrected": 0.048,
      "ALT_LLR": 3.41, "REF_LLR": -3.41,
      "REF_MeanSize": 168.3, "ALT_MeanSize": 142.7, "NonREF_MeanSize": 155.1, "N_MeanSize": 171.2,
      "Delta_ALT_REF": -25.6, "KS_ALT_REF": 0.31, "KS_Pval_ALT_REF": 0.003,
      "Delta_ALT_NonREF": -12.4, "KS_ALT_NonREF": 0.18, "KS_Pval_ALT_NonREF": 0.041,
      "Delta_REF_NonREF": 13.2, "KS_REF_NonREF": 0.15, "KS_Pval_REF_NonREF": 0.089,
      "Delta_ALT_N": -28.5, "KS_ALT_N": 0.34, "KS_Pval_ALT_N": 0.001,
      "Delta_REF_N": -3.1, "KS_REF_N": 0.08, "KS_Pval_REF_N": 0.621,
      "Delta_NonREF_N": -16.1, "KS_NonREF_N": 0.19, "KS_Pval_NonREF_N": 0.033,
      "VAF_Proxy": 0.049, "Error_Rate": 0.027, "N_Rate": 0.150, "Size_Ratio": 0.848, "Quality_Score": 0.731,
      "ALT_Confidence": "HIGH", "KS_Valid": true
    }
  ]
}

When no MAF is provided: "mfsd": { "enabled": false }

Field Group Columns Description
Variant info Chrom, Pos, Ref, Alt, VarType Variant coordinates and type
Counts REF_Count, ALT_Count, NonREF_Count, N_Count, Total_Count Raw fragment counts per allele class
GC-Weighted REF_Weighted, ALT_Weighted, NonREF_Weighted, N_Weighted, VAF_GC_Corrected GC-bias corrected counts/VAF
Log-Likelihood ALT_LLR, REF_LLR Log-likelihood ratios (duplex/low-N)
Mean Sizes REF_MeanSize, ALT_MeanSize, NonREF_MeanSize, N_MeanSize Mean fragment size per allele class
KS Tests Delta_*/KS_*/KS_Pval_* Kolmogorov–Smirnov distance + p-value (6 pairings)
Derived VAF_Proxy, Error_Rate, N_Rate, Size_Ratio, Quality_Score Summary biomarker values
Flags ALT_Confidence, KS_Valid Quality flags

Region MDS (Per-Exon/Gene Motif Diversity Score)

Per-exon and gene-level MDS from region-mds command. Present when --region-mds is run.

"region_mds": {
  "n_exons": 1820,
  "mds_exon_mean": 0.512,
  "mds_exon_std": 0.041,
  "exon": [
    {
      "gene": "ATM",
      "name": "ATM:exon1",
      "chrom": "11",
      "start": 108093558,
      "end": 108093795,
      "strand": "+",
      "n_fragments": 234,
      "mds": 0.531
    }
  ],
  "n_genes": 146,
  "mds_e1_mean": 0.504,
  "gene": [
    {
      "gene": "ATM",
      "n_exons": 23,
      "n_fragments": 5210,
      "mds_mean": 0.519,
      "mds_e1": 0.531,
      "mds_std": 0.038
    }
  ]
}
Field Level Description
exon[] Exon Per-exon records from {sample}.MDS.exon.tsv
exon[].mds Exon MDS for this exon/target region
exon[].name Exon Exon identifier (gene:exonN for WGS, target name for panel)
mds_exon_mean Summary Mean MDS across all exons
mds_exon_std Summary Std dev of per-exon MDS
gene[] Gene Per-gene records from {sample}.MDS.gene.tsv
gene[].mds_mean Gene Mean MDS across all exons of this gene
gene[].mds_e1 Gene MDS of the first exon (E1) — promoter proxy
mds_e1_mean Summary Mean E1 MDS across all genes

UXM (Methylation)

CpG methylation features. Only present when methylation BAM input is provided.

"uxm": {
  "enabled": true,
  "data": [
    { "region": "chr1:0-500000", "U_fraction": 0.82, "X_fraction": 0.05, "M_fraction": 0.13 }
  ]
}

When no methylation input: "uxm": { "enabled": false }


ML Integration Example

import json
import pandas as pd

with open("sample.features.json") as f:
    features = json.load(f)

# FSC off-target (genome-wide)
df_fsc = pd.DataFrame(features["fsc"]["off_target"])

# WPS — reconstruct DataFrame from columnar arrays
wps = features["wps"]
df_wps = pd.DataFrame({"region_id": wps["regions"], "wps_nuc": wps["wps_nuc"], "wps_tf": wps["wps_tf"]})
print(f"WPS anchors: {len(df_wps)}")

# Region MDS — per-gene E1 MDS
df_mds_gene = pd.DataFrame(features["region_mds"]["gene"])
print(f"Mean E1 MDS: {features['region_mds']['mds_e1_mean']:.3f}")

# mFSD — variant-level fragment size shift
if features.get("mfsd", {}).get("enabled"):
    df_mfsd = pd.DataFrame(features["mfsd"]["variants"])
    print(f"mFSD variants: {features['mfsd']['n_variants']}")

# Motif diversity
mds_z = features["motif"].get("mds_z", 0)
print(f"MDS: {features['motif']['mds']:.4f}")

JSON Generation with Panel Mode

For MSK-ACCESS panels:

krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    --assay xs2 \
    --target-regions targets.bed \
    --pon-model xs2.pon.parquet \
    --generate-json

Panel mode adds these additional features to the JSON: - fsc_gene: 146 genes (gene-level FSC) - fsc_region: per-exon FSC - fsc_region_e1: first-exon FSC - wps_panel: 1,820 anchors - wps_background: Alu stacking profiles - ocf.on_target, tfbs.on_target, atac.on_target: panel-specific entropy - PON z-scores across all features


Output File Structure

Krewlyzer generates TSV/Parquet files alongside the optional unified JSON:

Core Fragmentomics

out/
├── sample.FSD.tsv                   # Fragment size distribution (arm-level)
├── sample.FSD.ontarget.tsv          # Panel mode: on-target FSD
├── sample.FSR.tsv                   # Fragment size ratio (short/long)
├── sample.FSR.ontarget.tsv          # Panel mode: on-target FSR
├── sample.FSC.tsv                   # Fragment size coverage (bin-level)
├── sample.FSC.ontarget.tsv          # Panel mode: on-target FSC
├── sample.FSC.gene.tsv              # Gene-level FSC (with --assay)
├── sample.FSC.regions.tsv           # Exon-level FSC (aggregate_by='region')
├── sample.FSC.regions.e1only.tsv    # E1-only FSC (first exon per gene)
├── sample.fsc_counts.tsv            # Raw GC-binned fragment counts (pre-correction)
└── sample.correction_factors.tsv    # GC correction factors

WPS (Windowed Protection Score)

out/
├── sample.WPS.parquet               # Per-region WPS profiles (foreground)
├── sample.WPS.panel.parquet         # Panel-specific anchors (with --assay)
└── sample.WPS_background.parquet    # Alu stacking profiles (background)

Motif & Tissue-of-Origin

out/
├── sample.EndMotif.tsv              # 4-mer end motif frequencies (off-target)
├── sample.EndMotif.ontarget.tsv     # 4-mer end motif frequencies (on-target)
├── sample.EndMotif1mer.tsv          # Single-base end compositions (A/C/G/T)
├── sample.BreakPointMotif.tsv       # 4-mer breakpoint motif frequencies
├── sample.BreakPointMotif.ontarget.tsv
├── sample.MDS.tsv                   # Motif Diversity Score (off-target)
├── sample.MDS.ontarget.tsv          # Motif Diversity Score (on-target)
├── sample.MDS.exon.tsv              # Per-exon MDS (region-mds command)
├── sample.MDS.gene.tsv              # Per-gene aggregated MDS (region-mds command)
├── sample.OCF.tsv                   # Orientation-aware fragmentation
├── sample.OCF.ontarget.tsv          # Panel mode: on-target OCF
├── sample.OCF.offtarget.tsv         # Panel mode: off-target OCF
└── sample.OCF.sync.tsv              # Positional strand-specific profiles

Region Entropy (TFBS/ATAC)

out/
├── sample.TFBS.tsv                  # TF binding site entropy (808 factors)
├── sample.TFBS.ontarget.tsv         # Panel mode: on-target TFBS
├── sample.TFBS.sync.tsv             # Per-TF × per-size profiles
├── sample.TFBS.ontarget.sync.tsv
├── sample.ATAC.tsv                  # ATAC-seq peak entropy (23 cancer types)
├── sample.ATAC.ontarget.tsv         # Panel mode: on-target ATAC
├── sample.ATAC.sync.tsv             # Per-tissue × per-size profiles
└── sample.ATAC.ontarget.sync.tsv

Variant (mFSD)

out/
├── sample.mFSD.tsv                  # Per-variant fragment size metrics (46 columns)
└── sample.mFSD.distributions.tsv    # Per-variant size histograms (optional)

Methylation (UXM)

out/
└── sample.UXM.tsv                   # CpG methylation U/X/M fractions

Unified Output

out/
├── sample.metadata.tsv              # Run metadata and QC metrics (tabular)
└── sample.features.json             # All features (with --generate-json)

Note

The --generate-json flag produces the unified JSON in addition to the standard TSV/Parquet outputs.


See Also