Unified JSON Output

The --generate-json flag produces a single JSON file containing all features for ML integration.

Enabling JSON Output

krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    --generate-json

This generates {sample}.features.json alongside the standard TSV/Parquet outputs.

Output Structure

{
  "sample_id": "sample_001",
  "metadata": {
    "genome": "hg19",
    "assay": "xs2",
    "panel_mode": true,
    "on_target_rate": 0.45,
    "timestamp": "2024-01-20T00:00:00"
  },
  "fsc": { ... },
  "fsc_gene": { ... },
  "fsc_region": { ... },
  "fsc_region_e1": { ... },
  "fsr": { ... },
  "fsd": { ... },
  "wps": { ... },
  "wps_panel": { ... },
  "wps_background": { ... },
  "motif": { ... },
  "ocf": { ... },
  "tfbs": { ... },
  "atac": { ... },
  "gc_factors": { ... }
}

Feature Schemas

FSC (Fragment Size Coverage)

Window-based fragment size counts with z-scores.

"fsc": {
  "n_windows": 2534,
  "data": [
    {
      "region": "chr1:0-500000",
      "ultra_short": 123,
      "core_short": 4567,
      "mono_nucl": 8901,
      "di_nucl": 2345,
      "long": 678,
      "total": 16614,
      "log_ratio_core_short": -0.15,
      "log_ratio_mono_nucl": 0.02,
      "zscore_core_short": -1.23
    }
  ]
}

Field	Type	Description
`region`	string	Genomic window coordinates
`ultra_short`	int	Fragments 65-99bp
`core_short`	int	Fragments 100-149bp
`mono_nucl`	int	Fragments 150-259bp
`di_nucl`	int	Fragments 260-399bp
`long`	int	Fragments 400+bp
`log_ratio_*`	float	Log2(observed/expected) vs PON
`zscore_*`	float	Z-score vs PON (if PON provided)

FSC Gene (Panel Mode Only)

Gene-level fragment size aggregation. Only present with --assay.

"fsc_gene": [
  {
    "gene": "ATM",
    "n_regions": 62,
    "total_bp": 8432,
    "ultra_short": 1234,
    "core_short": 5678,
    "mono_nucl": 9012,
    "core_short_ratio": 0.282,
    "normalized_depth": 1245.67,
    "z_core_short": -0.45
  }
]

FSC Region (Panel Mode Only)

Per-exon/target fragment size data. More granular than gene-level.

"fsc_region": [
  {
    "chrom": "1",
    "start": 11168235,
    "end": 11168345,
    "gene": "MTOR",
    "region_name": "MTOR_target_02",
    "region_bp": 110,
    "ultra_short": 8.0,
    "core_short": 229.0,
    "mono_nucl": 804.0,
    "di_nucl": 88.0,
    "total": 1129.0,
    "normalized_depth": 1272.71
  }
]

Field	Type	Description
`region_name`	string	Unique exon/target identifier
`normalized_depth`	float	RPKM-like depth: (count × 10⁹) / (bp × total_frags)

FSC Region E1 (Panel Mode Only)

First exon (E1) per gene, filtered from fsc_region. E1 serves as a promoter-proximal proxy with stronger cancer signal (Helzer et al. 2025).

"fsc_region_e1": [
  {
    "chrom": "14",
    "start": 105238685,
    "end": 105238805,
    "gene": "AKT1",
    "region_name": "exon_AKT1_15a_1",
    "region_bp": 120,
    "ultra_short": 19.0,
    "mono_nucl": 635.0,
    "total": 1082.0,
    "normalized_depth": 3039.77
  }
]

Tip

Use fsc_region_e1 for early cancer detection models where promoter fragmentation changes are a primary signal.

FSR (Fragment Size Ratios)

Biomarker ratios for tumor detection.

"fsr": {
  "n_windows": 2534,
  "data": [
    {
      "region": "chr1:0-500000",
      "ultra_short_ratio": 0.0074,
      "core_short_ratio": 0.275,
      "mono_nucl_ratio": 0.536,
      "di_nucl_ratio": 0.141,
      "long_ratio": 0.041,
      "core_short_long_ratio": 6.73
    }
  ]
}

Field	Type	Description
`*_ratio`	float	Fraction of total fragments
`core_short_long_ratio`	float	Primary cancer biomarker (higher = more tumor)

FSD (Fragment Size Distribution)

Per-arm size distribution profiles.

"fsd": {
  "arms": ["1p", "1q", "2p", ...],
  "size_bins": [65, 70, 75, ..., 395, 400],
  "data": {
    "1p": {
      "counts": [123, 456, 789, ...],
      "proportions": [0.001, 0.004, 0.007, ...]
    }
  }
}

WPS (Windowed Protection Score)

Nucleosome positioning profiles around gene TSS/CTCF sites.

"wps": {
  "n_anchors": 15234,
  "columns": ["region_id", "chrom", "start", "end", 
              "wps_nuc_mean", "wps_tf_mean", "prot_frac_nuc", "prot_frac_tf",
              "wps_nuc_z", "wps_tf_z", "ndr_depth"],
  "data": [
    {
      "region_id": "ENSG00000142611_TSS",
      "chrom": "chr1",
      "start": 11166102,
      "end": 11166502,
      "wps_nuc_mean": 24.5,
      "wps_tf_mean": -3.2,
      "prot_frac_nuc": 0.62,
      "prot_frac_tf": 0.41,
      "wps_nuc_z": 1.2,
      "ndr_depth": 15.3
    }
  ]
}

Field	Type	Description
`wps_nuc_mean`	float	Mean nucleosomal WPS (120-180bp fragments)
`wps_tf_mean`	float	Mean TF footprint WPS (35-80bp fragments)
`prot_frac_*`	float	Protected fraction (values > 0)
`wps_*_z`	float	Z-score vs PON
`ndr_depth`	float	Nucleosome-depleted region depth

WPS Panel (Panel Mode Only)

Same schema as wps, but filtered to panel gene anchors.

"wps_panel": {
  "n_anchors": 1820,
  "data": [ ... ]
}

WPS Background

Alu element stacking scores for global fragmentation.

"wps_background": {
  "n_elements": 142567,
  "data": [
    {
      "region_id": "AluSx_chr1_12345",
      "stacking_score": 0.78,
      "coverage": 45.2
    }
  ]
}

Motif

End motif (EDM) k-mer frequencies and diversity score.

"motif": {
  "end_motif": {
    "AAAA": 0.0042,
    "AAAC": 0.0039,
    ...
  },
  "breakpoint_motif": {
    "AAAA": 0.0038,
    ...
  },
  "mds": 0.8234,
  "mds_z": -1.23
}

Field	Type	Description
`end_motif`	dict	256 4-mer frequencies (fragment ends)
`breakpoint_motif`	dict	256 4-mer frequencies (breakpoints)
`mds`	float	Motif Diversity Score
`mds_z`	float	MDS z-score vs PON (if PON with MDS baseline)

OCF (Orientation-aware cfDNA Fragmentation)

Open chromatin footprint scores by tissue type.

"ocf": {
  "tissues": ["Liver", "Lung", "Colon", "Placenta", ...],
  "scores": {
    "Liver": 0.42,
    "Lung": 0.31,
    "Colon": 0.28
  }
}

TFBS (Transcription Factor Binding Site Entropy)

Fragment size entropy at TFBS regions.

"tfbs": {
  "off_target": [
    {
      "region": "CTCF_chr1_12345",
      "entropy": 3.45,
      "n_fragments": 234,
      "mean_size": 167.5
    }
  ],
  "on_target": [ ... ]
}

ATAC (Chromatin Accessibility Regions)

Fragment size entropy at ATAC-seq accessible regions.

"atac": {
  "off_target": [
    {
      "region": "peak_chr1_23456",
      "entropy": 3.21,
      "n_fragments": 189,
      "mean_size": 145.2
    }
  ],
  "on_target": [ ... ]
}

GC Factors (Diagnostic)

GC bias correction factors used internally during processing.

Note

Not recommended for ML features. These are intermediate diagnostic data, not predictive features. The GC correction is already applied to FSC/FSR/FSD counts. Use those corrected values instead.

"gc_factors": {
  "off_target": [
    {
      "len_bin": 100,
      "gc_pct": 45,
      "correction_factor": 1.12
    }
  ],
  "on_target": [ ... ]
}

When to use GC factors: - QC/Diagnostics: Visualize library prep bias, capture efficiency - Batch Effect Detection: Compare correction factors across runs - Panel Development: Characterize probe GC performance

When NOT to use: - ML models: Skip these—use GC-corrected FSC/FSR/FSD instead

ML Integration Example

import json
import pandas as pd

# Load features
with open("sample.features.json") as f:
    features = json.load(f)

# Extract FSC gene-level for panel analysis
if "fsc_gene" in features:
    df_genes = pd.DataFrame(features["fsc_gene"])
    print(f"Gene FSC: {len(df_genes)} genes")

# Extract WPS for nucleosome signature
wps_data = pd.DataFrame(features["wps"]["data"])
print(f"WPS anchors: {len(wps_data)}")

# Use motif MDS z-score as feature
mds_z = features["motif"].get("mds_z", 0)
print(f"MDS z-score: {mds_z:.2f}")

JSON Generation with Panel Mode

For MSK-ACCESS panels:

krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    --assay xs2 \
    --target-regions targets.bed \
    --pon-model xs2.pon.parquet \
    --generate-json

This produces JSON with all panel-specific features: - fsc_gene: 146 genes - wps_panel: 1,820 anchors - wps_background: Alu stacking - PON z-scores across all features

Output File Structure

Krewlyzer generates TSV/Parquet files alongside the optional unified JSON:

Core Fragmentomics

out/
├── sample.FSD.tsv                   # Fragment size distribution (arm-level)
├── sample.FSD.ontarget.tsv          # Panel mode: on-target FSD
├── sample.FSR.tsv                   # Fragment size ratio (short/long)
├── sample.FSR.ontarget.tsv          # Panel mode: on-target FSR
├── sample.FSC.tsv                   # Fragment size coverage (bin-level)
├── sample.FSC.ontarget.tsv          # Panel mode: on-target FSC
├── sample.FSC.gene.tsv              # Gene-level FSC (with --assay)
├── sample.FSC.regions.tsv           # Exon-level FSC (aggregate_by='region')
├── sample.FSC.regions.e1only.tsv    # E1-only FSC (first exon per gene)
└── sample.correction_factors.tsv    # GC correction factors

WPS (Windowed Protection Score)

out/
├── sample.WPS.parquet               # Per-region WPS profiles (foreground)
├── sample.WPS.panel.parquet         # Panel-specific anchors (with --assay)
└── sample.WPS_background.parquet    # Alu stacking profiles (background)

Motif & Tissue-of-Origin

out/
├── sample.EndMotif.tsv              # 4-mer end motif frequencies
├── sample.MDS.tsv                   # Motif diversity score
├── sample.OCF.tsv                   # Orientation-aware fragmentation
├── sample.OCF.ontarget.tsv          # Panel mode: on-target OCF
└── sample.OCF.sync.tsv              # OCF sync scores

Region Entropy (TFBS/ATAC)

out/
├── sample.TFBS.tsv                  # TF binding site entropy (808 factors)
├── sample.TFBS.ontarget.tsv         # Panel mode: on-target TFBS
├── sample.ATAC.tsv                  # ATAC-seq peak entropy (23 cancer types)
└── sample.ATAC.ontarget.tsv         # Panel mode: on-target ATAC

Unified Output

out/
├── sample.metadata.json             # Run metadata and QC metrics
└── sample.features.json             # All features (with --generate-json)

Note

The --generate-json flag produces the unified JSON in addition to the standard TSV/Parquet outputs.

Unified JSON Output

Enabling JSON Output

Output Structure

Feature Schemas

FSC (Fragment Size Coverage)

FSC Gene (Panel Mode Only)

FSC Region (Panel Mode Only)

FSC Region E1 (Panel Mode Only)

FSR (Fragment Size Ratios)

FSD (Fragment Size Distribution)

WPS (Windowed Protection Score)

WPS Panel (Panel Mode Only)

WPS Background

Motif

OCF (Orientation-aware cfDNA Fragmentation)

TFBS (Transcription Factor Binding Site Entropy)

ATAC (Chromatin Accessibility Regions)

GC Factors (Diagnostic)

ML Integration Example

JSON Generation with Panel Mode

Output File Structure

Core Fragmentomics

WPS (Windowed Protection Score)

Motif & Tissue-of-Origin

Region Entropy (TFBS/ATAC)

Unified Output

See Also