Region Entropy (TFBS/ATAC Size Entropy)

Command: krewlyzer region-entropy

Plain English

Region Entropy calculates the diversity of fragment sizes at regulatory regions. A high entropy value indicates many different fragment sizes; low entropy indicates uniform sizes.

Use case: Cancer detection and subtyping - tumor cfDNA shows altered nucleosome positioning at specific regulatory elements.

Note

This feature is based on Helzer KT, et al. (2025) "Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels" published in Nature Communications.

Purpose

Calculates Shannon entropy of fragment size distributions at: - TFBS: Transcription Factor Binding Sites (808 factors from GTRD) - ATAC: Cancer-specific ATAC-seq peaks (23 cancer types from TCGA)

These metrics enable cancer phenotyping from targeted sequencing panels without requiring whole genome sequencing.

Scientific Background

From Helzer et al. 2025

Fragmentomics-based analysis of cell-free DNA (cfDNA) has emerged as a method to infer epigenetic and transcriptional data. While many reports analyze whole genome sequencing (WGS), targeted exon panels can be similarly employed for cancer phenotyping with minimal decrease in performance despite their smaller genomic coverage.

The study assessed 13 fragmentomics metrics including: - Fragment length proportions (small fragments, Shannon entropy) - Normalized fragment read depth - End motif diversity score (MDS) - TFBS entropy - fragments overlapping transcription factor binding sites - ATAC entropy - fragments overlapping cancer-specific open chromatin regions

Key findings relevant to TFBS/ATAC entropy: - Diversity metrics like Shannon entropy measure the spread of fragment sizes in a region - TFBS and ATAC entropy work well for cancer detection and subtyping - These metrics can be applied to commercial targeted sequencing panels

Biological Mechanism

Fragment size distributions at regulatory regions reflect nucleosome positioning:

Nucleosome-bound DNA: ~147bp core + ~20bp linker = ~167bp
Open chromatin (active TF binding): Variable sizes due to transcription factor binding
Tumor alterations: Aberrant nucleosome positioning → altered size distributions

Cancer cells exhibit: - Epigenetic dysregulation → Changed TFBS accessibility - Altered enhancer usage → Different ATAC peak patterns
- Tissue-specific signatures → Cancer type identification

Processing Flowchart

flowchart LR
    BED["sample.bed.gz"] --> RUST["Rust Backend
(par_iter)"]
    TFBS["TFBS regions"] --> RUST
    ATAC["ATAC regions"] --> RUST

    RUST --> ENTROPY["Entropy Calculation"]

    subgraph "Per Region Label (Parallel)"
        ENTROPY --> COUNT["Fragment count"]
        ENTROPY --> SIZES["Size distribution"]
        SIZES --> SHANNON["Shannon Entropy"]
    end

    SHANNON --> TSV["TFBS.tsv / ATAC.tsv"]
    TSV --> PON["PON Z-score"]

Use mouse to pan and zoom

Performance

Region-level parallelization via Rayon par_iter() enables efficient multi-core processing of TFBS/ATAC regions.

Shannon Entropy Formula

\[ H = -\sum_{i} p_i \log_2(p_i) \]

Where: - \(p_i\) = Proportion of fragments with size \(i\) - High entropy = Many equally represented sizes (diverse) - Low entropy = One dominant size (uniform)

As described in Helzer et al.: "Shannon entropy was calculated on the frequency of the fragment lengths... This yielded a single entropy value for each [TF/cancer type] in each sample."

Usage

# Basic usage (computes both TFBS and ATAC)
krewlyzer region-entropy -i sample.bed.gz -o output_dir/

# TFBS only
krewlyzer region-entropy -i sample.bed.gz -o output/ --no-atac

# ATAC only with PON normalization
krewlyzer region-entropy -i sample.bed.gz -o output/ \
    --no-tfbs --pon-model healthy.pon.parquet

# Via run-all (automatic when assets available)
krewlyzer run-all -i sample.bam -r hg19.fa -o output/

CLI Options

Option	Short	Type	Default	Description
`--input`	`-i`	PATH	required	Input .bed.gz file (from extract)
`--output`	`-o`	PATH	required	Output directory
`--sample-name`	`-s`	TEXT		Override sample name
`--tfbs/--no-tfbs`		FLAG	`--tfbs`	Enable/disable TFBS entropy
`--atac/--no-atac`		FLAG	`--atac`	Enable/disable ATAC entropy
`--tfbs-regions`		PATH		Custom TFBS regions BED.gz
`--atac-regions`		PATH		Custom ATAC regions BED.gz
`--genome`	`-G`	TEXT	hg19	Genome build (hg19/GRCh37/hg38/GRCh38)
`--gc-factors`	`-F`	PATH		GC correction factors TSV
`--pon-model`	`-P`	PATH		PON model for z-score normalization
`--pon-variant`		TEXT	all_unique	PON variant: `all_unique` or `duplex`
`--skip-pon`		FLAG		Skip PON z-score normalization (for ML negatives)
`--target-regions`	`-T`	PATH		Target regions BED (panel mode: generates .ontarget.tsv)
`--skip-target-regions`		FLAG		Force WGS mode (ignore bundled targets from --assay)
`--threads`	`-t`	INT	0	Number of threads (0 = all cores)
`--verbose`	`-v`	FLAG		Enable verbose logging

Assets

TFBS Regions

Source: Gene Transcription Regulation Database (GTRD) v19.10

As described in Helzer et al.: "A collection of consensus Homo sapiens TFBSs was downloaded from the Gene Transcription Regulation Database (GTRD, v19.10). For each TF, the top 5000 sites with the greatest amount of experimental support were used for analysis. TFs with fewer than 5000 sites were discarded, leaving a total of 808 TFs used for the analysis."

Genome	File	TF Count	Sites per TF
GRCh37	`TFBS.GRCh37.bed.gz`	808	5,000
GRCh38	`TFBS.GRCh38.bed.gz`	808	5,000

Format:

chr1  10000  10500  CTCF
chr1  15000  15200  FOXA1

ATAC Regions

Source: TCGA ATAC-seq Pan-Cancer Atlas

As described in Helzer et al.: "Consensus genomic regions from Assay for Transposase Accessible Chromatin with sequencing (ATAC-seq) data was downloaded from The Cancer Genome Atlas (TCGA) for 23 different cancer types."

Genome	File	Cancer Types
GRCh37	`ATAC.GRCh37.bed.gz`	23
GRCh38	`ATAC.GRCh38.bed.gz`	23

Cancer Types: ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM, STAD, TGCT, THCA, UCEC

Format:

chr1  10000  10500  BRCA
chr1  15000  15200  LUAD

Data Source: Region files are from Zhao-Lab-UW-DHO/fragmentomics_metrics

Panel Mode (Dual Output)

For targeted sequencing panels (like MSK-ACCESS), krewlyzer generates dual output:

Genome-wide (.tsv): All fragments across all TFBS/ATAC regions → WGS-comparable baseline
Panel-specific (.ontarget.tsv): Uses pre-intersected panel regions → panel-specific signal

flowchart LR
    subgraph "Genome-wide Output"
        GW_TFBS["All TFBS regions"]
        GW_ATAC["All ATAC regions"]
    end

    subgraph "Panel-specific Output"
        PS_TFBS["xs1/xs2 TFBS regions"]
        PS_ATAC["xs1/xs2 ATAC regions"]
    end

    GW_TFBS --> TSV1["sample.TFBS.tsv"]
    GW_ATAC --> TSV2["sample.ATAC.tsv"]
    PS_TFBS --> TSV3["sample.TFBS.ontarget.tsv"]
    PS_ATAC --> TSV4["sample.ATAC.ontarget.tsv"]

Use mouse to pan and zoom

Usage

# With --assay → auto-loads panel-specific region files
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ --assay xs2

# Or with target regions → enables panel mode
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    -T msk-access-v2.targets.bed

Output Files (Panel Mode)

File	Description	GC Correction
`{sample}.TFBS.tsv`	All 808 TFs (genome-wide)	Off-target GC model
`{sample}.TFBS.ontarget.tsv`	TFs overlapping panel	On-target GC model
`{sample}.TFBS.sync.tsv`	Detailed size distributions	-
`{sample}.TFBS.ontarget.sync.tsv`	Panel size distributions	-
`{sample}.ATAC.tsv`	All 23 cancer types	Off-target GC model
`{sample}.ATAC.ontarget.tsv`	Cancer types in panel	On-target GC model

Note

On-target outputs use on-target GC correction factors when available, providing better accuracy for capture-biased data.

PON Normalization

With a PON model, raw entropy is converted to Z-scores:

\[ Z = \frac{\text{entropy} - \mu_{\text{PON}}}{\sigma_{\text{PON}}} \]

Building PON with TFBS/ATAC

krewlyzer build-pon -i samples.txt -r hg19.fa -o healthy.pon.parquet

The PON model stores: - tfbs_baseline: Per-TF mean/std entropy from healthy samples - atac_baseline: Per-cancer-type mean/std entropy from healthy samples

Applying PON

krewlyzer region-entropy -i sample.bed.gz -o out/ \
    -P healthy.pon.parquet

Output with PON:

label   count   mean_size   entropy   z_score
CTCF    1234    167.2       5.23      1.45
FOXA1   892     165.8       4.98      -0.32

Output Format

TFBS Output: `{sample}.TFBS.tsv`

Column	Type	Description
`label`	TEXT	Transcription factor name (e.g., CTCF, FOXA1)
`count`	INT	Number of fragments overlapping TF regions
`mean_size`	FLOAT	Mean fragment size at these regions
`entropy`	FLOAT	Shannon entropy of size distribution (bits)
`z_score`	FLOAT	PON-normalized z-score (0 if no PON)

ATAC Output: `{sample}.ATAC.tsv`

Column	Type	Description
`label`	TEXT	Cancer type (e.g., BRCA, LUAD, COAD)
`count`	INT	Number of fragments overlapping cancer peaks
`mean_size`	FLOAT	Mean fragment size at these regions
`entropy`	FLOAT	Shannon entropy of size distribution (bits)
`z_score`	FLOAT	PON-normalized z-score (0 if no PON)

Interpretation

Entropy Values

Range	Interpretation	Clinical Significance
0-2	Very low entropy	Single dominant fragment size
2-4	Low entropy	Few distinct sizes
4-6	Moderate entropy	Normal healthy range
6-8	High entropy	Many distinct sizes
> 8	Very high entropy	Possible tumor signal

Z-Scores

Z-Score	Interpretation
-2 to +2	Within normal range
+2 to +3	Elevated (possible cancer signal)
> +3	Significantly elevated
< -2	Significantly reduced

Algorithm Details

Rust Backend (`region_entropy.rs`)

Following the methodology from Helzer et al.:

Load regions: Read BED.gz with label in 4th column
Intersect fragments: For each region, collect fragments with minimum 1bp overlap
Build histograms: Count fragments per size (20-500bp) per label
Compute entropy: Shannon entropy from normalized histogram
Output: TSV with label, count, mean_size, entropy

Python Processing (`region_entropy_processor.py`)

Load raw output: Read Rust-generated TSV
Apply PON baseline: If model provided, compute Z-scores
Write final output: TSV with z_score column

Performance

Dataset	Regions	Time	Memory
TFBS (808 TFs)	~4M regions	~30s	~500MB
ATAC (23 types)	~700K regions	~20s	~400MB

Citation

If you use this feature, please cite:

Reference

Helzer KT, Sharifi MN, Sperger JM, et al. Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels. Nat Commun 16, 9122 (2025). https://doi.org/10.1038/s41467-025-64153-z

Data source: - GitHub: Zhao-Lab-UW-DHO/fragmentomics_metrics

Region Entropy (TFBS/ATAC Size Entropy)

Purpose

Scientific Background

From Helzer et al. 2025

Biological Mechanism

Processing Flowchart

Shannon Entropy Formula

Usage

CLI Options

Assets

TFBS Regions

ATAC Regions

Panel Mode (Dual Output)

Usage

Output Files (Panel Mode)

PON Normalization

Building PON with TFBS/ATAC

Applying PON

Output Format

TFBS Output: {sample}.TFBS.tsv

ATAC Output: {sample}.ATAC.tsv

Interpretation

Entropy Values

Z-Scores

Algorithm Details

Rust Backend (region_entropy.rs)

Python Processing (region_entropy_processor.py)

Performance

Citation

See Also

TFBS Output: `{sample}.TFBS.tsv`

ATAC Output: `{sample}.ATAC.tsv`

Rust Backend (`region_entropy.rs`)

Python Processing (`region_entropy_processor.py`)