Skip to content

Region Entropy (TFBS/ATAC Size Entropy)

Command: krewlyzer region-entropy

Plain English

Region Entropy calculates the diversity of fragment sizes at regulatory regions. A high entropy value indicates many different fragment sizes; low entropy indicates uniform sizes.

Use case: Cancer detection and subtyping - tumor cfDNA shows altered nucleosome positioning at specific regulatory elements.

Note

This feature is based on Helzer KT, et al. (2025) "Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels" published in Nature Communications.


Purpose

Calculates Shannon entropy of fragment size distributions at: - TFBS: Transcription Factor Binding Sites (808 factors from GTRD) - ATAC: Cancer-specific ATAC-seq peaks (23 cancer types from TCGA)

These metrics enable cancer phenotyping from targeted sequencing panels without requiring whole genome sequencing.


Scientific Background

From Helzer et al. 2025

Fragmentomics-based analysis of cell-free DNA (cfDNA) has emerged as a method to infer epigenetic and transcriptional data. While many reports analyze whole genome sequencing (WGS), targeted exon panels can be similarly employed for cancer phenotyping with minimal decrease in performance despite their smaller genomic coverage.

The study assessed 13 fragmentomics metrics including: - Fragment length proportions (small fragments, Shannon entropy) - Normalized fragment read depth - End motif diversity score (MDS) - TFBS entropy - fragments overlapping transcription factor binding sites - ATAC entropy - fragments overlapping cancer-specific open chromatin regions

Key findings relevant to TFBS/ATAC entropy: - Diversity metrics like Shannon entropy measure the spread of fragment sizes in a region - TFBS and ATAC entropy work well for cancer detection and subtyping - These metrics can be applied to commercial targeted sequencing panels

Biological Mechanism

Fragment size distributions at regulatory regions reflect nucleosome positioning:

  • Nucleosome-bound DNA: ~147bp core + ~20bp linker = ~167bp
  • Open chromatin (active TF binding): Variable sizes due to transcription factor binding
  • Tumor alterations: Aberrant nucleosome positioning → altered size distributions

Cancer cells exhibit: - Epigenetic dysregulation → Changed TFBS accessibility - Altered enhancer usage → Different ATAC peak patterns
- Tissue-specific signatures → Cancer type identification


Processing Flowchart

flowchart LR
    BED["sample.bed.gz"] --> RUST["Rust Backend
(par_iter)"] TFBS["TFBS regions"] --> RUST ATAC["ATAC regions"] --> RUST RUST --> ENTROPY["Entropy Calculation"] subgraph "Per Region Label (Parallel)" ENTROPY --> COUNT["Fragment count"] ENTROPY --> SIZES["Size distribution"] SIZES --> SHANNON["Shannon Entropy"] end SHANNON --> TSV["TFBS.tsv / ATAC.tsv"] TSV --> PON["PON Z-score"]
Use mouse to pan and zoom

Performance

Region-level parallelization via Rayon par_iter() enables efficient multi-core processing of TFBS/ATAC regions.


Shannon Entropy Formula

\[ H = -\sum_{i} p_i \log_2(p_i) \]

Where: - \(p_i\) = Proportion of fragments with size \(i\) - High entropy = Many equally represented sizes (diverse) - Low entropy = One dominant size (uniform)

As described in Helzer et al.: "Shannon entropy was calculated on the frequency of the fragment lengths... This yielded a single entropy value for each [TF/cancer type] in each sample."


Usage

# Basic usage (computes both TFBS and ATAC)
krewlyzer region-entropy -i sample.bed.gz -o output_dir/

# TFBS only
krewlyzer region-entropy -i sample.bed.gz -o output/ --no-atac

# ATAC only with PON normalization
krewlyzer region-entropy -i sample.bed.gz -o output/ \
    --no-tfbs --pon-model healthy.pon.parquet

# Via run-all (automatic when assets available)
krewlyzer run-all -i sample.bam -r hg19.fa -o output/

CLI Options

Option Short Type Default Description
--input -i PATH required Input .bed.gz file (from extract)
--output -o PATH required Output directory
--sample-name -s TEXT Override sample name
--tfbs/--no-tfbs FLAG --tfbs Enable/disable TFBS entropy
--atac/--no-atac FLAG --atac Enable/disable ATAC entropy
--tfbs-regions PATH Custom TFBS regions BED.gz
--atac-regions PATH Custom ATAC regions BED.gz
--genome -G TEXT hg19 Genome build (hg19/GRCh37/hg38/GRCh38)
--gc-factors -F PATH GC correction factors TSV
--pon-model -P PATH PON model for z-score normalization
--pon-variant TEXT all_unique PON variant: all_unique or duplex
--skip-pon FLAG Skip PON z-score normalization (for ML negatives)
--target-regions -T PATH Target regions BED (panel mode: generates .ontarget.tsv)
--skip-target-regions FLAG Force WGS mode (ignore bundled targets from --assay)
--threads -t INT 0 Number of threads (0 = all cores)
--verbose -v FLAG Enable verbose logging

Assets

TFBS Regions

Source: Gene Transcription Regulation Database (GTRD) v19.10

As described in Helzer et al.: "A collection of consensus Homo sapiens TFBSs was downloaded from the Gene Transcription Regulation Database (GTRD, v19.10). For each TF, the top 5000 sites with the greatest amount of experimental support were used for analysis. TFs with fewer than 5000 sites were discarded, leaving a total of 808 TFs used for the analysis."

Genome File TF Count Sites per TF
GRCh37 TFBS.GRCh37.bed.gz 808 5,000
GRCh38 TFBS.GRCh38.bed.gz 808 5,000

Format:

chr1  10000  10500  CTCF
chr1  15000  15200  FOXA1

ATAC Regions

Source: TCGA ATAC-seq Pan-Cancer Atlas

As described in Helzer et al.: "Consensus genomic regions from Assay for Transposase Accessible Chromatin with sequencing (ATAC-seq) data was downloaded from The Cancer Genome Atlas (TCGA) for 23 different cancer types."

Genome File Cancer Types
GRCh37 ATAC.GRCh37.bed.gz 23
GRCh38 ATAC.GRCh38.bed.gz 23

Cancer Types: ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM, STAD, TGCT, THCA, UCEC

Format:

chr1  10000  10500  BRCA
chr1  15000  15200  LUAD

Data Source: Region files are from Zhao-Lab-UW-DHO/fragmentomics_metrics


Panel Mode (Dual Output)

For targeted sequencing panels (like MSK-ACCESS), krewlyzer generates dual output:

  1. Genome-wide (.tsv): All fragments across all TFBS/ATAC regions → WGS-comparable baseline
  2. Panel-specific (.ontarget.tsv): Uses pre-intersected panel regions → panel-specific signal
flowchart LR
    subgraph "Genome-wide Output"
        GW_TFBS["All TFBS regions"]
        GW_ATAC["All ATAC regions"]
    end

    subgraph "Panel-specific Output"
        PS_TFBS["xs1/xs2 TFBS regions"]
        PS_ATAC["xs1/xs2 ATAC regions"]
    end

    GW_TFBS --> TSV1["sample.TFBS.tsv"]
    GW_ATAC --> TSV2["sample.ATAC.tsv"]
    PS_TFBS --> TSV3["sample.TFBS.ontarget.tsv"]
    PS_ATAC --> TSV4["sample.ATAC.ontarget.tsv"]
Use mouse to pan and zoom

Usage

# With --assay → auto-loads panel-specific region files
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ --assay xs2

# Or with target regions → enables panel mode
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
    -T msk-access-v2.targets.bed

Output Files (Panel Mode)

File Description GC Correction
{sample}.TFBS.tsv All 808 TFs (genome-wide) Off-target GC model
{sample}.TFBS.ontarget.tsv TFs overlapping panel On-target GC model
{sample}.TFBS.sync.tsv Detailed size distributions -
{sample}.TFBS.ontarget.sync.tsv Panel size distributions -
{sample}.ATAC.tsv All 23 cancer types Off-target GC model
{sample}.ATAC.ontarget.tsv Cancer types in panel On-target GC model

Note

On-target outputs use on-target GC correction factors when available, providing better accuracy for capture-biased data.


PON Normalization

With a PON model, raw entropy is converted to Z-scores:

\[ Z = \frac{\text{entropy} - \mu_{\text{PON}}}{\sigma_{\text{PON}}} \]

Building PON with TFBS/ATAC

krewlyzer build-pon -i samples.txt -r hg19.fa -o healthy.pon.parquet

The PON model stores: - tfbs_baseline: Per-TF mean/std entropy from healthy samples - atac_baseline: Per-cancer-type mean/std entropy from healthy samples

Applying PON

krewlyzer region-entropy -i sample.bed.gz -o out/ \
    -P healthy.pon.parquet

Output with PON:

label   count   mean_size   entropy   z_score
CTCF    1234    167.2       5.23      1.45
FOXA1   892     165.8       4.98      -0.32


Output Format

TFBS Output: {sample}.TFBS.tsv

Column Type Description
label TEXT Transcription factor name (e.g., CTCF, FOXA1)
count INT Number of fragments overlapping TF regions
mean_size FLOAT Mean fragment size at these regions
entropy FLOAT Shannon entropy of size distribution (bits)
z_score FLOAT PON-normalized z-score (0 if no PON)

ATAC Output: {sample}.ATAC.tsv

Column Type Description
label TEXT Cancer type (e.g., BRCA, LUAD, COAD)
count INT Number of fragments overlapping cancer peaks
mean_size FLOAT Mean fragment size at these regions
entropy FLOAT Shannon entropy of size distribution (bits)
z_score FLOAT PON-normalized z-score (0 if no PON)

Interpretation

Entropy Values

Range Interpretation Clinical Significance
0-2 Very low entropy Single dominant fragment size
2-4 Low entropy Few distinct sizes
4-6 Moderate entropy Normal healthy range
6-8 High entropy Many distinct sizes
> 8 Very high entropy Possible tumor signal

Z-Scores

Z-Score Interpretation
-2 to +2 Within normal range
+2 to +3 Elevated (possible cancer signal)
> +3 Significantly elevated
< -2 Significantly reduced

Algorithm Details

Rust Backend (region_entropy.rs)

Following the methodology from Helzer et al.:

  1. Load regions: Read BED.gz with label in 4th column
  2. Intersect fragments: For each region, collect fragments with minimum 1bp overlap
  3. Build histograms: Count fragments per size (20-500bp) per label
  4. Compute entropy: Shannon entropy from normalized histogram
  5. Output: TSV with label, count, mean_size, entropy

Python Processing (region_entropy_processor.py)

  1. Load raw output: Read Rust-generated TSV
  2. Apply PON baseline: If model provided, compute Z-scores
  3. Write final output: TSV with z_score column

Performance

Dataset Regions Time Memory
TFBS (808 TFs) ~4M regions ~30s ~500MB
ATAC (23 types) ~700K regions ~20s ~400MB

Citation

If you use this feature, please cite:

Reference

Helzer KT, Sharifi MN, Sperger JM, et al. Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels. Nat Commun 16, 9122 (2025). https://doi.org/10.1038/s41467-025-64153-z

Data source: - GitHub: Zhao-Lab-UW-DHO/fragmentomics_metrics


See Also

  • Extract – Generate input .bed.gz files
  • Build PON – Create PON models with TFBS/ATAC baselines
  • OCF – Related open chromatin feature
  • Citation – All scientific references