Region Entropy (TFBS/ATAC Size Entropy)
Command: krewlyzer region-entropy
Plain English
Region Entropy calculates the diversity of fragment sizes at regulatory regions. A high entropy value indicates many different fragment sizes; low entropy indicates uniform sizes.
Use case: Cancer detection and subtyping - tumor cfDNA shows altered nucleosome positioning at specific regulatory elements.
Note
This feature is based on Helzer KT, et al. (2025) "Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels" published in Nature Communications.
Purpose
Calculates Shannon entropy of fragment size distributions at: - TFBS: Transcription Factor Binding Sites (808 factors from GTRD) - ATAC: Cancer-specific ATAC-seq peaks (23 cancer types from TCGA)
These metrics enable cancer phenotyping from targeted sequencing panels without requiring whole genome sequencing.
Scientific Background
From Helzer et al. 2025
Fragmentomics-based analysis of cell-free DNA (cfDNA) has emerged as a method to infer epigenetic and transcriptional data. While many reports analyze whole genome sequencing (WGS), targeted exon panels can be similarly employed for cancer phenotyping with minimal decrease in performance despite their smaller genomic coverage.
The study assessed 13 fragmentomics metrics including: - Fragment length proportions (small fragments, Shannon entropy) - Normalized fragment read depth - End motif diversity score (MDS) - TFBS entropy - fragments overlapping transcription factor binding sites - ATAC entropy - fragments overlapping cancer-specific open chromatin regions
Key findings relevant to TFBS/ATAC entropy: - Diversity metrics like Shannon entropy measure the spread of fragment sizes in a region - TFBS and ATAC entropy work well for cancer detection and subtyping - These metrics can be applied to commercial targeted sequencing panels
Biological Mechanism
Fragment size distributions at regulatory regions reflect nucleosome positioning:
- Nucleosome-bound DNA: ~147bp core + ~20bp linker = ~167bp
- Open chromatin (active TF binding): Variable sizes due to transcription factor binding
- Tumor alterations: Aberrant nucleosome positioning → altered size distributions
Cancer cells exhibit:
- Epigenetic dysregulation → Changed TFBS accessibility
- Altered enhancer usage → Different ATAC peak patterns
- Tissue-specific signatures → Cancer type identification
Processing Flowchart
flowchart LR
BED["sample.bed.gz"] --> RUST["Rust Backend
(par_iter)"]
TFBS["TFBS regions"] --> RUST
ATAC["ATAC regions"] --> RUST
RUST --> ENTROPY["Entropy Calculation"]
subgraph "Per Region Label (Parallel)"
ENTROPY --> COUNT["Fragment count"]
ENTROPY --> SIZES["Size distribution"]
SIZES --> SHANNON["Shannon Entropy"]
end
SHANNON --> TSV["TFBS.tsv / ATAC.tsv"]
TSV --> PON["PON Z-score"]
Performance
Region-level parallelization via Rayon par_iter() enables efficient multi-core processing of TFBS/ATAC regions.
Shannon Entropy Formula
Where: - \(p_i\) = Proportion of fragments with size \(i\) - High entropy = Many equally represented sizes (diverse) - Low entropy = One dominant size (uniform)
As described in Helzer et al.: "Shannon entropy was calculated on the frequency of the fragment lengths... This yielded a single entropy value for each [TF/cancer type] in each sample."
Usage
# Basic usage (computes both TFBS and ATAC)
krewlyzer region-entropy -i sample.bed.gz -o output_dir/
# TFBS only
krewlyzer region-entropy -i sample.bed.gz -o output/ --no-atac
# ATAC only with PON normalization
krewlyzer region-entropy -i sample.bed.gz -o output/ \
--no-tfbs --pon-model healthy.pon.parquet
# Via run-all (automatic when assets available)
krewlyzer run-all -i sample.bam -r hg19.fa -o output/
CLI Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--input |
-i |
PATH | required | Input .bed.gz file (from extract) |
--output |
-o |
PATH | required | Output directory |
--sample-name |
-s |
TEXT | Override sample name | |
--tfbs/--no-tfbs |
FLAG | --tfbs |
Enable/disable TFBS entropy | |
--atac/--no-atac |
FLAG | --atac |
Enable/disable ATAC entropy | |
--tfbs-regions |
PATH | Custom TFBS regions BED.gz | ||
--atac-regions |
PATH | Custom ATAC regions BED.gz | ||
--genome |
-G |
TEXT | hg19 | Genome build (hg19/GRCh37/hg38/GRCh38) |
--gc-factors |
-F |
PATH | GC correction factors TSV | |
--pon-model |
-P |
PATH | PON model for z-score normalization | |
--pon-variant |
TEXT | all_unique | PON variant: all_unique or duplex |
|
--skip-pon |
FLAG | Skip PON z-score normalization (for ML negatives) | ||
--target-regions |
-T |
PATH | Target regions BED (panel mode: generates .ontarget.tsv) | |
--skip-target-regions |
FLAG | Force WGS mode (ignore bundled targets from --assay) | ||
--threads |
-t |
INT | 0 | Number of threads (0 = all cores) |
--verbose |
-v |
FLAG | Enable verbose logging |
Assets
TFBS Regions
Source: Gene Transcription Regulation Database (GTRD) v19.10
As described in Helzer et al.: "A collection of consensus Homo sapiens TFBSs was downloaded from the Gene Transcription Regulation Database (GTRD, v19.10). For each TF, the top 5000 sites with the greatest amount of experimental support were used for analysis. TFs with fewer than 5000 sites were discarded, leaving a total of 808 TFs used for the analysis."
| Genome | File | TF Count | Sites per TF |
|---|---|---|---|
| GRCh37 | TFBS.GRCh37.bed.gz |
808 | 5,000 |
| GRCh38 | TFBS.GRCh38.bed.gz |
808 | 5,000 |
Format:
ATAC Regions
Source: TCGA ATAC-seq Pan-Cancer Atlas
As described in Helzer et al.: "Consensus genomic regions from Assay for Transposase Accessible Chromatin with sequencing (ATAC-seq) data was downloaded from The Cancer Genome Atlas (TCGA) for 23 different cancer types."
| Genome | File | Cancer Types |
|---|---|---|
| GRCh37 | ATAC.GRCh37.bed.gz |
23 |
| GRCh38 | ATAC.GRCh38.bed.gz |
23 |
Cancer Types: ACC, BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, PCPG, PRAD, SKCM, STAD, TGCT, THCA, UCEC
Format:
Data Source: Region files are from Zhao-Lab-UW-DHO/fragmentomics_metrics
Panel Mode (Dual Output)
For targeted sequencing panels (like MSK-ACCESS), krewlyzer generates dual output:
- Genome-wide (
.tsv): All fragments across all TFBS/ATAC regions → WGS-comparable baseline - Panel-specific (
.ontarget.tsv): Uses pre-intersected panel regions → panel-specific signal
flowchart LR
subgraph "Genome-wide Output"
GW_TFBS["All TFBS regions"]
GW_ATAC["All ATAC regions"]
end
subgraph "Panel-specific Output"
PS_TFBS["xs1/xs2 TFBS regions"]
PS_ATAC["xs1/xs2 ATAC regions"]
end
GW_TFBS --> TSV1["sample.TFBS.tsv"]
GW_ATAC --> TSV2["sample.ATAC.tsv"]
PS_TFBS --> TSV3["sample.TFBS.ontarget.tsv"]
PS_ATAC --> TSV4["sample.ATAC.ontarget.tsv"]
Usage
# With --assay → auto-loads panel-specific region files
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ --assay xs2
# Or with target regions → enables panel mode
krewlyzer run-all -i sample.bam -r hg19.fa -o out/ \
-T msk-access-v2.targets.bed
Output Files (Panel Mode)
| File | Description | GC Correction |
|---|---|---|
{sample}.TFBS.tsv |
All 808 TFs (genome-wide) | Off-target GC model |
{sample}.TFBS.ontarget.tsv |
TFs overlapping panel | On-target GC model |
{sample}.TFBS.sync.tsv |
Detailed size distributions | - |
{sample}.TFBS.ontarget.sync.tsv |
Panel size distributions | - |
{sample}.ATAC.tsv |
All 23 cancer types | Off-target GC model |
{sample}.ATAC.ontarget.tsv |
Cancer types in panel | On-target GC model |
Note
On-target outputs use on-target GC correction factors when available, providing better accuracy for capture-biased data.
PON Normalization
With a PON model, raw entropy is converted to Z-scores:
Building PON with TFBS/ATAC
The PON model stores:
- tfbs_baseline: Per-TF mean/std entropy from healthy samples
- atac_baseline: Per-cancer-type mean/std entropy from healthy samples
Applying PON
Output with PON:
Output Format
TFBS Output: {sample}.TFBS.tsv
| Column | Type | Description |
|---|---|---|
label |
TEXT | Transcription factor name (e.g., CTCF, FOXA1) |
count |
INT | Number of fragments overlapping TF regions |
mean_size |
FLOAT | Mean fragment size at these regions |
entropy |
FLOAT | Shannon entropy of size distribution (bits) |
z_score |
FLOAT | PON-normalized z-score (0 if no PON) |
ATAC Output: {sample}.ATAC.tsv
| Column | Type | Description |
|---|---|---|
label |
TEXT | Cancer type (e.g., BRCA, LUAD, COAD) |
count |
INT | Number of fragments overlapping cancer peaks |
mean_size |
FLOAT | Mean fragment size at these regions |
entropy |
FLOAT | Shannon entropy of size distribution (bits) |
z_score |
FLOAT | PON-normalized z-score (0 if no PON) |
Interpretation
Entropy Values
| Range | Interpretation | Clinical Significance |
|---|---|---|
| 0-2 | Very low entropy | Single dominant fragment size |
| 2-4 | Low entropy | Few distinct sizes |
| 4-6 | Moderate entropy | Normal healthy range |
| 6-8 | High entropy | Many distinct sizes |
| > 8 | Very high entropy | Possible tumor signal |
Z-Scores
| Z-Score | Interpretation |
|---|---|
| -2 to +2 | Within normal range |
| +2 to +3 | Elevated (possible cancer signal) |
| > +3 | Significantly elevated |
| < -2 | Significantly reduced |
Algorithm Details
Rust Backend (region_entropy.rs)
Following the methodology from Helzer et al.:
- Load regions: Read BED.gz with label in 4th column
- Intersect fragments: For each region, collect fragments with minimum 1bp overlap
- Build histograms: Count fragments per size (20-500bp) per label
- Compute entropy: Shannon entropy from normalized histogram
- Output: TSV with label, count, mean_size, entropy
Python Processing (region_entropy_processor.py)
- Load raw output: Read Rust-generated TSV
- Apply PON baseline: If model provided, compute Z-scores
- Write final output: TSV with z_score column
Performance
| Dataset | Regions | Time | Memory |
|---|---|---|---|
| TFBS (808 TFs) | ~4M regions | ~30s | ~500MB |
| ATAC (23 types) | ~700K regions | ~20s | ~400MB |
Citation
If you use this feature, please cite:
Reference
Helzer KT, Sharifi MN, Sperger JM, et al. Analysis of cfDNA fragmentomics metrics and commercial targeted sequencing panels. Nat Commun 16, 9122 (2025). https://doi.org/10.1038/s41467-025-64153-z
Data source: - GitHub: Zhao-Lab-UW-DHO/fragmentomics_metrics