Skip to content

Fragment Size Distribution (FSD)

Command: krewlyzer fsd

Plain English

FSD creates a "histogram" of fragment sizes for each chromosome arm. Healthy samples have a peak at ~166bp. Cancer samples show a left-shifted peak (~145bp).

Use case: Detect aneuploidy and copy number changes by comparing arm-level size distributions.


Purpose

Computes high-resolution (5bp bins) fragment length distributions per chromosome arm. Produces ML-ready features with log-ratio normalization and on/off-target split for panel data.


Processing Flowchart

flowchart LR
    BED["sample.bed.gz"] --> PIPELINE["Rust Pipeline"]
    ARMS["Chromosome Arms"] --> PIPELINE
    GC["GC Correction"] --> PIPELINE
    PIPELINE --> FSD["FSD.tsv"]

    subgraph "With --pon-model"
        FSD --> PON["PON Normalization"]
        PON --> LOGR["FSD.tsv + _logR columns"]
    end

    subgraph "With --target-regions"
        PIPELINE --> FSD_ON["FSD.ontarget.tsv"]
    end
Use mouse to pan and zoom

Python/Rust Architecture

flowchart TB
    subgraph "Python (CLI)"
        CLI["fsd.py"] --> UP["unified_processor.py"]
        UP --> ASSETS["AssetManager"]
    end

    subgraph "Rust Backend"
        UP --> RUST["_core.run_unified_pipeline()"]
        RUST --> GC["GC correction"]
        GC --> HIST["187-bin histogram per arm (65-999bp)"]
    end

    subgraph "Python (Post-processing)"
        HIST --> PROC["fsd_processor.py"]
        PROC --> PON["PON log-ratio"]
        PON --> OUT["FSD.tsv"]
    end
Use mouse to pan and zoom

Biological Context

Why Fragment Sizes Matter

cfDNA fragment sizes reflect nucleosome positioning and chromatin state in source cells:

Fragment Size Source Biological Significance
~145bp Core nucleosome Minimal DNA protection
~166bp Mono-nucleosome + linker "Classic" cfDNA peak
~334bp Di-nucleosome Stable chromatin regions
10bp periodicity DNA helical pitch Rotational phasing

Cancer Signature

Signal Healthy Plasma Cancer (ctDNA)
Modal peak ~166bp Left-shifted (~145bp)
10bp periodicity Clear Often disrupted
Arm-level variation Minimal Increased (correlates with CNAs)

Why arm-level?

Chromosome arms have distinct chromatin environments. Tumor-derived cfDNA shows arm-specific fragmentation shifts that correlate with copy number alterations.


Usage

# Basic usage
krewlyzer fsd -i sample.bed.gz -o output_dir/ --genome hg19

# With PON for log-ratio normalization
krewlyzer fsd -i sample.bed.gz -o output_dir/ -P msk-access.pon.parquet

# Panel data (MSK-ACCESS) with on/off-target split
krewlyzer run-all -i sample.bam -r ref.fa -o out/ \
    --target-regions panel_targets.bed

Options

Option Short Type Default Description
--input -i PATH required Input .bed.gz file (from extract)
--output -o PATH required Output directory
--sample-name -s TEXT Override sample name
--arms-file -a PATH Chromosome arms BED file
--target-regions -T PATH Target BED (enables on/off split)
--skip-target-regions FLAG Force WGS mode (ignore bundled targets)
--assay -A TEXT Assay code (xs1/xs2) for bundled assets
--genome -G TEXT hg19 Genome build (hg19/hg38)
--pon-model -P PATH PON model for z-score computation
--pon-variant TEXT all_unique PON variant: all_unique or duplex
--skip-pon FLAG Skip PON z-score normalization
--gc-correct FLAG True Apply GC bias correction
--threads -t INT 0 Threads (0=all cores)
--verbose -v FLAG Enable verbose logging

Output Files

{sample}.FSD.tsv (Off-Target / Default)

Column Type Description
region str Chromosome arm (e.g., "chr1:0-125000000")
65-69, 70-74, ..., 995-999 float GC-weighted counts in 187 bins (5bp steps, 65-999bp)
total float Sum of all bins
65-69_logR, ... float log2(sample / PoN_expected) (with -P)
pon_stability float 1 / (variance + k) (with -P)

{sample}.FSD.ontarget.tsv (Panel Mode Only)

Same schema as above, but for fragments overlapping target regions.

Important

Off-target = unbiased (preferred for biomarkers)
On-target = capture-biased (use cautiously for local analysis only)


GC Correction

When --gc-correct is enabled (default):

Normalization Order:
1. GC-weighting (Rust): raw_count × gc_correction_factor
2. PoN log-ratio (Python): log2((sample + 1) / (pon + 1))
GC Option Effect
Enabled Corrects for PCR/capture GC bias
Disabled (--no-gc-correct) Raw counts (faster, biased)

Tip

See GC Correction Details for the LOESS algorithm.


PON Integration

With --pon-model, FSD outputs include log-ratio normalization:

Column Formula Interpretation
{bin}_logR log2((sample + 1) / (PoN_expected + 1)) > 0 = above normal
pon_stability 1 / (variance + 0.01) Higher = more reliable

Formulas:

Log-Ratio:

\[ \text{logR} = \log_2 \left( \frac{\text{sample_count} + 1}{\text{PoN_expected} + 1} \right) \]

PON Stability:

\[ \text{stability} = \frac{1}{\text{variance} + 0.01} \]

Algorithm: 1. For each arm and size bin, retrieve PoN expected value 2. Compute log-ratio with pseudocount (+1) for zero-handling 3. Calculate stability from PoN variance (inverse weighting)

Tip

See PON Models for model structure and building.


Clinical Interpretation

Interpreting Log-Ratio Values

*_logR Value Meaning Possible Cause
~0 Normal No deviation from healthy
> 0.5 Elevated short fragments Tumor-derived cfDNA
< -0.5 Depleted Copy number loss?

Arm-Level Variation

Pattern Interpretation
Uniform across arms Healthy profile
Single arm deviation Focal CNA or arm-level event
Multiple arm deviations High tumor fraction (aneuploidy)

187-Bin Structure (65-999bp, 5bp steps)

Bin Range Size Range Description
0 65-69bp Ultra-short (sub-nucleosomal)
1-6 70-99bp Short fragments
7-16 100-149bp Core mono-nucleosomal
17-30 150-219bp Peak mono-nucleosomal
31-50 220-319bp Di-nucleosomal
51-86 320-499bp Multi-nucleosomal
87-186 500-999bp Extended range

Note

The pipeline uses 5bp bins from 65bp to 999bp, yielding 187 columns. This extended range (up to 1000bp) captures multi-nucleosomal fragments for comprehensive FSD analysis.


Panel Data Mode

For targeted sequencing (MSK-ACCESS):

krewlyzer fsd -i sample.bed.gz -o output/ \
    --target-regions MSK-ACCESS-v2_targets.bed
Output Contents Use Case
.FSD.tsv Off-target fragments Unbiased arm-level biomarkers
.FSD.ontarget.tsv On-target fragments Local context (capture-biased)

Warning

On-target FSD has capture bias and should not be used for global fragmentomics.


See Also