Mutant Fragment Size Distribution (mFSD)

Command: krewlyzer mfsd

Plain English: mFSD compares fragment sizes at known mutation sites. Fragments carrying the mutation (ALT) tend to be shorter than healthy fragments (REF).

Use case: MRD monitoring - track tumor DNA by comparing mutant vs. wild-type fragment sizes.

Purpose

Compares the size distribution of mutant vs. wild-type fragments at variant sites, with support for all small variant types and 4-way fragment classification.

Processing Flowchart

flowchart LR
    BAM["sample.bam"] --> RUST["Rust Backend"]
    VCF["variants.vcf/maf"] --> RUST
    REF["Reference FASTA"] --> RUST

    RUST --> MFSD["mFSD.tsv"]

    subgraph "Per Variant"
        RUST --> CLASS{"Classify Reads"}
        CLASS --> REF_F["REF fragments"]
        CLASS --> ALT_F["ALT fragments"]
        CLASS --> NONREF["NonREF"]
        CLASS --> N_F["N fragments"]
    end

    subgraph "Statistics"
        REF_F --> KS["KS Test"]
        ALT_F --> KS
        KS --> PVAL["p-value, Delta"]
    end

Use mouse to pan and zoom

Biological Context

Mutant ctDNA fragments are typically shorter (~145bp) than wild-type cfDNA (~166bp) due to: - Altered nucleosome positioning in tumor cells - Different chromatin accessibility - Enhanced apoptosis patterns

This module quantifies this difference using high-depth targeted sequencing data.

Variant Types Supported

Type	Example	Description
SNV	A>T	Single nucleotide variant
MNV	AT>GC	Multi-nucleotide variant
Insertion	A>ATG	Pure insertion
Deletion	ATG>A	Pure deletion
Complex	ATG>CT	Mixed substitution + indel

Usage

# Basic usage with VCF
krewlyzer mfsd -i sample.bam -V variants.vcf -o output_dir/

# With MAF file and GC correction
krewlyzer mfsd -i sample.bam -V variants.maf -o output/ \
    -r hg19.fa --correction-factors factors.csv

# With per-variant distributions
krewlyzer mfsd -i sample.bam -V variants.vcf -o output/ \
    --output-distributions

Dual BAM Support

For optimal results with duplex sequencing panels (e.g., MSK-ACCESS), use separate BAMs:

BAM Type	Use Case	Why
`all_unique`	FSC, FSD, WPS, OCF	Maximum coverage for background features
`duplex`	mFSD	Highest accuracy for variant detection

Via run-all CLI

# Provide dedicated duplex BAM for mFSD
krewlyzer run-all -i sample.all_unique.bam --mfsd-bam sample.duplex.bam \
    -r hg19.fa -o out/ --assay xs2 --variants sample.maf

Note

When --mfsd-bam is provided, duplex weighting is auto-enabled. If no duplex tags (cD/Marianas) are found, a warning is logged but processing continues with weight=1.0.

Via Nextflow Samplesheet

sample,bam,mfsd_bam,meth_bam,vcf,bed,maf,single_sample_maf,assay,pon,targets
ACCESS_001,/path/to/sample.all_unique.bam,/path/to/sample.duplex.bam,,,/path/to/variants.maf,true,XS2,,

CLI Options

Option	Short	Type	Default	Description
`--input`	`-i`	PATH	required	Input BAM file
`--variants`	`-V`	PATH	required	VCF or MAF file with variants
`--output`	`-o`	PATH	required	Output directory
`--sample-name`	`-s`	TEXT		Override sample name
`--reference`	`-r`	PATH		Reference FASTA for GC correction
`--correction-factors`	`-F`	PATH		Pre-computed correction factors CSV
`--duplex`	`-D`	FLAG		Enable duplex consensus weighting (fgbio/Marianas)
`--mapq`	`-q`	INT	20	Minimum mapping quality
`--minlen`		INT	65	Minimum fragment length
`--maxlen`		INT	1000	Maximum fragment length
`--skip-duplicates`		FLAG	True	Skip duplicate reads (always enabled in backend)
`--require-proper-pair`		FLAG	False	Require proper pairs (disable for duplex BAMs)
`--output-distributions`	`-d`	FLAG		Output per-variant size distributions
`--verbose`	`-v`	FLAG		Enable verbose logging
`--threads`	`-t`	INT	0	Number of threads (0=all)

Duplex Sequencing Support

mFSD supports duplex consensus sequencing for ultra-sensitive variant detection. When --duplex is enabled, fragments from high-confidence duplex families receive higher weight in statistical calculations.

Supported Duplex Formats

XS2: fgbio/Picard

Uses SAM auxiliary tags set by fgbio CallDuplexConsensusReads:

Tag	Meaning	Use in mFSD
`aD`	Max depth of strand A single-strand consensus	SS-A depth
`bD`	Max depth of strand B single-strand consensus	SS-B depth
`cD`	Max depth of duplex consensus	Primary weight source
`aM/bM/cM`	Min depth at any point	Quality check
`aE/bE/cE`	Error rate	Quality filter

Example SAM record:

read1   0   chr1   1000   60   100M   *   0   150   ACGT...   IIII...   cD:i:5   aD:i:8   bD:i:7

XS1: Marianas

Encodes family size in the read name per Marianas documentation:

Marianas:UMI1+UMI2:contig:start:posCount:negCount:contig2:start2:pos2:neg2

Position	Field	Example
0	`Marianas`	Marianas
1	UMI	ACT+TTA
2	Read1 contig	2
3	Read1 start	48033828
4	Read1 (+) count	4
5	Read1 (-) count	3

Family size = posCount + negCount (e.g., 4 + 3 = 7)

Duplex Weighting Formula

When --duplex is enabled, each fragment receives a weight based on its duplex family size:

\[ \text{weight} = \max(\ln(\text{family_size}), 1.0) \]

Family Size	Weight	Interpretation
1	1.0	Single read (no UMI collapse)
2	1.0	Minimum duplex (capped)
3	1.1	Low-confidence duplex
5	1.6	Moderate confidence
10	2.3	High confidence
50	3.9	Very high confidence

Interpretation in Duplex Mode

Important

When --duplex is enabled, ALT_Weighted and REF_Weighted will exceed raw counts. A ratio of ~1.6x indicates typical duplex family sizes of 3-5.

Column	Without Duplex	With Duplex
`ALT_Count`	Raw fragment count	Raw fragment count
`ALT_Weighted`	= ALT_Count	= Σ(weight × fragment)
`VAF_GC_Corrected`	Based on raw	Based on weighted counts

Example:

Variant at chr1:156845927
  ALT_Count    = 393
  ALT_Weighted = 691  (ratio = 1.76)
  → Indicates high-confidence duplex families

Log-Likelihood Ratio (LLR) Scoring

For low fragment count scenarios (common in duplex/panel sequencing), traditional KS tests are unreliable. mFSD provides LLR scoring as a probabilistic alternative.

LLR Formula

\[ \text{LLR} = \sum_{i=1}^{n} \left[ \log P(\text{size}_i | \text{Tumor}) - \log P(\text{size}_i | \text{Healthy}) \right] \]

Using Gaussian models: - Healthy: μ = 167bp, σ = 35bp (nucleosomal periodicity) - Tumor: μ = 145bp, σ = 25bp (sub-nucleosomal fragments)

LLR Output Columns

Column	Range	Interpretation
`ALT_LLR`	any	Log-likelihood ratio for ALT fragments
`REF_LLR`	any	Log-likelihood ratio for REF fragments

LLR Interpretation Guide

LLR Value	Interpretation	Action
> 5	Strong tumor signal	High confidence in tumor-derived fragments
0 to 5	Weak tumor signal	Possible tumor, verify with other evidence
-5 to 0	Weak healthy signal	Likely healthy, low tumor content
< -5	Strong healthy signal	Consistent with healthy cfDNA

Tip

For low-N variants (ALT_Count < 5), use ALT_LLR instead of KS_Pval_ALT_REF. The LLR is robust with even 1-2 fragments, while KS tests require ≥5.

Clinical Example: MRD Detection

Variant at TP53:chr17:7577539
  ALT_Count = 3           # Too few for KS test
  KS_Valid  = FALSE       # KS test unreliable
  ALT_LLR   = 4.2         # Positive = tumor-like fragments
  REF_LLR   = -89.5       # Negative = healthy-like REF population

  → Interpretation: ALT fragments show tumor signature despite low count

Cross-Species and Assay Support

The LLR model uses Gaussian distributions for healthy and tumor fragment length peaks. These parameters can be customized for different species or library preparations.

Built-in Presets

Preset	Healthy μ	Healthy σ	Tumor μ	Tumor σ	Use Case
human (default)	167bp	35bp	145bp	25bp	Human cfDNA
canine	153bp	30bp	135bp	22bp	Canine cfDNA
ssdna	160bp	40bp	140bp	30bp	Single-stranded library prep

Biological Rationale

Fragment length peaks vary across species due to: - Nucleosome spacing differences - Canine nucleosomes are more tightly packed - Chromatin structure - Different histone modifications - Library preparation - ssDNA preps capture shorter fragments

flowchart LR
    subgraph "Human cfDNA"
        H_HEALTHY["Healthy: ~167bp"] 
        H_TUMOR["Tumor: ~145bp"]
    end

    subgraph "Canine cfDNA"
        C_HEALTHY["Healthy: ~153bp"]
        C_TUMOR["Tumor: ~135bp"]
    end

    H_HEALTHY --> DELTA1["Δ = -22bp"]
    H_TUMOR --> DELTA1
    C_HEALTHY --> DELTA2["Δ = -18bp"]
    C_TUMOR --> DELTA2

Use mouse to pan and zoom

Python API Access

from krewlyzer._core import LLRModelParams

# Use built-in presets
human_params = LLRModelParams.human()
canine_params = LLRModelParams.canine()
ssdna_params = LLRModelParams.ssdna()

# Custom parameters
custom = LLRModelParams(
    healthy_mu=160.0,
    healthy_sigma=32.0,
    tumor_mu=140.0,
    tumor_sigma=20.0
)

Note

CLI support for preset selection is planned for a future release. Currently, the Rust backend defaults to human parameters.

Fragment Classification

Fragments are classified into 4 categories:

flowchart TB
    READ["Fragment at variant site"] --> CHECK{"Base at variant?"}
    CHECK -->|"Matches REF"| REF["REF category"]
    CHECK -->|"Matches ALT"| ALT["ALT category"]
    CHECK -->|"Other base"| NONREF["NonREF category"]
    CHECK -->|"N base"| N["N category"]

Use mouse to pan and zoom

Category	Description	Interpretation
REF	Supports reference allele	Healthy cfDNA
ALT	Supports alternate allele	Tumor signal
NonREF	Other base (not REF/ALT/N)	Sequencing errors
N	Contains N at variant	Low quality

Formulas

KS Test (Kolmogorov-Smirnov)

\[ \text{KS statistic} = \max |F_1(x) - F_2(x)| \]

Where: - \(F_1(x)\) = CDF of ALT fragment sizes - \(F_2(x)\) = CDF of REF fragment sizes

Size Delta

\[ \Delta_{\text{ALT-REF}} = \text{ALT_MeanSize} - \text{REF_MeanSize} \]

Expected: - Healthy: \(\approx 0\) - Cancer: \(< 0\) (ALT shorter)

VAF Proxy

\[ \text{VAF_Proxy} = \frac{\text{ALT_Count}}{\text{REF_Count} + \text{ALT_Count}} \]

Output Format

Main Output: `{sample}.mFSD.tsv`

Column Group	Columns	Description
Variant Info	Chrom, Pos, Ref, Alt, VarType	Variant details
Counts	REF/ALT/NonREF/N_Count	Fragment counts per category
Mean Sizes	REF/ALT/NonREF/N_MeanSize	Average fragment size
KS Tests	Delta_, KS_, KS_Pval_*	Pairwise comparisons
Derived	VAF_Proxy, Size_Ratio, Quality_Score	Computed metrics
Flags	ALT_Confidence, KS_Valid	Quality indicators

Optional: `{sample}.mFSD.distributions.tsv`

With --output-distributions:

Chrom  Pos    Ref  Alt  Category  Size  Count
chr1   12345  A    T    REF       145   3
chr1   12345  A    T    REF       166   12
chr1   12345  A    T    ALT       142   2

Clinical Interpretation

Metric	Healthy	Cancer (ctDNA)
`Delta_ALT_REF`	~0	Negative (ALT shorter)
`Size_Ratio`	~1.0	< 1.0
`VAF_Proxy`	0	> 0 (correlates with TF)

MRD Settings

Low fragment counts (1-2) produce NA
ALT_Confidence: HIGH (≥5), LOW (1-4), NONE (0)
KS_Valid: TRUE if REF and ALT ≥2 fragments each

Mutant Fragment Size Distribution (mFSD)

Purpose

Processing Flowchart

Biological Context

Variant Types Supported

Usage

Dual BAM Support

Via run-all CLI

Via Nextflow Samplesheet

CLI Options

Duplex Sequencing Support

Supported Duplex Formats

XS2: fgbio/Picard

XS1: Marianas

Duplex Weighting Formula

Interpretation in Duplex Mode

Log-Likelihood Ratio (LLR) Scoring

LLR Formula

LLR Output Columns

LLR Interpretation Guide

Clinical Example: MRD Detection

Cross-Species and Assay Support

Built-in Presets

Biological Rationale

Python API Access

Fragment Classification

Formulas

KS Test (Kolmogorov-Smirnov)

Size Delta

VAF Proxy

Output Format

Main Output: {sample}.mFSD.tsv

Optional: {sample}.mFSD.distributions.tsv

Clinical Interpretation

MRD Settings

See Also

Main Output: `{sample}.mFSD.tsv`

Optional: `{sample}.mFSD.distributions.tsv`