Counting & Metrics¶

How allele classifications become counts — read-level metrics, fragment counting, strand bias, and output columns.

Visual Overview

Read filter and counting metrics poster — The read-filter cascade and counting metrics — click to enlarge

Overview¶

After each read is classified as REF, ALT, or neither, gbcms accumulates counts at two levels: individual reads and collapsed fragments. It then computes strand bias statistics using Fisher's exact test.

flowchart LR
    Class([🧬 Allele Classification]) --> ReadCount([📖 Read-Level Counts])
    Class --> FragTrack([🔗 Fragment Tracking])
    FragTrack --> Consensus([⚖️ Quality-Weighted Consensus])
    Consensus --> FragCount([🧬 Fragment-Level Counts])
    ReadCount --> SB([📊 Strand Bias])
    FragCount --> FSB([📊 Fragment Strand Bias])

    classDef metrics fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    class ReadCount,FragCount,SB,FSB metrics;

Use mouse to pan and zoom

Read-Level Counting¶

Each read passing filters is counted independently. No deduplication is applied at this level.

flowchart TD
    Fetch(["📥 Fetch reads from
±5bp window"]) --> Filters["🔍 Apply read filters
(MAPQ, dup, etc.)"]
    Filters --> Classify["🧬 Allele classification
(check_snp / check_insertion / etc.)"]
    Classify --> Anchor{"Read start ≤
variant POS?"}
    Anchor -->|"Yes"| DP["✅ Count in DP
(REF, ALT, or neither)"]
    Anchor -->|"No"| ClassCheck{"Classified as
REF or ALT?"}
    ClassCheck -->|"Yes (shifted indel)"| DP
    ClassCheck -->|"No"| Skip(["⏭️ Skip — outside
variant footprint"]):::skip
    DP --> Allele{"Allele?"}
    Allele -->|REF| RD["RD += 1"]:::ref
    Allele -->|ALT| AD["AD += 1"]:::alt
    Allele -->|Neither| Neither["DP only
(no RD/AD)"]:::neither

    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef skip fill:#bdc3c7,color:#000,stroke:#95a5a6;

Use mouse to pan and zoom

Anchor Overlap Standard

DP gating uses single-position anchor overlap: read_start ≤ variant.pos. This matches the depth definition used by Mutect2, VarDictJava, and samtools mpileup — depth is measured at the variant position, not across the entire REF allele span. Reads fetched from the wider ±5bp window that don't overlap the anchor are excluded from DP unless classified as REF/ALT via shifted indel detection.

Read Metrics¶

Metric	Description
DP	Total depth — all mapped, quality-filtered reads whose alignment start ≤ variant POS, regardless of allele classification. Includes reads that are neither REF nor ALT (e.g., third alleles at multi-allelic sites). `DP ≥ RD + AD`.
RD / AD	Reference / Alternate read counts
DP_fwd / DP_rev	Strand-specific total depth
RD_fwd / RD_rev	Strand-specific reference counts
AD_fwd / AD_rev	Strand-specific alternate counts

Fragment Counting¶

Fragment counting collapses read pairs into a single observation per fragment. This is critical for cfDNA sequencing (like MSK-ACCESS) where the same DNA fragment is sequenced from both ends — counting reads would double-count each fragment.

Fragment Tracking¶

Fragments are tracked using QNAME hashing — each read's QNAME is hashed to a u64 key for memory-efficient lookup. The FragmentEvidence struct records the best base quality seen for REF and ALT across both reads in the pair.

flowchart TD
    Start([For each read passing filters]):::start
    Start --> Hash["Hash QNAME → u64 key"]
    Hash --> Observe["Record allele + base quality into FragmentEvidence"]
    Observe --> Done[After all reads processed]
    Done --> ForEach[For each unique fragment]
    ForEach --> HasBoth{Both REF and ALT evidence?}
    HasBoth -->|No| Direct[Assign to whichever allele was seen]
    HasBoth -->|Yes| QualCheck{"Quality difference > threshold?"}
    QualCheck -->|"REF qual >> ALT qual"| Ref([✅ Count as REF]):::ref
    QualCheck -->|"ALT qual >> REF qual"| Alt([🔴 Count as ALT]):::alt
    QualCheck -->|"Within threshold"| Discard([⚪ Discard — ambiguous]):::discard
    Direct --> Count
    Ref --> Count
    Alt --> Count
    Discard --> DPF["Still counted in DPF"]:::info
    Count[Update RDF / ADF / DPF + strand counts]

    classDef start fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef discard fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef info fill:#3498db15,stroke:#3498db;

Use mouse to pan and zoom

Quality-Weighted Consensus¶

When R1 and R2 of a fragment disagree (one supports REF, the other ALT), the engine resolves the conflict using base quality scores:

Scenario	Condition	Result
REF wins	`best_ref_qual > best_alt_qual + threshold`	Count as REF
ALT wins	`best_alt_qual > best_ref_qual + threshold`	Count as ALT
Ambiguous	Quality difference ≤ threshold	Discard (neither REF nor ALT)

The threshold is configurable via --fragment-qual-threshold (default: 10).

Why Discard Instead of Defaulting to REF?

Assigning ambiguous fragments to REF would systematically deflate VAF by inflating the reference count. In cfDNA sequencing where true variants can be at 0.1–1% VAF, this bias could mask real mutations. Discarding preserves an unbiased VAF estimate at the cost of slightly reduced power.

Fragment Orientation¶

Fragment strand is determined by read 1 orientation (preferred) or read 2 if read 1 is not available. This is consistent with standard library preparation conventions.

Fragment Metrics¶

Metric	Description
DPF	Fragment depth — all unique fragments (including discarded)
RDF / ADF	Reference / Alternate fragment counts (resolved only)
RDF_fwd / RDF_rev	Strand-specific reference fragment counts
ADF_fwd / ADF_rev	Strand-specific alternate fragment counts

Quality Signal: DPF − (RDF + ADF)

Discarded fragments are counted in DPF but not in RDF or ADF. The gap DPF − (RDF + ADF) reveals how many ambiguous fragments exist — a useful quality metric. A large gap suggests a noisy or error-prone site.

VAF Calculation¶

VAF = AD / (RD + AD)

Where AD and RD are the read-level alternate and reference counts. Fragment-level VAF can be similarly computed as ADF / (RDF + ADF).

Strand Bias (Fisher's Exact Test)¶

Strand bias detects when an allele is disproportionately supported by reads on one strand — a common sequencing artifact.

flowchart LR
    subgraph Table ["2×2 Contingency Table"]
        direction TB
        Header["| · | Fwd | Rev |"]
        RefRow["| REF | a | b |"]
        AltRow["| ALT | c | d |"]
    end

    Table --> Fisher([Fisher Exact Test]):::test
    Fisher --> Pval[SB_pval]
    Fisher --> OR[SB_OR]

    Pval --> Interpret{p < 0.05?}
    Interpret -->|Yes| Bias{{⚠️ Possible Strand Bias}}:::warn
    Interpret -->|No| NoBias([✅ No Strand Bias]):::pass

    classDef test fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef warn fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;

Use mouse to pan and zoom

Computed at both levels:

Metric	Level	Description
SB_pval / SB_OR	Read	Strand bias from individual reads
FSB_pval / FSB_OR	Fragment	Strand bias from collapsed fragments

Paired-End Data: Use FSB, Not SB

For paired-end sequencing (e.g., MSK-ACCESS), R1 and R2 from the same fragment are not independent observations. Read-level SB (SB_pval) artificially doubles the sample size N in the Fisher's test contingency table, producing deflated p-values that can falsely flag true variants as strand bias artifacts. Clinical filtering pipelines should use FSB_pval (fragment-level), which correctly treats each physical fragment as a single independent observation.

Strand Bias Example

If a variant has AD_fwd=15, AD_rev=1, that's suspicious — almost all ALT-supporting reads are on the forward strand. Fisher's test would yield a low p-value, flagging this as a potential artifact.

Complete Output Column Reference¶

All fields in the BaseCounts struct returned by count_bam():

Column	Type	Description
`dp`	u32	Total read depth — reads overlapping the variant anchor position, including 'neither' reads
`rd`	u32	Reference read count
`ad`	u32	Alternate read count
`dp_fwd` / `dp_rev`	u32	Strand-specific total depth
`rd_fwd` / `rd_rev`	u32	Strand-specific reference counts
`ad_fwd` / `ad_rev`	u32	Strand-specific alternate counts
`dpf`	u32	Fragment depth (all fragments)
`rdf`	u32	Reference fragment count
`adf`	u32	Alternate fragment count
`rdf_fwd` / `rdf_rev`	u32	Strand-specific reference fragment counts
`adf_fwd` / `adf_rev`	u32	Strand-specific alternate fragment counts
`sb_pval`	f64	Read-level strand bias p-value
`sb_or`	f64	Read-level strand bias odds ratio
`fsb_pval`	f64	Fragment-level strand bias p-value
`fsb_or`	f64	Fragment-level strand bias odds ratio
`used_decomposed`	bool	True if corrected homopolymer allele was used

mFSD — Mutant Fragment Size Distribution¶

mFSD compares insert-size distributions for REF-classified vs ALT-classified fragments at each variant position. Short-fragment enrichment in the ALT class indicates tumor-derived cfDNA.

Enabled with --mfsd. All 34 columns (and their VCF INFO equivalents) are absent from output without this flag.

Fragment Classes¶

Class	Definition
`REF`	Fragment supporting the reference allele, valid insert size (50–1000 bp)
`ALT`	Fragment supporting the alternate allele, valid insert size
`NonREF`	Fragment supporting a third allele (neither REF nor ALT)
`N`	Fragment where the base at the variant position was called `N`

MAF Columns (34 total)¶

Raw Counts¶

Column	Description
`mfsd_ref_count`	Fragments classified REF with valid insert size
`mfsd_alt_count`	Fragments classified ALT with valid insert size
`mfsd_nonref_count`	Fragments classified as a third allele
`mfsd_n_count`	Fragments with base `N` at the variant position

Log-Likelihood Ratios¶

Column	Description
`mfsd_alt_llr`	LLR for ALT fragments: Σ log(P_tumor/P_healthy). Positive = tumor-like (short fragments).
`mfsd_ref_llr`	LLR for REF fragments

Mean Fragment Sizes¶

Column	Description
`mfsd_ref_mean`	Mean insert size (bp) for REF fragments. `NA` when class is empty.
`mfsd_alt_mean`	Mean insert size (bp) for ALT fragments
`mfsd_nonref_mean`	Mean insert size (bp) for NonREF fragments
`mfsd_n_mean`	Mean insert size (bp) for N fragments

Pairwise KS Statistics (6 pairs × 3 values = 18 columns)¶

Each pair yields: delta (mean difference in bp), ks (KS D-statistic), pval (KS p-value). Values are NA when either class has fewer than 5 fragments (mfsd_ks_valid = False).

Pairs
ALT vs REF: `mfsd_delta_alt_ref`, `mfsd_ks_alt_ref`, `mfsd_pval_alt_ref`
ALT vs NonREF: `mfsd_delta_alt_nonref`, `mfsd_ks_alt_nonref`, `mfsd_pval_alt_nonref`
REF vs NonREF: `mfsd_delta_ref_nonref`, `mfsd_ks_ref_nonref`, `mfsd_pval_ref_nonref`
ALT vs N: `mfsd_delta_alt_n`, `mfsd_ks_alt_n`, `mfsd_pval_alt_n`
REF vs N: `mfsd_delta_ref_n`, `mfsd_ks_ref_n`, `mfsd_pval_ref_n`
NonREF vs N: `mfsd_delta_nonref_n`, `mfsd_ks_nonref_n`, `mfsd_pval_nonref_n`

Derived Metrics¶

Column	Description
`mfsd_error_rate`	NonREF / total_mFSD fragments. `NA` when total = 0.
`mfsd_n_rate`	N / total_mFSD fragments. `NA` when total = 0.
`mfsd_size_ratio`	mean(ALT) / mean(REF). `NA` when REF mean = 0 or ALT count = 0.
`mfsd_quality_score`	1 − error_rate − n_rate. `NA` when either rate is `NA`.
`mfsd_alt_confidence`	`HIGH` (≥5 ALT frags), `LOW` (1–4), `NONE` (0)
`mfsd_ks_valid`	`True` when both ALT and REF have ≥5 fragments

VCF INFO Fields (7 total)¶

Added to ##INFO header and per-variant INFO column when --mfsd is set.

INFO key	Type	Description
`MFSD_DELTA_ALT_REF`	Float	mean(ALT) − mean(REF) in bp
`MFSD_KS_ALT_REF`	Float	KS D-statistic (ALT vs REF)
`MFSD_PVAL_ALT_REF`	Float	KS p-value (ALT vs REF)
`MFSD_ALT_LLR`	Float	LLR for ALT fragments
`MFSD_REF_LLR`	Float	LLR for REF fragments
`MFSD_ALT_COUNT`	Integer	ALT-classified fragment count
`MFSD_REF_COUNT`	Integer	REF-classified fragment count

Parquet Output (--mfsd-parquet)¶

When --mfsd-parquet is also set, a companion <sample>.fsd.parquet is written with raw arrays:

Column	Type	Description
`chrom`	String	Chromosome
`pos`	Int64	1-based position
`ref`	String	Reference allele
`alt`	String	Alternate allele
`ref_sizes`	List\<Int32>	Insert sizes (bp) of all REF fragments
`alt_sizes`	List\<Int32>	Insert sizes (bp) of all ALT fragments

Written natively by the Rust engine (no pyarrow dependency).

Comparison with Original GBCMS¶

Feature	Original GBCMS	gbcms
Counting algorithm	Region-based chunking, position matching	Per-variant CIGAR traversal
Indel detection	Exact position match only	Windowed scan (±5bp) with 3-layer safeguards
Complex variants	Optional via `--generic_counting`	Always uses haplotype reconstruction
Complex quality handling	Exact match only	Masked comparison — unreliable bases excluded
Base quality filtering	No threshold	Default `--min-baseq 20`
MNP handling	Not explicit	Dedicated `check_mnp` with contiguity check
Fragment counting	Optional (`--fragment_count`), majority-rule	Always computed, quality-weighted consensus
Positive strand counts	Optional (`--positive_count`)	Always computed
Strand bias	Not computed	Fisher's exact test (read + fragment level)
Fractional depth	`--fragment_fractional_weight`	Not implemented
Parallelism	OpenMP block-based	Rayon per-variant

Allele Classification — How reads are classified
Read Filters — Which reads reach counting
Architecture — System design
Glossary — Term definitions

Counting & Metrics¶

Overview¶

Read-Level Counting¶

Read Metrics¶

Fragment Counting¶

Fragment Tracking¶

Quality-Weighted Consensus¶

Fragment Orientation¶

Fragment Metrics¶

VAF Calculation¶

Strand Bias (Fisher's Exact Test)¶

Complete Output Column Reference¶

mFSD — Mutant Fragment Size Distribution¶

Fragment Classes¶

MAF Columns (34 total)¶

Raw Counts¶

Log-Likelihood Ratios¶

Mean Fragment Sizes¶

Pairwise KS Statistics (6 pairs × 3 values = 18 columns)¶

Derived Metrics¶

VCF INFO Fields (7 total)¶

Parquet Output (--mfsd-parquet)¶

Comparison with Original GBCMS¶

Related¶