Skip to content

Counting & Metrics

How allele classifications become counts — read-level metrics, fragment counting, strand bias, and output columns.

Visual Overview

Read filter and counting metrics poster
The read-filter cascade and counting metrics — click to enlarge

Overview

After each read is classified as REF, ALT, or neither, gbcms accumulates counts at two levels: individual reads and collapsed fragments. It then computes strand bias statistics using Fisher's exact test.

flowchart LR
    Class([🧬 Allele Classification]) --> ReadCount([📖 Read-Level Counts])
    Class --> FragTrack([🔗 Fragment Tracking])
    FragTrack --> Consensus([⚖️ Quality-Weighted Consensus])
    Consensus --> FragCount([🧬 Fragment-Level Counts])
    ReadCount --> SB([📊 Strand Bias])
    FragCount --> FSB([📊 Fragment Strand Bias])

    classDef metrics fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    class ReadCount,FragCount,SB,FSB metrics;
Use mouse to pan and zoom

Read-Level Counting

Each read passing filters is counted independently. No deduplication is applied at this level.

flowchart TD
    Fetch(["📥 Fetch reads from
±5bp window"]) --> Filters["🔍 Apply read filters
(MAPQ, dup, etc.)"] Filters --> Classify["🧬 Allele classification
(check_snp / check_insertion / etc.)"] Classify --> Anchor{"Read start ≤
variant POS?"} Anchor -->|"Yes"| DP["✅ Count in DP
(REF, ALT, or neither)"] Anchor -->|"No"| ClassCheck{"Classified as
REF or ALT?"} ClassCheck -->|"Yes (shifted indel)"| DP ClassCheck -->|"No"| Skip(["⏭️ Skip — outside
variant footprint"]):::skip DP --> Allele{"Allele?"} Allele -->|REF| RD["RD += 1"]:::ref Allele -->|ALT| AD["AD += 1"]:::alt Allele -->|Neither| Neither["DP only
(no RD/AD)"]:::neither classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px; classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px; classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px; classDef skip fill:#bdc3c7,color:#000,stroke:#95a5a6;
Use mouse to pan and zoom

Anchor Overlap Standard

DP gating uses single-position anchor overlap: read_start ≤ variant.pos. This matches the depth definition used by Mutect2, VarDictJava, and samtools mpileup — depth is measured at the variant position, not across the entire REF allele span. Reads fetched from the wider ±5bp window that don't overlap the anchor are excluded from DP unless classified as REF/ALT via shifted indel detection.

Read Metrics

Metric Description
DP Total depth — all mapped, quality-filtered reads whose alignment start ≤ variant POS, regardless of allele classification. Includes reads that are neither REF nor ALT (e.g., third alleles at multi-allelic sites). DP ≥ RD + AD.
RD / AD Reference / Alternate read counts
DP_fwd / DP_rev Strand-specific total depth
RD_fwd / RD_rev Strand-specific reference counts
AD_fwd / AD_rev Strand-specific alternate counts

Fragment Counting

Fragment counting collapses read pairs into a single observation per fragment. This is critical for cfDNA sequencing (like MSK-ACCESS) where the same DNA fragment is sequenced from both ends — counting reads would double-count each fragment.

Fragment Tracking

Fragments are tracked using QNAME hashing — each read's QNAME is hashed to a u64 key for memory-efficient lookup. The FragmentEvidence struct records the best base quality seen for REF and ALT across both reads in the pair.

flowchart TD
    Start([For each read passing filters]):::start
    Start --> Hash["Hash QNAME → u64 key"]
    Hash --> Observe["Record allele + base quality into FragmentEvidence"]
    Observe --> Done[After all reads processed]
    Done --> ForEach[For each unique fragment]
    ForEach --> HasBoth{Both REF and ALT evidence?}
    HasBoth -->|No| Direct[Assign to whichever allele was seen]
    HasBoth -->|Yes| QualCheck{"Quality difference > threshold?"}
    QualCheck -->|"REF qual >> ALT qual"| Ref([✅ Count as REF]):::ref
    QualCheck -->|"ALT qual >> REF qual"| Alt([🔴 Count as ALT]):::alt
    QualCheck -->|"Within threshold"| Discard([⚪ Discard — ambiguous]):::discard
    Direct --> Count
    Ref --> Count
    Alt --> Count
    Discard --> DPF["Still counted in DPF"]:::info
    Count[Update RDF / ADF / DPF + strand counts]

    classDef start fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef discard fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef info fill:#3498db15,stroke:#3498db;
Use mouse to pan and zoom

Quality-Weighted Consensus

When R1 and R2 of a fragment disagree (one supports REF, the other ALT), the engine resolves the conflict using base quality scores:

Scenario Condition Result
REF wins best_ref_qual > best_alt_qual + threshold Count as REF
ALT wins best_alt_qual > best_ref_qual + threshold Count as ALT
Ambiguous Quality difference ≤ threshold Discard (neither REF nor ALT)

The threshold is configurable via --fragment-qual-threshold (default: 10).

Why Discard Instead of Defaulting to REF?

Assigning ambiguous fragments to REF would systematically deflate VAF by inflating the reference count. In cfDNA sequencing where true variants can be at 0.1–1% VAF, this bias could mask real mutations. Discarding preserves an unbiased VAF estimate at the cost of slightly reduced power.

Fragment Orientation

Fragment strand is determined by read 1 orientation (preferred) or read 2 if read 1 is not available. This is consistent with standard library preparation conventions.

Fragment Metrics

Metric Description
DPF Fragment depth — all unique fragments (including discarded)
RDF / ADF Reference / Alternate fragment counts (resolved only)
RDF_fwd / RDF_rev Strand-specific reference fragment counts
ADF_fwd / ADF_rev Strand-specific alternate fragment counts

Quality Signal: DPF − (RDF + ADF)

Discarded fragments are counted in DPF but not in RDF or ADF. The gap DPF − (RDF + ADF) reveals how many ambiguous fragments exist — a useful quality metric. A large gap suggests a noisy or error-prone site.


VAF Calculation

VAF = AD / (RD + AD)

Where AD and RD are the read-level alternate and reference counts. Fragment-level VAF can be similarly computed as ADF / (RDF + ADF).


Strand Bias (Fisher's Exact Test)

Strand bias detects when an allele is disproportionately supported by reads on one strand — a common sequencing artifact.

flowchart LR
    subgraph Table ["2×2 Contingency Table"]
        direction TB
        Header["| · | Fwd | Rev |"]
        RefRow["| REF | a | b |"]
        AltRow["| ALT | c | d |"]
    end

    Table --> Fisher([Fisher Exact Test]):::test
    Fisher --> Pval[SB_pval]
    Fisher --> OR[SB_OR]

    Pval --> Interpret{p < 0.05?}
    Interpret -->|Yes| Bias{{⚠️ Possible Strand Bias}}:::warn
    Interpret -->|No| NoBias([✅ No Strand Bias]):::pass

    classDef test fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    classDef warn fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Use mouse to pan and zoom

Computed at both levels:

Metric Level Description
SB_pval / SB_OR Read Strand bias from individual reads
FSB_pval / FSB_OR Fragment Strand bias from collapsed fragments

Paired-End Data: Use FSB, Not SB

For paired-end sequencing (e.g., MSK-ACCESS), R1 and R2 from the same fragment are not independent observations. Read-level SB (SB_pval) artificially doubles the sample size N in the Fisher's test contingency table, producing deflated p-values that can falsely flag true variants as strand bias artifacts. Clinical filtering pipelines should use FSB_pval (fragment-level), which correctly treats each physical fragment as a single independent observation.

Strand Bias Example

If a variant has AD_fwd=15, AD_rev=1, that's suspicious — almost all ALT-supporting reads are on the forward strand. Fisher's test would yield a low p-value, flagging this as a potential artifact.


Complete Output Column Reference

All fields in the BaseCounts struct returned by count_bam():

Column Type Description
dp u32 Total read depth — reads overlapping the variant anchor position, including 'neither' reads
rd u32 Reference read count
ad u32 Alternate read count
dp_fwd / dp_rev u32 Strand-specific total depth
rd_fwd / rd_rev u32 Strand-specific reference counts
ad_fwd / ad_rev u32 Strand-specific alternate counts
dpf u32 Fragment depth (all fragments)
rdf u32 Reference fragment count
adf u32 Alternate fragment count
rdf_fwd / rdf_rev u32 Strand-specific reference fragment counts
adf_fwd / adf_rev u32 Strand-specific alternate fragment counts
sb_pval f64 Read-level strand bias p-value
sb_or f64 Read-level strand bias odds ratio
fsb_pval f64 Fragment-level strand bias p-value
fsb_or f64 Fragment-level strand bias odds ratio
used_decomposed bool True if corrected homopolymer allele was used

mFSD — Mutant Fragment Size Distribution

mFSD compares insert-size distributions for REF-classified vs ALT-classified fragments at each variant position. Short-fragment enrichment in the ALT class indicates tumor-derived cfDNA.

Enabled with --mfsd. All 34 columns (and their VCF INFO equivalents) are absent from output without this flag.

Fragment Classes

Class Definition
REF Fragment supporting the reference allele, valid insert size (50–1000 bp)
ALT Fragment supporting the alternate allele, valid insert size
NonREF Fragment supporting a third allele (neither REF nor ALT)
N Fragment where the base at the variant position was called N

MAF Columns (34 total)

Raw Counts

Column Description
mfsd_ref_count Fragments classified REF with valid insert size
mfsd_alt_count Fragments classified ALT with valid insert size
mfsd_nonref_count Fragments classified as a third allele
mfsd_n_count Fragments with base N at the variant position

Log-Likelihood Ratios

Column Description
mfsd_alt_llr LLR for ALT fragments: Σ log(P_tumor/P_healthy). Positive = tumor-like (short fragments).
mfsd_ref_llr LLR for REF fragments

Mean Fragment Sizes

Column Description
mfsd_ref_mean Mean insert size (bp) for REF fragments. NA when class is empty.
mfsd_alt_mean Mean insert size (bp) for ALT fragments
mfsd_nonref_mean Mean insert size (bp) for NonREF fragments
mfsd_n_mean Mean insert size (bp) for N fragments

Pairwise KS Statistics (6 pairs × 3 values = 18 columns)

Each pair yields: delta (mean difference in bp), ks (KS D-statistic), pval (KS p-value). Values are NA when either class has fewer than 5 fragments (mfsd_ks_valid = False).

Pairs
ALT vs REF: mfsd_delta_alt_ref, mfsd_ks_alt_ref, mfsd_pval_alt_ref
ALT vs NonREF: mfsd_delta_alt_nonref, mfsd_ks_alt_nonref, mfsd_pval_alt_nonref
REF vs NonREF: mfsd_delta_ref_nonref, mfsd_ks_ref_nonref, mfsd_pval_ref_nonref
ALT vs N: mfsd_delta_alt_n, mfsd_ks_alt_n, mfsd_pval_alt_n
REF vs N: mfsd_delta_ref_n, mfsd_ks_ref_n, mfsd_pval_ref_n
NonREF vs N: mfsd_delta_nonref_n, mfsd_ks_nonref_n, mfsd_pval_nonref_n

Derived Metrics

Column Description
mfsd_error_rate NonREF / total_mFSD fragments. NA when total = 0.
mfsd_n_rate N / total_mFSD fragments. NA when total = 0.
mfsd_size_ratio mean(ALT) / mean(REF). NA when REF mean = 0 or ALT count = 0.
mfsd_quality_score 1 − error_rate − n_rate. NA when either rate is NA.
mfsd_alt_confidence HIGH (≥5 ALT frags), LOW (1–4), NONE (0)
mfsd_ks_valid True when both ALT and REF have ≥5 fragments

VCF INFO Fields (7 total)

Added to ##INFO header and per-variant INFO column when --mfsd is set.

INFO key Type Description
MFSD_DELTA_ALT_REF Float mean(ALT) − mean(REF) in bp
MFSD_KS_ALT_REF Float KS D-statistic (ALT vs REF)
MFSD_PVAL_ALT_REF Float KS p-value (ALT vs REF)
MFSD_ALT_LLR Float LLR for ALT fragments
MFSD_REF_LLR Float LLR for REF fragments
MFSD_ALT_COUNT Integer ALT-classified fragment count
MFSD_REF_COUNT Integer REF-classified fragment count

Parquet Output (--mfsd-parquet)

When --mfsd-parquet is also set, a companion <sample>.fsd.parquet is written with raw arrays:

Column Type Description
chrom String Chromosome
pos Int64 1-based position
ref String Reference allele
alt String Alternate allele
ref_sizes List\<Int32> Insert sizes (bp) of all REF fragments
alt_sizes List\<Int32> Insert sizes (bp) of all ALT fragments

Written natively by the Rust engine (no pyarrow dependency).


Comparison with Original GBCMS

Feature Original GBCMS gbcms
Counting algorithm Region-based chunking, position matching Per-variant CIGAR traversal
Indel detection Exact position match only Windowed scan (±5bp) with 3-layer safeguards
Complex variants Optional via --generic_counting Always uses haplotype reconstruction
Complex quality handling Exact match only Masked comparison — unreliable bases excluded
Base quality filtering No threshold Default --min-baseq 20
MNP handling Not explicit Dedicated check_mnp with contiguity check
Fragment counting Optional (--fragment_count), majority-rule Always computed, quality-weighted consensus
Positive strand counts Optional (--positive_count) Always computed
Strand bias Not computed Fisher's exact test (read + fragment level)
Fractional depth --fragment_fractional_weight Not implemented
Parallelism OpenMP block-based Rayon per-variant