Allele Classification¶

How gbcms classifies each read as supporting the reference allele, the alternate allele, or neither.

Detailed Visual Reference (PDF)

Dispatch¶

After passing read filters, each read is dispatched to a type-specific allele checker based on the variant's allele lengths (ref_len × alt_len), not on the variant_type string. This makes dispatch robust against inconsistent upstream type labels:

flowchart LR
    Fetch([📥 Fetch Reads]):::fetch --> Filter([🔍 Apply Filters]):::filter
    Filter --> Dispatch{"ref_len × alt_len?"}
    Dispatch -->|"1 × 1"| SNP[check_snp]
    Dispatch -->|"N × N"| MNP[check_mnp]
    Dispatch -->|"1 × N"| INS[check_insertion]
    Dispatch -->|"N × 1"| DEL[check_deletion]
    Dispatch -->|"else"| CPX[check_complex]
    MNP -.->|"inconclusive"| CPX
    INS -.->|"Phase 3 fallback"| CPX
    DEL -.->|"fallback (CIGAR mismatch)"| CPX
    SNP --> Count([📊 Update Counts]):::count
    INS --> Count
    DEL --> Count
    MNP --> Count
    CPX --> Count

    classDef fetch fill:#4a90d9,color:#fff,stroke:#2c6fad,stroke-width:2px;
    classDef filter fill:#e67e22,color:#fff,stroke:#bf6516,stroke-width:2px;
    classDef count fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;

Use mouse to pan and zoom

Allele-Length Routing

Dispatch uses ref_allele.len() and alt_allele.len() to determine the variant class. This means even if upstream callers emit inconsistent type strings (e.g., "COMPLEX" for what is really a pure deletion after normalization), the correct checker is always selected.

SNP (Single Nucleotide Polymorphism)¶

A single base substitution — the simplest and most common variant type.

Property	Value
Detection	`len(REF) == 1 && len(ALT) == 1`
Position	0-based index of the substituted base
Quality check	Base quality at the position must meet `--min-baseq`

Algorithm¶

flowchart TD
    Start([🧬 SNP Check]):::start --> Walk[Walk CIGAR to variant pos]
    Walk --> Found{Position found?}
    Found -->|No| Neither1([Neither]):::neither
    Found -->|Yes| BQ{Base quality ≥ min_baseq?}
    BQ -->|No| Neither2([Neither]):::neither
    BQ -->|Yes| Compare[Compare base to REF and ALT]
    Compare --> IsRef{base == REF?}
    IsRef -->|Yes| Ref([✅ REF]):::ref
    IsRef -->|No| IsAlt{base == ALT?}
    IsAlt -->|Yes| Alt([🔴 ALT]):::alt
    IsAlt -->|No| Neither3([Neither]):::neither

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;

Use mouse to pan and zoom

Visual Example¶

Variant: chr1:100 C→T (SNP)

Reference: 5'─ ...G  A  T  C  G  A  T  C  G  A... ─3'
                          98 99 100 101
                               ▲
                           variant pos

Read 1:    5'─ ...G  A  T [T] G  A  T  C... ─3'   → ALT ✅
                               ↑
                          base=T matches ALT

Read 2:    5'─ ...G  A  T [C] G  A  T  C... ─3'   → REF ✅
                               ↑
                          base=C matches REF

Read 3:    5'─ ...G  A  T [A] G  A  T  C... ─3'   → Neither
                               ↑
                          base=A ≠ REF or ALT

Insertion¶

Bases inserted after an anchor position. The anchor is the last reference base before the inserted sequence.

Property	Value
Detection	`len(REF) == 1 && len(ALT) > 1`
Position	0-based index of the anchor base
Quality check	Quality-masked sequence comparison

Algorithm¶

The insertion check uses a single CIGAR walk with four detection strategies:

flowchart TD
    Start([🧬 Insertion Check]):::start --> Walk[Walk CIGAR left → right]
    Walk --> Found{Match block contains anchor?}
    Found -->|No| WindowCheck
    Found -->|Yes| Backward{Anchor at start of block AND prev op is Ins?}
    Backward -->|Yes| BWMatch{Length + seq match? quality-masked}
    BWMatch -->|Yes| BWAlt([🔴 ALT - backward]):::alt
    BWMatch -->|No| AtEnd
    Backward -->|No| AtEnd{Anchor at end of block?}
    AtEnd -->|No| RefCov[Mark ref coverage]
    AtEnd -->|Yes| NextOp{Next op is Ins?}
    NextOp -->|No| RefCov
    NextOp -->|Yes| Strict{"Length + seq match? (quality-masked)"}
    Strict -->|Yes| StrictAlt([🔴 ALT - strict]):::alt
    Strict -->|No| RefCov
    RefCov --> WindowCheck

    WindowCheck{Ins within ±5bp window?}
    WindowCheck -->|No| Continue[Continue CIGAR walk]
    WindowCheck -->|Yes| S1{"S1: Seq matches? (quality-masked)"}
    S1 -->|No| LenMatch{"Same length?"}
    LenMatch -->|Yes| FlagP3["Flag for Phase 3"]:::fallback
    LenMatch -->|No| Continue
    S1 -->|Yes| S3{S3: Anchor base matches ref?}
    S3 -->|No| Continue
    S3 -->|Yes| S2[S2: Track closest match]
    S2 --> Continue

    Continue --> MoreOps{More CIGAR ops?}
    MoreOps -->|Yes| Walk
    MoreOps -->|No| Eval{Windowed candidate found?}
    Eval -->|Yes| WinAlt([🔴 ALT - windowed]):::alt
    Eval -->|No| P3Check{"Phase 3 flagged
AND ref coverage?"}
    P3Check -->|Yes| Complex([🔄 check_complex
Phase 3 SW]):::fallback
    P3Check -->|No| HasRef{Read covered anchor?}
    HasRef -->|Yes| Ref([✅ REF]):::ref
    HasRef -->|No| Neither([Neither]):::neither

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef fallback fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;

Use mouse to pan and zoom

Windowed Scan Safeguards¶

Three layers of validation prevent false-positive windowed matches:

Safeguard	Check	Purpose
S1	Inserted sequence matches expected ALT bases (quality-masked)	Prevents matching unrelated insertions
S2	Closest match wins (minimum distance from anchor)	When multiple candidates exist, picks the most likely
S3	Reference base at shifted anchor matches original anchor base	Ensures the shifted position is biologically equivalent

Phase 3 Haplotype Fallback

When the windowed scan finds an insertion that matches in length but not sequence (e.g., aligner represents the biological event with shifted bases in a repeat), the engine flags has_nearby_length_match and falls back to check_complex for Smith-Waterman arbitration. This ensures ambiguous cases are resolved by haplotype comparison rather than strict sequence matching.

Visual Example¶

Variant: chr1:100 A→ATG (insertion of TG after anchor A)

Reference:     5'─ ...C  G  A ── C  G  T  A... ─3'
                          99 100  101 102
                               ▲
                          anchor pos

Read 1 (ALT, strict):  CIGAR = 5M 2I 5M
               5'─ ...C  G  A [T  G] C  G  T  A... ─3'
                               └──┘
                          inserted bases at anchor → ALT ✅ (strict)

Read 2 (ALT, windowed): CIGAR = 7M 2I 3M
               5'─ ...C  G  A  C  G [T  G] T  A... ─3'
                                          └──┘
                    insertion shifted +2bp, same seq → ALT ✅ (windowed)

Read 3 (REF):  CIGAR = 10M
               5'─ ...C  G  A  C  G  T  A... ─3'
                               ↑
                          no insertion after anchor → REF ✅

Deletion¶

Bases deleted after an anchor position. Mirrors insertion but looks for Del CIGAR operations.

Property	Value
Detection	`len(REF) > 1 && len(ALT) == 1`
Position	0-based index of the anchor base
Quality check	Quality-masked ref-context comparison

Algorithm¶

Same single-walk strategy as insertion, with three additional features:

Reciprocal overlap matching — For large deletions (≥50bp), if the CIGAR shows a deletion at the anchor but with a different length, gbcms accepts it if the reciprocal overlap is ≥50% (SV-caller standard, used by SURVIVOR):
```
reciprocal_overlap = min(expected, found) / max(expected, found)
Accept if expected_del ≥ 50bp AND reciprocal_overlap ≥ 0.50
```
Interior REF guard — For large deletions (≥50bp), reads that start inside the deleted region are classified as REF directly, without falling back to Smith-Waterman. This prevents overcounting caused by the SW aligner's bias toward shorter ALT haplotypes.
Haplotype fallback — When no CIGAR match is found and the read doesn't cover the anchor, falls back to check_complex for haplotype-based comparison.

flowchart TD
    Start([🧬 Deletion Check]):::start --> Walk[Walk CIGAR left → right]
    Walk --> Found{Match block contains anchor?}
    Found -->|No| WindowCheck
    Found -->|Yes| AtEnd{Anchor at end of block?}
    AtEnd -->|No| RefCov[Mark ref coverage]
    AtEnd -->|Yes| NextOp{Next op is Del?}
    NextOp -->|No| RefCov
    NextOp -->|Yes| LenCheck{Length matches?}
    LenCheck -->|Yes| StrictAlt([🔴 ALT - strict]):::alt
    LenCheck -->|No| RecipCheck{"≥50bp AND overlap ≥50%?"}
    RecipCheck -->|Yes| TolAlt([🔴 ALT - tolerant]):::alt
    RecipCheck -->|No| Fallback1([🔄 check_complex]):::fallback
    RefCov --> WindowCheck

    WindowCheck{Del within ±5bp window?}
    WindowCheck -->|No| Continue[Continue walk]
    WindowCheck -->|Yes| S1{S1: Length check}
    S1 -->|Match| S3{"S3: Ref bases match?"}
    S1 -->|"≥50bp + overlap ≥50%"| S2Track[S2: Track closest]
    S1 -->|No match| Continue
    S3 -->|Yes| S2[S2: Track closest]
    S3 -->|No| Continue
    S2Track --> Continue
    S2 --> Continue

    Continue --> MoreOps{More ops?}
    MoreOps -->|Yes| Walk
    MoreOps -->|No| Eval{Windowed match?}
    Eval -->|Yes| WinAlt([🔴 ALT]):::alt
    Eval -->|No| Interior{"Interior read?
(≥50bp del, read starts inside span)"}
    Interior -->|Yes| IntRef([✅ REF - interior]):::ref
    Interior -->|No| HasRef{Anchor covered?}
    HasRef -->|Yes| Ref([✅ REF]):::ref
    HasRef -->|No| Fallback2([🔄 check_complex]):::fallback

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef fallback fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;

Use mouse to pan and zoom

Interior REF Guard

For deletions >50bp, reads that fall entirely within the deleted region would be incorrectly classified as ALT by Smith-Waterman (because the short ALT haplotype aligns better than the long REF haplotype). The interior REF guard catches these reads early and classifies them as REF before they reach Phase 3.

Visual Example¶

Variant: chr1:100 ACG→A (deletion of CG after anchor A)

Reference:     5'─ ...T  G  A  C  G  T  A  C... ─3'
                          99 100 101 102
                               ▲
                          anchor pos

Read 1 (ALT, strict):  CIGAR = 5M 2D 5M
               5'─ ...T  G  A  ──  ──  T  A  C... ─3'
                               └─────┘
                          2bp deletion at anchor → ALT ✅ (strict)

Read 2 (ALT, windowed): CIGAR = 7M 2D 3M
               5'─ ...T  G  A  C  G  T  ──  ──  A  C... ─3'
                                              └─────┘
                  deletion shifted +2bp, same length, ref matches → ALT ✅ (windowed)

Read 3 (REF):  CIGAR = 12M
               5'─ ...T  G  A  C  G  T  A  C... ─3'
                               ↑
                          no deletion after anchor → REF ✅

MNP (Multi-Nucleotide Polymorphism)¶

Multiple adjacent bases substituted simultaneously.

Property	Value
Detection	`len(REF) == len(ALT) && len(REF) > 1`
Position	0-based index of the first substituted base
Quality check	Minimum base quality across the MNP block must meet `--min-baseq`

Algorithm¶

flowchart TD
    Start([🧬 MNP Check]):::start --> Find[Find read position of first base]
    Find --> Found{Position found?}
    Found -->|No| Structural1([🔄 Structural → check_complex]):::fallback
    Found -->|Yes| Cover{Read covers entire MNP region?}
    Cover -->|No| Structural2([🔄 Structural → check_complex]):::fallback
    Cover -->|Yes| Contig{No indels within MNP block?}

    Contig -->|Indel found| Structural3([🔄 Structural → check_complex]):::fallback
    Contig -->|Contiguous| MinBQ{"min(BQ across block) ≥ threshold?"}

    MinBQ -->|No| LowQ(["🔄 LowQuality → check_complex"]):::fallback
    MinBQ -->|Yes| Compare[Compare all bases to REF and ALT]

    Compare --> Final{All bases match?}
    Final -->|All match ALT| Alt([🔴 ALT]):::alt
    Final -->|All match REF| Ref([✅ REF]):::ref
    Final -->|Mixed / Neither| Third([⬜ ThirdAllele — skip read]):::neither

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef fallback fill:#f39c12,color:#fff,stroke:#e67e22,stroke-width:2px;

Use mouse to pan and zoom

Min-BQ-Across-Block Strategy

MNP quality is assessed using the minimum base quality across the entire MNP block, matching C++ GBCMS baseCountDNP behavior. If min(BQ) < threshold, the read is skipped entirely — it contributes to neither REF, ALT, nor DP. This replaces the previous per-base rejection which routed low-quality reads to Phase 3 Smith-Waterman, where MNP haplotypes (differing at only 2-3 of ~20 positions) generated ties ~95% of the time, causing catastrophic ALT loss.

Phase 3 Fallback Conditions

check_complex is invoked for both structural issues and quality issues:

Read position not found in CIGAR walk
Read doesn't cover the entire MNP region
Indel detected within the MNP block (contiguity check)
Low quality: min(BQ across block) < threshold — Phase 2 masked comparison can still classify using only reliable bases

The first three indicate a complex variant misannotated as an MNP; the last leverages quality-masking to rescue low-quality reads that strict MNP matching would discard.

Contiguity Check¶

The contiguity check is performed first (before quality or sequence comparison) as a fail-fast for structural issues. gbcms compares the read positions of the first and last MNP base — if the distance doesn't equal len - 1, an indel exists within the block and the read is routed to check_complex for haplotype-based resolution.

Complex (Indel + Substitution)¶

Variants where REF and ALT differ in both sequence and length. Uses a sophisticated three-phase algorithm with quality-aware matching and Smith-Waterman fallback.

Property	Value
Detection	Fallback for all other combinations
Position	0-based index of the first reference base
Quality check	Masked comparison — bases below `--min-baseq` are masked

Three-Phase Algorithm¶

flowchart TD
    Start([🧬 Complex Check]):::start
    Start --> Phase1

    subgraph Phase1 [Phase 1: Haplotype Reconstruction]
        direction TB
        Walk[Walk CIGAR to reconstruct
read sequence for variant region]
        Walk --> Ops{CIGAR op type?}
        Ops -->|M/=/X| Match[Append overlapping bases + quals]
        Ops -->|I| Ins[Append inserted bases + quals]
        Ops -->|D/N| Del[Advance ref_pos only]
        Ops -->|S| SC{Overlaps variant?}
        SC -->|Yes| SCAdd[Append soft-clipped bases]
        SC -->|No| SCSkip[Skip]
    end

    Phase1 --> LargeGuard{"ref_len > 50 AND
recon < 10% of ref?"}
    LargeGuard -->|Yes| SkipP2[Skip Phase 2
unreliable]:::neither
    LargeGuard -->|No| LenRatio{"ref_len > 2 × alt_len?"}
    LenRatio -->|Yes| SkipP2B[Skip Phase 2 Case B
+ Phase 2.5]:::neither
    LenRatio -->|No| Phase2

    subgraph Phase2 ["Phase 2: Masked Comparison"]
        direction TB
        LenCheck{Recon length matches?}
        LenCheck -->|"ALT and REF"| CaseA["Case A: Dual compare
+ ambiguity detection"]
        LenCheck -->|"ALT only"| CaseB["Case B: ALT-only compare"]
        LenCheck -->|"REF only"| CaseC["Case C: REF-only compare"]
        LenCheck -->|Neither| NoMatch[No match]
    end

    CaseA --> Ambig{"Reliable bases
match BOTH?"}
    Ambig -->|Yes| Neither1([Neither - ambiguous]):::neither
    Ambig -->|No| Matches

    Matches{Which allele matches?}
    Matches -->|ALT| Alt1([🔴 ALT]):::alt
    Matches -->|REF| Ref1([✅ REF]):::ref
    Matches -->|Neither| NoMatch

    CaseB --> AltMatch{"0 mismatches on
reliable bases?"}
    AltMatch -->|Yes| Alt2([🔴 ALT]):::alt
    AltMatch -->|No| NoMatch

    CaseC --> RefMatch{"0 mismatches on
reliable bases?"}
    RefMatch -->|Yes| Ref2([✅ REF]):::ref
    RefMatch -->|No| NoMatch

    SkipP2 --> Phase25
    SkipP2B --> Phase3
    NoMatch --> Phase25

    subgraph Phase25 ["Phase 2.5: Edit Distance"]
        direction TB
        ReconLen{"recon_len ≥ 2?"}
        ReconLen -->|No| SkipED[Skip to Phase 3]
        ReconLen -->|Yes| RatioGuard{"ref_len > 2 × alt_len?"}
        RatioGuard -->|Yes| SkipED
        RatioGuard -->|No| EditDist["Compute Levenshtein distance
to REF and ALT alleles"]
        EditDist --> EDMargin{">1 edit margin?"}
        EDMargin -->|"ALT closer"| Alt25([🔴 ALT]):::alt
        EDMargin -->|"REF closer"| Ref25([✅ REF]):::ref
        EDMargin -->|"Ambiguous"| SkipED
    end

    SkipED --> Phase3

    subgraph Phase3 ["Phase 3: Alignment Fallback"]
        direction TB
        IsMNP{"Is cleanly MNP?
(ref == alt > 1)"}

        IsMNP -->|Yes| Extract[Extract raw read window]
        IsMNP -->|No| Extract
        Extract --> Backend{"--alignment-backend?"}
        Backend -->|"sw (default)"| SW["Semiglobal SW alignment
vs REF and ALT haplotypes"]
        Backend -->|"hmm"| HMM["PairHMM likelihood
vs REF and ALT haplotypes"]
        SW --> Margin{"Score difference ≥ 2?"}
        HMM --> LLR{"LLR > threshold?"}
        LLR -->|"Confident"| ConfResult2
        LLR -->|"Below threshold"| Neither4([Neither]):::neither
        Margin -->|"Confident call"| ConfCheck
        Margin -->|"Within margin"| Neither3([Neither]):::neither

        ConfCheck{"Dual trigger?
(borderline OR poor)"}
        ConfCheck -->|"No"| ConfResult
        ConfCheck -->|"Yes"| LocalSW["Local SW alignment
(soft-clips bad flanks)"]
        LocalSW --> LocalMargin{"Score difference ≥ 2?"}
        LocalMargin -->|ALT wins| Alt3b([🔴 ALT]):::alt
        LocalMargin -->|REF wins| Ref3b([✅ REF]):::ref
        LocalMargin -->|"Ambiguous Tie"| Tie2([Neither]):::neither

        ConfResult -->|ALT| Alt3([🔴 ALT]):::alt
        ConfResult -->|REF| Ref3([✅ REF]):::ref
        ConfResult2 -->|ALT| Alt3c([🔴 ALT]):::alt
        ConfResult2 -->|REF| Ref3c([✅ REF]):::ref
        Margin -->|"Ambiguous Tie"| Tie1([Neither]):::neither
    end

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;

Use mouse to pan and zoom

Phase 1: Haplotype Reconstruction¶

Walks the CIGAR to rebuild what the read shows for the genomic region [pos, pos + ref_len). Each CIGAR operation contributes differently:

CIGAR Op	Action	Example
M / = / X	Append overlapping bases and qualities	Standard aligned bases
I	Append inserted bases if `ref_pos` is within variant region	Captures insertions
D / N	Advance `ref_pos` without appending	Deletions skip
S	Append if `ref_pos` overlaps variant window	Recovers soft-clipped evidence
H / P	No action	Hard clips have no sequence

Phase 2: Masked Comparison¶

Instead of exact matching, bases with quality below --min-baseq are masked out — they cannot vote for either allele. Three cases based on reconstructed sequence length:

Case	Condition	Behavior
A	`recon == alt == ref` length	Dual compare + ambiguity detection
B	`recon == alt` length only	ALT-only masked compare
C	`recon == ref` length only	REF-only masked compare

Ambiguity Detection (Case A)

When reliable bases match both REF and ALT (possible when they differ only at masked positions), the read is discarded rather than guessed. This prevents false calls at positions where low-quality bases happen to match one allele.

Large-REF Guard

For variants with REF >50bp, if the reconstruction is <10% of the REF length (e.g., 1bp recon for a 1024bp deletion), Phase 2 is skipped entirely. A tiny reconstruction would trivially match a short ALT allele, causing overcounting. Phase 3 SW handles these correctly.

Phase 2.5: Edit Distance¶

When Phase 2's strict length comparison fails (Case A/B/C don't match), gbcms measures the Levenshtein edit distance between the reconstruction and each allele. This catches cases where the reconstruction is off by 1-2 bases due to an incomplete variant definition:

Parameter	Value	Rationale
Margin	>1 edit	Prevents noise on very short strings
Min recon length	2 bases	Single-base reconstructions are too short for reliable edit distance

Phase 2.5 is Supplementary

Phase 2.5 helps edge cases with longer reconstructions. For variants like EPHA7 where the strict reconstruction is only 1bp ("C"), the recon_len >= 2 guard skips Phase 2.5 entirely — the fix comes from Phase 3's local fallback instead.

Phase 3: Alignment Fallback¶

When Phase 2 and Phase 2.5 both fail, the engine expands to the full ref_context window and performs dual-haplotype alignment using one of two backends:

Build haplotypes: REF_hap = left_ctx + REF + right_ctx, ALT_hap = left_ctx + ALT + right_ctx
Mask: Replace low-quality bases with N (scores 0 against any base)
Align: Semiglobal alignment — read (query) is fully consumed, haplotype (text) has free overhangs
Score: ALT wins if alt_score ≥ ref_score + 2; REF wins if ref_score ≥ alt_score + 2
Dual-trigger check: For indel/complex variants, if the semiglobal result is low-confidence, retry with local alignment

Parameter	Value	Rationale
Scoring	+1 match, −1 mismatch	Standard
N vs anything	0	Low-quality bases don't bias
Gap open	−5	Affine gap penalties
Gap extend	−1	Affine gap penalties
Score margin	≥2	Prevents ambiguous calls
Min usable bases	3	Reads with <3 usable bases are skipped

Alternative: PairHMM Backend

Pass --alignment-backend hmm to use a PairHMM (pair hidden Markov model) instead of Smith-Waterman. PairHMM integrates base quality probabilities directly into the alignment score, using Log-Likelihood Ratio (LLR) instead of edit-distance scoring. Default LLR threshold: 2.3 (≈ ln(10), i.e., 10:1 odds). Gap probabilities are tunable separately for repeat vs non-repeat regions. See CLI Reference.

Dual-Trigger Local Fallback¶

For complex variants (both alleles > 1bp with different lengths), semiglobal alignment can produce confident but incorrect calls when the MAF/VCF definition is incomplete (e.g., a complex variant missing an adjacent SNV). The ALT haplotype then has a "frameshifted flank" — right-context bases that don't match the biological read. Semiglobal forces gap penalties through this invalid flank.

[!NOTE] The dual-trigger only applies to complex variants (e.g., TCC→CT, ATGA→CATG), not to pure insertions or deletions. Pure indels are well-handled by semiglobal alignment, and applying local fallback would risk false positives in homopolymer regions.

Two conditions detect low-confidence semiglobal results:

Trigger	Condition	Purpose
Borderline	`abs(alt_score - ref_score) ≤ margin + 1`	Score difference barely decisive
Poor quality	`max(scores) < read_len / 2`	Both haplotypes heavily penalized

When either trigger fires, the engine retries with local alignment (Aligner::local()), which soft-clips the bad flank and finds the best matching substring without penalizing overhangs on either side.

Performance: Aligner Reuse

SW aligners are created once per variant in count_single_variant() and reused for all reads. The bio::alignment::pairwise::Aligner reuses internal DP buffers, avoiding repeated heap allocation.

Raw Read Window Extraction

Phase 3 uses extract_raw_read_window() instead of CIGAR-projected extraction. For complex variants (e.g., TCC→CT represented as DEL+INS in CIGAR), CIGAR projection produces a hybrid sequence matching neither haplotype. Raw extraction returns the contiguous read bases that SW can correctly classify.

Ambiguous Tie — Unbiased VAF Preservation¶

For small Complex and MNP variants, biological reads heavily affected by surrounding genetic polymorphism can result in 50% partial matches against the Alternate array. Mathematically, this scores an exact numerical tie between the REF and ALT haplotypes (alt_score = 11, ref_score = 11).

These ambiguous reads are routed to neither (is_ref = false, is_alt = false). They still contribute to physical Total Depth (DP) via the anchor overlap gate, but do not inflate RD or AD. This preserves an unbiased VAF = AD / (RD + AD) — critical for low-VAF cfDNA detection where even small RD inflation can push a variant below the clinical Limit of Detection.

Multi-Allelic Behavior¶

When multiple variants have overlapping REF spans at the same locus, reads carrying one variant's ALT allele could be incorrectly counted as REF for another variant. The engine addresses this with a two-phase approach:

Phase 1: Annotation¶

During normalization, assign_multi_allelic_groups() identifies overlapping variants using a sweep-line algorithm over sorted (chrom, pos) coordinates. Variants whose REF spans intersect receive a shared multi_allelic_group ID, and their validation_status is appended with _MULTI_ALLELIC (e.g., PASS_MULTI_ALLELIC).

Phase 2: Sibling ALT Exclusion¶

During counting, reads classified as REF for a variant are additionally checked against all sibling variants in the same group. For each sibling, the full check_allele_with_qual() pipeline (including CIGAR reconstruction and SW alignment) determines if the read actually carries the sibling's ALT allele. If so, the read is excluded from REF for the current variant.

This prevents systematic REF inflation at multi-allelic loci, preserving unbiased VAF estimation.

Dynamic SW Gap Penalties¶

Phase 3's Smith-Waterman aligners use adaptive affine gap penalties tuned by the local repeat context:

Context	`gap_extend`	Rationale
Stable DNA (`repeat_span < 10bp`)	-1	Standard penalty — prevents spurious gap extension
Tandem repeat (`repeat_span ≥ 10bp`)	0 (free)	Absorbs polymerase slippage noise — prevents undercounting in MSI regions

The repeat_span is computed during normalization using find_tandem_repeat() and stored on the Variant struct. gap_open remains fixed at -5 in all cases.

MSI-High Tumors

In microsatellite-unstable tumors, insertions/deletions within long homopolymer or dinucleotide repeats are common. The free gap extension ensures these slippage-like variants get correctly classified rather than artificially rejected by high gap penalties.

Limitations¶

Windowed scan range — Indels shifted beyond the context padding from their expected position won't be detected by the CIGAR-based check. Phase 3 SW can catch some of these via ref_context, but only if the read shows evidence (indels/clips) near the variant. Adaptive context padding (enabled by default) utilizes a multi-anchor footprint sweep to dramatically widen the padded detection bounds natively for INDEL clusters.
Score margin ≥ 2 — The SW margin is fixed at 2 points to prevent ambiguous calls. Reads failing to achieve definitive spacing are routed to neither — they contribute to DP but not to RD or AD, preserving unbiased VAF.
Soft-clip recovery — Phase 1 includes soft-clipped bases that overlap the variant window, but only when ref_pos is within the variant region. Soft clips at the edge of reads far from the variant are not considered.
MNP strict matching — MNP strict matching is all-or-nothing, but inconclusive reads fall back to the complex variant classification chain (Phase 2 → 2.5 → 3). The fallback rescues reads with low-quality bases or partial matches that strict matching would reject.
Incomplete variant definitions — When the MAF/VCF represents a complex event incompletely (e.g., TCC→CT omitting an adjacent SNV), reads may carry a different CIGAR signature than expected. The dual-trigger local fallback in Phase 3 mitigates this by soft-clipping "frameshifted flanks" caused by the definition mismatch.

Variant Normalization — How variants are prepared before classification
Read Filters — Which reads reach the allele checker
Counting & Metrics — How classifications become counts
Glossary — Term definitions

Allele Classification¶

Dispatch¶

SNP (Single Nucleotide Polymorphism)¶

Algorithm¶

Visual Example¶

Insertion¶

Algorithm¶

Windowed Scan Safeguards¶

Visual Example¶

Deletion¶

Algorithm¶

Visual Example¶

MNP (Multi-Nucleotide Polymorphism)¶

Algorithm¶

Contiguity Check¶

Complex (Indel + Substitution)¶

Three-Phase Algorithm¶

Phase 1: Haplotype Reconstruction¶

Phase 2: Masked Comparison¶

Phase 2.5: Edit Distance¶

Phase 3: Alignment Fallback¶

Dual-Trigger Local Fallback¶

Ambiguous Tie — Unbiased VAF Preservation¶

Multi-Allelic Behavior¶

Phase 1: Annotation¶

Phase 2: Sibling ALT Exclusion¶

Dynamic SW Gap Penalties¶

Limitations¶

Related¶