Skip to content

Allele Classification

How gbcms classifies each read as supporting the reference allele, the alternate allele, or neither.

Detailed Visual Reference (PDF)

Dispatch

After passing read filters, each read is dispatched to a type-specific allele checker based on the variant's allele lengths (ref_len × alt_len), not on the variant_type string. This makes dispatch robust against inconsistent upstream type labels:

flowchart LR
    Fetch([📥 Fetch Reads]):::fetch --> Filter([🔍 Apply Filters]):::filter
    Filter --> Dispatch{"ref_len × alt_len?"}
    Dispatch -->|"1 × 1"| SNP[check_snp]
    Dispatch -->|"N × N"| MNP[check_mnp]
    Dispatch -->|"1 × N"| INS[check_insertion]
    Dispatch -->|"N × 1"| DEL[check_deletion]
    Dispatch -->|"else"| CPX[check_complex]
    MNP -.->|"inconclusive"| CPX
    INS -.->|"Phase 3 fallback"| CPX
    DEL -.->|"fallback (CIGAR mismatch)"| CPX
    SNP --> Count([📊 Update Counts]):::count
    INS --> Count
    DEL --> Count
    MNP --> Count
    CPX --> Count

    classDef fetch fill:#4a90d9,color:#fff,stroke:#2c6fad,stroke-width:2px;
    classDef filter fill:#e67e22,color:#fff,stroke:#bf6516,stroke-width:2px;
    classDef count fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Use mouse to pan and zoom

Allele-Length Routing

Dispatch uses ref_allele.len() and alt_allele.len() to determine the variant class. This means even if upstream callers emit inconsistent type strings (e.g., "COMPLEX" for what is really a pure deletion after normalization), the correct checker is always selected.


SNP (Single Nucleotide Polymorphism)

A single base substitution — the simplest and most common variant type.

Property Value
Detection len(REF) == 1 && len(ALT) == 1
Position 0-based index of the substituted base
Quality check Base quality at the position must meet --min-baseq

Algorithm

flowchart TD
    Start([🧬 SNP Check]):::start --> Walk[Walk CIGAR to variant pos]
    Walk --> Found{Position found?}
    Found -->|No| Neither1([Neither]):::neither
    Found -->|Yes| BQ{Base quality ≥ min_baseq?}
    BQ -->|No| Neither2([Neither]):::neither
    BQ -->|Yes| Compare[Compare base to REF and ALT]
    Compare --> IsRef{base == REF?}
    IsRef -->|Yes| Ref([✅ REF]):::ref
    IsRef -->|No| IsAlt{base == ALT?}
    IsAlt -->|Yes| Alt([🔴 ALT]):::alt
    IsAlt -->|No| Neither3([Neither]):::neither

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
Use mouse to pan and zoom

Visual Example

Variant: chr1:100 C→T (SNP)

Reference: 5'─ ...G  A  T  C  G  A  T  C  G  A... ─3'
                          98 99 100 101
                           variant pos

Read 1:    5'─ ...G  A  T [T] G  A  T  C... ─3'   → ALT ✅
                          base=T matches ALT

Read 2:    5'─ ...G  A  T [C] G  A  T  C... ─3'   → REF ✅
                          base=C matches REF

Read 3:    5'─ ...G  A  T [A] G  A  T  C... ─3'   → Neither
                          base=A ≠ REF or ALT

Insertion

Bases inserted after an anchor position. The anchor is the last reference base before the inserted sequence.

Property Value
Detection len(REF) == 1 && len(ALT) > 1
Position 0-based index of the anchor base
Quality check Quality-masked sequence comparison

Algorithm

The insertion check uses a single CIGAR walk with four detection strategies:

flowchart TD
    Start([🧬 Insertion Check]):::start --> Walk[Walk CIGAR left → right]
    Walk --> Found{Match block contains anchor?}
    Found -->|No| WindowCheck
    Found -->|Yes| Backward{Anchor at start of block AND prev op is Ins?}
    Backward -->|Yes| BWMatch{Length + seq match? quality-masked}
    BWMatch -->|Yes| BWAlt([🔴 ALT - backward]):::alt
    BWMatch -->|No| AtEnd
    Backward -->|No| AtEnd{Anchor at end of block?}
    AtEnd -->|No| RefCov[Mark ref coverage]
    AtEnd -->|Yes| NextOp{Next op is Ins?}
    NextOp -->|No| RefCov
    NextOp -->|Yes| Strict{"Length + seq match? (quality-masked)"}
    Strict -->|Yes| StrictAlt([🔴 ALT - strict]):::alt
    Strict -->|No| RefCov
    RefCov --> WindowCheck

    WindowCheck{Ins within ±5bp window?}
    WindowCheck -->|No| Continue[Continue CIGAR walk]
    WindowCheck -->|Yes| S1{"S1: Seq matches? (quality-masked)"}
    S1 -->|No| LenMatch{"Same length?"}
    LenMatch -->|Yes| FlagP3["Flag for Phase 3"]:::fallback
    LenMatch -->|No| Continue
    S1 -->|Yes| S3{S3: Anchor base matches ref?}
    S3 -->|No| Continue
    S3 -->|Yes| S2[S2: Track closest match]
    S2 --> Continue

    Continue --> MoreOps{More CIGAR ops?}
    MoreOps -->|Yes| Walk
    MoreOps -->|No| Eval{Windowed candidate found?}
    Eval -->|Yes| WinAlt([🔴 ALT - windowed]):::alt
    Eval -->|No| P3Check{"Phase 3 flagged
AND ref coverage?"} P3Check -->|Yes| Complex([🔄 check_complex
Phase 3 SW]):::fallback P3Check -->|No| HasRef{Read covered anchor?} HasRef -->|Yes| Ref([✅ REF]):::ref HasRef -->|No| Neither([Neither]):::neither classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px; classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px; classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px; classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px; classDef fallback fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;
Use mouse to pan and zoom

Windowed Scan Safeguards

Three layers of validation prevent false-positive windowed matches:

Safeguard Check Purpose
S1 Inserted sequence matches expected ALT bases (quality-masked) Prevents matching unrelated insertions
S2 Closest match wins (minimum distance from anchor) When multiple candidates exist, picks the most likely
S3 Reference base at shifted anchor matches original anchor base Ensures the shifted position is biologically equivalent

Phase 3 Haplotype Fallback

When the windowed scan finds an insertion that matches in length but not sequence (e.g., aligner represents the biological event with shifted bases in a repeat), the engine flags has_nearby_length_match and falls back to check_complex for Smith-Waterman arbitration. This ensures ambiguous cases are resolved by haplotype comparison rather than strict sequence matching.

Visual Example

Variant: chr1:100 A→ATG (insertion of TG after anchor A)

Reference:     5'─ ...C  G  A ── C  G  T  A... ─3'
                          99 100  101 102
                          anchor pos

Read 1 (ALT, strict):  CIGAR = 5M 2I 5M
               5'─ ...C  G  A [T  G] C  G  T  A... ─3'
                               └──┘
                          inserted bases at anchor → ALT ✅ (strict)

Read 2 (ALT, windowed): CIGAR = 7M 2I 3M
               5'─ ...C  G  A  C  G [T  G] T  A... ─3'
                                          └──┘
                    insertion shifted +2bp, same seq → ALT ✅ (windowed)

Read 3 (REF):  CIGAR = 10M
               5'─ ...C  G  A  C  G  T  A... ─3'
                          no insertion after anchor → REF ✅

Deletion

Bases deleted after an anchor position. Mirrors insertion but looks for Del CIGAR operations.

Property Value
Detection len(REF) > 1 && len(ALT) == 1
Position 0-based index of the anchor base
Quality check Quality-masked ref-context comparison

Algorithm

Same single-walk strategy as insertion, with three additional features:

  1. Reciprocal overlap matching — For large deletions (≥50bp), if the CIGAR shows a deletion at the anchor but with a different length, gbcms accepts it if the reciprocal overlap is ≥50% (SV-caller standard, used by SURVIVOR):

    reciprocal_overlap = min(expected, found) / max(expected, found)
    Accept if expected_del ≥ 50bp AND reciprocal_overlap ≥ 0.50
    
  2. Interior REF guard — For large deletions (≥50bp), reads that start inside the deleted region are classified as REF directly, without falling back to Smith-Waterman. This prevents overcounting caused by the SW aligner's bias toward shorter ALT haplotypes.

  3. Haplotype fallback — When no CIGAR match is found and the read doesn't cover the anchor, falls back to check_complex for haplotype-based comparison.

flowchart TD
    Start([🧬 Deletion Check]):::start --> Walk[Walk CIGAR left → right]
    Walk --> Found{Match block contains anchor?}
    Found -->|No| WindowCheck
    Found -->|Yes| AtEnd{Anchor at end of block?}
    AtEnd -->|No| RefCov[Mark ref coverage]
    AtEnd -->|Yes| NextOp{Next op is Del?}
    NextOp -->|No| RefCov
    NextOp -->|Yes| LenCheck{Length matches?}
    LenCheck -->|Yes| StrictAlt([🔴 ALT - strict]):::alt
    LenCheck -->|No| RecipCheck{"≥50bp AND overlap ≥50%?"}
    RecipCheck -->|Yes| TolAlt([🔴 ALT - tolerant]):::alt
    RecipCheck -->|No| Fallback1([🔄 check_complex]):::fallback
    RefCov --> WindowCheck

    WindowCheck{Del within ±5bp window?}
    WindowCheck -->|No| Continue[Continue walk]
    WindowCheck -->|Yes| S1{S1: Length check}
    S1 -->|Match| S3{"S3: Ref bases match?"}
    S1 -->|"≥50bp + overlap ≥50%"| S2Track[S2: Track closest]
    S1 -->|No match| Continue
    S3 -->|Yes| S2[S2: Track closest]
    S3 -->|No| Continue
    S2Track --> Continue
    S2 --> Continue

    Continue --> MoreOps{More ops?}
    MoreOps -->|Yes| Walk
    MoreOps -->|No| Eval{Windowed match?}
    Eval -->|Yes| WinAlt([🔴 ALT]):::alt
    Eval -->|No| Interior{"Interior read?
(≥50bp del, read starts inside span)"} Interior -->|Yes| IntRef([✅ REF - interior]):::ref Interior -->|No| HasRef{Anchor covered?} HasRef -->|Yes| Ref([✅ REF]):::ref HasRef -->|No| Fallback2([🔄 check_complex]):::fallback classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px; classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px; classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px; classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px; classDef fallback fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;
Use mouse to pan and zoom

Interior REF Guard

For deletions >50bp, reads that fall entirely within the deleted region would be incorrectly classified as ALT by Smith-Waterman (because the short ALT haplotype aligns better than the long REF haplotype). The interior REF guard catches these reads early and classifies them as REF before they reach Phase 3.

Visual Example

Variant: chr1:100 ACG→A (deletion of CG after anchor A)

Reference:     5'─ ...T  G  A  C  G  T  A  C... ─3'
                          99 100 101 102
                          anchor pos

Read 1 (ALT, strict):  CIGAR = 5M 2D 5M
               5'─ ...T  G  A  ──  ──  T  A  C... ─3'
                               └─────┘
                          2bp deletion at anchor → ALT ✅ (strict)

Read 2 (ALT, windowed): CIGAR = 7M 2D 3M
               5'─ ...T  G  A  C  G  T  ──  ──  A  C... ─3'
                                              └─────┘
                  deletion shifted +2bp, same length, ref matches → ALT ✅ (windowed)

Read 3 (REF):  CIGAR = 12M
               5'─ ...T  G  A  C  G  T  A  C... ─3'
                          no deletion after anchor → REF ✅

MNP (Multi-Nucleotide Polymorphism)

Multiple adjacent bases substituted simultaneously.

Property Value
Detection len(REF) == len(ALT) && len(REF) > 1
Position 0-based index of the first substituted base
Quality check Minimum base quality across the MNP block must meet --min-baseq

Algorithm

flowchart TD
    Start([🧬 MNP Check]):::start --> Find[Find read position of first base]
    Find --> Found{Position found?}
    Found -->|No| Structural1([🔄 Structural → check_complex]):::fallback
    Found -->|Yes| Cover{Read covers entire MNP region?}
    Cover -->|No| Structural2([🔄 Structural → check_complex]):::fallback
    Cover -->|Yes| Contig{No indels within MNP block?}

    Contig -->|Indel found| Structural3([🔄 Structural → check_complex]):::fallback
    Contig -->|Contiguous| MinBQ{"min(BQ across block) ≥ threshold?"}

    MinBQ -->|No| LowQ(["🔄 LowQuality → check_complex"]):::fallback
    MinBQ -->|Yes| Compare[Compare all bases to REF and ALT]

    Compare --> Final{All bases match?}
    Final -->|All match ALT| Alt([🔴 ALT]):::alt
    Final -->|All match REF| Ref([✅ REF]):::ref
    Final -->|Mixed / Neither| Third([⬜ ThirdAllele — skip read]):::neither

    classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
    classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
    classDef fallback fill:#f39c12,color:#fff,stroke:#e67e22,stroke-width:2px;
Use mouse to pan and zoom

Min-BQ-Across-Block Strategy

MNP quality is assessed using the minimum base quality across the entire MNP block, matching C++ GBCMS baseCountDNP behavior. If min(BQ) < threshold, the read is skipped entirely — it contributes to neither REF, ALT, nor DP. This replaces the previous per-base rejection which routed low-quality reads to Phase 3 Smith-Waterman, where MNP haplotypes (differing at only 2-3 of ~20 positions) generated ties ~95% of the time, causing catastrophic ALT loss.

Phase 3 Fallback Conditions

check_complex is invoked for both structural issues and quality issues:

  • Read position not found in CIGAR walk
  • Read doesn't cover the entire MNP region
  • Indel detected within the MNP block (contiguity check)
  • Low quality: min(BQ across block) < threshold — Phase 2 masked comparison can still classify using only reliable bases

The first three indicate a complex variant misannotated as an MNP; the last leverages quality-masking to rescue low-quality reads that strict MNP matching would discard.

Contiguity Check

The contiguity check is performed first (before quality or sequence comparison) as a fail-fast for structural issues. gbcms compares the read positions of the first and last MNP base — if the distance doesn't equal len - 1, an indel exists within the block and the read is routed to check_complex for haplotype-based resolution.


Complex (Indel + Substitution)

Variants where REF and ALT differ in both sequence and length. Uses a sophisticated three-phase algorithm with quality-aware matching and Smith-Waterman fallback.

Property Value
Detection Fallback for all other combinations
Position 0-based index of the first reference base
Quality check Masked comparison — bases below --min-baseq are masked

Three-Phase Algorithm

flowchart TD
    Start([🧬 Complex Check]):::start
    Start --> Phase1

    subgraph Phase1 [Phase 1: Haplotype Reconstruction]
        direction TB
        Walk[Walk CIGAR to reconstruct
read sequence for variant region] Walk --> Ops{CIGAR op type?} Ops -->|M/=/X| Match[Append overlapping bases + quals] Ops -->|I| Ins[Append inserted bases + quals] Ops -->|D/N| Del[Advance ref_pos only] Ops -->|S| SC{Overlaps variant?} SC -->|Yes| SCAdd[Append soft-clipped bases] SC -->|No| SCSkip[Skip] end Phase1 --> LargeGuard{"ref_len > 50 AND
recon < 10% of ref?"} LargeGuard -->|Yes| SkipP2[Skip Phase 2
unreliable]:::neither LargeGuard -->|No| LenRatio{"ref_len > 2 × alt_len?"} LenRatio -->|Yes| SkipP2B[Skip Phase 2 Case B
+ Phase 2.5]:::neither LenRatio -->|No| Phase2 subgraph Phase2 ["Phase 2: Masked Comparison"] direction TB LenCheck{Recon length matches?} LenCheck -->|"ALT and REF"| CaseA["Case A: Dual compare
+ ambiguity detection"] LenCheck -->|"ALT only"| CaseB["Case B: ALT-only compare"] LenCheck -->|"REF only"| CaseC["Case C: REF-only compare"] LenCheck -->|Neither| NoMatch[No match] end CaseA --> Ambig{"Reliable bases
match BOTH?"} Ambig -->|Yes| Neither1([Neither - ambiguous]):::neither Ambig -->|No| Matches Matches{Which allele matches?} Matches -->|ALT| Alt1([🔴 ALT]):::alt Matches -->|REF| Ref1([✅ REF]):::ref Matches -->|Neither| NoMatch CaseB --> AltMatch{"0 mismatches on
reliable bases?"} AltMatch -->|Yes| Alt2([🔴 ALT]):::alt AltMatch -->|No| NoMatch CaseC --> RefMatch{"0 mismatches on
reliable bases?"} RefMatch -->|Yes| Ref2([✅ REF]):::ref RefMatch -->|No| NoMatch SkipP2 --> Phase25 SkipP2B --> Phase3 NoMatch --> Phase25 subgraph Phase25 ["Phase 2.5: Edit Distance"] direction TB ReconLen{"recon_len ≥ 2?"} ReconLen -->|No| SkipED[Skip to Phase 3] ReconLen -->|Yes| RatioGuard{"ref_len > 2 × alt_len?"} RatioGuard -->|Yes| SkipED RatioGuard -->|No| EditDist["Compute Levenshtein distance
to REF and ALT alleles"] EditDist --> EDMargin{">1 edit margin?"} EDMargin -->|"ALT closer"| Alt25([🔴 ALT]):::alt EDMargin -->|"REF closer"| Ref25([✅ REF]):::ref EDMargin -->|"Ambiguous"| SkipED end SkipED --> Phase3 subgraph Phase3 ["Phase 3: Alignment Fallback"] direction TB IsMNP{"Is cleanly MNP?
(ref == alt > 1)"} IsMNP -->|Yes| Extract[Extract raw read window] IsMNP -->|No| Extract Extract --> Backend{"--alignment-backend?"} Backend -->|"sw (default)"| SW["Semiglobal SW alignment
vs REF and ALT haplotypes"] Backend -->|"hmm"| HMM["PairHMM likelihood
vs REF and ALT haplotypes"] SW --> Margin{"Score difference ≥ 2?"} HMM --> LLR{"LLR > threshold?"} LLR -->|"Confident"| ConfResult2 LLR -->|"Below threshold"| Neither4([Neither]):::neither Margin -->|"Confident call"| ConfCheck Margin -->|"Within margin"| Neither3([Neither]):::neither ConfCheck{"Dual trigger?
(borderline OR poor)"} ConfCheck -->|"No"| ConfResult ConfCheck -->|"Yes"| LocalSW["Local SW alignment
(soft-clips bad flanks)"] LocalSW --> LocalMargin{"Score difference ≥ 2?"} LocalMargin -->|ALT wins| Alt3b([🔴 ALT]):::alt LocalMargin -->|REF wins| Ref3b([✅ REF]):::ref LocalMargin -->|"Ambiguous Tie"| Tie2([Neither]):::neither ConfResult -->|ALT| Alt3([🔴 ALT]):::alt ConfResult -->|REF| Ref3([✅ REF]):::ref ConfResult2 -->|ALT| Alt3c([🔴 ALT]):::alt ConfResult2 -->|REF| Ref3c([✅ REF]):::ref Margin -->|"Ambiguous Tie"| Tie1([Neither]):::neither end classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px; classDef ref fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px; classDef alt fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px; classDef neither fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;
Use mouse to pan and zoom

Phase 1: Haplotype Reconstruction

Walks the CIGAR to rebuild what the read shows for the genomic region [pos, pos + ref_len). Each CIGAR operation contributes differently:

CIGAR Op Action Example
M / = / X Append overlapping bases and qualities Standard aligned bases
I Append inserted bases if ref_pos is within variant region Captures insertions
D / N Advance ref_pos without appending Deletions skip
S Append if ref_pos overlaps variant window Recovers soft-clipped evidence
H / P No action Hard clips have no sequence

Phase 2: Masked Comparison

Instead of exact matching, bases with quality below --min-baseq are masked out — they cannot vote for either allele. Three cases based on reconstructed sequence length:

Case Condition Behavior
A recon == alt == ref length Dual compare + ambiguity detection
B recon == alt length only ALT-only masked compare
C recon == ref length only REF-only masked compare

Ambiguity Detection (Case A)

When reliable bases match both REF and ALT (possible when they differ only at masked positions), the read is discarded rather than guessed. This prevents false calls at positions where low-quality bases happen to match one allele.

Large-REF Guard

For variants with REF >50bp, if the reconstruction is <10% of the REF length (e.g., 1bp recon for a 1024bp deletion), Phase 2 is skipped entirely. A tiny reconstruction would trivially match a short ALT allele, causing overcounting. Phase 3 SW handles these correctly.

Phase 2.5: Edit Distance

When Phase 2's strict length comparison fails (Case A/B/C don't match), gbcms measures the Levenshtein edit distance between the reconstruction and each allele. This catches cases where the reconstruction is off by 1-2 bases due to an incomplete variant definition:

Parameter Value Rationale
Margin >1 edit Prevents noise on very short strings
Min recon length 2 bases Single-base reconstructions are too short for reliable edit distance

Phase 2.5 is Supplementary

Phase 2.5 helps edge cases with longer reconstructions. For variants like EPHA7 where the strict reconstruction is only 1bp ("C"), the recon_len >= 2 guard skips Phase 2.5 entirely — the fix comes from Phase 3's local fallback instead.

Phase 3: Alignment Fallback

When Phase 2 and Phase 2.5 both fail, the engine expands to the full ref_context window and performs dual-haplotype alignment using one of two backends:

  1. Build haplotypes: REF_hap = left_ctx + REF + right_ctx, ALT_hap = left_ctx + ALT + right_ctx
  2. Mask: Replace low-quality bases with N (scores 0 against any base)
  3. Align: Semiglobal alignment — read (query) is fully consumed, haplotype (text) has free overhangs
  4. Score: ALT wins if alt_score ≥ ref_score + 2; REF wins if ref_score ≥ alt_score + 2
  5. Dual-trigger check: For indel/complex variants, if the semiglobal result is low-confidence, retry with local alignment
Parameter Value Rationale
Scoring +1 match, −1 mismatch Standard
N vs anything 0 Low-quality bases don't bias
Gap open −5 Affine gap penalties
Gap extend −1 Affine gap penalties
Score margin ≥2 Prevents ambiguous calls
Min usable bases 3 Reads with <3 usable bases are skipped

Alternative: PairHMM Backend

Pass --alignment-backend hmm to use a PairHMM (pair hidden Markov model) instead of Smith-Waterman. PairHMM integrates base quality probabilities directly into the alignment score, using Log-Likelihood Ratio (LLR) instead of edit-distance scoring. Default LLR threshold: 2.3 (≈ ln(10), i.e., 10:1 odds). Gap probabilities are tunable separately for repeat vs non-repeat regions. See CLI Reference.

Dual-Trigger Local Fallback

For complex variants (both alleles > 1bp with different lengths), semiglobal alignment can produce confident but incorrect calls when the MAF/VCF definition is incomplete (e.g., a complex variant missing an adjacent SNV). The ALT haplotype then has a "frameshifted flank" — right-context bases that don't match the biological read. Semiglobal forces gap penalties through this invalid flank.

[!NOTE] The dual-trigger only applies to complex variants (e.g., TCC→CT, ATGA→CATG), not to pure insertions or deletions. Pure indels are well-handled by semiglobal alignment, and applying local fallback would risk false positives in homopolymer regions.

Two conditions detect low-confidence semiglobal results:

Trigger Condition Purpose
Borderline abs(alt_score - ref_score) ≤ margin + 1 Score difference barely decisive
Poor quality max(scores) < read_len / 2 Both haplotypes heavily penalized

When either trigger fires, the engine retries with local alignment (Aligner::local()), which soft-clips the bad flank and finds the best matching substring without penalizing overhangs on either side.

Performance: Aligner Reuse

SW aligners are created once per variant in count_single_variant() and reused for all reads. The bio::alignment::pairwise::Aligner reuses internal DP buffers, avoiding repeated heap allocation.

Raw Read Window Extraction

Phase 3 uses extract_raw_read_window() instead of CIGAR-projected extraction. For complex variants (e.g., TCC→CT represented as DEL+INS in CIGAR), CIGAR projection produces a hybrid sequence matching neither haplotype. Raw extraction returns the contiguous read bases that SW can correctly classify.

Ambiguous Tie — Unbiased VAF Preservation

For small Complex and MNP variants, biological reads heavily affected by surrounding genetic polymorphism can result in 50% partial matches against the Alternate array. Mathematically, this scores an exact numerical tie between the REF and ALT haplotypes (alt_score = 11, ref_score = 11).

These ambiguous reads are routed to neither (is_ref = false, is_alt = false). They still contribute to physical Total Depth (DP) via the anchor overlap gate, but do not inflate RD or AD. This preserves an unbiased VAF = AD / (RD + AD) — critical for low-VAF cfDNA detection where even small RD inflation can push a variant below the clinical Limit of Detection.


Multi-Allelic Behavior

When multiple variants have overlapping REF spans at the same locus, reads carrying one variant's ALT allele could be incorrectly counted as REF for another variant. The engine addresses this with a two-phase approach:

Phase 1: Annotation

During normalization, assign_multi_allelic_groups() identifies overlapping variants using a sweep-line algorithm over sorted (chrom, pos) coordinates. Variants whose REF spans intersect receive a shared multi_allelic_group ID, and their validation_status is appended with _MULTI_ALLELIC (e.g., PASS_MULTI_ALLELIC).

Phase 2: Sibling ALT Exclusion

During counting, reads classified as REF for a variant are additionally checked against all sibling variants in the same group. For each sibling, the full check_allele_with_qual() pipeline (including CIGAR reconstruction and SW alignment) determines if the read actually carries the sibling's ALT allele. If so, the read is excluded from REF for the current variant.

This prevents systematic REF inflation at multi-allelic loci, preserving unbiased VAF estimation.


Dynamic SW Gap Penalties

Phase 3's Smith-Waterman aligners use adaptive affine gap penalties tuned by the local repeat context:

Context gap_extend Rationale
Stable DNA (repeat_span < 10bp) -1 Standard penalty — prevents spurious gap extension
Tandem repeat (repeat_span ≥ 10bp) 0 (free) Absorbs polymerase slippage noise — prevents undercounting in MSI regions

The repeat_span is computed during normalization using find_tandem_repeat() and stored on the Variant struct. gap_open remains fixed at -5 in all cases.

MSI-High Tumors

In microsatellite-unstable tumors, insertions/deletions within long homopolymer or dinucleotide repeats are common. The free gap extension ensures these slippage-like variants get correctly classified rather than artificially rejected by high gap penalties.


Limitations

  1. Windowed scan range — Indels shifted beyond the context padding from their expected position won't be detected by the CIGAR-based check. Phase 3 SW can catch some of these via ref_context, but only if the read shows evidence (indels/clips) near the variant. Adaptive context padding (enabled by default) utilizes a multi-anchor footprint sweep to dramatically widen the padded detection bounds natively for INDEL clusters.

  2. Score margin ≥ 2 — The SW margin is fixed at 2 points to prevent ambiguous calls. Reads failing to achieve definitive spacing are routed to neither — they contribute to DP but not to RD or AD, preserving unbiased VAF.

  3. Soft-clip recovery — Phase 1 includes soft-clipped bases that overlap the variant window, but only when ref_pos is within the variant region. Soft clips at the edge of reads far from the variant are not considered.

  4. MNP strict matching — MNP strict matching is all-or-nothing, but inconclusive reads fall back to the complex variant classification chain (Phase 2 → 2.5 → 3). The fallback rescues reads with low-quality bases or partial matches that strict matching would reject.

  5. Incomplete variant definitions — When the MAF/VCF represents a complex event incompletely (e.g., TCC→CT omitting an adjacent SNV), reads may carry a different CIGAR signature than expected. The dual-trigger local fallback in Phase 3 mitigates this by soft-clipping "frameshifted flanks" caused by the definition mismatch.