Variant Normalization¶

How gbcms prepares variants before counting — validation, left-alignment, and homopolymer decomposition.

Visual Overview

Normalization pipeline poster — The 5-step variant normalization pipeline — click to enlarge

Overview¶

Before counting any reads, every variant passes through the preparation pipeline in prepare_variants(). This ensures consistent, biologically correct coordinates regardless of how the input was generated.

flowchart LR
    Input([📄 Input Variant]) --> S1[MAF Anchor Resolution]
    S1 --> S2[REF Validation]
    S2 --> S3[Left-Alignment]
    S3 --> S4[ref_context Fetch]
    S4 --> S5[Homopolymer Detection]
    S5 --> Output([✅ PreparedVariant])

    classDef step fill:#3498db,color:#fff,stroke:#2471a3,stroke-width:2px;
    class S1,S2,S3,S4,S5 step;

Use mouse to pan and zoom

Each input variant produces exactly one PreparedVariant, even if validation fails — this guarantees output row count always matches input.

Step 1: MAF Anchor Resolution¶

MAF files represent indels using - dashes. gbcms converts them to VCF-style anchor-based coordinates, which requires fetching the anchor base from the reference FASTA.

Indel Type	MAF REF	MAF ALT	VCF REF	VCF ALT	Anchor Source
Insertion	`-`	`TG`	`A`	`ATG`	Fetch base at `Start_Position`
Deletion	`CG`	`-`	`ACG`	`A`	Fetch base at `Start_Position − 1`

When Is This Step Triggered?

Only for MAF input (is_maf=true) when REF or ALT is -, or when REF and ALT have different lengths (complex indels). VCF input skips this step entirely.

FETCH_FAILED

If the anchor base cannot be fetched (e.g., chromosome not in FASTA), the variant's validation_status is set to FETCH_FAILED and it is excluded from counting.

Step 2: REF Validation¶

The REF allele is compared against the reference genome at the stated position. An exact match passes immediately. If the exact match fails, a similarity score is computed — variants with ≥90% base match are corrected to the FASTA REF and proceed to counting.

flowchart TD
    Fetch["Fetch ref bases at pos"] --> Exact{"Exact match?"}
    Exact -- "Yes" --> Pass([✅ PASS]):::pass
    Exact -- "No" --> Sim["Compute similarity\n(matching bases / max length)"]
    Sim --> Thresh{"≥ 90%?"}
    Thresh -- "Yes" --> Corrected(["⚠️ PASS_WARN_REF_CORRECTED\n(use FASTA REF)"]):::warn
    Thresh -- "No" --> Fail([❌ REF_MISMATCH]):::fail

    classDef pass fill:#27ae60,color:#fff;
    classDef warn fill:#f39c12,color:#fff;
    classDef fail fill:#e74c3c,color:#fff;

Use mouse to pan and zoom

When a variant receives PASS_WARN_REF_CORRECTED, the MAF's REF allele is replaced with the FASTA sequence for all downstream steps (left-alignment, haplotype construction). The original MAF REF is preserved in original_ref for auditing.

Real-World Example: EGFR Exon 19 Deletion

A 27bp complex EGFR deletion had 26/27 bases matching the FASTA (96%), with only the last base differing due to a MAF annotation artifact. Without tolerance, this variant was silently rejected with zero counts. With tolerant validation, it receives PASS_WARN_REF_CORRECTED and produces valid counts.

Common Causes of REF_MISMATCH

Wrong reference genome version (GRCh37 vs GRCh38)
Chromosome naming mismatch (chr1 vs 1)
Variant was called against a different reference build
Upstream normalization changed coordinates incorrectly
MAF annotation artifact (trailing base error) — now handled by tolerant validation

Step 3: Left-Alignment¶

For indels and complex variants, gbcms applies bcftools-style left-alignment to ensure consistent positioning in repetitive regions.

flowchart TD
    Input["Indel: chr1:105 ACG→A
(in a CGCG repeat)"]
    Fetch["Fetch ±100bp flanking sequence"]
    Align["Shift left through repeat"]
    Result["chr1:101 ACG→A
(leftmost position)"]

    Input --> Fetch --> Align --> Result

Use mouse to pan and zoom

Parameter	Value	Description
`norm_window`	100bp initial, up to 2500bp	Initial window for leftward search. Doubles on edge-hit (exponential retry).
`max_norm_window`	2500bp	Safety cap for centromeric/telomeric regions
Trigger	`ref_len ≠ alt_len` or both >1bp	SNPs are never left-aligned

Dynamic Window Expansion

If a variant shifts all the way to the window edge during left-alignment, it may not have fully converged. The engine automatically doubles the window (100 → 200 → 400 → ... → 2500bp) and retries. This ensures correct normalization even for variants in massive tandem repeats (e.g., centromeric regions) without penalizing the common case.

After alignment, the variant type is re-detected based on the new allele lengths:

Condition	Assigned Type
`ref_len == 1 && alt_len == 1`	SNP
`ref_len == 1 && alt_len > 1`	INSERTION
`ref_len > 1 && alt_len == 1`	DELETION
Otherwise	COMPLEX

Debugging Normalization

Use gbcms normalize to see exactly how each variant was transformed. The output TSV shows original and normalized coordinates side by side, plus whether the variant was modified (was_normalized=true).

Step 4: ref_context Fetch (Adaptive)¶

For indels and complex variants, a flanking reference sequence is fetched around the (possibly normalized) position. This context serves two purposes:

Windowed indel detection — Safeguard 3 uses ref_context to verify that shifted indels delete/insert the expected reference bases
Phase 3 Smith-Waterman alignment — the context is used to build REF and ALT haplotypes for dual-haplotype alignment

Adaptive Context Padding¶

By default, gbcms dynamically adjusts the context padding based on nearby tandem repeats. This prevents ambiguous SW alignments in repetitive regions where a fixed padding might consist entirely of the repeat motif.

Multi-Anchor Footprint Sweep¶

Historically, scanning only the variant's strict anchor base caused miscalculations for left-aligned INDELs, as the trailing INDEL mutations generated by BWA mapping often fell outside the default 5bp padding window.

The engine now performs a Multi-Anchor Footprint Sweep. It scans across the entire variant footprint (pos through pos + ref_len) looking for tandem repeats, and extracts the maximum necessary padding to clear the repeat entirely.

flowchart LR
    Scan["Scan Footprint\n(pos to pos+ref_len)"] --> Detect{"Repeat\nfound?"}
    Detect -- "Yes (span=S)" --> Formula["padding = max(default, S/2 + 3)\ncapped at 50bp"]
    Detect -- "No" --> Default["padding = default (5bp)"]
    Formula --> Fetch["Fetch ref_context"]
    Default --> Fetch

Use mouse to pan and zoom

Parameter	CLI Flag	Default	Description
`context_padding`	`--context-padding`	5	Minimum flanking bases on each side (range: 1–50)
`adaptive_context`	`--adaptive-context`	✅ enabled	Dynamically increase padding across the footprint in repeat regions

Example: Poly-A Insertion¶

Variant: chr7:100 A→AAAAAAAAAAAAA (12bp poly-A insertion)
Footprint: chr7:100-101
Flanking: ...AAAAAAAAAA[anchor]AAAAAAAAAA...

Sweep Detected: homopolymer max span=20bp → padding = max(5, 20/2+3) = 13bp

Fixed padding=5:     AAAAA + REF/ALT + AAAAA  (all A's — ambiguous, BWA alignments drop)
Adaptive padding=13: GCTTAAAAA... + REF/ALT + ...AAAAATTGAC  (anchored)

Supported Repeat Types¶

Type	Example	Motif Size	Detection Range
Homopolymer	`AAAAAAA`	1bp	≥2 copies
Dinucleotide	`CACACACA`	2bp	≥2 copies
Trinucleotide	`CAGCAGCAG`	3bp	≥2 copies
Up to hexanucleotide	`GGCCCCGGCCCC`	4–6bp	≥2 copies

Disabling Adaptive Context

Use --no-adaptive-context to fall back to fixed padding. This may be useful for strict reproducibility against older versions.

SNPs Don't Need Context

SNPs are checked by a simple single-base comparison, so ref_context is not fetched for them. This saves unnecessary FASTA lookups.

Step 5: Homopolymer Decomposition Detection¶

Some variant callers merge nearby events (e.g., a 1bp deletion + SNV in a homopolymer) into a single complex variant with an inflated deletion size. For example, CCCCCC→T when the reads actually show CCCCCC→CCCCT (a D(1) + C→T).

Detection Criteria¶

All four conditions must be met:

flowchart TD
    C1{"REF is homopolymer?
(all same base, ≥3bp)"} -- Yes --> C2{"Net deletion?
(ref_len > alt_len)"}
    C2 -- Yes --> C3{"ALT last base ==
next ref base after REF span?"}
    C3 -- Yes --> C4{"ALT last base ≠
homopolymer base?"}
    C4 -- Yes --> Detected([🔍 Decomposition detected]):::detect

    C1 -- No --> Skip([Skip]):::skip
    C2 -- No --> Skip
    C3 -- No --> Skip
    C4 -- No --> Skip

    classDef detect fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;
    classDef skip fill:#95a5a6,color:#fff,stroke:#7f8c8d,stroke-width:2px;

Use mouse to pan and zoom

Corrected Allele Construction¶

When detected, a corrected ALT allele is built by keeping most of the homopolymer intact and appending the SNV base:

Example:  REF = CCCCCC, ALT = T, next_ref_base = T

Step 1: homopolymer base = C
Step 2: Confirmed: alt_last (T) == next_ref_base (T), T ≠ C
Step 3: corrected_alt = C × (ref_len - 1) + alt_last = CCCCC + T = CCCCT

Dual-Counting Flow¶

The corrected variant is stored in PreparedVariant.decomposed_variant. During counting, both the original and corrected alleles are counted independently. The result with the higher alt_count wins:

flowchart TD
    CountOrig["count_bam(original: CCCCCC→T)"]
    CountDecomp["count_bam(corrected: CCCCCC→CCCCT)"]
    Compare{"corrected AD > original AD?"}
    UseDecomp["Return corrected counts"]
    UseOrig["Return original counts"]
    FlagWarn["validation_status =
PASS_WARN_HOMOPOLYMER_DECOMP"]
    FlagPass["validation_status = PASS"]

    CountOrig --> Compare
    CountDecomp --> Compare
    Compare -- "Yes" --> UseDecomp --> FlagWarn
    Compare -- "No" --> UseOrig --> FlagPass

Use mouse to pan and zoom

Self-Validating

The warning flag only appears when reads actually support the decomposed representation better. If the original variant gets more ALT support, it's used as-is with a normal PASS status. This makes the approach safe — it never blindly corrects variants.

Real-World: SOX2

SOX2 at chr17:181430901: CCCCCC→T (6bp→1bp, net −5bp). Original count: alt=3. Corrected CCCCCC→CCCCT count: alt=79. Corrected wins → validation_status = PASS_WARN_HOMOPOLYMER_DECOMP.

Validation Status Reference¶

Every PreparedVariant carries a validation_status string:

Status	Meaning	Counted?
`PASS`	REF matches FASTA exactly	✅
`PASS_WARN_REF_CORRECTED`	REF ≥90% match; corrected to FASTA REF	✅
`PASS_WARN_HOMOPOLYMER_DECOMP`	Passed, but corrected allele was used (see above)	✅
`PASS_MULTI_ALLELIC`	Passed, variant overlaps another variant at the same locus (sibling ALT exclusion active)	✅
`REF_MISMATCH`	REF allele <90% match against reference genome	❌
`FETCH_FAILED`	Could not fetch reference region	❌

Filtering Behavior

The pipeline filters on validation_status.startswith("PASS"), so all PASS and PASS_WARN_* variants proceed to counting. REF_MISMATCH and FETCH_FAILED variants are logged as rejected.

Limitations¶

Left-alignment only — gbcms left-aligns but does not right-align. If your input is right-aligned, the windowed scan (±5bp) may still catch it, but left-alignment is recommended.
Dynamic normalization window — Starting at 100bp, the window automatically doubles (up to 2500bp) when a variant shifts to the window edge. In practice, this covers even variants in centromeric/telomeric repeats. Variants shifted beyond 2500bp won't be corrected.
Homopolymer detection is conservative — only triggers for homopolymer REF ≥3bp with a clear D(n)+SNV pattern. Other miscollapsed events are not detected.

gbcms normalize — CLI command for standalone normalization
Input Formats — VCF/MAF coordinate conventions
Allele Classification — How reads are classified after normalization

Variant Normalization¶

Overview¶

Step 1: MAF Anchor Resolution¶

Step 2: REF Validation¶

Step 3: Left-Alignment¶

Step 4: ref_context Fetch (Adaptive)¶

Adaptive Context Padding¶

Multi-Anchor Footprint Sweep¶

Example: Poly-A Insertion¶

Supported Repeat Types¶

Step 5: Homopolymer Decomposition Detection¶

Detection Criteria¶

Corrected Allele Construction¶

Dual-Counting Flow¶

Validation Status Reference¶

Limitations¶

Related¶