RNA Splice-Junction Handling¶
The Problem: Alignment Artifacts at Exon Boundaries¶
RNA-seq aligners (STAR, HISAT2) produce splice-aware alignments with
CIGAR N (RefSkip) operations spanning introns. Bases near these splice
junctions may be misaligned — the aligner sometimes extends a few bases
into the intron ("overhang"), creating false variant calls at exon
boundaries.
Reference: ...EXON1=====][INTRON...intron...INTRON][=====EXON2...
Read: ...EXON1=====XX YYYY=====EXON2...
^^ ^^^^
overhang (2bp) overhang (4bp)
may be misaligned may be misaligned
Community Approach: GATK SplitNCigarReads¶
GATK addresses this by physically modifying the BAM before variant calling:
- Splits reads at each CIGAR
Nboundary into separate alignment records (e.g.,45M 5000N 55M→ Fragment 145M+ Fragment 255M). - Hard-clips overhangs if they contain >1 mismatch within a ≤40bp window.
- Reassigns MAPQ 255 → 60 for downstream caller compatibility.
This produces a clean BAM where no read contains N operators, at the
cost of losing the original read structure.
gbcms Approach: Dual-Mechanism (No BAM Modification)¶
gbcms handles splice-junction artifacts without modifying the BAM, using two complementary mechanisms:
1. Consensus Intron Snipping (automatic in RNA mode)¶
apply_consensus_splicing() in rna.rs modifies the reference
haplotype, not the reads:
- Examines local read pileup for consensus introns (>50% threshold)
- Removes intronic bases from the reference haplotype used by PairHMM
- PairHMM then evaluates reads against a mature mRNA-like reference
This is arguably more elegant than SplitNCigarReads because reads retain their original structure while the reference adapts to the biology.
2. Heuristic BAQ Near Splice Junctions (--apply-baq)¶
apply_heuristic_baq() in baq.rs penalizes base qualities within
5bp of CIGAR N/RefSkip operations:
- Subtracts 20 from BQ at bases near splice boundaries
- Bases that drop below
--min-baseq(default 20) are effectively filtered from variant evidence - This mimics GATK SplitNCigarReads' overhang clipping without removing bases — a softer approach that preserves information
Default behavior:
| Mode | --apply-baq default |
Rationale |
|---|---|---|
gbcms dna |
False (off) |
DNA BAMs typically go through BQSR or consensus calling (fgbio/Marianas), which already recalibrates BQ |
gbcms rna |
True (on) |
RNA BAMs typically do not go through BQSR or consensus calling; raw sequencer BQ values are preserved |
When to Use --no-baq in RNA Mode¶
Disable BAQ if your upstream workflow already handles splice-boundary quality adjustment:
- SplitNCigarReads: If you run GATK SplitNCigarReads before gbcms,
the overhang problem is already resolved. Use
--no-baqto avoid double-penalization. - BQSR: If your pipeline runs GATK Base Quality Score Recalibration, base qualities are already machine-learning-calibrated. BAQ on top may over-penalize.
- Consensus-based deduplication: If your pipeline uses fgbio or similar consensus calling (not typical for RNA), BQ values reflect consensus agreement. BAQ is unnecessary.
When BAQ Is Most Valuable¶
BAQ provides the most benefit when the upstream workflow does not run:
- GATK SplitNCigarReads (overhang clipping)
- GATK BQSR (quality recalibration)
- Consensus-based deduplication (e.g., fgbio)
In these pipelines, raw sequencer BQ values are preserved all the way
to gbcms. A base near a splice boundary may have BQ 30+ even if it is
a misalignment artifact. The -20 BAQ penalty brings it below the
min_baseq=20 threshold, effectively filtering it from variant
evidence.
Technical Details¶
BAQ Parameters¶
| Parameter | Value | Description |
|---|---|---|
BAQ_RADIUS |
5 bp | Bases within 5bp of indel/splice boundary are penalized |
BAQ_PENALTY |
20 | Subtracted from BQ (clamped to 0, never negative) |
Performance¶
BAQ is applied lazily — apply_heuristic_baq() returns None for
reads without indels or splice junctions (no allocation). For RNA BAMs
where most reads have splice junctions, the cost is one CIGAR walk per
read plus a quality vector copy, which is negligible compared to PairHMM
alignment.
Trace Logging¶
When --trace is enabled, BAQ logs per-read adjustments:
Defense in Depth: Splice Bleed Protection¶
"Splice bleed" occurs when a read has only 2–3 bases on one side of a splice junction. The aligner may force those bases to align to the wrong exon, creating false variant evidence at exon boundaries.
Worst-Case Example¶
A known SNP (G→A) sits at the first base of Exon 2:
Exon boundary
│
│ Known SNP: G→A
│ ↓
Genome: ...INTRON────┤──[G]CGTACGTACGT...
│ Exon 2
Read A: ════╗══[A]═══════════════ → ALT ✅ real variant
╚ (CIGAR: 10M 3000N 40M)
Read B: ════════════─╗──[A]── → ALT 🟡 splice bleed!
╚ (CIGAR: 30M 3000N 3M)
↑
Only 3 bases in Exon 2.
The [A] may be from intronic
sequence, not a real ALT.
Read C: ═════════════╗══[G]═══════════════ → REF ✅
╚ (CIGAR: 20M 3000N 30M)
How gbcms Catches Read B: 5 Defense Layers¶
For a splice-bled base to produce a false positive, all 5 layers must fail simultaneously — which is exceedingly rare:
┌────────────────────────────────────────────────────────────────────┐
│ Defense Layers │
├────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: MAPQ filter (--min-mapq, default 1 for RNA) │
│ └── STAR gives MAPQ=0 to reads it can't confidently place. │
│ Reads with very short exonic overlap are often multi-mapped. │
│ │
│ Layer 2: Base quality filter (--min-baseq, default 20) │
│ └── Bled bases typically receive low BQ from the aligner because │
│ the alignment evidence is ambiguous near the splice site. │
│ │
│ Layer 3: PairHMM quality-weighted classification │
│ └── Even if a bled base passes the BQ filter, PairHMM evaluates │
│ the FULL read against REF/ALT haplotypes. The 2–3 uncertain │
│ bases are outweighed by 30+ properly aligned bases on the │
│ other side. LLR typically falls below threshold → NEITHER. │
│ │
│ Layer 4: splice_spanning_count (tracked in output) │
│ └── Downstream QC can flag variants where ALL ALT evidence comes │
│ from splice-spanning reads (rna_splice_spanning in MAF/VCF). │
│ │
│ Layer 5: BAQ splice-proximity penalty (--apply-baq, on for RNA) │
│ └── Subtracts 20 from BQ within 5bp of splice junctions. A base │
│ with BQ 30 near a splice site becomes BQ 10, which is below │
│ --min-baseq 20 → excluded from allele evidence entirely. │
│ │
└────────────────────────────────────────────────────────────────────┘
Why All 5 Layers Must Fail¶
For a splice-bled base to cause a false ALT call:
- The variant must sit within 1–2 bp of the exon boundary
- The aligner must give the bled base MAPQ ≥ 1 (Layer 1 pass)
- The aligner must give the bled base BQ ≥ 20 (Layer 2 pass)
- BAQ must fail to bring BQ below 20 — but BAQ subtracts 20 from bases within 5bp of the splice junction, so a base needs BQ ≥ 40 to survive (Layer 5 pass)
- PairHMM must give the read LLR ≥ 2.3 despite only 2–3 bases of evidence on the bled side (Layer 3 pass)
- The wrong base must happen to match the known ALT allele (1/3 chance for SNPs)
In practice, Layers 2 and 5 together require BQ ≥ 40 at the bled position, which is exceptionally unlikely for a 2–3 base overhang.
abbreviations