Skip to content

Read FiltersΒΆ

Which reads are excluded before allele classification β€” the filter cascade, CLI flags, and defaults.

Visual Overview

Read filter and counting metrics poster
The read-filter cascade and counting metrics β€” click to enlarge

Filter CascadeΒΆ

Every read from the BAM passes through a filter cascade before being checked for allele support. Reads failing any enabled filter are discarded. The order matches the Rust engine implementation.

flowchart TD
    Start([πŸ“– Read from BAM]):::process

    subgraph Filters [Quality & Alignment Filters]
        direction TB
        F1{Duplicate?}:::decision
        F2{Secondary?}:::decision
        F3{Supplementary?}:::decision
        F4{QC Failed?}:::decision
        F5{Improper pair?}:::decision
        F6{Contains indel?}:::decision
        F7{MAPQ β‰₯ threshold?}:::decision
    end

    Drop{{❌ Discard Read}}:::discard
    Pass([βœ… Pass to Allele Checker]):::success

    Start --> F1

    F1 -- "Yes" --> Drop
    F1 -- "No" --> F2

    F2 -- "Yes" --> Drop
    F2 -- "No" --> F3

    F3 -- "Yes" --> Drop
    F3 -- "No" --> F4

    F4 -- "Yes" --> Drop
    F4 -- "No" --> F5

    F5 -- "Yes" --> Drop
    F5 -- "No" --> F6

    F6 -- "Yes" --> Drop
    F6 -- "No" --> F7

    F7 -- "Below threshold" --> Drop
    F7 -- "Pass" --> Pass

    classDef process fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
    classDef decision fill:#f39c12,color:#fff,stroke:#d68910,stroke-width:2px;
    classDef discard fill:#e74c3c,color:#fff,stroke:#c0392b,stroke-width:2px;
    classDef success fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Use mouse to pan and zoom

Filter OptionsΒΆ

Filter CLI Flag Default SAM Flag Description
Duplicates --filter-duplicates On 0x400 PCR/optical duplicates
Secondary --filter-secondary Off 0x100 Secondary alignments
Supplementary --filter-supplementary Off 0x800 Chimeric/split alignments
QC Failed --filter-qc-failed Off 0x200 Platform QC failures
Improper Pair --filter-improper-pair Off 0x2 (inverted) Reads not properly paired
Indel reads --filter-indel Off CIGAR-based Any Ins or Del in CIGAR
MAPQ threshold --min-mapq 20 β€” Minimum mapping quality

Quality ThresholdsΒΆ

Beyond the read-level filters above, gbcms also applies base-level quality thresholds during allele classification:

Parameter CLI Flag Default Description
Min base quality --min-baseq 20 Bases below this are masked or rejected
Fragment consensus threshold --fragment-qual-threshold 10 Quality margin for R1/R2 conflict resolution

How Base Quality Is Used

The effect of --min-baseq varies by variant type:

  • SNP: Read is rejected if the base at the variant position has quality < threshold
  • MNP: Read is rejected if any base in the MNP region has quality < threshold
  • Insertion/Deletion (windowed scan): Inserted/deleted bases below threshold are masked (treated as wildcards) rather than rejecting the entire read
  • Complex (Phase 2): Low-quality bases are masked β€” they cannot vote for either allele
  • Complex (Phase 3 SW): Low-quality bases are replaced with N, which scores 0 against any base

Comparison with Original GBCMSΒΆ

Filter Non-Primary

The original GBCMS has a single --filter_non_primary flag. gbcms splits this into --filter-secondary and --filter-supplementary for finer control. Both default to off, matching the original behavior.

Feature Original GBCMS gbcms
Base quality filtering No threshold Default --min-baseq 20
Duplicate filtering Optional On by default
Non-primary filter Single flag Split: secondary + supplementary
Indel read filter Optional Optional (off by default)

Fetch WindowΒΆ

gbcms fetches reads from a dynamic window around each variant position β€” not just at the exact position. The window defaults to Β±5bp but scales with the variant's repeat span to capture indels that aligners shift beyond 5bp in long homopolymers and microsatellite regions. The formula is max(5, repeat_span + 2).

Variant: chr1:100 A→ATG (insertion, repeat_span=0)
BAM fetch window: [95, 106)    ← Β±5bp (default)

Variant: chr1:200 A→AT (insertion in 15bp poly-A, repeat_span=15)
BAM fetch window: [183, 206)   ← Β±17bp (repeat_span + 2)

Anchor Overlap Gate

Although reads are fetched from the wider window, only reads that overlap the variant's anchor position (or are classified as REF/ALT via shifted indel matching) contribute to the DP count. This prevents DP inflation from reads that were brought in by the window padding but don't actually cover the variant.

Read A: [90─────100]        β†’ overlaps anchor at 100 β†’ counted in DP βœ…
Read B: [92──97]            β†’ doesn't overlap 100, not classified β†’ skipped ❌
Read C: [95──99] + shifted  β†’ doesn't overlap 100, but windowed scan found ALT β†’ counted βœ