Input Formats¶
gbcms accepts VCF and MAF files as variant input.
VCF (Variant Call Format)¶
Standard VCF format with required fields:
Requirements¶
- Tab-separated
#CHROM,POS,REF,ALTcolumns required- 1-based positions
MAF (Mutation Annotation Format)¶
Standard MAF format with required columns:
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
TP53 chr17 7577120 7577120 C T
KRAS chr12 25398284 25398284 G A
Required Columns¶
| Column | Description |
|---|---|
Chromosome |
Chromosome name |
Start_Position |
1-based start position |
Reference_Allele |
Reference allele |
Tumor_Seq_Allele2 |
Alternate allele |
MAF Indel Normalization¶
MAF represents indels using - dashes, while gbcms internally uses VCF-style anchor-based coordinates. When a MAF file contains insertions (Reference_Allele = -) or deletions (Tumor_Seq_Allele2 = -), gbcms automatically converts them at input time.
Reference FASTA Required
MAF indel conversion requires --fasta to fetch the anchor base from the reference genome. Without it, indel variants cannot be normalized and will be skipped.
flowchart TD
MAF([📄 MAF Row]):::start --> Check{REF or ALT is '-'?}
Check -->|No: SNP/MNP| Direct[Use Start_Position as VCF POS]
Check -->|Yes: Indel| Type{Which is '-'?}
Type -->|"REF = '-'"| Ins[Insertion]
Type -->|"ALT = '-'"| Del[Deletion]
subgraph Insertion
Ins --> InsAnchor["Anchor = Start_Position"]
InsAnchor --> InsFetch["Fetch anchor base from FASTA"]
InsFetch --> InsResult["REF = anchor base\nALT = anchor + inserted seq"]
end
subgraph Deletion
Del --> DelAnchor["Anchor = Start_Position − 1"]
DelAnchor --> DelFetch["Fetch anchor base from FASTA"]
DelFetch --> DelResult["REF = anchor + deleted seq\nALT = anchor base"]
end
Direct --> Out([🧬 Internal Variant]):::success
InsResult --> Out
DelResult --> Out
classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
classDef success fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Insertion Example¶
Insert TG after chr1:100 (where the reference base at position 100 is A):
| Field | MAF | VCF (internal) |
|---|---|---|
| Position | Start_Position = 100 |
POS = 100 |
| REF | - |
A (fetched from FASTA) |
| ALT | TG |
ATG (anchor + inserted seq) |
Deletion Example¶
Delete CG at chr1:101–102 (where the reference base at position 100 is A):
| Field | MAF | VCF (internal) |
|---|---|---|
| Position | Start_Position = 101 (first deleted base) |
POS = 100 (anchor) |
| REF | CG |
ACG (anchor + deleted seq) |
| ALT | - |
A (anchor only) |
Position Shift for Deletions
For insertions, Start_Position already points to the anchor base. For deletions, Start_Position points to the first deleted base, so gbcms shifts back by one position to find the anchor.
Variant Left-Normalization¶
gbcms automatically left-aligns indels and complex variants during the preparation step. For full details on the normalization algorithm, homopolymer decomposition detection, and REF validation, see Variant Normalization.
Left-Align Your Variants
Inconsistently normalized variants reduce the effectiveness of windowed indel detection. While the ±5bp window will catch most aligner-shifted indels, left-alignment ensures the anchor position is consistent with standard conventions.
Reference FASTA¶
- Must have corresponding
.faiindex - Chromosome names must match VCF/MAF
BAM Requirements¶
- Must have corresponding
.baiindex - Coordinate-sorted
- Chromosome names must match reference
Related¶
- CLI Run Command — Usage examples
- Variant Normalization — How variants are prepared
- Allele Classification — How counting works
- Glossary — Term definitions
abbreviations