Input Formats¶
gbcms accepts VCF and MAF files as variant input.
VCF (Variant Call Format)¶
Standard VCF format with required fields:
Requirements¶
- Tab-separated
#CHROM,POS,REF,ALTcolumns required- 1-based positions
MAF (Mutation Annotation Format)¶
Standard MAF format with required columns:
Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele Tumor_Seq_Allele2
TP53 chr17 7577120 7577120 C T
KRAS chr12 25398284 25398284 G A
Required Columns¶
| Column | Description |
|---|---|
Chromosome |
Chromosome name |
Start_Position |
1-based start position |
Reference_Allele |
Reference allele |
Tumor_Seq_Allele2 |
Alternate allele |
MAF Indel Normalization¶
MAF represents indels using - dashes, while gbcms internally uses VCF-style anchor-based coordinates. When a MAF file contains insertions (Reference_Allele = -) or deletions (Tumor_Seq_Allele2 = -), gbcms automatically converts them at input time.
Reference FASTA Required
MAF indel conversion requires --fasta to fetch the anchor base from the reference genome. Without it, indel variants cannot be normalized and will be skipped.
flowchart TD
MAF(["📄 MAF Row"]):::start --> Check{"REF or ALT is '-'?"}
Check -->|"No — SNP/MNP"| Direct["VCF POS = Start_Position"]
Check -->|"Yes — Indel"| Type{"Which is '-'?"}
Type -->|"REF = '-' (Insertion)"| InsResult["POS = Start_Position\nAnchor @ Start_Position (from FASTA)\nREF = anchor\nALT = anchor + inserted seq"]
Type -->|"ALT = '-' (Deletion)\n-1 for anchor"| DelResult["POS = Start_Position − 1\nAnchor @ Start_Position−1 (from FASTA)\nREF = anchor + deleted seq\nALT = anchor"]
Direct --> Out(["🧬 Internal Variant"]):::pass
InsResult --> Out
DelResult --> Out
classDef start fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:2px;
classDef pass fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:2px;
Insertion Example¶
Insert TG after chr1:100 (where the reference base at position 100 is A):
| Field | MAF | VCF (internal) |
|---|---|---|
| Position | Start_Position = 100 |
POS = 100 |
| REF | - |
A (fetched from FASTA) |
| ALT | TG |
ATG (anchor + inserted seq) |
Deletion Example¶
Delete CG at chr1:101–102 (where the reference base at position 100 is A):
| Field | MAF | VCF (internal) |
|---|---|---|
| Position | Start_Position = 101 (first deleted base) |
POS = 100 (anchor) |
| REF | CG |
ACG (anchor + deleted seq) |
| ALT | - |
A (anchor only) |
Position Shift for Deletions
For insertions, Start_Position already points to the anchor base. For deletions, Start_Position points to the first deleted base, so gbcms shifts back by one position to find the anchor.
Variant Left-Normalization¶
gbcms automatically left-aligns indels and complex variants during the preparation step. For full details on the normalization algorithm, homopolymer decomposition detection, and REF validation, see Variant Normalization.
Left-Align Your Variants
Inconsistently normalized variants reduce the effectiveness of windowed indel detection. While the ±5bp window will catch most aligner-shifted indels, left-alignment ensures the anchor position is consistent with standard conventions.
Reference FASTA¶
- Must have corresponding
.faiindex - Chromosome names must match VCF/MAF
BAM Requirements¶
- Must have corresponding
.baiindex - Coordinate-sorted
- Chromosome names must match reference
Related¶
- DNA CLI Reference — Usage examples
- Variant Normalization — How variants are prepared
- Allele Classification — How counting works
- Glossary — Term definitions
abbreviations