Quick Start Guide¶
This guide shows you how to get started with py-gbcms using the standalone CLI for processing one or a few samples.
Processing many samples? Use the Nextflow Workflow instead for automatic parallelization on HPC clusters.
Prerequisites¶
- Python >= 3.10
- Rust toolchain (for installation from source)
- BAM files with index (.bai)
- Reference FASTA with index (.fai)
- Variants file (VCF or MAF)
Install via pip install py-gbcms or see the project README for detailed setup instructions.
Basic Usage¶
Single Sample¶
Count variants for one sample:
gbcms run \
--variants variants.vcf \
--bam sample1.bam \
--fasta reference.fa \
--output-dir results/
Output: results/sample1.vcf
Multiple Samples¶
Process multiple samples sequentially:
# Sample 1
gbcms run --variants variants.vcf --bam sample1.bam --fasta ref.fa --output-dir results/
# Sample 2
gbcms run --variants variants.vcf --bam sample2.bam --fasta ref.fa --output-dir results/
# Sample 3
gbcms run --variants variants.vcf --bam sample3.bam --fasta ref.fa --output-dir results/
Or use a BAM list file:
# Create bam_list.txt with:
# sample1 /path/to/sample1.bam
# sample2 /path/to/sample2.bam
gbcms run \
--variants variants.vcf \
--bam-list bam_list.txt \
--fasta reference.fa \
--output-dir results/
Note: The CLI processes samples sequentially. For parallel processing of many samples, use the Nextflow Workflow.
Common Options¶
Output Format¶
VCF (default):
gbcms run --variants variants.vcf --bam sample.bam --fasta ref.fa --output-dir results/ --format vcf
MAF:
gbcms run --variants variants.maf --bam sample.bam --fasta ref.fa --output-dir results/ --format maf
Custom Sample IDs¶
Override the sample name:
gbcms run \
--variants variants.vcf \
--bam MySampleID:sample.bam \
--fasta reference.fa \
--output-dir results/
Output: results/MySampleID.vcf
Output Suffix¶
Add suffix to output filenames:
gbcms run \
--variants variants.vcf \
--bam sample.bam \
--fasta reference.fa \
--output-dir results/ \
--suffix .genotyped
Output: results/sample.genotyped.vcf
Threading¶
Use multiple threads for processing:
gbcms run \
--variants variants.vcf \
--bam sample.bam \
--fasta reference.fa \
--output-dir results/ \
--threads 4
Quality Filters¶
Minimum mapping quality:
gbcms run --variants variants.vcf --bam sample.bam --fasta ref.fa --output-dir results/ --min-mapq 30
Minimum base quality:
gbcms run --variants variants.vcf --bam sample.bam --fasta ref.fa --output-dir results/ --min-baseq 20
Filter duplicates (default: enabled):
gbcms run --variants variants.vcf --bam sample.bam --fasta ref.fa --output-dir results/ --filter-duplicates
Filter secondary alignments:
gbcms run --variants variants.vcf --bam sample.bam --fasta ref.fa --output-dir results/ --filter-secondary
Complete Example¶
Process a sample with strict filtering:
gbcms run \
--variants variants.vcf \
--bam TumorSample:tumor.bam \
--fasta hg19.fa \
--output-dir genotyped_results/ \
--format vcf \
--suffix .genotyped \
--threads 8 \
--min-mapq 30 \
--min-baseq 20 \
--filter-duplicates \
--filter-secondary \
--filter-supplementary
Output: genotyped_results/TumorSample.genotyped.vcf
Using Docker¶
Run via Docker container:
docker run --rm -v $(pwd):/data ghcr.io/msk-access/py-gbcms:2.0.0 \
gbcms run \
--variants /data/variants.vcf \
--bam /data/sample.bam \
--fasta /data/reference.fa \
--output-dir /data/results/
Next Steps¶
- Many samples on HPC: See Nextflow Workflow
- Usage patterns: See Usage Overview