Skip to content

Fragment Extraction

Command: krewlyzer extract

Plain English

Extract converts your BAM file into cfDNA fragments with quality information. It's the first step for most analyses—think of it as "preprocessing" your sequencing data.

Key output: .bed.gz file containing fragment coordinates and GC correction factors.


Purpose

The extract module serves as the entry point for most analysis workflows. It processes a BAM file to extract valid cell-free DNA (cfDNA) fragments and saves them in a standardized, compressed BED format with GC content.


Processing Flowchart

flowchart LR
    BAM[BAM File] --> RUST[Rust Backend]
    REF[Reference FASTA] --> RUST
    GC_REF[GC Reference] --> RUST

    RUST --> BED["sample.bed.gz"]
    RUST --> META["metadata.json"]
    RUST --> FACTORS["correction_factors.csv"]

    subgraph "With --target-regions"
        TARGETS[Target BED] --> RUST
        RUST --> GC_OFF["GC model from OFF-target only"]
    end
Use mouse to pan and zoom

Biological Context

Raw sequencing data (BAM) contains reads that must be paired and filtered to reconstruct physical DNA fragments. This step standardizes the data, removing PCR duplicates and low-quality mappings, ensuring downstream analysis focuses on high-confidence unique molecules.


Usage

krewlyzer extract -i sample.bam -r hg19.fa -o output_dir/ [options]

Options

Option Short Type Default Description
--input -i PATH required Input BAM file
--reference -r PATH required Reference genome FASTA (indexed)
--output -o PATH required Output directory
--genome -G TEXT hg19 Genome build for GC assets
--assay -A TEXT Assay code (xs1/xs2) for auto-loading bundled targets
--target-regions -T PATH Target BED (panel mode: GC from off-target)
--skip-target-regions FLAG Disable panel mode even when --assay is specified
--gc-correct FLAG True Compute GC correction factors
--exclude-regions -x PATH BED file of regions to exclude
--mapq -q INT 20 Minimum mapping quality
--minlen INT 65 Minimum fragment length
--maxlen INT 1000 Maximum fragment length
--skip-duplicates FLAG True Skip duplicate reads
--require-proper-pair FLAG True Require proper read pairs
--chromosomes TEXT Comma-separated chromosomes to process
--sample-name -s TEXT Override sample name
--verbose -v FLAG Enable verbose logging
--threads -t INT 0 Number of threads (0=all)

Output Files

File Description
{sample}.bed.gz Block-gzipped BED with fragment coordinates + GC
{sample}.bed.gz.tbi Tabix index for random access
{sample}.metadata.json Run statistics and configuration
{sample}.correction_factors.csv GC correction factors (with --gc-correct)

GC Correction Factors Format

Column Description
gc_bin GC content bin (0.00-1.00)
short_factor Correction for short fragments (65-149bp)
intermediate_factor Correction for intermediate (150-220bp)
long_factor Correction for long fragments (221-400bp)

Panel Mode (--assay or --target-regions)

For targeted sequencing panels (MSK-ACCESS), use --assay to auto-load bundled targets, or --target-regions for a custom target file:

# Recommended: Auto-load bundled targets for xs2 assay
krewlyzer extract -i sample.bam -r hg19.fa -o output/ --assay xs2

# Alternative: Explicit target file
krewlyzer extract -i sample.bam -r hg19.fa -o output/ \
    --target-regions MSK-ACCESS_targets.bed

Why This Matters

flowchart TB
    subgraph "Without --target-regions"
        ALL[All Fragments] --> GC_ALL["GC model biased by capture"]
        GC_ALL --> BAD["❌ Incorrect correction"]
    end

    subgraph "With --target-regions"
        OFF[Off-Target Only] --> GC_OFF["GC model from unbiased reads"]
        GC_OFF --> GOOD["✅ Accurate correction"]
    end
Use mouse to pan and zoom

Problem: On-target fragments have GC bias from hybridization probes. Using all fragments for GC modeling contaminates the correction factors.

Solution: With --target-regions, only off-target fragments are used to compute GC correction factors, producing unbiased normalization for downstream analysis.

Outputs in Panel Mode

Output Description
correction_factors.tsv GC factors from off-target only (unbiased)
correction_factors.ontarget.csv GC factors from on-target only (for mFSD)
sample.bed.gz All fragments (on + off-target)

Tip

Use .correction_factors.ontarget.csv with krewlyzer mfsd --correction-factors for panel variant calling—it's trained on the same capture regions as your variants.

Important

The BED file contains all fragments. Target filtering happens per-tool using the same --target-regions flag in FSC, FSD, WPS, etc.


BAM Compatibility

Duplex/Consensus BAMs

If you're using duplex or consensus BAMs, disable the proper pair filter:

krewlyzer extract -i duplex.bam -r hg19.fa -o output/ \
    --no-require-proper-pair
BAM Type Proper Pairs? Flag Needed?
Standard WGS ✅ Yes Default works
Standard Panel ✅ Yes Default works
Duplex/UMI ❌ No --no-require-proper-pair
Consensus ❌ No --no-require-proper-pair

Auto-Detection

The run-all command automatically detects BAM compatibility issues and suggests the correct flags.


See Also