GC Bias Correction

GC content bias is a major source of variability in cfDNA sequencing. Krewlyzer implements LOESS-based correction in Rust for high performance.

Why GC Correction?

DNA fragments with extreme GC content (very AT-rich or GC-rich) are under-represented due to:

PCR amplification bias - GC-rich regions amplify less efficiently
Sequencing bias - Certain GC ranges have lower quality scores
Capture bias - Hybridization efficiency varies with GC

Without correction, fragment counts are confounded by GC content, obscuring biological signal.

How It Works

LOESS Regression

Krewlyzer uses Locally Estimated Scatterplot Smoothing (LOESS) to model the relationship between GC content and fragment count:

For each fragment type (short, intermediate, long):
1. Bin fragments by GC content (0.00-1.00 in 0.01 steps)
2. Fit LOESS curve: count = f(gc)
3. Compute correction factor: factor[gc] = median(count) / loess_fit[gc]
4. Apply: corrected_count = raw_count × factor[gc]

Configuration

Parameter	Default	Description
LOESS fraction	0.3	Fraction of data for local fitting
LOESS iterations	3	Robustness iterations
Delta	0.01	Smoothing delta

Fragment Length Bins

GC correction is applied per fragment length bin (17 bins, 20bp width):

Bin	Range	Fragment Type
0	60-79bp	Ultra-short
1	80-99bp	Ultra-short
2-4	100-159bp	Short
5-7	160-219bp	Intermediate
8-16	220-400bp	Long

Usage

Automatic (Default)

GC correction is enabled by default for most tools:

krewlyzer extract -i sample.bam -r hg19.fa -o output/
# Generates: sample.correction_factors.csv

Disable GC Correction

krewlyzer fsc -i sample.bed.gz --no-gc-correct -o output/

Using Pre-computed Factors

# mFSD can use factors from extract
krewlyzer mfsd -i sample.bam -V variants.vcf \
    --correction-factors output/sample.correction_factors.csv \
    -o output/

Per-Tool GC Correction

Tool	GC Option	Source	Notes
extract	`--gc-correct`	Computes factors	Generates `.correction_factors.csv`
FSC	`--gc-correct`	From extract	Via `run_unified_pipeline`
FSR	`--gc-correct`	From extract	Via `run_unified_pipeline`
FSD	`--gc-correct`	From extract	Via `run_unified_pipeline`
WPS	`--gc-correct`	From extract	Via `run_unified_pipeline`
OCF	`--gc-correct`	From extract	Via `run_unified_pipeline`
mFSD	`--correction-factors`	Manual input	Uses pre-computed CSV
motif	N/A	N/A	No GC correction
UXM	N/A	N/A	No GC correction

Panel Data (--target-regions)

When --target-regions is provided to extract: - GC model is built from off-target fragments only - Avoids capture bias contamination - Generates both .correction_factors.csv (off-target) and .correction_factors.ontarget.csv (on-target)

Building GC Reference Assets

The build-gc-reference command pre-computes reference GC data:

# Standard (WGS) mode
krewlyzer build-gc-reference hg19.fa -o data/gc/ -e hg19-blacklist.bed

# Panel mode (generates both WGS and on-target assets)
krewlyzer build-gc-reference hg19.fa -o data/gc/ -T msk_targets.bed

Output Files

Mode	File	Description
Standard	`valid_regions_{genome}.bed.gz`	100kb bins for GC estimation
Standard	`ref_genome_GC_{genome}.parquet`	Expected fragment counts
Panel	`valid_regions_{genome}.ontarget.bed.gz`	Bins overlapping targets
Panel	`ref_genome_GC_{genome}.ontarget.parquet`	On-target expected counts

Options

Option	Description
`-o, --output`	Output directory (required)
`-e, --exclude-regions`	Blacklist BED to exclude
`-T, --target-regions`	Target BED for panel mode
`--bin-size`	Bin size in bp (default: 100kb)
`-n, --genome-name`	Genome name (default: from FASTA)

Correction Factors File

The extract command generates {sample}.correction_factors.csv:

len_bin,gc_bin,factor,observed,expected,n_fragments
0,0.30,1.23,1234,1003,50000
0,0.31,1.21,1256,1038,51234
...

Column	Description
`len_bin`	Fragment length bin (0-16)
`gc_bin`	GC content (0.00-1.00)
`factor`	Correction multiplier
`observed`	Raw fragment count
`expected`	LOESS-predicted count
`n_fragments`	Number of fragments in bin

PON-based Hybrid Correction

When a PON model is provided, correction uses a hybrid approach:

flowchart LR
    obs[Observed Counts] --> pon[PON Correction]
    pon --> |"observed / pon_expected"| res[Residual LOESS]
    res --> |"pon_corrected / residual"| final[Final Counts]

Use mouse to pan and zoom

Algorithm: 1. PON correction: Divide by assay-specific expected coverage 2. Residual LOESS: Fit sample-specific residual bias 3. Final: Divide PON-corrected counts by residual

This removes both assay-wide and sample-specific GC effects.

Implementation Details

Rust Module: `gc_correction.rs`

Key structures:

/// Fragment length bins (17 bins, 60-400bp)
pub struct LengthBin(u8);

/// GC correction factors lookup
pub struct CorrectionFactors {
    factors: HashMap<(LengthBin, u8), f64>,
    stats: HashMap<(LengthBin, u8), CorrectionBinStats>,
}

/// Apply LOESS-based correction
pub fn correct_gc_bias(
    gc_values: &[f64],
    counts: &[f64],
    config: Option<GcCorrectionConfig>
) -> Result<Vec<f64>>

Python Interface

from krewlyzer import _core

# Compute and save correction factors
_core.gc.compute_and_write_gc_factors(
    bed_path="sample.bed.gz",
    gc_reference_path="gc_reference.parquet",
    output_path="correction_factors.csv"
)

When to Disable

Disable GC correction (--no-gc-correct) when:

Comparing raw signal across samples with known GC differences
Validating uncorrected patterns
Working with very small regions where LOESS may be unstable

For most applications, keep correction enabled (default).