Skip to content

Core Architecture

The kreview.core module establishes all dataclass configurations (Paths, LabelConfig, EvalRun), DuckDB connection management, and the thread-throttled parquet loading engine with exponential backoff retry.

For conceptual explanations, see:


kreview.core

LabelConfig dataclass

Configuration for the ctDNA labeling engine.

Paths dataclass

All input paths for the labeling pipeline.

__post_init__()

Coerce all path fields from str to Path for safe / operations.

EvalRun dataclass

Records the exact conditions of an evaluation run.

load_samplesheet(path)

Load a krewlyzer samplesheet CSV.

get_sample_ids(samplesheet_path)

Extract unique sample IDs from a samplesheet.

load_maf(path) cached

Load MAF file with only the columns needed for labeling.

load_sv(path) cached

Load structural variant file.

load_cna(path) cached

Load discrete CNA matrix (wide format). Index = Hugo_Symbol.

load_clinical_sample(path) cached

Load sample-level clinical data (skips 4-line cBioPortal # header).

load_clinical_patient(path) cached

Load patient-level clinical data (skips 4-line cBioPortal # header).

clear_cbioportal_caches()

Explicitly clear all cBioPortal LRU caches to prevent memory leaks in long running processes.

get_duckdb_conn()

Create a thread-local DuckDB connection with optimal settings. Configured natively for 4 threads and 4GB memory to prevent HPC OOMs.

discover_available_samples(results_dirs, required_suffix='.metadata.parquet')

Discover which samples have completed krewlyzer processing across multiple input directories.

load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size=500)

Load one feature type across available samples using explicit file list.

ARCHITECTURE NOTE: We build an explicit file list from discovered samples instead of using a glob pattern. This avoids DuckDB scanning thousands of directories over network mounts (SFTP/NFS), which causes multi-minute stalls.

load_sample_feature(sample_id, feature_suffix, results_dir)

Load a single parquet feature file for one sample. (Useful for single-sample logic)

load_sample_metadata(sample_id, results_dir)

Load the metadata.parquet for a sample -> dict of QC values.

make_variant_key(df)

Create a hashable variant key from genomic coordinates.