Core Architecture
The kreview.core module establishes all dataclass configurations (Paths, LabelConfig, EvalRun), DuckDB connection management, and the thread-throttled parquet loading engine with exponential backoff retry.
For conceptual explanations, see:
kreview.core
LabelConfig
dataclass
Configuration for the ctDNA labeling engine.
Paths
dataclass
All input paths for the labeling pipeline.
__post_init__()
Coerce all path fields from str to Path for safe / operations.
EvalRun
dataclass
Records the exact conditions of an evaluation run.
load_samplesheet(path)
Load a krewlyzer samplesheet CSV.
get_sample_ids(samplesheet_path)
Extract unique sample IDs from a samplesheet.
load_maf(path)
cached
Load MAF file with only the columns needed for labeling.
load_sv(path)
cached
Load structural variant file.
load_cna(path)
cached
Load discrete CNA matrix (wide format). Index = Hugo_Symbol.
load_clinical_sample(path)
cached
Load sample-level clinical data (skips 4-line cBioPortal # header).
load_clinical_patient(path)
cached
Load patient-level clinical data (skips 4-line cBioPortal # header).
clear_cbioportal_caches()
Explicitly clear all cBioPortal LRU caches to prevent memory leaks in long running processes.
get_duckdb_conn()
Create a thread-local DuckDB connection with optimal settings. Configured natively for 4 threads and 4GB memory to prevent HPC OOMs.
discover_available_samples(results_dirs, required_suffix='.metadata.parquet')
Discover which samples have completed krewlyzer processing across multiple input directories.
load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size=500)
Load one feature type across available samples using explicit file list.
ARCHITECTURE NOTE: We build an explicit file list from discovered samples instead of using a glob pattern. This avoids DuckDB scanning thousands of directories over network mounts (SFTP/NFS), which causes multi-minute stalls.
load_sample_feature(sample_id, feature_suffix, results_dir)
Load a single parquet feature file for one sample. (Useful for single-sample logic)
load_sample_metadata(sample_id, results_dir)
Load the metadata.parquet for a sample -> dict of QC values.
make_variant_key(df)
Create a hashable variant key from genomic coordinates.