Core Architecture

The kreview.core module establishes all dataclass configurations (Paths, LabelConfig, EvalRun), DuckDB connection management, and the thread-throttled parquet loading engine with exponential backoff retry.

For conceptual explanations, see:

`kreview.core`

`LabelConfig` `dataclass`

Configuration for the ctDNA labeling engine.

`Paths` `dataclass`

All input paths for the labeling pipeline.

`__post_init__()`

Coerce all path fields from str to Path for safe / operations.

`EvalRun` `dataclass`

Records the exact conditions of an evaluation run.

`load_samplesheet(path)`

Load a krewlyzer samplesheet CSV.

`get_sample_ids(samplesheet_path)`

Extract unique sample IDs from a samplesheet.

`load_maf(path)` `cached`

Load MAF file with only the columns needed for labeling.

`load_sv(path)` `cached`

Load structural variant file.

`load_cna(path)` `cached`

Load discrete CNA matrix (wide format). Index = Hugo_Symbol.

`load_clinical_sample(path)` `cached`

Load sample-level clinical data (skips 4-line cBioPortal # header).

`load_clinical_patient(path)` `cached`

Load patient-level clinical data (skips 4-line cBioPortal # header).

`clear_cbioportal_caches()`

Explicitly clear all cBioPortal LRU caches to prevent memory leaks in long running processes.

`get_duckdb_conn()`

Create a thread-local DuckDB connection with optimal settings. Configured natively for 4 threads and 4GB memory to prevent HPC OOMs.

`discover_available_samples(results_dirs, required_suffix='.metadata.parquet')`

Discover which samples have completed krewlyzer processing across multiple input directories.

`load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size=500)`

Load one feature type across available samples using explicit file list.

ARCHITECTURE NOTE: We build an explicit file list from discovered samples instead of using a glob pattern. This avoids DuckDB scanning thousands of directories over network mounts (SFTP/NFS), which causes multi-minute stalls.

`load_sample_feature(sample_id, feature_suffix, results_dir)`

Load a single parquet feature file for one sample. (Useful for single-sample logic)

`load_sample_metadata(sample_id, results_dir)`

Load the metadata.parquet for a sample -> dict of QC values.

`make_variant_key(df)`

Create a hashable variant key from genomic coordinates.

Core Architecture

kreview.core

LabelConfig dataclass

Paths dataclass

__post_init__()

EvalRun dataclass

load_samplesheet(path)

get_sample_ids(samplesheet_path)

load_maf(path) cached

load_sv(path) cached

load_cna(path) cached

load_clinical_sample(path) cached

load_clinical_patient(path) cached

clear_cbioportal_caches()

get_duckdb_conn()

discover_available_samples(results_dirs, required_suffix='.metadata.parquet')

load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size=500)

load_sample_feature(sample_id, feature_suffix, results_dir)

load_sample_metadata(sample_id, results_dir)

make_variant_key(df)