Skip to content

Core Architecture

The kreview.core module establishes all dataclass configurations (Paths, LabelConfig, EvalRun), DuckDB connection management, and the thread-throttled parquet loading engine with exponential backoff retry.

For conceptual explanations, see:


kreview.core

LabelConfig dataclass

Configuration for the ctDNA labeling engine.

Paths dataclass

All input paths for the labeling pipeline.

__post_init__()

Coerce all path fields from str to Path for safe / operations.

EvalRun dataclass

Records the exact conditions of an evaluation run.

load_samplesheet(path)

Load a krewlyzer samplesheet CSV.

get_sample_ids(samplesheet_path)

Extract unique sample IDs from a samplesheet.

load_maf(path) cached

Load MAF file with only the columns needed for labeling.

load_sv(path) cached

Load structural variant file.

load_cna(path) cached

Load discrete CNA matrix (wide format). Index = Hugo_Symbol.

load_clinical_sample(path) cached

Load sample-level clinical data (skips 4-line cBioPortal # header).

load_clinical_patient(path) cached

Load patient-level clinical data (skips 4-line cBioPortal # header).

clear_cbioportal_caches()

Explicitly clear all cBioPortal LRU caches to prevent memory leaks in long running processes.

get_duckdb_conn()

Create a thread-local DuckDB connection with optimal settings. Configured natively for 4 threads and 4GB memory to prevent HPC OOMs.

discover_available_samples(results_dirs, required_suffix='.metadata.parquet')

Discover which samples have completed krewlyzer processing across multiple input directories.

iter_feature_chunks(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size='auto')

Stream feature data as chunks without accumulating in memory.

This is the memory-safe alternative to load_feature_cohort. Each chunk contains complete samples (one parquet file = one sample), so chunk boundaries never split a sample's data.

Parameters:

Name Type Description Default
feature_suffix str

Parquet file suffix (e.g. '.FSC.ontarget.parquet').

required
results_dirs list[Path]

Krewlyzer output directories to scan.

required
sample_ids list[str] | set[str] | None

Optional subset of samples to include.

None
conn DuckDBPyConnection | None

Optional DuckDB connection (created if not provided).

None
chunk_size int | str

Number of files per batch. 'auto' (default) probes the first 5 files to estimate rows-per-sample and self-tunes the batch size to target ~15M rows per chunk.

'auto'

Yields:

Type Description
(chunk_df, chunk_idx, n_chunks)

A tuple of the DataFrame chunk,

the zero-based chunk index, and total number of chunks.

Raises:

Type Description
ValueError

If results_dirs is empty.

load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size='auto')

Load one feature type across available samples into a single DataFrame.

This is the backward-compatible batch loader. For memory-constrained environments (e.g., HPC with 64GB RAM), use iter_feature_chunks instead to process data in a streaming fashion.

Parameters:

Name Type Description Default
chunk_size int | str

'auto' (default) probes parquet row density at runtime. Pass an integer to override (e.g. 500 for large features).

'auto'

ARCHITECTURE NOTE: We build an explicit file list from discovered samples instead of using a glob pattern. This avoids DuckDB scanning thousands of directories over network mounts (SFTP/NFS), which causes multi-minute stalls.

load_metadata_cohort(results_dirs, sample_ids=None, chunk_size='auto')

Load metadata for all samples. Thin wrapper around load_feature_cohort.

Metadata files are ~1 row per sample, so 'auto' will resolve to chunk_size=15_000 (single-sweep load) on first probe.

run_feature_sql(sql_query, feature_suffix, results_dirs, sample_ids=None, conn=None)

Execute a full-cohort SQL extraction in a single DuckDB pass.

This is the SQL pushdown alternative to iter_feature_chunks + per-sample extract(). It runs the evaluator's SQL query against all discovered parquet files at once, returning one row per sample.

The query must contain a read_parquet(?, ...) placeholder that accepts the file path list.

Parameters:

Name Type Description Default
sql_query str

DuckDB SQL with ? placeholder for file paths.

required
feature_suffix str

Parquet suffix (e.g. '.MDS.ontarget.parquet').

required
results_dirs list[Path]

Krewlyzer output directories to scan.

required
sample_ids list[str] | set[str] | None

Optional subset to filter to.

None
conn DuckDBPyConnection | None

Optional DuckDB connection.

None

Returns:

Type Description
DataFrame

DataFrame with one row per sample, or empty DataFrame on failure.

load_sample_feature(sample_id, feature_suffix, results_dir)

Load a single parquet feature file for one sample. (Useful for single-sample logic)

load_sample_metadata(sample_id, results_dir)

Load the metadata.parquet for a sample -> dict of QC values.

make_variant_key(df)

Create a hashable variant key from genomic coordinates.

fuse_matrices(output_dir, *, min_evaluators=1, drop_low_variance=True, variance_threshold=0.0, output_name='super_matrix.parquet')

Discover per-evaluator matrices and fuse into a single super-matrix.

Each *_matrix.parquet file in output_dir contains label/metadata columns plus evaluator-specific feature columns. This function:

  1. Discovers all *_matrix.parquet files (skips super_matrix.parquet and scoreboard_* files).
  2. Extracts metadata columns once from the first matrix.
  3. Prefixes each evaluator's feature columns with the evaluator name to prevent collisions (e.g., FSCOnTarget__ratio_short).
  4. Outer-joins all feature DataFrames on SAMPLE_ID.
  5. Filters to samples appearing in at least min_evaluators matrices.
  6. Drops low-variance features (cross-evaluator QC) when enabled.
  7. Writes the result to output_dir / output_name.

Parameters:

Name Type Description Default
output_dir str | Path

Directory containing *_matrix.parquet files.

required
min_evaluators int

Minimum number of evaluators a sample must appear in.

1
drop_low_variance bool

If True, drop features with variance at or below variance_threshold after fusion. Eliminates constant columns that carry no discriminative signal (step 3A.3 of the plan).

True
variance_threshold float

Features with variance <= this value are dropped. Default 0.0 drops only exactly-constant features.

0.0
output_name str

Filename for the output parquet.

'super_matrix.parquet'

Returns:

Type Description
DataFrame

The fused DataFrame.