Core Architecture
The kreview.core module establishes all dataclass configurations (Paths, LabelConfig, EvalRun), DuckDB connection management, and the thread-throttled parquet loading engine with exponential backoff retry.
For conceptual explanations, see:
kreview.core
LabelConfig
dataclass
Configuration for the ctDNA labeling engine.
Paths
dataclass
All input paths for the labeling pipeline.
__post_init__()
Coerce all path fields from str to Path for safe / operations.
EvalRun
dataclass
Records the exact conditions of an evaluation run.
load_samplesheet(path)
Load a krewlyzer samplesheet CSV.
get_sample_ids(samplesheet_path)
Extract unique sample IDs from a samplesheet.
load_maf(path)
cached
Load MAF file with only the columns needed for labeling.
load_sv(path)
cached
Load structural variant file.
load_cna(path)
cached
Load discrete CNA matrix (wide format). Index = Hugo_Symbol.
load_clinical_sample(path)
cached
Load sample-level clinical data (skips 4-line cBioPortal # header).
load_clinical_patient(path)
cached
Load patient-level clinical data (skips 4-line cBioPortal # header).
clear_cbioportal_caches()
Explicitly clear all cBioPortal LRU caches to prevent memory leaks in long running processes.
get_duckdb_conn()
Create a thread-local DuckDB connection with optimal settings. Configured natively for 4 threads and 4GB memory to prevent HPC OOMs.
discover_available_samples(results_dirs, required_suffix='.metadata.parquet')
Discover which samples have completed krewlyzer processing across multiple input directories.
iter_feature_chunks(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size='auto')
Stream feature data as chunks without accumulating in memory.
This is the memory-safe alternative to load_feature_cohort. Each chunk
contains complete samples (one parquet file = one sample), so chunk
boundaries never split a sample's data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_suffix
|
str
|
Parquet file suffix (e.g. '.FSC.ontarget.parquet'). |
required |
results_dirs
|
list[Path]
|
Krewlyzer output directories to scan. |
required |
sample_ids
|
list[str] | set[str] | None
|
Optional subset of samples to include. |
None
|
conn
|
DuckDBPyConnection | None
|
Optional DuckDB connection (created if not provided). |
None
|
chunk_size
|
int | str
|
Number of files per batch. 'auto' (default) probes the first 5 files to estimate rows-per-sample and self-tunes the batch size to target ~15M rows per chunk. |
'auto'
|
Yields:
| Type | Description |
|---|---|
(chunk_df, chunk_idx, n_chunks)
|
A tuple of the DataFrame chunk, |
|
the zero-based chunk index, and total number of chunks. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If results_dirs is empty. |
load_feature_cohort(feature_suffix, results_dirs, sample_ids=None, conn=None, chunk_size='auto')
Load one feature type across available samples into a single DataFrame.
This is the backward-compatible batch loader. For memory-constrained
environments (e.g., HPC with 64GB RAM), use iter_feature_chunks instead
to process data in a streaming fashion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_size
|
int | str
|
'auto' (default) probes parquet row density at runtime. Pass an integer to override (e.g. 500 for large features). |
'auto'
|
ARCHITECTURE NOTE: We build an explicit file list from discovered samples instead of using a glob pattern. This avoids DuckDB scanning thousands of directories over network mounts (SFTP/NFS), which causes multi-minute stalls.
load_metadata_cohort(results_dirs, sample_ids=None, chunk_size='auto')
Load metadata for all samples. Thin wrapper around load_feature_cohort.
Metadata files are ~1 row per sample, so 'auto' will resolve to chunk_size=15_000 (single-sweep load) on first probe.
run_feature_sql(sql_query, feature_suffix, results_dirs, sample_ids=None, conn=None)
Execute a full-cohort SQL extraction in a single DuckDB pass.
This is the SQL pushdown alternative to iter_feature_chunks +
per-sample extract(). It runs the evaluator's SQL query against
all discovered parquet files at once, returning one row per sample.
The query must contain a read_parquet(?, ...) placeholder that
accepts the file path list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sql_query
|
str
|
DuckDB SQL with |
required |
feature_suffix
|
str
|
Parquet suffix (e.g. '.MDS.ontarget.parquet'). |
required |
results_dirs
|
list[Path]
|
Krewlyzer output directories to scan. |
required |
sample_ids
|
list[str] | set[str] | None
|
Optional subset to filter to. |
None
|
conn
|
DuckDBPyConnection | None
|
Optional DuckDB connection. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with one row per sample, or empty DataFrame on failure. |
load_sample_feature(sample_id, feature_suffix, results_dir)
Load a single parquet feature file for one sample. (Useful for single-sample logic)
load_sample_metadata(sample_id, results_dir)
Load the metadata.parquet for a sample -> dict of QC values.
make_variant_key(df)
Create a hashable variant key from genomic coordinates.
fuse_matrices(output_dir, *, min_evaluators=1, drop_low_variance=True, variance_threshold=0.0, output_name='super_matrix.parquet')
Discover per-evaluator matrices and fuse into a single super-matrix.
Each *_matrix.parquet file in output_dir contains label/metadata
columns plus evaluator-specific feature columns. This function:
- Discovers all
*_matrix.parquetfiles (skipssuper_matrix.parquetandscoreboard_*files). - Extracts metadata columns once from the first matrix.
- Prefixes each evaluator's feature columns with the evaluator name
to prevent collisions (e.g.,
FSCOnTarget__ratio_short). - Outer-joins all feature DataFrames on
SAMPLE_ID. - Filters to samples appearing in at least
min_evaluatorsmatrices. - Drops low-variance features (cross-evaluator QC) when enabled.
- Writes the result to
output_dir / output_name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
str | Path
|
Directory containing |
required |
min_evaluators
|
int
|
Minimum number of evaluators a sample must appear in. |
1
|
drop_low_variance
|
bool
|
If True, drop features with variance at or below
|
True
|
variance_threshold
|
float
|
Features with variance <= this value are dropped. Default 0.0 drops only exactly-constant features. |
0.0
|
output_name
|
str
|
Filename for the output parquet. |
'super_matrix.parquet'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The fused DataFrame. |