Evaluation Engine API Reference
The kreview.eval_engine module contains the statistical testing functions, ML model training, visualization generators, and clinical utility computations.
For conceptual explanations, see:
- Statistical Tests
- Models & Metrics
- Decision Curve Analysis
- Dashboard Interpretation Guide
- Feature Cards
kreview.eval_engine
FeatureEvaluator
Base class for all feature evaluators. Defines the extraction contract that transforms raw DuckDB queries into 1D arrays.
Subclasses must implement extract() for per-sample Python extraction.
Optionally, override extract_sql() to return a DuckDB SQL query that
performs full-cohort extraction in a single pass (no Python loop).
supports_sql
property
True if this evaluator provides a SQL pushdown query.
extract(df)
Transform the loaded raw dataframe into meaningful scalar metrics. Called per sample-group or per sample.
extract_sql()
Return a DuckDB SQL query for full-cohort extraction, or None.
If implemented, the query should:
- Accept a read_parquet(?, ...) placeholder for file paths
- GROUP BY sample_id to produce one row per sample
- SELECT all extracted feature columns with their final names
Returning None (the default) means this evaluator does not
support SQL pushdown and will use the chunked Python path.
parse_array(s)
Parse a string-encoded numeric array into a list of floats.
Handles formats like '[1.0 2.0 3.0]' from parquet serialization. Returns empty list on any parse failure (no silent corruption).
univariate_auc(feature_col, y, n_folds=5, random_state=42)
Compute cross-validated AUC for a single feature using univariate LR.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_col
|
pandas Series or array-like of a single feature. |
required | |
y
|
binary label array (0/1). |
required | |
n_folds
|
int
|
number of CV folds. |
5
|
random_state
|
int
|
random seed. |
42
|
Returns:
| Type | Description |
|---|---|
float
|
Cross-validated AUC (float). Returns 0.5 if the feature is constant, |
float
|
there are too few samples per class, or CV fails. |
mutual_info_score(feature_col, y, random_state=42)
Compute mutual information between a single feature and binary target.
Uses sklearn's mutual_info_classif with k=3 nearest neighbors to estimate the non-linear dependency between a feature and the label. Unlike AUC, mutual information captures arbitrary (non-monotonic) relationships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_col
|
pandas Series or array-like of a single feature. |
required | |
y
|
binary label array (0/1). |
required | |
random_state
|
int
|
random seed for reproducibility. |
42
|
Returns:
| Type | Description |
|---|---|
float
|
Mutual information score (float, >= 0). Higher means more informative. |
float
|
Returns 0.0 if the feature is constant or computation fails. |
set_theme(cvd_safe=False)
Dynamically updates the global label and model colors based on CVD preference.
evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)
Run all statistical tests for a single feature in one stratum. Outputs metrics directly to scoring dict.
plot_violin(df, feature_col, label_col='label', title='')
4-group violin with overlaid box plot and individual points for small groups.
plot_density(df, feature_col, label_col='label', title='')
Overlaid density curves per group — shows distribution shape differences.
plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')
Continuous relationship between feature and tumor burden (VAF proxy).
plot_roc_curves(y_true_dict, y_score_dict, title='')
Overlay ROC curves for multiple comparisons.
plot_feature_importance(importances, title='')
Bar plot of RF feature importances.
decision_curve_analysis(y_true, y_prob, thresholds=None)
Compute Decision Curve Analysis (DCA) net benefit data.
For each threshold, calculates the net benefit of using the model vs treating all or treating none. This helps clinicians choose an operating threshold that balances false positives against missed detections.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
Binary ground truth labels (0/1). |
required |
y_prob
|
ndarray
|
Predicted probabilities for positive class. |
required |
thresholds
|
ndarray | None
|
Array of decision thresholds to evaluate.
Defaults to |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with keys |
dict
|
|
evaluate_model(model, X, y, cv, name, feature_names=None, refit=True, compute_shap=False, shap_samples=500, random_state=42)
Evaluate any sklearn-compatible model via stratified cross-validation.
Shared primitive used by cpu_models(), gpu_models(), and multimodal_eval(). The model must implement fit() and predict_proba().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
sklearn-compatible estimator (Pipeline, RF, XGB, TabPFN, etc.). |
required | |
X
|
ndarray
|
Feature matrix, shape (n_samples, n_features). |
required |
y
|
ndarray
|
Binary labels (0/1), shape (n_samples,). |
required |
cv
|
StratifiedKFold
|
Pre-configured StratifiedKFold splitter. |
required |
name
|
str
|
Prefix for all result keys (e.g. "lr", "rf", "tabpfn"). |
required |
feature_names
|
list[str] | None
|
Optional feature names for importance extraction. |
None
|
refit
|
bool
|
If True, refit model on full data after CV. |
True
|
compute_shap
|
bool
|
If True, compute SHAP values (requires refit=True). |
False
|
shap_samples
|
int
|
Max samples for SHAP computation. |
500
|
Returns:
| Type | Description |
|---|---|
dict
|
(result_dict, fitted_model_or_None). result_dict keys are prefixed |
object | None
|
with |
cpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, compute_shap=False, shap_samples=500)
Train LR, RF, and XGB on a feature matrix with stratified CV.
Delegates per-model evaluation to evaluate_model(), then adds
cross-model diagnostics (AUC deltas, top features, threshold sweep,
DCA, feature stability, subgroup analysis).
Returns (results_dict, lr_pipeline, rf_model, xgb_model).
Audit fixes preserved
- C-01: LR uses Pipeline(scaler+lr) to prevent data leakage
- C-02: Subgroup metrics use out-of-fold predictions (unbiased)
- H-01: LR has class_weight="balanced", XGB has scale_pos_weight
- M-02: Bootstrap 95% CI on AUC values
gpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, models=('tabpfn',), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, compute_shap=False, shap_samples=500)
Train GPU foundation models (TabPFN, TabICL) on a feature matrix.
Same output schema as cpu_models() for each model, using the shared
evaluate_model() primitive. Fine-tuning is ON by default.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Feature matrix, shape (n_samples, n_features). |
required |
y
|
ndarray
|
Binary labels (0/1). |
required |
feature_names
|
list[str] | None
|
Optional feature names. |
None
|
cancer_types
|
ndarray | None
|
Optional cancer type array for subgroup analysis. |
None
|
assays
|
ndarray | None
|
Optional assay array for subgroup analysis. |
None
|
n_folds
|
int
|
Number of CV folds. |
5
|
random_state
|
int
|
Random seed. |
42
|
models
|
tuple[str, ...]
|
Tuple of GPU model names ("tabpfn", "tabicl"). |
('tabpfn',)
|
device
|
str
|
PyTorch device ("cuda", "cpu"). |
'cuda'
|
finetune
|
bool
|
If True (default), use fine-tuned variants. |
True
|
finetune_epochs
|
int
|
Epochs for fine-tuning. |
30
|
finetune_lr
|
float
|
Learning rate for fine-tuning. |
1e-05
|
compute_shap
|
bool
|
If True, compute SHAP values. |
False
|
shap_samples
|
int
|
Max samples for SHAP computation. |
500
|
Returns:
| Type | Description |
|---|---|
dict
|
(results_dict, fitted_models_dict). fitted_models_dict maps |
dict[str, object]
|
model name to fitted model object. |
load_model_results(directory, evaluator_name)
Load and merge CPU + GPU model results for a single evaluator.
Looks for {evaluator_name}_model_results.json (CPU) and
{evaluator_name}_gpu_model_results.json (GPU). If both exist,
GPU keys are merged into the CPU dict (GPU keys take precedence
for overlapping model keys, metadata keys like 'evaluator' are
kept from CPU).
Used by report templates where the evaluator name is known.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
Directory containing the JSON result files. |
required |
evaluator_name
|
str
|
Evaluator name (e.g., |
required |
Returns:
| Type | Description |
|---|---|
dict | None
|
Merged dict, or None if no JSON exists for this evaluator. |
load_all_model_results(directory)
Load and merge all CPU + GPU model results from a directory.
Scans for *_model_results.json and *_gpu_model_results.json,
groups by evaluator name, and merges GPU keys into CPU dicts.
Used by scoreboard and multimodal engine for directory scanning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
Directory containing model result JSON files. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict]
|
Dict keyed by evaluator name → merged model results dict. |
dict[str, dict]
|
Empty dict if no results found. |
multimodal_eval(results_dir, super_matrix_path=None, *, models=('rf', 'xgb'), gpu_models=(), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, n_folds=5, top_percentile=10.0, random_state=42, multimodal_selection='mi')
Cross-evaluator multimodal evaluation.
Implements three complementary strategies:
-
Stacking: Meta-learner trained on OOF predictions from all per-evaluator models. Each column is one evaluator's OOF probability. This measures how much combining multiple fragmentomics signals improves classification.
-
Raw features (optional): If
super_matrix_pathis provided, trains directly on the fused feature matrix withmultimodal_selection-based feature selection (MI or Boruta-SHAP). -
Ablation: Leave-one-evaluator-out analysis on the stacking matrix, showing each evaluator's marginal contribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results_dir
|
str | Path
|
Directory with |
required |
super_matrix_path
|
str | Path | None
|
Optional path to |
None
|
models
|
tuple[str, ...]
|
CPU model names to use ( |
('rf', 'xgb')
|
gpu_models
|
tuple[str, ...]
|
GPU model names ( |
()
|
device
|
str
|
PyTorch device string for GPU models. |
'cuda'
|
finetune
|
bool
|
If True, use fine-tuned GPU variants. |
True
|
finetune_epochs
|
int
|
Epochs for GPU model fine-tuning. |
30
|
finetune_lr
|
float
|
Learning rate for GPU model fine-tuning. |
1e-05
|
n_folds
|
int
|
Cross-validation folds. |
5
|
top_percentile
|
float
|
Top N% features for feature selection. |
10.0
|
random_state
|
int
|
Reproducibility seed. |
42
|
multimodal_selection
|
str
|
Feature selection strategy ('mi' or 'boruta_shap'). |
'mi'
|
Returns:
| Type | Description |
|---|---|
dict
|
A comprehensive results dict with keys for each strategy. |