Evaluation Engine API Reference

The kreview.eval_engine module contains the statistical testing functions, ML model training, visualization generators, and clinical utility computations.

For conceptual explanations, see:

`kreview.eval_engine`

`FeatureEvaluator`

Base class for all feature evaluators. Defines the extraction contract that transforms raw DuckDB queries into 1D arrays.

Subclasses must implement extract() for per-sample Python extraction. Optionally, override extract_sql() to return a DuckDB SQL query that performs full-cohort extraction in a single pass (no Python loop).

`supports_sql` `property`

True if this evaluator provides a SQL pushdown query.

`extract(df)`

Transform the loaded raw dataframe into meaningful scalar metrics. Called per sample-group or per sample.

`extract_sql()`

Return a DuckDB SQL query for full-cohort extraction, or None.

If implemented, the query should: - Accept a read_parquet(?, ...) placeholder for file paths - GROUP BY sample_id to produce one row per sample - SELECT all extracted feature columns with their final names

Returning None (the default) means this evaluator does not support SQL pushdown and will use the chunked Python path.

`parse_array(s)`

Parse a string-encoded numeric array into a list of floats.

Handles formats like '[1.0 2.0 3.0]' from parquet serialization. Returns empty list on any parse failure (no silent corruption).

`univariate_auc(feature_col, y, n_folds=5, random_state=42)`

Compute cross-validated AUC for a single feature using univariate LR.

Parameters:

Name	Type	Description	Default
`feature_col`		pandas Series or array-like of a single feature.	required
`y`		binary label array (0/1).	required
`n_folds`	`int`	number of CV folds.	`5`
`random_state`	`int`	random seed.	`42`

Returns:

Type	Description
`float`	Cross-validated AUC (float). Returns 0.5 if the feature is constant,
`float`	there are too few samples per class, or CV fails.

`mutual_info_score(feature_col, y, random_state=42)`

Compute mutual information between a single feature and binary target.

Uses sklearn's mutual_info_classif with k=3 nearest neighbors to estimate the non-linear dependency between a feature and the label. Unlike AUC, mutual information captures arbitrary (non-monotonic) relationships.

Parameters:

Name	Type	Description	Default
`feature_col`		pandas Series or array-like of a single feature.	required
`y`		binary label array (0/1).	required
`random_state`	`int`	random seed for reproducibility.	`42`

Returns:

Type	Description
`float`	Mutual information score (float, >= 0). Higher means more informative.
`float`	Returns 0.0 if the feature is constant or computation fails.

`set_theme(cvd_safe=False)`

Dynamically updates the global label and model colors based on CVD preference.

`evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)`

Run all statistical tests for a single feature in one stratum. Outputs metrics directly to scoring dict.

`plot_violin(df, feature_col, label_col='label', title='')`

4-group violin with overlaid box plot and individual points for small groups.

`plot_density(df, feature_col, label_col='label', title='')`

Overlaid density curves per group — shows distribution shape differences.

`plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')`

Continuous relationship between feature and tumor burden (VAF proxy).

`plot_roc_curves(y_true_dict, y_score_dict, title='')`

Overlay ROC curves for multiple comparisons.

`plot_feature_importance(importances, title='')`

Bar plot of RF feature importances.

`decision_curve_analysis(y_true, y_prob, thresholds=None)`

Compute Decision Curve Analysis (DCA) net benefit data.

For each threshold, calculates the net benefit of using the model vs treating all or treating none. This helps clinicians choose an operating threshold that balances false positives against missed detections.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	Binary ground truth labels (0/1).	required
`y_prob`	`ndarray`	Predicted probabilities for positive class.	required
`thresholds`	`ndarray \| None`	Array of decision thresholds to evaluate. Defaults to `np.linspace(0.01, 0.99, 99)`.	`None`

Returns:

Type	Description
`dict`	Dictionary with keys `thresholds`, `net_benefit_model`, and
`dict`	`net_benefit_treat_all`.

`evaluate_model(model, X, y, cv, name, feature_names=None, refit=True, compute_shap=False, shap_samples=500, random_state=42)`

Evaluate any sklearn-compatible model via stratified cross-validation.

Shared primitive used by cpu_models(), gpu_models(), and multimodal_eval(). The model must implement fit() and predict_proba().

Parameters:

Name	Type	Description	Default
`model`		sklearn-compatible estimator (Pipeline, RF, XGB, TabPFN, etc.).	required
`X`	`ndarray`	Feature matrix, shape (n_samples, n_features).	required
`y`	`ndarray`	Binary labels (0/1), shape (n_samples,).	required
`cv`	`StratifiedKFold`	Pre-configured StratifiedKFold splitter.	required
`name`	`str`	Prefix for all result keys (e.g. "lr", "rf", "tabpfn").	required
`feature_names`	`list[str] \| None`	Optional feature names for importance extraction.	`None`
`refit`	`bool`	If True, refit model on full data after CV.	`True`
`compute_shap`	`bool`	If True, compute SHAP values (requires refit=True).	`False`
`shap_samples`	`int`	Max samples for SHAP computation.	`500`

Returns:

Type	Description
`dict`	(result_dict, fitted_model_or_None). result_dict keys are prefixed
`object \| None`	with `name` (e.g. `auc_rf`, `rf_oof_probs`, `rf_fold_aucs`).

`cpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, compute_shap=False, shap_samples=500)`

Train LR, RF, and XGB on a feature matrix with stratified CV.

Delegates per-model evaluation to evaluate_model(), then adds cross-model diagnostics (AUC deltas, top features, threshold sweep, DCA, feature stability, subgroup analysis).

Returns (results_dict, lr_pipeline, rf_model, xgb_model).

Audit fixes preserved

C-01: LR uses Pipeline(scaler+lr) to prevent data leakage
C-02: Subgroup metrics use out-of-fold predictions (unbiased)
H-01: LR has class_weight="balanced", XGB has scale_pos_weight
M-02: Bootstrap 95% CI on AUC values

`gpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, models=('tabpfn',), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, compute_shap=False, shap_samples=500)`

Train GPU foundation models (TabPFN, TabICL) on a feature matrix.

Same output schema as cpu_models() for each model, using the shared evaluate_model() primitive. Fine-tuning is ON by default.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	Feature matrix, shape (n_samples, n_features).	required
`y`	`ndarray`	Binary labels (0/1).	required
`feature_names`	`list[str] \| None`	Optional feature names.	`None`
`cancer_types`	`ndarray \| None`	Optional cancer type array for subgroup analysis.	`None`
`assays`	`ndarray \| None`	Optional assay array for subgroup analysis.	`None`
`n_folds`	`int`	Number of CV folds.	`5`
`random_state`	`int`	Random seed.	`42`
`models`	`tuple[str, ...]`	Tuple of GPU model names ("tabpfn", "tabicl").	`('tabpfn',)`
`device`	`str`	PyTorch device ("cuda", "cpu").	`'cuda'`
`finetune`	`bool`	If True (default), use fine-tuned variants.	`True`
`finetune_epochs`	`int`	Epochs for fine-tuning.	`30`
`finetune_lr`	`float`	Learning rate for fine-tuning.	`1e-05`
`compute_shap`	`bool`	If True, compute SHAP values.	`False`
`shap_samples`	`int`	Max samples for SHAP computation.	`500`

Returns:

Type	Description
`dict`	(results_dict, fitted_models_dict). fitted_models_dict maps
`dict[str, object]`	model name to fitted model object.

`load_model_results(directory, evaluator_name)`

Load and merge CPU + GPU model results for a single evaluator.

Looks for {evaluator_name}_model_results.json (CPU) and {evaluator_name}_gpu_model_results.json (GPU). If both exist, GPU keys are merged into the CPU dict (GPU keys take precedence for overlapping model keys, metadata keys like 'evaluator' are kept from CPU).

Used by report templates where the evaluator name is known.

Parameters:

Name	Type	Description	Default
`directory`	`Path`	Directory containing the JSON result files.	required
`evaluator_name`	`str`	Evaluator name (e.g., `"FSCOnTarget"`).	required

Returns:

Type	Description
`dict \| None`	Merged dict, or None if no JSON exists for this evaluator.

`load_all_model_results(directory)`

Load and merge all CPU + GPU model results from a directory.

Scans for *_model_results.json and *_gpu_model_results.json, groups by evaluator name, and merges GPU keys into CPU dicts.

Used by scoreboard and multimodal engine for directory scanning.

Parameters:

Name	Type	Description	Default
`directory`	`Path`	Directory containing model result JSON files.	required

Returns:

Type	Description
`dict[str, dict]`	Dict keyed by evaluator name → merged model results dict.
`dict[str, dict]`	Empty dict if no results found.

`multimodal_eval(results_dir, super_matrix_path=None, *, models=('rf', 'xgb'), gpu_models=(), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, n_folds=5, top_percentile=10.0, random_state=42, multimodal_selection='mi')`

Cross-evaluator multimodal evaluation.

Implements three complementary strategies:

Stacking: Meta-learner trained on OOF predictions from all per-evaluator models. Each column is one evaluator's OOF probability. This measures how much combining multiple fragmentomics signals improves classification.
Raw features (optional): If super_matrix_path is provided, trains directly on the fused feature matrix with multimodal_selection-based feature selection (MI or Boruta-SHAP).
Ablation: Leave-one-evaluator-out analysis on the stacking matrix, showing each evaluator's marginal contribution.

Parameters:

Name	Type	Description	Default
`results_dir`	`str \| Path`	Directory with `*_model_results.json` files.	required
`super_matrix_path`	`str \| Path \| None`	Optional path to `super_matrix.parquet`.	`None`
`models`	`tuple[str, ...]`	CPU model names to use (`rf`, `xgb`, `lr`).	`('rf', 'xgb')`
`gpu_models`	`tuple[str, ...]`	GPU model names (`tabpfn`, `tabicl`). Empty = CPU only.	`()`
`device`	`str`	PyTorch device string for GPU models.	`'cuda'`
`finetune`	`bool`	If True, use fine-tuned GPU variants.	`True`
`finetune_epochs`	`int`	Epochs for GPU model fine-tuning.	`30`
`finetune_lr`	`float`	Learning rate for GPU model fine-tuning.	`1e-05`
`n_folds`	`int`	Cross-validation folds.	`5`
`top_percentile`	`float`	Top N% features for feature selection.	`10.0`
`random_state`	`int`	Reproducibility seed.	`42`
`multimodal_selection`	`str`	Feature selection strategy ('mi' or 'boruta_shap').	`'mi'`

Returns:

Type	Description
`dict`	A comprehensive results dict with keys for each strategy.

Evaluation Engine API Reference

kreview.eval_engine

FeatureEvaluator

supports_sql property

extract(df)

extract_sql()

parse_array(s)

univariate_auc(feature_col, y, n_folds=5, random_state=42)

mutual_info_score(feature_col, y, random_state=42)

set_theme(cvd_safe=False)

evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)

plot_violin(df, feature_col, label_col='label', title='')

plot_density(df, feature_col, label_col='label', title='')

plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')

plot_roc_curves(y_true_dict, y_score_dict, title='')

plot_feature_importance(importances, title='')

decision_curve_analysis(y_true, y_prob, thresholds=None)

evaluate_model(model, X, y, cv, name, feature_names=None, refit=True, compute_shap=False, shap_samples=500, random_state=42)

cpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, compute_shap=False, shap_samples=500)

gpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, models=('tabpfn',), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, compute_shap=False, shap_samples=500)

load_model_results(directory, evaluator_name)

load_all_model_results(directory)

multimodal_eval(results_dir, super_matrix_path=None, *, models=('rf', 'xgb'), gpu_models=(), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, n_folds=5, top_percentile=10.0, random_state=42, multimodal_selection='mi')

`kreview.eval_engine`

`FeatureEvaluator`

`supports_sql` `property`

`extract(df)`

`extract_sql()`

`parse_array(s)`

`univariate_auc(feature_col, y, n_folds=5, random_state=42)`

`mutual_info_score(feature_col, y, random_state=42)`

`set_theme(cvd_safe=False)`

`evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)`

`plot_violin(df, feature_col, label_col='label', title='')`

`plot_density(df, feature_col, label_col='label', title='')`

`plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')`

`plot_roc_curves(y_true_dict, y_score_dict, title='')`

`plot_feature_importance(importances, title='')`

`decision_curve_analysis(y_true, y_prob, thresholds=None)`

`evaluate_model(model, X, y, cv, name, feature_names=None, refit=True, compute_shap=False, shap_samples=500, random_state=42)`

`cpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, compute_shap=False, shap_samples=500)`

`gpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, models=('tabpfn',), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, compute_shap=False, shap_samples=500)`

`load_model_results(directory, evaluator_name)`

`load_all_model_results(directory)`

`multimodal_eval(results_dir, super_matrix_path=None, *, models=('rf', 'xgb'), gpu_models=(), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, n_folds=5, top_percentile=10.0, random_state=42, multimodal_selection='mi')`