Skip to content

Evaluation Engine API Reference

The kreview.eval_engine module contains the statistical testing functions, ML model training, visualization generators, and clinical utility computations.

For conceptual explanations, see:


kreview.eval_engine

FeatureEvaluator

Base class for all feature evaluators. Defines the extraction contract that transforms raw DuckDB queries into 1D arrays.

Subclasses must implement extract() for per-sample Python extraction. Optionally, override extract_sql() to return a DuckDB SQL query that performs full-cohort extraction in a single pass (no Python loop).

supports_sql property

True if this evaluator provides a SQL pushdown query.

extract(df)

Transform the loaded raw dataframe into meaningful scalar metrics. Called per sample-group or per sample.

extract_sql()

Return a DuckDB SQL query for full-cohort extraction, or None.

If implemented, the query should: - Accept a read_parquet(?, ...) placeholder for file paths - GROUP BY sample_id to produce one row per sample - SELECT all extracted feature columns with their final names

Returning None (the default) means this evaluator does not support SQL pushdown and will use the chunked Python path.

parse_array(s)

Parse a string-encoded numeric array into a list of floats.

Handles formats like '[1.0 2.0 3.0]' from parquet serialization. Returns empty list on any parse failure (no silent corruption).

univariate_auc(feature_col, y, n_folds=5, random_state=42)

Compute cross-validated AUC for a single feature using univariate LR.

Parameters:

Name Type Description Default
feature_col

pandas Series or array-like of a single feature.

required
y

binary label array (0/1).

required
n_folds int

number of CV folds.

5
random_state int

random seed.

42

Returns:

Type Description
float

Cross-validated AUC (float). Returns 0.5 if the feature is constant,

float

there are too few samples per class, or CV fails.

mutual_info_score(feature_col, y, random_state=42)

Compute mutual information between a single feature and binary target.

Uses sklearn's mutual_info_classif with k=3 nearest neighbors to estimate the non-linear dependency between a feature and the label. Unlike AUC, mutual information captures arbitrary (non-monotonic) relationships.

Parameters:

Name Type Description Default
feature_col

pandas Series or array-like of a single feature.

required
y

binary label array (0/1).

required
random_state int

random seed for reproducibility.

42

Returns:

Type Description
float

Mutual information score (float, >= 0). Higher means more informative.

float

Returns 0.0 if the feature is constant or computation fails.

set_theme(cvd_safe=False)

Dynamically updates the global label and model colors based on CVD preference.

evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)

Run all statistical tests for a single feature in one stratum. Outputs metrics directly to scoring dict.

plot_violin(df, feature_col, label_col='label', title='')

4-group violin with overlaid box plot and individual points for small groups.

plot_density(df, feature_col, label_col='label', title='')

Overlaid density curves per group — shows distribution shape differences.

plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')

Continuous relationship between feature and tumor burden (VAF proxy).

plot_roc_curves(y_true_dict, y_score_dict, title='')

Overlay ROC curves for multiple comparisons.

plot_feature_importance(importances, title='')

Bar plot of RF feature importances.

decision_curve_analysis(y_true, y_prob, thresholds=None)

Compute Decision Curve Analysis (DCA) net benefit data.

For each threshold, calculates the net benefit of using the model vs treating all or treating none. This helps clinicians choose an operating threshold that balances false positives against missed detections.

Parameters:

Name Type Description Default
y_true ndarray

Binary ground truth labels (0/1).

required
y_prob ndarray

Predicted probabilities for positive class.

required
thresholds ndarray | None

Array of decision thresholds to evaluate. Defaults to np.linspace(0.01, 0.99, 99).

None

Returns:

Type Description
dict

Dictionary with keys thresholds, net_benefit_model, and

dict

net_benefit_treat_all.

evaluate_model(model, X, y, cv, name, feature_names=None, refit=True, compute_shap=False, shap_samples=500, random_state=42)

Evaluate any sklearn-compatible model via stratified cross-validation.

Shared primitive used by cpu_models(), gpu_models(), and multimodal_eval(). The model must implement fit() and predict_proba().

Parameters:

Name Type Description Default
model

sklearn-compatible estimator (Pipeline, RF, XGB, TabPFN, etc.).

required
X ndarray

Feature matrix, shape (n_samples, n_features).

required
y ndarray

Binary labels (0/1), shape (n_samples,).

required
cv StratifiedKFold

Pre-configured StratifiedKFold splitter.

required
name str

Prefix for all result keys (e.g. "lr", "rf", "tabpfn").

required
feature_names list[str] | None

Optional feature names for importance extraction.

None
refit bool

If True, refit model on full data after CV.

True
compute_shap bool

If True, compute SHAP values (requires refit=True).

False
shap_samples int

Max samples for SHAP computation.

500

Returns:

Type Description
dict

(result_dict, fitted_model_or_None). result_dict keys are prefixed

object | None

with name (e.g. auc_rf, rf_oof_probs, rf_fold_aucs).

cpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, compute_shap=False, shap_samples=500)

Train LR, RF, and XGB on a feature matrix with stratified CV.

Delegates per-model evaluation to evaluate_model(), then adds cross-model diagnostics (AUC deltas, top features, threshold sweep, DCA, feature stability, subgroup analysis).

Returns (results_dict, lr_pipeline, rf_model, xgb_model).

Audit fixes preserved
  • C-01: LR uses Pipeline(scaler+lr) to prevent data leakage
  • C-02: Subgroup metrics use out-of-fold predictions (unbiased)
  • H-01: LR has class_weight="balanced", XGB has scale_pos_weight
  • M-02: Bootstrap 95% CI on AUC values

gpu_models(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42, models=('tabpfn',), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, compute_shap=False, shap_samples=500)

Train GPU foundation models (TabPFN, TabICL) on a feature matrix.

Same output schema as cpu_models() for each model, using the shared evaluate_model() primitive. Fine-tuning is ON by default.

Parameters:

Name Type Description Default
X ndarray

Feature matrix, shape (n_samples, n_features).

required
y ndarray

Binary labels (0/1).

required
feature_names list[str] | None

Optional feature names.

None
cancer_types ndarray | None

Optional cancer type array for subgroup analysis.

None
assays ndarray | None

Optional assay array for subgroup analysis.

None
n_folds int

Number of CV folds.

5
random_state int

Random seed.

42
models tuple[str, ...]

Tuple of GPU model names ("tabpfn", "tabicl").

('tabpfn',)
device str

PyTorch device ("cuda", "cpu").

'cuda'
finetune bool

If True (default), use fine-tuned variants.

True
finetune_epochs int

Epochs for fine-tuning.

30
finetune_lr float

Learning rate for fine-tuning.

1e-05
compute_shap bool

If True, compute SHAP values.

False
shap_samples int

Max samples for SHAP computation.

500

Returns:

Type Description
dict

(results_dict, fitted_models_dict). fitted_models_dict maps

dict[str, object]

model name to fitted model object.

load_model_results(directory, evaluator_name)

Load and merge CPU + GPU model results for a single evaluator.

Looks for {evaluator_name}_model_results.json (CPU) and {evaluator_name}_gpu_model_results.json (GPU). If both exist, GPU keys are merged into the CPU dict (GPU keys take precedence for overlapping model keys, metadata keys like 'evaluator' are kept from CPU).

Used by report templates where the evaluator name is known.

Parameters:

Name Type Description Default
directory Path

Directory containing the JSON result files.

required
evaluator_name str

Evaluator name (e.g., "FSCOnTarget").

required

Returns:

Type Description
dict | None

Merged dict, or None if no JSON exists for this evaluator.

load_all_model_results(directory)

Load and merge all CPU + GPU model results from a directory.

Scans for *_model_results.json and *_gpu_model_results.json, groups by evaluator name, and merges GPU keys into CPU dicts.

Used by scoreboard and multimodal engine for directory scanning.

Parameters:

Name Type Description Default
directory Path

Directory containing model result JSON files.

required

Returns:

Type Description
dict[str, dict]

Dict keyed by evaluator name → merged model results dict.

dict[str, dict]

Empty dict if no results found.

multimodal_eval(results_dir, super_matrix_path=None, *, models=('rf', 'xgb'), gpu_models=(), device='cuda', finetune=True, finetune_epochs=30, finetune_lr=1e-05, n_folds=5, top_percentile=10.0, random_state=42, multimodal_selection='mi')

Cross-evaluator multimodal evaluation.

Implements three complementary strategies:

  1. Stacking: Meta-learner trained on OOF predictions from all per-evaluator models. Each column is one evaluator's OOF probability. This measures how much combining multiple fragmentomics signals improves classification.

  2. Raw features (optional): If super_matrix_path is provided, trains directly on the fused feature matrix with multimodal_selection-based feature selection (MI or Boruta-SHAP).

  3. Ablation: Leave-one-evaluator-out analysis on the stacking matrix, showing each evaluator's marginal contribution.

Parameters:

Name Type Description Default
results_dir str | Path

Directory with *_model_results.json files.

required
super_matrix_path str | Path | None

Optional path to super_matrix.parquet.

None
models tuple[str, ...]

CPU model names to use (rf, xgb, lr).

('rf', 'xgb')
gpu_models tuple[str, ...]

GPU model names (tabpfn, tabicl). Empty = CPU only.

()
device str

PyTorch device string for GPU models.

'cuda'
finetune bool

If True, use fine-tuned GPU variants.

True
finetune_epochs int

Epochs for GPU model fine-tuning.

30
finetune_lr float

Learning rate for GPU model fine-tuning.

1e-05
n_folds int

Cross-validation folds.

5
top_percentile float

Top N% features for feature selection.

10.0
random_state int

Reproducibility seed.

42
multimodal_selection str

Feature selection strategy ('mi' or 'boruta_shap').

'mi'

Returns:

Type Description
dict

A comprehensive results dict with keys for each strategy.