Evaluation Engine API Reference

The kreview.eval_engine module contains the statistical testing functions, ML model training, visualization generators, and clinical utility computations.

For conceptual explanations, see:

`kreview.eval_engine`

`FeatureEvaluator`

Base class for all feature evaluators. Defines the extraction contract that transforms raw DuckDB queries into 1D arrays.

`extract(df)`

Transform the loaded raw dataframe into meaningful scalar metrics. Called per sample-group or per sample.

`parse_array(s)`

Parse a string-encoded numeric array into a list of floats.

Handles formats like '[1.0 2.0 3.0]' from parquet serialization. Returns empty list on any parse failure (no silent corruption).

`univariate_auc(feature_col, y, n_folds=5, random_state=42)`

Compute cross-validated AUC for a single feature using univariate LR.

Parameters:

Name	Type	Description	Default
`feature_col`		pandas Series or array-like of a single feature.	required
`y`		binary label array (0/1).	required
`n_folds`	`int`	number of CV folds.	`5`
`random_state`	`int`	random seed.	`42`

Returns:

Type	Description
`float`	Cross-validated AUC (float). Returns 0.5 if the feature is constant,
`float`	there are too few samples per class, or CV fails.

`set_theme(cvd_safe=False)`

Dynamically updates the global label and model colors based on CVD preference.

`evaluate_feature(feature_values, labels, total_fragments=None, max_vaf=None)`

Run all statistical tests for a single feature in one stratum. Outputs metrics directly to scoring dict.

`plot_violin(df, feature_col, label_col='label', title='')`

4-group violin with overlaid box plot and individual points for small groups.

`plot_density(df, feature_col, label_col='label', title='')`

Overlaid density curves per group — shows distribution shape differences.

`plot_feature_vs_vaf(df, feature_col, vaf_col='max_vaf', label_col='label', title='')`

Continuous relationship between feature and tumor burden (VAF proxy).

`plot_roc_curves(y_true_dict, y_score_dict, title='')`

Overlay ROC curves for multiple comparisons.

`plot_feature_importance(importances, title='')`

Bar plot of RF feature importances.

`plot_threshold_sensitivity(results_df, title='')`

Show how label counts shift with VAF/min_variants thresholds.

`decision_curve_analysis(y_true, y_prob, thresholds=None)`

Compute Decision Curve Analysis (DCA) net benefit data.

For each threshold, calculates the net benefit of using the model vs treating all or treating none. This helps clinicians choose an operating threshold that balances false positives against missed detections.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	Binary ground truth labels (0/1).	required
`y_prob`	`ndarray`	Predicted probabilities for positive class.	required
`thresholds`	`ndarray \| None`	Array of decision thresholds to evaluate. Defaults to `np.linspace(0.01, 0.99, 99)`.	`None`

Returns:

Type	Description
`dict`	Dictionary with keys `thresholds`, `net_benefit_model`, and
`dict`	`net_benefit_treat_all`.

`single_feature_model(X, y, feature_names=None, cancer_types=None, assays=None, n_folds=5, random_state=42)`

Train LR, RF, and XGB on a feature matrix with stratified CV.

Returns (results_dict, lr_pipeline, rf_model, xgb_model).

Fixes applied (audit v3): - C-01: LR uses Pipeline(scaler+lr) to prevent data leakage - C-02: Subgroup metrics use out-of-fold predictions (unbiased) - H-01: LR has class_weight="balanced", XGB has scale_pos_weight - H-07: Bare except replaced with Exception - M-02: Bootstrap 95% CI on AUC values