Running the Pipeline
The backbone of kreview is run through a highly modular typer CLI. It connects all independent elements of the pipeline dynamically.
Available Commands
kreview provides a modular CLI where each pipeline stage can run independently or together via kreview run:
| Command | Purpose | Pipeline Order |
|---|---|---|
kreview label |
Generate ctDNA labels only | 1 |
kreview extract |
Label + extract feature matrices per evaluator | 2 |
kreview select |
Score features (AUC/MI) + mRMR or hybrid union selection | 3 |
kreview eval cpu |
CPU model evaluation (LR, RF, XGB) | 4a (parallel) |
kreview eval gpu |
GPU model evaluation (TabPFN, TabICL) | 4b (parallel) |
kreview fuse |
Fuse per-evaluator matrices → super-matrix | 4c (parallel) |
kreview eval multimodal |
Cross-evaluator stacking + ablation | 5 (needs 4a+4b+4c) |
kreview report |
Re-generate HTML dashboards | 6 |
kreview run |
Full pipeline: all of the above in sequence | — |
kreview features-list |
List registered evaluators | — |
Steps 4a, 4b, and 4c are independent
After feature selection, eval cpu, eval gpu, and fuse can run in parallel. They all converge at eval multimodal, which needs the OOF predictions from eval + the super-matrix from fuse.
Basic Execution
Disable Python Buffering
When running over terminal orchestrators (like nohup, standard piping, or SLURM), it is critical to run Python with PYTHONUNBUFFERED=1 so that the structlog progress output streams in real-time.
PYTHONUNBUFFERED=1 kreview run \
--cancer-samplesheet "/path/to/samplesheet.csv" \
--healthy-xs1-samplesheet "/path/to/healthy1.csv" \
--healthy-xs2-samplesheet "/path/to/healthy2.csv" \
--cbioportal-dir "/path/to/msk_solid_heme_cbioportal" \
--krewlyzer-dir "/path/to/feature_parquets" \
--output output/ \
--strategy mrmr \
--ch-hotspot-maf /path/to/ch_hotspots.maf \
--export-duckdb
--export-duckdb automatically writes a persistent SQL-queryable kreview_lake.duckdb after processing.
If your Krewlyzer results span multiple directories, create a manifest.txt listing one directory per line, then pass it down:
Control statistical parameters and cross-validation:
kreview run \
--cancer-samplesheet ... \
--krewlyzer-dir ... \
--cv-folds 5 \
--impute-strategy median
median, mean, zero. Folds must be 3-20.
Control SHAP explainability and display:
kreview run \
--cancer-samplesheet ... \
--krewlyzer-dir ... \
--shap-samples 5000 \
--shap-features 10 \
--top-percentile 20
| Flag | Default | Description |
|---|---|---|
--shap-samples |
500 | Max samples for SHAP computation (higher = slower but more stable) |
--shap-features |
10 | Number of features to show in SHAP beeswarm/waterfall |
--top-percentile |
10.0 | Top X% of features per metric (AUC, MI). Union of both sets feeds models. |
The --cvd-safe flag switches all dashboard visualizations to the Okabe-Ito color palette, which is accessible for red-green colorblindness. By default, kreview uses a curated neon palette optimized for dark backgrounds.
Feature selection uses mRMR (Minimum Redundancy Maximum Relevance) by default. It selects features that are highly correlated with the target but mutually dissimilar, preventing multi-collinearity. A legacy Hybrid Union strategy (top X% by AUC ∪ top X% by MI) is also available.
| Flag | Default | Description |
|---|---|---|
--strategy |
mrmr | Feature selection strategy: mrmr or hybrid_union |
--top-percentile |
10.0 | Percentile cutoff. For mRMR, this controls K features to select. |
--compute-univariate-auc |
True | Compute per-feature LR AUC (required for hybrid selection). |
--no-compute-univariate-auc |
— | Opt-out: degrades selection to MI-only with a warning. |
Deprecated: --top-n
The --top-n flag is deprecated since v0.0.9 and will be removed in v0.1.0. Use --top-percentile instead.
Targeted Execution
Use the --features flag with a comma-separated list to run only specific evaluators:
If you are running in headless CI environments and don't need HTML dashboards:
Note: When--skip-report is omitted, kreview generates both interactive Plotly output/reports/*.html dashboards and a static_plots/ subdirectory containing 2x-scaled .png versions of every chart.
Modular Pipeline (HPC / Nextflow)
The same pipeline can be run step-by-step for HPC parallelization or debugging:
# Step 0: Label (run once, share across all extractors)
kreview label \
--cancer-samplesheet samplesheet.csv \
--healthy-xs1-samplesheet healthy1.csv \
--healthy-xs2-samplesheet healthy2.csv \
--cbioportal-dir /path/to/cbioportal/ \
--ch-hotspot-maf /path/to/ch_hotspots.maf \
--output labels.parquet
# Step 1: Extract matrices (parallelizable per evaluator)
# Use --labels to skip re-labeling in each extract job
kreview extract --cancer-samplesheet samplesheet.csv \
--healthy-xs1-samplesheet healthy1.csv \
--healthy-xs2-samplesheet healthy2.csv \
--cbioportal-dir /path/to/cbioportal/ \
--krewlyzer-dir /path/to/features/ \
--labels labels.parquet \
--output output/
# Step 2: Feature selection (mRMR is default)
kreview select --matrices-dir output/ --top-percentile 50 --strategy mrmr --output selected/
# Or overwrite in-place:
# kreview select --matrices-dir output/ --top-percentile 50 --overwrite
# Steps 3a/3b/3c can run in PARALLEL
# 3a: CPU model evaluation
kreview eval cpu --matrices-dir selected/ --output results/
# 3b: GPU model evaluation (optional)
kreview eval gpu --matrices-dir selected/ --output results/
# 3c: Fuse selected matrices → super-matrix
kreview fuse --output-dir selected/
# Step 4: Multimodal evaluation (needs OOF probs + super_matrix)
kreview eval multimodal \
--results-dir results/ \
--super-matrix selected/super_matrix.parquet \
--multimodal-selection boruta_shap \
--output results/
# Step 5: Report
kreview report --results-dir results/
Inspecting Parquet Outputs
Use parq-cli to quickly inspect parquet files directly from the terminal:
kreview select options
| Flag | Default | Description |
|---|---|---|
--matrices-dir |
required | Directory with *_matrix.parquet from extract |
--top-percentile |
50 | Top N% per metric for selection |
--strategy |
mrmr | Feature selection strategy: mrmr or hybrid_union |
--cv-folds |
5 | Folds for univariate AUC scoring |
--impute-strategy |
median | Imputation for missing values |
--output |
output/ | Output directory for selected matrices |
--overwrite |
false | Overwrite originals instead of separate output |
Labels Only
If you only need to generate the ctDNA truth labels without running feature evaluation:
kreview label \
--cancer-samplesheet "/path/to/samplesheet.csv" \
--healthy-xs1-samplesheet "/path/to/healthy1.csv" \
--healthy-xs2-samplesheet "/path/to/healthy2.csv" \
--cbioportal-dir "/path/to/cBioPortal/" \
--output labels.parquet
This produces a single Parquet file with sample IDs, clinical metadata, and the assigned 5-tier labels.
Data Lake Integration
Every time kreview run executes, feature matrices are loaded into memory, evaluated, and then destroyed. If you want a persistent output for downstream analysis:
This creates or merges an immutable kreview_lake.duckdb file in your output/ directory. Downstream researchers can then query it directly with DuckDB or Pandas without re-running the pipeline.
Re-generating Reports
If you have existing evaluation results (stats.json, *_matrix.parquet) and want to regenerate the HTML dashboard:
For a guide to interpreting the generated dashboard, see the Dashboard Interpretation Guide.