Running the Pipeline
The backbone of kreview is run through a highly modular typer CLI. It connects all independent elements of the pipeline dynamically.
Available Commands
kreview exposes four subcommands:
| Command | Purpose |
|---|---|
kreview run |
Full pipeline: label → extract → evaluate → report |
kreview label |
Generate ctDNA labels only (no feature evaluation) |
kreview features-list |
List all registered feature evaluators |
kreview report |
Re-generate HTML dashboard from existing results |
Basic Execution
Disable Python Buffering
When running over terminal orchestrators (like nohup, standard piping, or SLURM), it is critical to run Python with PYTHONUNBUFFERED=1 so that the structlog progress output streams in real-time.
PYTHONUNBUFFERED=1 kreview run \
--cancer-samplesheet "/path/to/samplesheet.csv" \
--healthy-xs1-samplesheet "/path/to/healthy1.csv" \
--healthy-xs2-samplesheet "/path/to/healthy2.csv" \
--cbioportal-dir "/path/to/msk_solid_heme_cbioportal" \
--krewlyzer-dir "/path/to/feature_parquets" \
--output output/ \
--workers 4 \
--export-duckdb
--export-duckdb automatically writes a persistent SQL-queryable kreview_lake.duckdb after processing.
If your Krewlyzer results span multiple directories, create a manifest.txt listing one directory per line, then pass it down:
Control statistical parameters and cross-validation:
kreview run \
--cancer-samplesheet ... \
--krewlyzer-dir ... \
--cv-folds 5 \
--impute-strategy median
median, mean, zero. Folds must be 3-20.
Control SHAP explainability and display:
kreview run \
--cancer-samplesheet ... \
--krewlyzer-dir ... \
--shap-samples 5000 \
--shap-features 10 \
--top-n 50
| Flag | Default | Description |
|---|---|---|
--shap-samples |
500 | Max samples for SHAP computation (higher = slower but more stable) |
--shap-features |
10 | Number of features to show in SHAP beeswarm/waterfall |
--top-n |
50 | Number of top features (by importance) to include in model training |
The --cvd-safe flag switches all dashboard visualizations to the Okabe-Ito color palette, which is accessible for red-green colorblindness. By default, kreview uses a curated neon palette optimized for dark backgrounds.
The --compute-univariate-auc flag computes a standalone ROC-AUC for each individual feature column (not just the ensemble model). This is computationally expensive but enables:
- Univariate AUC badges in the dashboard's statistical ledger
- Marker size encoding in the volcano plot (larger = better univariate AUC)
Targeted Execution
Use the --features flag with a comma-separated list to run only specific evaluators:
If you are running in headless CI environments and don't need HTML dashboards:
Note: When--skip-report is omitted, kreview generates both interactive Plotly output/reports/*.html dashboards and a static_plots/ subdirectory containing 2x-scaled .png versions of every chart.
Labels Only
If you only need to generate the ctDNA truth labels without running feature evaluation:
kreview label \
--cancer-samplesheet "/path/to/samplesheet.csv" \
--healthy-xs1-samplesheet "/path/to/healthy1.csv" \
--healthy-xs2-samplesheet "/path/to/healthy2.csv" \
--cbioportal-dir "/path/to/cBioPortal/" \
--output labels.parquet
This produces a single Parquet file with sample IDs, clinical metadata, and the assigned 5-tier labels.
Data Lake Integration
Every time kreview run executes, feature matrices are loaded into memory, evaluated, and then destroyed. If you want a persistent output for downstream analysis:
This creates or merges an immutable kreview_lake.duckdb file in your output/ directory. Downstream researchers can then query it directly with DuckDB or Pandas without re-running the pipeline.
Re-generating Reports
If you have existing evaluation results (stats.json, *_matrix.parquet) and want to regenerate the HTML dashboard:
For a guide to interpreting the generated dashboard, see the Dashboard Interpretation Guide.