Skip to content

Running the Pipeline

The backbone of kreview is run through a highly modular typer CLI. It connects all independent elements of the pipeline dynamically.


Available Commands

kreview exposes four subcommands:

Command Purpose
kreview run Full pipeline: label → extract → evaluate → report
kreview label Generate ctDNA labels only (no feature evaluation)
kreview features-list List all registered feature evaluators
kreview report Re-generate HTML dashboard from existing results

Basic Execution

Disable Python Buffering

When running over terminal orchestrators (like nohup, standard piping, or SLURM), it is critical to run Python with PYTHONUNBUFFERED=1 so that the structlog progress output streams in real-time.

PYTHONUNBUFFERED=1 kreview run \
  --cancer-samplesheet "/path/to/samplesheet.csv" \
  --healthy-xs1-samplesheet "/path/to/healthy1.csv" \
  --healthy-xs2-samplesheet "/path/to/healthy2.csv" \
  --cbioportal-dir "/path/to/msk_solid_heme_cbioportal" \
  --krewlyzer-dir "/path/to/feature_parquets" \
  --output output/ \
  --workers 4 \
  --export-duckdb
Note: --export-duckdb automatically writes a persistent SQL-queryable kreview_lake.duckdb after processing.

If your Krewlyzer results span multiple directories, create a manifest.txt listing one directory per line, then pass it down:

kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir manifest.txt \
  --output output/

Control statistical parameters and cross-validation:

kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir ... \
  --cv-folds 5 \
  --impute-strategy median
Options for imputation: median, mean, zero. Folds must be 3-20.

Control SHAP explainability and display:

kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir ... \
  --shap-samples 5000 \
  --shap-features 10 \
  --top-n 50
Flag Default Description
--shap-samples 500 Max samples for SHAP computation (higher = slower but more stable)
--shap-features 10 Number of features to show in SHAP beeswarm/waterfall
--top-n 50 Number of top features (by importance) to include in model training
kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir ... \
  --cvd-safe

The --cvd-safe flag switches all dashboard visualizations to the Okabe-Ito color palette, which is accessible for red-green colorblindness. By default, kreview uses a curated neon palette optimized for dark backgrounds.

kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir ... \
  --compute-univariate-auc

The --compute-univariate-auc flag computes a standalone ROC-AUC for each individual feature column (not just the ensemble model). This is computationally expensive but enables:

  • Univariate AUC badges in the dashboard's statistical ledger
  • Marker size encoding in the volcano plot (larger = better univariate AUC)

If you see PermissionError during large cohort loading, reduce the parquet chunk-size:

kreview run \
  --cancer-samplesheet ... \
  --krewlyzer-dir ... \
  --chunk-size 100

Targeted Execution

Use the --features flag with a comma-separated list to run only specific evaluators:

kreview run \
  ...
  --features "BreakPointMotifOnTarget,EndMotifOnTarget"

Run only Tier 1 (fragment size) or Tier 2 (nucleosome/motif) features:

kreview run \
  ...
  --tier 1

If you are running in headless CI environments and don't need HTML dashboards:

kreview run \
  ...
  --skip-report
Note: When --skip-report is omitted, kreview generates both interactive Plotly output/reports/*.html dashboards and a static_plots/ subdirectory containing 2x-scaled .png versions of every chart.

Skip evaluators that already have model results (useful for incremental HPC re-runs):

kreview run \
  ...
  --resume
This checks for existing *_model_results.json files and skips extractors that have already completed.


Labels Only

If you only need to generate the ctDNA truth labels without running feature evaluation:

kreview label \
  --cancer-samplesheet "/path/to/samplesheet.csv" \
  --healthy-xs1-samplesheet "/path/to/healthy1.csv" \
  --healthy-xs2-samplesheet "/path/to/healthy2.csv" \
  --cbioportal-dir "/path/to/cBioPortal/" \
  --output labels.parquet

This produces a single Parquet file with sample IDs, clinical metadata, and the assigned 5-tier labels.


Data Lake Integration

Every time kreview run executes, feature matrices are loaded into memory, evaluated, and then destroyed. If you want a persistent output for downstream analysis:

kreview run \
  ...
  --export-duckdb

This creates or merges an immutable kreview_lake.duckdb file in your output/ directory. Downstream researchers can then query it directly with DuckDB or Pandas without re-running the pipeline.


Re-generating Reports

If you have existing evaluation results (stats.json, *_matrix.parquet) and want to regenerate the HTML dashboard:

kreview report \
  --results-dir output/BreakPointMotifOnTarget/ \
  --output-dir output/reports/

For a guide to interpreting the generated dashboard, see the Dashboard Interpretation Guide.