Pipeline Architecture
This document describes the decomposed pipeline architecture of kreview, how commands share code through the nbdev notebook-first workflow, and how to add a new pipeline stage.
Pipeline Overview
The kreview pipeline is decomposed into independent stages that can run either sequentially (kreview run) or in parallel (Nextflow / HPC):
graph LR
classDef step fill:#8b5cf6,stroke:#5b21b6,color:#fff;
A["Label"]:::step --> B["Extract ×N"]:::step
B --> C[Select]:::step
C --> D["Eval CPU"]:::step
C --> E["Eval GPU"]:::step
C --> F[Fuse]:::step
D --> S[Scoreboard]:::step
E --> S
D --> G["Eval Multimodal"]:::step
E --> G
F --> G
S --> H[Report]:::step
D --> H
E --> H
G --> I["Report Multimodal"]:::step
Parallel execution
After feature selection, eval cpu, eval gpu, and fuse run in parallel.
After eval, scoreboard, report, and eval multimodal run in parallel — Scoreboard needs all JSONs; Report needs matrices + model results + scoreboard; Multimodal needs super_matrix + OOF probs.
Command → Module → Notebook Mapping
Every module in kreview is auto-generated from an nbdev notebook in nbs/. Never manually edit the .py files (see nbdev Workflow).
| CLI Command | Module | Source Notebook | Purpose |
|---|---|---|---|
kreview label |
cli.py:label() |
nbs/90_cli.ipynb |
Generate ctDNA labels |
kreview extract |
cli.py:extract() |
nbs/90_cli.ipynb |
Label + extract feature matrices |
kreview select |
cli_select.py:select() |
nbs/92_cli_select.ipynb |
Score features + mRMR/hybrid-union selection |
kreview eval cpu |
cli_eval.py:eval_cpu() |
nbs/91_cli_eval.ipynb |
CPU model evaluation (LR, RF, XGB) |
kreview eval gpu |
cli_eval.py:eval_gpu() |
nbs/91_cli_eval.ipynb |
GPU model evaluation (TabPFN, TabICL) |
kreview fuse |
cli.py:fuse() |
nbs/90_cli.ipynb |
Fuse per-evaluator matrices → super-matrix |
kreview eval multimodal |
cli_eval.py:eval_multimodal() |
nbs/91_cli_eval.ipynb |
Cross-evaluator stacking + ablation |
kreview report |
cli.py:report() |
nbs/90_cli.ipynb |
Re-generate HTML dashboards |
kreview run |
cli.py:run() |
nbs/90_cli.ipynb |
Full pipeline orchestrator |
kreview features-list |
cli.py:features_list() |
nbs/90_cli.ipynb |
List registered evaluators |
Shared Libraries
| Module | Source Notebook | Functions | Used By |
|---|---|---|---|
selection.py |
nbs/04_selection.ipynb |
score_features(), select_features(), build_binary_target() |
kreview run, kreview select |
eval_engine.py |
nbs/02_eval_engine.ipynb |
cpu_models(), gpu_models(), univariate_auc(), mutual_info_score(), load_model_results(), load_all_model_results() |
kreview run, kreview eval cpu/gpu, selection.py, scoreboard.py, report templates |
scoreboard.py |
Standalone | build_scoreboard() |
kreview report, KREVIEW_SCOREBOARD |
core.py |
nbs/00_core.ipynb |
LABEL_META_COLS, Paths, LabelConfig |
All commands |
registry.py |
nbs/03_registry.ipynb |
get_all_evaluators() |
kreview run, kreview extract, kreview features-list |
Shared Code Principle
kreview run is an orchestrator — it calls the same shared functions as the standalone commands. This guarantees that local runs and HPC runs produce identical results given the same inputs.
# selection.py — used by BOTH kreview run AND kreview select
from kreview.selection import score_features, select_features
# eval_engine.py — used by BOTH kreview run AND kreview eval cpu/gpu
from kreview.eval_engine import cpu_models, gpu_models
When editing these shared functions, always edit the source notebook (nbs/*.ipynb) and run nbdev_export to regenerate the .py modules.
Data Flow
Each stage communicates through parquet files on disk:
Label: → labels.parquet (5-tier ctDNA labels)
Extract: → {evaluator}_matrix.parquet (full features)
Select: → {evaluator}_matrix.parquet (selected features, overwrites)
→ {evaluator}_eval_stats.parquet (per-feature scores for ALL features)
→ {evaluator}_selection_qc.json (selection audit trail)
Eval CPU: → {evaluator}_model_results.json (AUCs, OOF probs)
→ {evaluator}_{model}_model.joblib (trained models)
Eval GPU: → {evaluator}_gpu_model_results.json (GPU AUCs, OOF probs)
Fuse: → super_matrix.parquet (wide join on SAMPLE_ID)
Scoreboard: → scoreboard_combined__all.parquet (cross-evaluator rankings)
Multimodal: → multimodal_results.json (stacking + ablation results)
Report: → reports/{evaluator}.html (interactive dashboards)
Nextflow Parallelism
See Nextflow Integration for full HPC execution docs.
In Nextflow multistage mode (params.pipeline_mode = 'multistage'), the DAG executes as:
LABEL (1 job) → EXTRACT ×N → SELECT ×N ──┬── EVAL_CPU ──┬── EVAL_MULTIMODAL
├── EVAL_GPU ──┤
└── FUSE ──────┘
└── REPORT (parallel)
Each stage is a separate Nextflow process in nextflow/modules/local/kreview/:
| Process | Module | Input | Output | publishDir |
|---|---|---|---|---|
KREVIEW_LABEL |
label.nf |
Samplesheets + cBioPortal | labels.parquet |
outdir/labels/ |
KREVIEW_EXTRACT |
extract.nf |
Samplesheets + labels.parquet | *_matrix.parquet |
outdir/matrices/raw/ |
KREVIEW_SELECT_SINGLE |
select_single.nf |
Raw matrix | Selected matrix + stats + QC | outdir/matrices/selected/ |
KREVIEW_EVAL_CPU_SINGLE |
eval_cpu_single.nf |
Selected matrix | *_model_results.json + *.joblib |
outdir/models/cpu/ |
KREVIEW_EVAL_GPU_SINGLE |
eval_gpu_single.nf |
Selected matrix | *_gpu_model_results.json + *.joblib |
outdir/models/gpu/ |
KREVIEW_FUSE |
fuse.nf |
All selected matrices | super_matrix.parquet |
outdir/matrices/fused/ |
KREVIEW_SCOREBOARD |
scoreboard.nf |
Collected CPU + GPU JSONs | scoreboard_combined__all.parquet |
outdir/ |
KREVIEW_EVAL_MULTIMODAL |
eval_multimodal.nf |
Fuse + eval results | Multimodal results | outdir/models/multimodal/ |
KREVIEW_REPORT |
report.nf |
Matrices + JSONs + stats + QC + joblib + scoreboard | HTML dashboards | outdir/reports/ |
KREVIEW_REPORT_MULTIMODAL |
report_multimodal.nf |
Multimodal JSON + super_matrix | Multimodal dashboard | outdir/reports/ |
Adding a New Pipeline Stage
- Create the source notebook (e.g.,
nbs/04_new_stage.ipynb) with the shared logic functions - Run
nbdev_exportto generatekreview/new_stage.py - Create a CLI notebook (e.g.,
nbs/93_cli_new_stage.ipynb) with the thin CLI wrapper - Run
nbdev_exportto generatekreview/cli_new_stage.py - Register in
nbs/90_cli.ipynbviaapp.command()orapp.add_typer() - Update
kreview runin the same notebook to call the shared functions - Create a Nextflow process in
nextflow/modules/local/kreview/new_stage.nf - Wire into
nextflow/workflows/kreview_eval.nf - Add tests in
tests/test_new_stage.py - Update this document and Pipeline CLI
Remember the Golden Rule
Always edit the notebook first, then nbdev_export. See nbdev Workflow for details.