Pipeline Architecture

This document describes the decomposed pipeline architecture of kreview, how commands share code through the nbdev notebook-first workflow, and how to add a new pipeline stage.

Pipeline Overview

The kreview pipeline is decomposed into independent stages that can run either sequentially (kreview run) or in parallel (Nextflow / HPC):

graph LR
    classDef step fill:#8b5cf6,stroke:#5b21b6,color:#fff;
    A["Label"]:::step --> B["Extract ×N"]:::step
    B --> C[Select]:::step
    C --> D["Eval CPU"]:::step
    C --> E["Eval GPU"]:::step
    C --> F[Fuse]:::step
    D --> S[Scoreboard]:::step
    E --> S
    D --> G["Eval Multimodal"]:::step
    E --> G
    F --> G
    S --> H[Report]:::step
    D --> H
    E --> H
    G --> I["Report Multimodal"]:::step

Use mouse to pan and zoom

Parallel execution

After feature selection, eval cpu, eval gpu, and fuse run in parallel. After eval, scoreboard, report, and eval multimodal run in parallel — Scoreboard needs all JSONs; Report needs matrices + model results + scoreboard; Multimodal needs super_matrix + OOF probs.

Command → Module → Notebook Mapping

Every module in kreview is auto-generated from an nbdev notebook in nbs/. Never manually edit the .py files (see nbdev Workflow).

CLI Command	Module	Source Notebook	Purpose
`kreview label`	`cli.py:label()`	`nbs/90_cli.ipynb`	Generate ctDNA labels
`kreview extract`	`cli.py:extract()`	`nbs/90_cli.ipynb`	Label + extract feature matrices
`kreview select`	`cli_select.py:select()`	`nbs/92_cli_select.ipynb`	Score features + mRMR/hybrid-union selection
`kreview eval cpu`	`cli_eval.py:eval_cpu()`	`nbs/91_cli_eval.ipynb`	CPU model evaluation (LR, RF, XGB)
`kreview eval gpu`	`cli_eval.py:eval_gpu()`	`nbs/91_cli_eval.ipynb`	GPU model evaluation (TabPFN, TabICL)
`kreview fuse`	`cli.py:fuse()`	`nbs/90_cli.ipynb`	Fuse per-evaluator matrices → super-matrix
`kreview eval multimodal`	`cli_eval.py:eval_multimodal()`	`nbs/91_cli_eval.ipynb`	Cross-evaluator stacking + ablation
`kreview report`	`cli.py:report()`	`nbs/90_cli.ipynb`	Re-generate HTML dashboards
`kreview run`	`cli.py:run()`	`nbs/90_cli.ipynb`	Full pipeline orchestrator
`kreview features-list`	`cli.py:features_list()`	`nbs/90_cli.ipynb`	List registered evaluators

Shared Libraries

Module	Source Notebook	Functions	Used By
`selection.py`	`nbs/04_selection.ipynb`	`score_features()`, `select_features()`, `build_binary_target()`	`kreview run`, `kreview select`
`eval_engine.py`	`nbs/02_eval_engine.ipynb`	`cpu_models()`, `gpu_models()`, `univariate_auc()`, `mutual_info_score()`, `load_model_results()`, `load_all_model_results()`	`kreview run`, `kreview eval cpu/gpu`, `selection.py`, `scoreboard.py`, report templates
`scoreboard.py`	Standalone	`build_scoreboard()`	`kreview report`, `KREVIEW_SCOREBOARD`
`core.py`	`nbs/00_core.ipynb`	`LABEL_META_COLS`, `Paths`, `LabelConfig`	All commands
`registry.py`	`nbs/03_registry.ipynb`	`get_all_evaluators()`	`kreview run`, `kreview extract`, `kreview features-list`

Shared Code Principle

kreview run is an orchestrator — it calls the same shared functions as the standalone commands. This guarantees that local runs and HPC runs produce identical results given the same inputs.

# selection.py — used by BOTH kreview run AND kreview select
from kreview.selection import score_features, select_features

# eval_engine.py — used by BOTH kreview run AND kreview eval cpu/gpu
from kreview.eval_engine import cpu_models, gpu_models

When editing these shared functions, always edit the source notebook (nbs/*.ipynb) and run nbdev_export to regenerate the .py modules.

Data Flow

Each stage communicates through parquet files on disk:

Label:    → labels.parquet  (5-tier ctDNA labels)
Extract:  → {evaluator}_matrix.parquet  (full features)
Select:   → {evaluator}_matrix.parquet  (selected features, overwrites)
          → {evaluator}_eval_stats.parquet  (per-feature scores for ALL features)
          → {evaluator}_selection_qc.json  (selection audit trail)
Eval CPU: → {evaluator}_model_results.json  (AUCs, OOF probs)
          → {evaluator}_{model}_model.joblib  (trained models)
Eval GPU: → {evaluator}_gpu_model_results.json  (GPU AUCs, OOF probs)
Fuse:     → super_matrix.parquet  (wide join on SAMPLE_ID)
Scoreboard: → scoreboard_combined__all.parquet  (cross-evaluator rankings)
Multimodal: → multimodal_results.json  (stacking + ablation results)
Report:   → reports/{evaluator}.html  (interactive dashboards)

Nextflow Parallelism

See Nextflow Integration for full HPC execution docs.

In Nextflow multistage mode (params.pipeline_mode = 'multistage'), the DAG executes as:

LABEL (1 job) → EXTRACT ×N → SELECT ×N ──┬── EVAL_CPU ──┬── EVAL_MULTIMODAL
                                           ├── EVAL_GPU ──┤
                                           └── FUSE ──────┘
                                                          └── REPORT (parallel)

Each stage is a separate Nextflow process in nextflow/modules/local/kreview/:

Process	Module	Input	Output	publishDir
`KREVIEW_LABEL`	`label.nf`	Samplesheets + cBioPortal	`labels.parquet`	`outdir/labels/`
`KREVIEW_EXTRACT`	`extract.nf`	Samplesheets + labels.parquet	`*_matrix.parquet`	`outdir/matrices/raw/`
`KREVIEW_SELECT_SINGLE`	`select_single.nf`	Raw matrix	Selected matrix + stats + QC	`outdir/matrices/selected/`
`KREVIEW_EVAL_CPU_SINGLE`	`eval_cpu_single.nf`	Selected matrix	`_model_results.json` + `.joblib`	`outdir/models/cpu/`
`KREVIEW_EVAL_GPU_SINGLE`	`eval_gpu_single.nf`	Selected matrix	`_gpu_model_results.json` + `.joblib`	`outdir/models/gpu/`
`KREVIEW_FUSE`	`fuse.nf`	All selected matrices	`super_matrix.parquet`	`outdir/matrices/fused/`
`KREVIEW_SCOREBOARD`	`scoreboard.nf`	Collected CPU + GPU JSONs	`scoreboard_combined__all.parquet`	`outdir/`
`KREVIEW_EVAL_MULTIMODAL`	`eval_multimodal.nf`	Fuse + eval results	Multimodal results	`outdir/models/multimodal/`
`KREVIEW_REPORT`	`report.nf`	Matrices + JSONs + stats + QC + joblib + scoreboard	HTML dashboards	`outdir/reports/`
`KREVIEW_REPORT_MULTIMODAL`	`report_multimodal.nf`	Multimodal JSON + super_matrix	Multimodal dashboard	`outdir/reports/`

Adding a New Pipeline Stage

Create the source notebook (e.g., nbs/04_new_stage.ipynb) with the shared logic functions
Run nbdev_export to generate kreview/new_stage.py
Create a CLI notebook (e.g., nbs/93_cli_new_stage.ipynb) with the thin CLI wrapper
Run nbdev_export to generate kreview/cli_new_stage.py
Register in nbs/90_cli.ipynb via app.command() or app.add_typer()
Update kreview run in the same notebook to call the shared functions
Create a Nextflow process in nextflow/modules/local/kreview/new_stage.nf
Wire into nextflow/workflows/kreview_eval.nf
Add tests in tests/test_new_stage.py
Update this document and Pipeline CLI

Remember the Golden Rule

Always edit the notebook first, then nbdev_export. See nbdev Workflow for details.