Skip to content

Kreview: Scalable cfDNA Fragmentomics Evaluation

CI PyPI License: AGPL v3

Welcome to the kreview evaluation intelligence platform. Built at MSKCC, kreview accelerates the downstream analysis of cell-free DNA (cfDNA) fragmentomics metrics generated by the Krewlyzer Rust pipeline.

Fragmentomics relies heavily on subtle biological properties—like where apoptotic DNA is cleaved by nucleases like DNASE1, or the differential length patterns generated by tumor-derived vs healthy hematopoietic cfDNA. kreview manages the high-throughput evaluation of these physical signals across large multi-thousand sample cohorts.


🏗️ Execution Architecture

At its core, kreview acts as an orchestration and machine learning engine. It seamlessly bridges raw analytical pipelines with DuckDB data lakes and statistical modeling.

flowchart TD
    %% Define Styles
    classDef extract fill:#3b82f6,stroke:#1e40af,color:#fff;
    classDef label fill:#10b981,stroke:#047857,color:#fff;
    classDef ml fill:#8b5cf6,stroke:#5b21b6,color:#fff;
    classDef persist fill:#f59e0b,stroke:#b45309,color:#fff;

    subgraph Data Sources
        cBioPortal["MSK-IMPACT Genotypes\n(MAF, SV, CNA)"] 
        Krewlyzer["Krewlyzer Artifacts\n(Parquet)"]
    end

    subgraph K-Review Engine
        A["DuckDB Engine\nChunked I/O with Retry"]:::extract
        B["5-Tier ctDNA Labeling\nAssigns Ground Truth"]:::label
        C["26 Feature Evaluators\n(FSC, EndMotifs, WPS, MDS)"]:::extract
        D["Sklearn ML Pipeline\n(RF, XGB, LR, CV AUC)"]:::ml
    end

    subgraph Outputs
        DB[("kreview_lake.duckdb\nData Lake")]:::persist
        Dash["HTML Plotly\nDashboards"]:::persist
        Stats["stats.json"]:::persist
    end

    cBioPortal --> B
    Krewlyzer --> A
    B -.-> C
    A --> C
    C --> D

    D --> Dash
    D --> Stats
    C -.-> DB
Use mouse to pan and zoom

What happens in a run?

  1. Ingest & Chunking: kreview loads parquet outputs from the upstream Krewlyzer pipeline. It uses throttled DuckDB queries with exponential backoff retry to parse millions of rows reliably without overwhelming memory or socket limits.
  2. Gold Standard Labeling: It accesses clinical MSK-IMPACT files to generate 5-tier truth labels (e.g., verifying if a somatic variant in cfDNA was also detected in the patient's matched solid tissue).
  3. Statistical Modeling: It loads fragmentomics features dynamically, evaluating them against the ground truth using non-parametric group testing and ensemble ML evaluation (Random Forest, XGBoost, Logistic Regression).
  4. Interactive Insight: It generates comprehensive 6-page HTML dashboards with progressive disclosure — from executive summary to SHAP explainability — so researchers can inspect diagnostic performance, clinical utility (DCA), and feature importance. See the Dashboard Guide for details.

🔬 Notebook-First Development

kreview is built using the nbdev notebook-first framework. All source code lives in Jupyter Notebooks (nbs/) and is automatically compiled into the Python package.

flowchart LR
    NB["nbs/*.ipynb\n(Source of Truth)"] -->|nbdev-export| PY["kreview/*.py\n(Auto-generated)"]
    PY -->|pip install -e .| CLI["kreview CLI\n(typer)"]
    NB -->|nbdev-test| TEST["Unit Tests"]
Use mouse to pan and zoom

Golden Rule

Never manually edit files inside kreview/*.py. They are auto-generated from the notebooks. See the nbdev Workflow guide.


🧬 Why Fragmentomics?

Traditional tumor profiling heavily targets Single Nucleotide Variants (SNVs). Fragmentomics unlocks orthogonal layers of diagnostic signal independently of genetic mutation status.

By utilizing kreview, we systematically map:

  • Fragment Size Distribution (FSD & FSC): Using lengths (e.g. fragments under 150bp) to detect tumor properties.
  • Nucleosome Protection (WPS): Tracing structurally bound or accessible transcription factor environments.
  • Cleavage Signatures (EndMotif & Breakpoints): Profiling circulating end-cutting nuclease signatures.
  • Chromatin Accessibility (ATAC): Evaluating openness at regulatory regions.
  • Motif Divergence (MDS): Measuring shifts in end-motif distributions from healthy baselines.

➡️ Ready to start? Jump into our Installation Guide or explore the CLI Pipeline.