Research Process Manager
A system for tracking, documenting, and visualizing multi-step scientific workflows — built for researchers who work in notebooks.
The Vision
Scientific analysis pipelines are messy. Data flows through notebooks, parameters change between experiments, and six months later nobody remembers which threshold produced which result.
This system solves that with three lightweight tools that work together: automatic documentation of what your code does, a relational database that tracks every run with full lineage, and a visual canvas that lets you see the whole pipeline at a glance, configure runs, and launch execution.
User Stories
Design and run a pipeline step
Open the Workflow Canvas. Select a registered method, configure parameters, pick the upstream data source, and launch. The system tracks everything to the database and chains it to the upstream run.
Browse available upstream data
In the Canvas, click on a method's input port to see all completed runs from compatible upstream modules. Select one — the Canvas wires the connection and validates the output contract matches.
Fan-out analysis
Multiple downstream analyses (pRB labeling, OPP labeling, elastic net, etc.) all load from the same upstream run. Each gets its own tracked run with a clear parent linkage in the database.
Run a pipeline for multiple samples
Design a workflow template in the Canvas, parameterize it with three cell lines and two ploidy thresholds, preview the 6-run tree, and submit. Nextflow executes all paths in parallel. The Canvas shows live progress and groups all runs under one execution plan.
Visualize the full pipeline
Open the Workflow Canvas. It reads all runs from the database, follows parent linkages, and renders the full DAG. Click a node to see parameters, metrics, and artifacts.
Verify results before publication
Run pm verify --plan 42. Every artifact from the paper's pipeline is checked against its checksum manifest. All clean — no post-generation modifications.
The Three Tools
1. dFlow — Workflow Documentation
Generates structured documentation from annotated notebooks and Python modules. Two modes:
- Comment-based: Parses
# Step N:markers with PURPOSE/INPUTS/OUTPUTS annotations - Decorator-based:
@workflowand@taskdecorators + AST scanning build a knowledge graph viaWorkflowRegistry
Outputs .workflow.md files with quick-reference tables, function signatures, and Mermaid diagrams. Agents read these to understand notebook structure without parsing implementation code.
2. Process Manager Database
A Postgres database that stores the complete state of the system: registered modules, methods with versioned code pointers, tracked function schemas, parameter sets, run history with full lineage, and artifact manifests. All operations are ACID-transactional.
The database is the single source of truth for what methods exist, what has been run, and how runs connect. The filesystem stores the actual code (in a git-tracked monorepo) and the actual artifacts (in the managed .runs/ store).
3. Workflow Canvas
A visual frontend (FastAPI + LiteGraph.js) that reads from and writes to the Postgres database. It serves as the primary interface for:
- Registering methods — A 4-step wizard that parses code and writes to the DB
- Building pipelines — Drag-and-drop DAG construction with automatic connection validation
- Configuring runs — Parameter forms generated from tracked function schemas
- Monitoring execution — Live status of execution plans and individual runs
- Browsing results — Navigate the run lineage graph, inspect artifacts and metrics
Core Data Model
Module → Method → Run
The core data model organizes work into three levels:
- Module — A pipeline stage or domain of analysis (e.g.,
data_labeling,ploidy_filtering). Defines the output contract that all its methods must satisfy. - Method — A specific analysis approach within a module (e.g.,
mIF_pRB_labeling,mIF_ploidy_filtering). Versioned, with tracked functions and parameter schemas. - Run — A single execution of a method version with specific parameters and data inputs.
There is one global Postgres instance. All modules, methods, and runs live in the same database. There is no per-experiment partitioning — the module/method hierarchy is the organizational structure.
Database Schema
The database uses 15 tables. No migrations are needed when methods, functions, or input types change — the schema handles everything with rows, not columns.
Table Responsibilities
- modules / methods / method_versions — The hierarchy and version history. Each version captures a git commit hash pointing to an immutable code snapshot.
- tracked_functions / param_defs — The parameter schema: which functions are tracked and what parameters they accept. Defined at registration, stored as rows.
- param_sets / param_set_values — Deduplicated parameter combinations. 50 runs with the same defaults share one
param_setrow. - method_inputs — What data a method needs (upstream modules or user-supplied values).
- runs / run_inputs / run_outputs — Execution history: what happened, what data went in, what came out. Each run links to a specific
method_versionandparam_set. - run_logs — Structured log entries (level, message, timestamp) for each run, queryable from the Canvas.
- output_contracts — Per-module output shape requirements.
- workflow_templates — Reusable pipeline DAG definitions stored as JSONB. Created in the Canvas builder, referenced by execution plans.
- execution_plans — Template instantiations with data sources, parameter branches, Nextflow session IDs, and lifecycle status.
Parameter Tracking & Deduplication
Parameter Storage
Parameters are stored as (function_name, param_name, value) triples, scoped to tracked functions. This handles duplicate parameter names across functions (e.g., two functions both having a threshold parameter).
Fingerprint-Based Deduplication
When a run is created, the system computes a fingerprint from all parameter values — a sorted, concatenated hash like sha256("filter_cells.threshold=0.5|train_clf.model_type=logistic|train_clf.n_iter=100"). If a param_set with that fingerprint already exists, the run reuses it. This means 50 runs with identical defaults produce 50 rows in runs but only 1 row in param_sets.
Fingerprints exist only on param_sets. Runs do not have their own fingerprint — duplicate detection is done by querying for matching param_set_id + run_inputs combinations (see Duplicate Detection).
Data Inputs
Data inputs are fundamentally different from parameters. They come in several forms:
| Input Type | Example | Storage |
|---|---|---|
| Upstream run | Filtered data from preprocessing run #47 | FK to runs table via source_run_id |
| Scalar | sample_id = "Pa16c" | JSONB value |
| List | features = ["CycD1", "p27"] | JSONB array |
| File | feature_set.txt | Copied into run directory, path in JSONB |
All data inputs use the run_inputs table with a JSONB value column, avoiding the need for separate tables per input type. Upstream inputs additionally set source_run_id as a foreign key for lineage tracking.
Adding Tracked Functions Later
Because parameters are stored as rows (not columns), adding a new tracked function to an existing method is just INSERT operations into tracked_functions and param_defs. No schema migration needed. Old runs retain their original parameter sets; new runs include the additional function's parameters.
Output Contracts
Every method in a module must produce output that conforms to the module's output contract. Contracts define required columns, types, and constraints for the output data.
For example, all methods in the data_preprocessing module must output a table with cell_id (str), condition (str), and quality_score (float). This is validated when a run saves its results.
Contracts serve two purposes:
- Validation — Catch bad output before it propagates downstream. If a preprocessing method drops the
cell_idcolumn, the save fails immediately. - Compatibility — When the Canvas wires modules together, it checks that the upstream module's output contract matches what the downstream method expects. This enables automatic connection suggestions in the visual DAG.
Method Lifecycle
Registration Wizard (4-Step)
The Workflow Canvas provides a guided wizard for registering methods. The user never writes config files manually — the Canvas parses the code and lets them point-and-click.
Step 1: Select Files
User selects helper.py, run.py, and picks a pixi environment. The Canvas validates the files are parseable Python.
Step 2: Pick Tracked Functions
The Canvas parses both files with AST, extracts all function signatures (names, parameters, types, defaults), and presents them with checkboxes. The user selects which functions matter for tracking.
Step 3: Define Outputs & Inputs
The user declares what the method produces (output artifacts with names and types) and what data it consumes. Data inputs can be:
- Upstream module — data from a run of another module (typed, validated against output contracts)
- User-supplied — scalars, lists, files, or identifiers provided at run time (e.g., sample ID, feature set)
Step 4: Review & Register
The Canvas shows a summary: tracked functions, extracted parameters with defaults, declared outputs, and data inputs. The user can go back to any step before confirming. On "Register," the Canvas writes to the database and copies the pixi environment into the method directory.
Versioning & Git Integration
Methods evolve. A cell cycle phase annotation might start with a GMM model, switch to SVM, then move to XGBoost with new features. The system tracks every version with full traceability.
Version Lifecycle
When a user modifies run.py or helper.py and re-registers through the Canvas wizard, the version increments automatically. Each version captures:
- The exact source code (via git commit hash)
- The tracked functions and their parameter schemas
- The frozen dependencies (
pixi.lock) - A human-readable description of what changed
The Methods Monorepo
All methods live in a git-tracked monorepo. Each method is a subdirectory under its module:
methods/
data_preprocessing/
mIF_ploidy_filtering/
helper.py
run.py
pixi.toml
pixi.lock
feature_selection/
gas_brake_feature_sets/
...
classification/
calibrated_classifier/
...
Auto-Commit on Registration
When the Canvas registers a new version, it auto-commits:
- Writes updated files to the method directory
- Stages only that directory:
git add methods/classification/calibrated_classifier/ - Commits with a structured message:
method: calibrated_classifier v3 - XGBoost + CycD1 feature - Tags the commit:
git tag calibrated_classifier/v3 - Stores the git hash and tag in
method_versions
Git Policy
- Tagged commits on
main(not per-method branches) keeps history linear git log -- methods/classification/calibrated_classifier/gives the full version history of any method- No model artifacts or large binary files in the repo — those go in
.runs/ - A
.gitignoreat the repo root excludes*.joblib,*.h5,*.tif,data/, and other heavyweight patterns - No source code is stored in the database — git is the source of truth for code, the DB stores the pointer
Version Selection in the Canvas
When wiring up a pipeline in the Canvas, each method node has a version selector. Different pipeline configurations can use different versions of the same method — enabling side-by-side comparison of results from "the old way" vs. "the new way."
Structured Diffs
The Canvas provides two layers of diff when comparing versions:
- Code diff — from
git diff tag/v2 tag/v3 -- method_dir/. Raw line-level changes. - Semantic diff — from the database. Shows parameter changes (added/removed/modified), tracked function changes, and dependency changes in a structured, human-readable format.
A researcher can glance at the semantic diff and immediately see "v3 switched from SVM to XGBoost and added CycD1 as a feature" without parsing code.
Execution
Execution Plans & Templates
A workflow template defines what methods to run and how they connect. An execution plan defines for which samples and with which parameter branches. The Canvas builds templates; the user parameterizes them into execution plans; Nextflow runs them.
Workflow Templates
Templates are the DAG definitions created in the Canvas builder — stored as JSONB in the workflow_templates table. Each template captures the steps, connections, and method versions.
The Canvas lists available templates by querying the database. Templates are versioned via a content_hash — if the definition changes, the hash changes, and execution plans that reference the old version retain their original template snapshot.
Parameterization & Fan-Out
From a template, the user creates an execution plan by specifying:
- Data sources — which samples to process (e.g.,
["Pa16c", "CFPAC", "HPAC"]). Each sample creates an independent path through the pipeline. - Parameter branches per step — one or more parameter sets for each step in the template. Multiple param sets at a step create a fork in the pipeline.
The total run count is the product of all branches: 2 samples × 1 QC config × 2 ploidy thresholds × 1 labeling config × 2 analysis configs = 8 runs.
Run Tree Preview
Before launching, the Canvas shows the full run tree — every (sample, param) combination that will be executed. The preview includes:
- Total run count
- Estimated time (from
CRITICALmarkers in method workflow docs) - Duplicate detection — runs with matching param_set + inputs that already completed
- A visual tree of all paths with parameters at each fork
Execution Plan Lifecycle
| Status | Meaning | Canvas Action |
|---|---|---|
draft | Template + params chosen, not yet submitted | Edit parameters, preview run tree |
submitted | Nextflow workflow generated and launched | View progress |
running | Nextflow processes executing | Live status updates per run |
completed | All runs finished successfully | Browse results, compare branches |
partial | Some runs completed, some failed | Inspect failures, Resume |
failed | Pipeline-level failure | View error, Resume or Re-run |
Resume works by calling nextflow run pipeline.nf -resume with the stored session ID. Nextflow skips completed processes natively.
Nextflow Integration
Each registered method directory is a self-contained executable unit that maps directly to a Nextflow process:
| Method Artifact | Nextflow Mapping |
|---|---|
| Method directory | Nextflow process |
run.py | Process script (python run.py --args) |
helper.py | Imported by run.py (comes along for the ride) |
pixi.toml | Conda/pixi directive (local dev) |
| Docker image | Container directive (production/HPC) |
| Canvas DAG edges | Nextflow channel connections |
The Workflow Canvas exports the visual DAG as a Nextflow .nf file. Each node becomes a process, edges become channels, and the Canvas parameterization UI maps to Nextflow params.
The pm CLI Bridge
Each Nextflow process calls the pm CLI at two points — start and finish — to register run metadata in the database and record artifacts:
# At process START — register the run, get back a run ID
RUN_ID=$(python -m pm.register_run \
--plan-id 42 \
--method mIF_ploidy_filtering \
--sample Pa16c \
--branch "Pa16c/QC_strict/G0G1_tight" \
--parent-run-id 17 \
--params '{"dna_norm_max": 2.0}')
# Run the actual method
python methods/data_preprocessing/mIF_ploidy_filtering/run.py \
--input upstream_output.parquet --output output.parquet
# At process END — record completion, artifacts, metrics
python -m pm.complete_run \
--run-id $RUN_ID \
--status completed \
--output output.parquet \
--metrics '{"n_cells": 9102, "pct_filtered": 18.3}'
pixi run python run.py). For production, the Canvas exports the full pipeline as a Nextflow workflow that runs on HPC or cloud infrastructure.
Duplicate Run Detection
Before executing, the system checks: does a completed run exist with the same method_version_id, param_set_id, and identical run_inputs? If so, the user gets a warning with the option to:
- Skip — reuse the existing run (default for batch execution)
- Re-run — create a new run anyway (for reproducibility checks)
This is a "detect and warn" model, not "lock and block." Two users could independently start execution plans that overlap — the second will see a warning that some runs already exist and can choose to skip them.
Exploratory Mode
The Canvas is rigid by design — structured pipelines with tracked parameters and validated outputs. But researchers need to poke at data interactively. Exploratory Mode bridges the gap.
How It Works
From any method node in the Canvas, click "Explore". The Canvas generates a pre-wired notebook (Jupyter or marimo — user's choice) with:
- All helper functions imported and available
- Data loaded from the selected upstream run
- The current
run.pymain function body, editable - An output preview / validation cell
The user tinkers — trying different thresholds, visualizing intermediate results, iterating on the pipeline. When they're done:
- Save as new version — The Canvas registers the modified code as v(N+1) of the method, auto-commits to git, and updates the database.
- Discard — No changes. The exploration was just for understanding.
run.py directly and re-register through the wizard.
Storage & Integrity
Global Managed Store
Run outputs — DataFrames, images, TIF stacks, model files — are stored on the filesystem in a single global managed directory. The database stores metadata and pointers; the filesystem stores the actual data.
All run artifacts live under a global .runs/ directory, indexed by run ID:
.runs/ # Global managed store
00000001/ # Run ID (auto-incrementing)
meta.json # Cached run metadata (mirrors DB)
.manifest.json # Checksums for integrity verification
run.log # stdout/stderr captured during execution
output.parquet # Primary output
plots/
umap.png
00000002/
meta.json
.manifest.json
run.log
exported_tiles/
tile_001.tif
...tile_3000.tif
00000003/
meta.json
.manifest.json
run.log
classifier.joblib
This structure is flat by design. No nested module/method/sample directories to disagree about. Run IDs are unique (auto-incrementing integers from Postgres), so collisions are structurally impossible. If you want to browse by module or sample, use the views (below) or the Canvas.
mkdir for output directories.
The meta.json File
Each run directory includes a meta.json that mirrors the database record. This makes the store self-describing — even without the database, you can inspect any run directory and understand what it is:
{
"run_id": 2,
"method": "export_tiles",
"method_version": 1,
"params": {"lif_file": "experiment.lif", "channel": "DAPI", "tile_size": 512},
"parent_run_id": 1,
"status": "completed",
"started_at": "2026-02-14T10:23:00Z",
"finished_at": "2026-02-14T10:25:30Z",
"outputs": ["exported_tiles/tile_001.tif", "exported_tiles/tile_002.tif"]
}
The .runs/ directory is portable — zip it, move it to another machine, and the meta.json files can reconstruct the database entries.
Views & Symlinks
The views/ directory provides human-navigable projections of the managed store, generated entirely from the database:
views/
by_sample/
Pa16c/
mIF_multiround_qc_QC_strict → ../../.runs/00000001
mIF_ploidy_filtering_G0G1_tight → ../../.runs/00000002
CFPAC/
mIF_multiround_qc_QC_strict → ../../.runs/00000004
by_method/
mIF_multiround_qc/
Pa16c_QC_strict → ../../.runs/00000001
CFPAC_QC_strict → ../../.runs/00000004
by_plan/
2026-02-14_full_comparison/
Pa16c_QC_strict_G0G1_tight → ../../.runs/00000002
latest/
mIF_multiround_qc → ../../.runs/00000007
mIF_ploidy_filtering → ../../.runs/00000008
Views are disposable. Delete the entire views/ directory and regenerate it from the database in seconds. The managed store is unaffected. Custom view generators can be added without changing the storage model.
Per-Run Logging
Every run captures its stdout/stderr to a run.log file in the run directory. When a run fails, click the node in the Canvas to see the full traceback.
Logging is implemented at two levels:
- File-based —
.runs/{id}/run.logcaptures everything the process printed. Always available, zero overhead. - Structured (DB) — The
run_logstable stores parsed log entries (level, message, timestamp). The Canvas can filter by level, search across runs, and show warnings/errors inline on DAG nodes.
The execution engine pipes stdout/stderr to the log file and optionally parses Python logging-formatted lines into the run_logs table.
Tiered Checksumming
Data inside the managed store has a verified chain of custody. Once it leaves the system (via export or manual copy), it's the user's responsibility. The system doesn't prevent tampering — it detects it.
Three Trust Zones
| Zone | Location | Protection |
|---|---|---|
| Verified | .runs/ | Read-only permissions + checksum manifests. Only the execution engine writes here. |
| Browsable | views/ | Symlinks to read-only targets. Can't modify through symlinks. |
| User | Exported copies | User-owned. Includes provenance manifest for optional verification. |
Read-Only After Completion
When a run finishes, the engine locks the run directory:
- Hash output files and write
.manifest.json - Set all files to read-only (
chmod 444/ NTFS deny-write ACL) - Update the database with artifact paths, sizes, and checksums
Tiered Hashing
Not all outputs are created equal. A single output.parquet file is cheap to hash. A directory of 3,000 TIF tiles is not. Three tiers:
| Tier | What It Checks | Speed (3K TIFs) | When Used |
|---|---|---|---|
| Directory fingerprint | File count + names + sizes | < 1 second | Every time: finalization, upstream load, verify |
| Spot-check sampling | Full SHA-256 of N random files | ~5–10 seconds (N=10) | At finalization + thorough verify |
| Full content hash | SHA-256 of every file | ~5–15 minutes | Only on full verify or before publication |
The threshold: ≤50 files → full hash everything. >50 files → directory fingerprint + 10 random spot-checks.
Upstream Verification
The execution engine verifies upstream data before every run. If a checksum fails, the run refuses to start:
ERROR: Upstream run 00000003 has been tampered with.
File: output.parquet
Expected checksum: sha256:a3f7c2...
Actual checksum: sha256:9b1e44...
Options:
1. Re-run upstream step to regenerate clean data
2. Override with reason (logged in DB)
Error Handling
The system adheres to ACID transactions — database operations either fully commit or fully roll back. The main risk is a split between the filesystem and the database: the DB commits but the artifact write fails, or vice versa.
Write-First Strategy
The execution engine follows a strict order:
- Write artifacts to
.runs/{id}/— the filesystem operation happens first - Hash and lock the artifacts — create
.manifest.json, set read-only - Commit to the database — insert
run_outputs, updateruns.status = 'completed'
If the DB commit fails after artifacts are written, the orphan files in .runs/ are harmless — they can be detected and cleaned up (run directories with no matching DB record). This is simpler and safer than the reverse, where a DB record points to missing files.
Failed Runs
If run.py crashes mid-execution:
- The run directory may contain partial outputs
- The DB record stays at
status = 'running' - The
run.logfile contains the traceback - No
.manifest.jsonis created (finalization didn't happen) - A cleanup sweep can detect stale
runningrecords with no heartbeat and mark themfailed
Architecture
Now that all the pieces are described, here's how they connect:
MVP Prototype
The MVP proves one thing: define a branched pipeline in the Canvas, fan it out across samples, execute it via Nextflow, and trace any result back through the full lineage tree. Everything else is polish.
Scope & Non-Goals
The demo
3 Python scripts (same pixi environment), 2 samples. The Canvas shows a 3-step pipeline. User hits Run, provides the sample list. The system generates a .nf file, Nextflow executes all 6 runs (3 steps × 2 samples), artifacts land in .runs/, the DB records everything, and the Canvas shows a lineage tree where clicking run 6 traces back through runs 4 → 2 → 1.
In scope
- Postgres DB (reduced tables)
- Nextflow
.nffile generation from Canvas DAG pmCLI bridge (register_run / complete_run)- Canvas: DAG builder, parameterization, execution monitor, lineage view
.runs/flat artifact store withmeta.json- Methods seeded manually (SQL insert or seed script)
Out of scope
- Registration wizard (manual seeding instead)
- dFlow / workflow documentation
- Method versioning (hardcoded v1)
- Parameter deduplication / fingerprints
- Output contracts / validation
- Views / symlinks
- Integrity / checksumming
- Exploratory mode
- Duplicate run detection
- Git auto-commit
Reduced Schema (7 tables)
Just enough to record methods, runs, and lineage:
MVP Simplifications
- No
param_sets— params stored directly as JSONB onruns. Deduplication can be added later. - No
method_versions— implicit v1. The FK can be added later without changing the rest. - No
execution_plansorworkflow_templates— the Canvas stores the DAG in memory and generates the.nfdirectly. Templates saved to DB come later. runs.sample— denormalized sample name so lineage queries are simple. No separate samples table.runs.nf_process_name— links the DB record to the Nextflow process for log correlation.
Nextflow Generation
The system generates a concrete, disposable .nf file for each execution. No Nextflow params, no config files, no reusable templates. Everything is hardcoded. The .nf file is a throwaway build artifact; the DB is the record of truth.
Generation Strategy
The Canvas DAG has N method nodes with edges. The user provides a sample list. The generator produces a DSL2 .nf file where:
- Each method → one Nextflow
process - The sample list → a
Channel.fromList() - Each edge in the DAG → a channel connecting output to input
- Params for each process are hardcoded in the script block
- Each process script calls the
pmCLI at start and end
Example: Generated .nf file
Given a Canvas DAG of preprocess → filter → label with samples ["Pa16c", "CFPAC"]:
nextflow.enable.dsl = 2
process preprocess {
input:
val sample
output:
tuple val(sample), path("output.parquet")
script:
"""
RUN_ID=\$(python -m pm.register_run \\
--method preprocess \\
--sample ${sample})
pixi run python methods/data_preprocessing/preprocess/run.py \\
--sample ${sample} \\
--output output.parquet \\
--threshold 0.5
python -m pm.complete_run \\
--run-id \$RUN_ID \\
--status completed \\
--output output.parquet
"""
}
process filter_cells {
input:
tuple val(sample), path(input_file)
output:
tuple val(sample), path("filtered.parquet")
script:
"""
PARENT_ID=\$(python -m pm.lookup_run \\
--method preprocess --sample ${sample})
RUN_ID=\$(python -m pm.register_run \\
--method filter_cells \\
--sample ${sample} \\
--parent-run-id \$PARENT_ID)
pixi run python methods/data_preprocessing/filter_cells/run.py \\
--input ${input_file} \\
--output filtered.parquet \\
--min_quality 0.8
python -m pm.complete_run \\
--run-id \$RUN_ID \\
--status completed \\
--output filtered.parquet
"""
}
process label {
input:
tuple val(sample), path(input_file)
output:
tuple val(sample), path("labeled.parquet")
script:
"""
PARENT_ID=\$(python -m pm.lookup_run \\
--method filter_cells --sample ${sample})
RUN_ID=\$(python -m pm.register_run \\
--method label \\
--sample ${sample} \\
--parent-run-id \$PARENT_ID)
pixi run python methods/data_labeling/label/run.py \\
--input ${input_file} \\
--output labeled.parquet
python -m pm.complete_run \\
--run-id \$RUN_ID \\
--status completed \\
--output labeled.parquet
"""
}
workflow {
samples_ch = Channel.fromList(["Pa16c", "CFPAC"])
preprocess_out = preprocess(samples_ch)
filter_out = filter_cells(preprocess_out)
label(filter_out)
}
DAG Shapes Handled
| Shape | Example | Nextflow Mapping |
|---|---|---|
| Linear chain | A → B → C | Channels pipe output to input sequentially |
| Fan-out | A → B, A → C | A's output channel is consumed by both B and C (Nextflow handles this natively with into) |
| Fan-in | B → D, C → D | D takes a merged channel from B and C (via mix or join) |
| Diamond | A → B → D, A → C → D | Combination of fan-out and fan-in |
.nf per execution. No attempt at reuse. If the user changes one parameter and re-runs, a brand new .nf is generated. Old .nf files are kept in .runs/plans/{timestamp}/pipeline.nf for debugging but are never re-executed.
pm CLI Bridge
A minimal Python CLI with 3 commands, called by Nextflow process scripts:
| Command | What It Does | Returns |
|---|---|---|
pm register_run | Inserts a runs row with status='running', writes started_at | Run ID (printed to stdout) |
pm complete_run | Updates runs.status, writes finished_at, inserts run_outputs, copies artifacts to .runs/{id}/, writes meta.json | Nothing (exit 0 on success) |
pm lookup_run | Finds the most recent completed run for a given method + sample | Run ID (for wiring parent_run_id) |
Artifact Routing
When pm complete_run is called:
- Creates
.runs/{run_id}/directory - Copies the output file(s) into it
- Writes
meta.jsonwith run metadata - Captures
run.log(Nextflow already captures stdout/stderr per process via.command.log— copy it) - Inserts
run_outputsrows with artifact paths - Inserts
run_inputsrow withsource_run_idpointing to the parent run
The run_inputs.source_run_id FK is what creates the lineage chain. Every run knows which upstream run produced its input data.
Canvas MVP
The Canvas for the MVP has 4 views, all reading from the same Postgres database:
1. Method Palette
Sidebar listing all registered methods (from GET /methods). Shows method name, module, and parameter list. No registration wizard — methods are seeded into the DB manually or via seed script.
2. DAG Builder
LiteGraph.js canvas where the user drags methods from the palette, places them as nodes, and draws edges to connect outputs to inputs. Each node shows its parameter fields (from param_defs) with editable values. A "Samples" input at the top lets the user type a comma-separated list of sample names.
3. Execution Monitor
After hitting "Run":
- The Canvas sends the DAG + params + samples to
POST /execute - The backend generates the
.nffile and launchesnextflow run pipeline.nf - The Canvas polls
GET /runs?execution_id=Xand updates node badges: ⏳ pending → 🔄 running → ✅ completed / ❌ failed - Clicking a node shows its
run.logand output file listing
4. Lineage View
The key view. Click any completed run and see its full ancestry:
Clicking Run 5 highlights the chain: Run 5 ← Run 3 ← Run 1. The panel shows: sample = Pa16c, method = label, params = {…}, input came from Run 3 (filter_cells), which came from Run 1 (preprocess). Full provenance in one click.
The lineage query is a recursive CTE on run_inputs.source_run_id:
WITH RECURSIVE lineage AS (
-- Start from the clicked run
SELECT r.id, r.method_id, r.sample, r.status, ri.source_run_id
FROM runs r
LEFT JOIN run_inputs ri ON ri.run_id = r.id
WHERE r.id = 5 -- clicked run
UNION ALL
-- Walk up the chain
SELECT r.id, r.method_id, r.sample, r.status, ri.source_run_id
FROM runs r
LEFT JOIN run_inputs ri ON ri.run_id = r.id
JOIN lineage l ON l.source_run_id = r.id
)
SELECT * FROM lineage;
Lineage: The Core Feature
Lineage isn't a feature — it's THE feature. Everything else (the Canvas, Nextflow, the DB) exists to make this one thing work: click a result, see exactly how it was produced.
For each run in the lineage chain, the panel shows:
- Method — what code ran
- Sample — what data it processed
- Parameters — every setting that was used
- Input — which upstream run produced its input (clickable link)
- Output — what files it produced (viewable/downloadable)
- Timestamps — when it started and finished
- Status — completed / failed
- Log — full stdout/stderr
This is what you show your coworkers. Not the schema, not the Nextflow DSL — the lineage view. "I clicked this result and I can see every step and parameter that produced it."
Build Phases
| Phase | What | Deliverable |
|---|---|---|
| 1. DB + pm CLI | Postgres schema (7 tables), SQLAlchemy models, pm register_run / complete_run / lookup_run, seed script with 3 methods |
Run pm register_run --method preprocess --sample Pa16c from terminal, see the row in the DB |
| 2. Nextflow gen + execution | Python function that takes a DAG definition (JSON) + sample list + params → generates .nf file. Run it manually: nextflow run pipeline.nf. pm CLI populates DB. |
3 scripts × 2 samples = 6 runs in the DB with full lineage chain. Query with the recursive CTE and see the tree. |
| 3. FastAPI + Canvas | REST endpoints for methods, runs, lineage. Canvas frontend: drag methods, wire them, set params, type samples, hit Run. Poll for status. Click node → see lineage. | The full demo: build a pipeline visually, run it, trace any result. |
What Exists Today
| Component | Status | Notes |
|---|---|---|
| Pipeline notebooks | Done | Regionprops → QC → Ploidy → Labeling → Analysis |
| dFlow comment-based | Done | Generates .workflow.md from Step markers |
| dFlow decorators | WIP | @workflow, @task, WorkflowRegistry, AST scanning |
| Workflow Canvas frontend | WIP | LiteGraph.js + FastAPI, basic DAG rendering |
| Registration wizard (Canvas) | Planned | 4-step guided flow: files → functions → outputs → register |
| Postgres database | Planned | 15-table schema: modules, methods, method_versions, runs, etc. |
| Parameter deduplication | Planned | Fingerprint-based param_sets, duplicate run detection |
| Output contracts | Planned | Per-module output shape validation |
| Run parameterization (Canvas) | Planned | Configure runs in the Canvas UI with upstream data selection |
| Nextflow export | Planned | Canvas DAG → .nf file, pixi/Docker execution |
| Method versioning + git | Planned | Auto-commit per version, git tags, structured diffs |
| Exploratory Mode | Planned | Spawn pre-wired notebooks from Canvas for interactive editing |
| Execution plans | Planned | Template + data sources + param branches → Nextflow run tree |
| Workflow templates | Planned | Canvas DAG stored as JSONB in workflow_templates table |
| pm CLI bridge | Planned | register_run / complete_run called by each Nextflow process |
Artifact storage (.runs/) | Planned | Global managed flat store with meta.json, views/ symlinks |
| Data integrity | Planned | Tiered checksumming, read-only lock, upstream verification |
| Per-run logging | Planned | run.log files + structured run_logs table |
Comparisons
This system isn't trying to replace any single tool — it fills the gaps between them.
| Tool | What It Does Well | Gap This Fills |
|---|---|---|
| W&B | Beautiful dashboards, team collaboration | Cloud-only, no notebook-native workflow, no lineage chains |
| Airflow / Prefect | Production orchestration, scheduling | Too heavy for exploratory research, DAGs defined in code not visually |
| Nextflow | HPC pipeline execution, reproducibility | Doesn't help with interactive exploration or parameter discovery — we use it as the execution engine, not the brain |
| KNIME | Visual pipeline building, no-code | Can't integrate with custom Python notebooks, limited extensibility |
| CellProfiler | Image analysis pipelines | Domain-specific, no general-purpose downstream tracking |
Future Features
Tests Per Method
Attach test suites directly to methods in the registry. When a method runs, its tests run automatically. Three levels:
- Contract tests — Does the output match the declared schema?
- Sanity tests — Row count > 0, no NaN in label columns, no duplicate cell IDs
- Domain tests — Proliferation rate between 5-80%, intensity values in expected range
This turns every pipeline step into a validated, testable unit without requiring researchers to write test frameworks from scratch.
CLI Interface
An optional command-line interface for power users who want to script operations outside the Canvas GUI. This is not a priority — the Canvas is the primary interface — but could be added later for automation and scripting.