Research Process Manager

A system for tracking, documenting, and visualizing multi-step scientific workflows — built for researchers who work in notebooks.

The Vision

Scientific analysis pipelines are messy. Data flows through notebooks, parameters change between experiments, and six months later nobody remembers which threshold produced which result.

This system solves that with three lightweight tools that work together: automatic documentation of what your code does, a relational database that tracks every run with full lineage, and a visual canvas that lets you see the whole pipeline at a glance, configure runs, and launch execution.

User Stories

Design and run a pipeline step

Open the Workflow Canvas. Select a registered method, configure parameters, pick the upstream data source, and launch. The system tracks everything to the database and chains it to the upstream run.

Browse available upstream data

In the Canvas, click on a method's input port to see all completed runs from compatible upstream modules. Select one — the Canvas wires the connection and validates the output contract matches.

Fan-out analysis

Multiple downstream analyses (pRB labeling, OPP labeling, elastic net, etc.) all load from the same upstream run. Each gets its own tracked run with a clear parent linkage in the database.

Run a pipeline for multiple samples

Design a workflow template in the Canvas, parameterize it with three cell lines and two ploidy thresholds, preview the 6-run tree, and submit. Nextflow executes all paths in parallel. The Canvas shows live progress and groups all runs under one execution plan.

Visualize the full pipeline

Open the Workflow Canvas. It reads all runs from the database, follows parent linkages, and renders the full DAG. Click a node to see parameters, metrics, and artifacts.

Verify results before publication

Run pm verify --plan 42. Every artifact from the paper's pipeline is checked against its checksum manifest. All clean — no post-generation modifications.

The Three Tools

1. dFlow — Workflow Documentation

Generates structured documentation from annotated notebooks and Python modules. Two modes:

  • Comment-based: Parses # Step N: markers with PURPOSE/INPUTS/OUTPUTS annotations
  • Decorator-based: @workflow and @task decorators + AST scanning build a knowledge graph via WorkflowRegistry

Outputs .workflow.md files with quick-reference tables, function signatures, and Mermaid diagrams. Agents read these to understand notebook structure without parsing implementation code.

2. Process Manager Database

A Postgres database that stores the complete state of the system: registered modules, methods with versioned code pointers, tracked function schemas, parameter sets, run history with full lineage, and artifact manifests. All operations are ACID-transactional.

The database is the single source of truth for what methods exist, what has been run, and how runs connect. The filesystem stores the actual code (in a git-tracked monorepo) and the actual artifacts (in the managed .runs/ store).

3. Workflow Canvas

A visual frontend (FastAPI + LiteGraph.js) that reads from and writes to the Postgres database. It serves as the primary interface for:

  • Registering methods — A 4-step wizard that parses code and writes to the DB
  • Building pipelines — Drag-and-drop DAG construction with automatic connection validation
  • Configuring runs — Parameter forms generated from tracked function schemas
  • Monitoring execution — Live status of execution plans and individual runs
  • Browsing results — Navigate the run lineage graph, inspect artifacts and metrics

Core Data Model

Module → Method → Run

The core data model organizes work into three levels:

  • Module — A pipeline stage or domain of analysis (e.g., data_labeling, ploidy_filtering). Defines the output contract that all its methods must satisfy.
  • Method — A specific analysis approach within a module (e.g., mIF_pRB_labeling, mIF_ploidy_filtering). Versioned, with tracked functions and parameter schemas.
  • Run — A single execution of a method version with specific parameters and data inputs.

There is one global Postgres instance. All modules, methods, and runs live in the same database. There is no per-experiment partitioning — the module/method hierarchy is the organizational structure.

Database Schema

The database uses 15 tables. No migrations are needed when methods, functions, or input types change — the schema handles everything with rows, not columns.

erDiagram modules ||--o{ methods : contains modules ||--o{ output_contracts : defines methods ||--o{ method_versions : "versioned as" methods ||--o{ tracked_functions : has methods ||--o{ method_inputs : requires tracked_functions ||--o{ param_defs : has param_defs ||--o{ param_set_values : "value in" param_sets ||--o{ param_set_values : contains methods ||--o{ runs : executes method_versions ||--o{ runs : "code snapshot" param_sets ||--o{ runs : "used by" runs ||--o{ run_inputs : consumes runs ||--o{ run_outputs : produces runs ||--o{ run_logs : "logged by" method_inputs ||--o{ run_inputs : "filled by" workflow_templates ||--o{ execution_plans : "instantiated as" execution_plans ||--o{ runs : spawns modules { int id PK text name UK text description } methods { int id PK int module_id FK text name } method_versions { int id PK int method_id FK int version text description text git_commit_hash text git_tag text pixi_lock_hash timestamp created_at text created_by } tracked_functions { int id PK int method_id FK text function_name text source_file int ordinal } param_defs { int id PK int tracked_function_id FK text param_name text param_type text default_value } param_sets { int id PK int method_id FK text fingerprint UK } param_set_values { int param_set_id FK int param_def_id FK text value } method_inputs { int id PK int method_id FK text input_name text input_type int source_module_id FK bool required } runs { int id PK int method_id FK int method_version_id FK int param_set_id FK int execution_plan_id FK text plan_step text plan_branch text status timestamp started_at timestamp finished_at } run_inputs { int id PK int run_id FK int method_input_id FK int source_run_id FK jsonb value } run_outputs { int id PK int run_id FK text output_name text artifact_path text artifact_type bigint size_bytes text checksum } run_logs { int id PK int run_id FK text level text message timestamp logged_at } output_contracts { int id PK int module_id FK text field_name text field_type bool required } workflow_templates { int id PK text name UK text description jsonb definition text content_hash timestamp created_at timestamp updated_at } execution_plans { int id PK int template_id FK text name jsonb data_sources jsonb branch_params text status text nextflow_run_id text nextflow_session_id timestamp created_at }

Table Responsibilities

  • modules / methods / method_versions — The hierarchy and version history. Each version captures a git commit hash pointing to an immutable code snapshot.
  • tracked_functions / param_defs — The parameter schema: which functions are tracked and what parameters they accept. Defined at registration, stored as rows.
  • param_sets / param_set_values — Deduplicated parameter combinations. 50 runs with the same defaults share one param_set row.
  • method_inputs — What data a method needs (upstream modules or user-supplied values).
  • runs / run_inputs / run_outputs — Execution history: what happened, what data went in, what came out. Each run links to a specific method_version and param_set.
  • run_logs — Structured log entries (level, message, timestamp) for each run, queryable from the Canvas.
  • output_contracts — Per-module output shape requirements.
  • workflow_templates — Reusable pipeline DAG definitions stored as JSONB. Created in the Canvas builder, referenced by execution plans.
  • execution_plans — Template instantiations with data sources, parameter branches, Nextflow session IDs, and lifecycle status.

Parameter Tracking & Deduplication

Parameter Storage

Parameters are stored as (function_name, param_name, value) triples, scoped to tracked functions. This handles duplicate parameter names across functions (e.g., two functions both having a threshold parameter).

Fingerprint-Based Deduplication

When a run is created, the system computes a fingerprint from all parameter values — a sorted, concatenated hash like sha256("filter_cells.threshold=0.5|train_clf.model_type=logistic|train_clf.n_iter=100"). If a param_set with that fingerprint already exists, the run reuses it. This means 50 runs with identical defaults produce 50 rows in runs but only 1 row in param_sets.

Fingerprints exist only on param_sets. Runs do not have their own fingerprint — duplicate detection is done by querying for matching param_set_id + run_inputs combinations (see Duplicate Detection).

Data Inputs

Data inputs are fundamentally different from parameters. They come in several forms:

Input TypeExampleStorage
Upstream runFiltered data from preprocessing run #47FK to runs table via source_run_id
Scalarsample_id = "Pa16c"JSONB value
Listfeatures = ["CycD1", "p27"]JSONB array
Filefeature_set.txtCopied into run directory, path in JSONB

All data inputs use the run_inputs table with a JSONB value column, avoiding the need for separate tables per input type. Upstream inputs additionally set source_run_id as a foreign key for lineage tracking.

Adding Tracked Functions Later

Because parameters are stored as rows (not columns), adding a new tracked function to an existing method is just INSERT operations into tracked_functions and param_defs. No schema migration needed. Old runs retain their original parameter sets; new runs include the additional function's parameters.

Output Contracts

Every method in a module must produce output that conforms to the module's output contract. Contracts define required columns, types, and constraints for the output data.

For example, all methods in the data_preprocessing module must output a table with cell_id (str), condition (str), and quality_score (float). This is validated when a run saves its results.

Contracts serve two purposes:

  • Validation — Catch bad output before it propagates downstream. If a preprocessing method drops the cell_id column, the save fails immediately.
  • Compatibility — When the Canvas wires modules together, it checks that the upstream module's output contract matches what the downstream method expects. This enables automatic connection suggestions in the visual DAG.

Method Lifecycle

Authoring Contract

Users write methods as plain Python — no special notebook format required. Any editor works (VS Code, Jupyter, vim, marimo). The contract is simple: provide two files and an environment.

The Method Directory

Each method lives in a self-contained directory:

  • helper.py — Reusable functions (data loaders, filters, model builders, save logic). Plain Python module, importable normally.
  • run.py — One main function that orchestrates the pipeline by calling helpers.
  • pixi.toml + pixi.lock — Environment specification. Copied from the user's selected environment at registration time.

Example

import helper

def run(input_path: str, output_dir: str = "./output"):
    adata = helper.load_adata(input_path)

    # Parameters for tracked functions are captured by the system
    filtered = helper.filter_cells(adata, threshold=0.5)
    model = helper.train_classifier(filtered.X, filtered.obs["label"],
                                     model_type="logistic", n_iter=100)

    helper.save_results(model, filtered, output_dir=output_dir)
    return {"n_cells": len(filtered), "accuracy": model.score}

The code is plain Python. Which functions are tracked (and which of their parameters are captured) is determined during registration in the Canvas wizard — not in the code itself.

Why not require annotations in code? The value is in the registration wizard + the database — not in how the user writes code. Users develop however they want, then register via the Canvas. The wizard does AST scanning to extract function signatures automatically.

Registration Wizard (4-Step)

The Workflow Canvas provides a guided wizard for registering methods. The user never writes config files manually — the Canvas parses the code and lets them point-and-click.

Step 1: Select Files

User selects helper.py, run.py, and picks a pixi environment. The Canvas validates the files are parseable Python.

Step 2: Pick Tracked Functions

The Canvas parses both files with AST, extracts all function signatures (names, parameters, types, defaults), and presents them with checkboxes. The user selects which functions matter for tracking.

Step 3: Define Outputs & Inputs

The user declares what the method produces (output artifacts with names and types) and what data it consumes. Data inputs can be:

  • Upstream module — data from a run of another module (typed, validated against output contracts)
  • User-supplied — scalars, lists, files, or identifiers provided at run time (e.g., sample ID, feature set)

Step 4: Review & Register

The Canvas shows a summary: tracked functions, extracted parameters with defaults, declared outputs, and data inputs. The user can go back to any step before confirming. On "Register," the Canvas writes to the database and copies the pixi environment into the method directory.

No parameters are needed at registration. The wizard extracts everything from code. Parameters are supplied later when configuring a run in the Canvas.

Versioning & Git Integration

Methods evolve. A cell cycle phase annotation might start with a GMM model, switch to SVM, then move to XGBoost with new features. The system tracks every version with full traceability.

Version Lifecycle

When a user modifies run.py or helper.py and re-registers through the Canvas wizard, the version increments automatically. Each version captures:

  • The exact source code (via git commit hash)
  • The tracked functions and their parameter schemas
  • The frozen dependencies (pixi.lock)
  • A human-readable description of what changed

The Methods Monorepo

All methods live in a git-tracked monorepo. Each method is a subdirectory under its module:

methods/
  data_preprocessing/
    mIF_ploidy_filtering/
      helper.py
      run.py
      pixi.toml
      pixi.lock
  feature_selection/
    gas_brake_feature_sets/
      ...
  classification/
    calibrated_classifier/
      ...

Auto-Commit on Registration

When the Canvas registers a new version, it auto-commits:

  1. Writes updated files to the method directory
  2. Stages only that directory: git add methods/classification/calibrated_classifier/
  3. Commits with a structured message: method: calibrated_classifier v3 - XGBoost + CycD1 feature
  4. Tags the commit: git tag calibrated_classifier/v3
  5. Stores the git hash and tag in method_versions
Why auto-commit? If users commit manually, they might forget, or bundle unrelated changes. Auto-commits ensure every registered version has a clean, atomic, 1:1 commit. Users who want to manually edit code use Exploratory Mode — the Canvas handles git when they save a new version.

Git Policy

  • Tagged commits on main (not per-method branches) keeps history linear
  • git log -- methods/classification/calibrated_classifier/ gives the full version history of any method
  • No model artifacts or large binary files in the repo — those go in .runs/
  • A .gitignore at the repo root excludes *.joblib, *.h5, *.tif, data/, and other heavyweight patterns
  • No source code is stored in the database — git is the source of truth for code, the DB stores the pointer

Version Selection in the Canvas

When wiring up a pipeline in the Canvas, each method node has a version selector. Different pipeline configurations can use different versions of the same method — enabling side-by-side comparison of results from "the old way" vs. "the new way."

Structured Diffs

The Canvas provides two layers of diff when comparing versions:

  • Code diff — from git diff tag/v2 tag/v3 -- method_dir/. Raw line-level changes.
  • Semantic diff — from the database. Shows parameter changes (added/removed/modified), tracked function changes, and dependency changes in a structured, human-readable format.

A researcher can glance at the semantic diff and immediately see "v3 switched from SVM to XGBoost and added CycD1 as a feature" without parsing code.

Execution

Execution Plans & Templates

A workflow template defines what methods to run and how they connect. An execution plan defines for which samples and with which parameter branches. The Canvas builds templates; the user parameterizes them into execution plans; Nextflow runs them.

Workflow Templates

Templates are the DAG definitions created in the Canvas builder — stored as JSONB in the workflow_templates table. Each template captures the steps, connections, and method versions.

The Canvas lists available templates by querying the database. Templates are versioned via a content_hash — if the definition changes, the hash changes, and execution plans that reference the old version retain their original template snapshot.

Parameterization & Fan-Out

From a template, the user creates an execution plan by specifying:

  • Data sources — which samples to process (e.g., ["Pa16c", "CFPAC", "HPAC"]). Each sample creates an independent path through the pipeline.
  • Parameter branches per step — one or more parameter sets for each step in the template. Multiple param sets at a step create a fork in the pipeline.

The total run count is the product of all branches: 2 samples × 1 QC config × 2 ploidy thresholds × 1 labeling config × 2 analysis configs = 8 runs.

Run Tree Preview

Before launching, the Canvas shows the full run tree — every (sample, param) combination that will be executed. The preview includes:

  • Total run count
  • Estimated time (from CRITICAL markers in method workflow docs)
  • Duplicate detection — runs with matching param_set + inputs that already completed
  • A visual tree of all paths with parameters at each fork

Execution Plan Lifecycle

StatusMeaningCanvas Action
draftTemplate + params chosen, not yet submittedEdit parameters, preview run tree
submittedNextflow workflow generated and launchedView progress
runningNextflow processes executingLive status updates per run
completedAll runs finished successfullyBrowse results, compare branches
partialSome runs completed, some failedInspect failures, Resume
failedPipeline-level failureView error, Resume or Re-run

Resume works by calling nextflow run pipeline.nf -resume with the stored session ID. Nextflow skips completed processes natively.

Nextflow Integration

Each registered method directory is a self-contained executable unit that maps directly to a Nextflow process:

Method ArtifactNextflow Mapping
Method directoryNextflow process
run.pyProcess script (python run.py --args)
helper.pyImported by run.py (comes along for the ride)
pixi.tomlConda/pixi directive (local dev)
Docker imageContainer directive (production/HPC)
Canvas DAG edgesNextflow channel connections

The Workflow Canvas exports the visual DAG as a Nextflow .nf file. Each node becomes a process, edges become channels, and the Canvas parameterization UI maps to Nextflow params.

The pm CLI Bridge

Each Nextflow process calls the pm CLI at two points — start and finish — to register run metadata in the database and record artifacts:

# At process START — register the run, get back a run ID
RUN_ID=$(python -m pm.register_run \
    --plan-id 42 \
    --method mIF_ploidy_filtering \
    --sample Pa16c \
    --branch "Pa16c/QC_strict/G0G1_tight" \
    --parent-run-id 17 \
    --params '{"dna_norm_max": 2.0}')

# Run the actual method
python methods/data_preprocessing/mIF_ploidy_filtering/run.py \
    --input upstream_output.parquet --output output.parquet

# At process END — record completion, artifacts, metrics
python -m pm.complete_run \
    --run-id $RUN_ID \
    --status completed \
    --output output.parquet \
    --metrics '{"n_cells": 9102, "pct_filtered": 18.3}'
Nextflow handles execution; pm handles bookkeeping. Nextflow manages fan-out, parallelism, resumability, and dependency ordering. The pm CLI registers each run in the database so the Canvas can display lineage, compare parameters, and track provenance.
Two execution modes: During development, users run methods interactively via the Canvas (which calls pixi run python run.py). For production, the Canvas exports the full pipeline as a Nextflow workflow that runs on HPC or cloud infrastructure.

Duplicate Run Detection

Before executing, the system checks: does a completed run exist with the same method_version_id, param_set_id, and identical run_inputs? If so, the user gets a warning with the option to:

  • Skip — reuse the existing run (default for batch execution)
  • Re-run — create a new run anyway (for reproducibility checks)

This is a "detect and warn" model, not "lock and block." Two users could independently start execution plans that overlap — the second will see a warning that some runs already exist and can choose to skip them.

Exploratory Mode

The Canvas is rigid by design — structured pipelines with tracked parameters and validated outputs. But researchers need to poke at data interactively. Exploratory Mode bridges the gap.

How It Works

From any method node in the Canvas, click "Explore". The Canvas generates a pre-wired notebook (Jupyter or marimo — user's choice) with:

  • All helper functions imported and available
  • Data loaded from the selected upstream run
  • The current run.py main function body, editable
  • An output preview / validation cell

The user tinkers — trying different thresholds, visualizing intermediate results, iterating on the pipeline. When they're done:

  • Save as new version — The Canvas registers the modified code as v(N+1) of the method, auto-commits to git, and updates the database.
  • Discard — No changes. The exploration was just for understanding.
No editor lock-in. Users can develop in any environment (VS Code, Jupyter, marimo, vim). Exploratory Mode is a convenience — a pre-wired notebook is faster than manually loading upstream data and importing helpers. But users can also just edit run.py directly and re-register through the wizard.

Storage & Integrity

Global Managed Store

Run outputs — DataFrames, images, TIF stacks, model files — are stored on the filesystem in a single global managed directory. The database stores metadata and pointers; the filesystem stores the actual data.

All run artifacts live under a global .runs/ directory, indexed by run ID:

.runs/                                # Global managed store
  00000001/                           # Run ID (auto-incrementing)
    meta.json                         # Cached run metadata (mirrors DB)
    .manifest.json                    # Checksums for integrity verification
    run.log                           # stdout/stderr captured during execution
    output.parquet                    # Primary output
    plots/
      umap.png
  00000002/
    meta.json
    .manifest.json
    run.log
    exported_tiles/
      tile_001.tif
      ...tile_3000.tif
  00000003/
    meta.json
    .manifest.json
    run.log
    classifier.joblib

This structure is flat by design. No nested module/method/sample directories to disagree about. Run IDs are unique (auto-incrementing integers from Postgres), so collisions are structurally impossible. If you want to browse by module or sample, use the views (below) or the Canvas.

Why not user-defined directory structures? This was a primary design goal — eliminate the 50-different-naming-conventions problem. The system owns the directory structure. Users never run mkdir for output directories.

The meta.json File

Each run directory includes a meta.json that mirrors the database record. This makes the store self-describing — even without the database, you can inspect any run directory and understand what it is:

{
  "run_id": 2,
  "method": "export_tiles",
  "method_version": 1,
  "params": {"lif_file": "experiment.lif", "channel": "DAPI", "tile_size": 512},
  "parent_run_id": 1,
  "status": "completed",
  "started_at": "2026-02-14T10:23:00Z",
  "finished_at": "2026-02-14T10:25:30Z",
  "outputs": ["exported_tiles/tile_001.tif", "exported_tiles/tile_002.tif"]
}

The .runs/ directory is portable — zip it, move it to another machine, and the meta.json files can reconstruct the database entries.

Views & Symlinks

The views/ directory provides human-navigable projections of the managed store, generated entirely from the database:

views/
  by_sample/
    Pa16c/
      mIF_multiround_qc_QC_strict → ../../.runs/00000001
      mIF_ploidy_filtering_G0G1_tight → ../../.runs/00000002
    CFPAC/
      mIF_multiround_qc_QC_strict → ../../.runs/00000004
  by_method/
    mIF_multiround_qc/
      Pa16c_QC_strict → ../../.runs/00000001
      CFPAC_QC_strict → ../../.runs/00000004
  by_plan/
    2026-02-14_full_comparison/
      Pa16c_QC_strict_G0G1_tight → ../../.runs/00000002
  latest/
    mIF_multiround_qc → ../../.runs/00000007
    mIF_ploidy_filtering → ../../.runs/00000008

Views are disposable. Delete the entire views/ directory and regenerate it from the database in seconds. The managed store is unaffected. Custom view generators can be added without changing the storage model.

Per-Run Logging

Every run captures its stdout/stderr to a run.log file in the run directory. When a run fails, click the node in the Canvas to see the full traceback.

Logging is implemented at two levels:

  • File-based.runs/{id}/run.log captures everything the process printed. Always available, zero overhead.
  • Structured (DB) — The run_logs table stores parsed log entries (level, message, timestamp). The Canvas can filter by level, search across runs, and show warnings/errors inline on DAG nodes.

The execution engine pipes stdout/stderr to the log file and optionally parses Python logging-formatted lines into the run_logs table.

Tiered Checksumming

Data inside the managed store has a verified chain of custody. Once it leaves the system (via export or manual copy), it's the user's responsibility. The system doesn't prevent tampering — it detects it.

Three Trust Zones

ZoneLocationProtection
Verified.runs/Read-only permissions + checksum manifests. Only the execution engine writes here.
Browsableviews/Symlinks to read-only targets. Can't modify through symlinks.
UserExported copiesUser-owned. Includes provenance manifest for optional verification.

Read-Only After Completion

When a run finishes, the engine locks the run directory:

  1. Hash output files and write .manifest.json
  2. Set all files to read-only (chmod 444 / NTFS deny-write ACL)
  3. Update the database with artifact paths, sizes, and checksums

Tiered Hashing

Not all outputs are created equal. A single output.parquet file is cheap to hash. A directory of 3,000 TIF tiles is not. Three tiers:

TierWhat It ChecksSpeed (3K TIFs)When Used
Directory fingerprintFile count + names + sizes< 1 secondEvery time: finalization, upstream load, verify
Spot-check samplingFull SHA-256 of N random files~5–10 seconds (N=10)At finalization + thorough verify
Full content hashSHA-256 of every file~5–15 minutesOnly on full verify or before publication

The threshold: ≤50 files → full hash everything. >50 files → directory fingerprint + 10 random spot-checks.

Upstream Verification

The execution engine verifies upstream data before every run. If a checksum fails, the run refuses to start:

ERROR: Upstream run 00000003 has been tampered with.
  File: output.parquet
  Expected checksum: sha256:a3f7c2...
  Actual checksum:   sha256:9b1e44...

  Options:
    1. Re-run upstream step to regenerate clean data
    2. Override with reason (logged in DB)
Override with audit trail. The system isn't a prison. Researchers can bypass integrity checks with a documented override, which logs the reason, the user, and the timestamp in the database. Every result generated through the normal flow has a verified chain of custody; overrides are clearly marked.

Error Handling

The system adheres to ACID transactions — database operations either fully commit or fully roll back. The main risk is a split between the filesystem and the database: the DB commits but the artifact write fails, or vice versa.

Write-First Strategy

The execution engine follows a strict order:

  1. Write artifacts to .runs/{id}/ — the filesystem operation happens first
  2. Hash and lock the artifacts — create .manifest.json, set read-only
  3. Commit to the database — insert run_outputs, update runs.status = 'completed'

If the DB commit fails after artifacts are written, the orphan files in .runs/ are harmless — they can be detected and cleaned up (run directories with no matching DB record). This is simpler and safer than the reverse, where a DB record points to missing files.

Failed Runs

If run.py crashes mid-execution:

  • The run directory may contain partial outputs
  • The DB record stays at status = 'running'
  • The run.log file contains the traceback
  • No .manifest.json is created (finalization didn't happen)
  • A cleanup sweep can detect stale running records with no heartbeat and mark them failed

Architecture

Now that all the pieces are described, here's how they connect:

graph TB subgraph Authoring["Method Authoring"] HP[helper.py — functions] RP[run.py — main function] PX[pixi.toml — environment] end subgraph Frontend["Workflow Canvas"] WIZ[Registration Wizard] WC[Visual DAG Builder] PARAM[Run Parameterization] MON[Execution Monitor] end subgraph Registration["Registration Wizard Steps"] S1["1. Select Files"] S2["2. Pick Tracked Functions"] S3["3. Define Outputs"] S4["4. Review & Register"] S1 --> S2 --> S3 --> S4 end subgraph DB["Postgres (ACID)"] MOD[modules] MTH[methods] MV[method_versions] TF[tracked_functions] PD[param_defs] PS[param_sets] RN[runs] RI[run_inputs] RO[run_outputs] RL[run_logs] OC[output_contracts] WT[workflow_templates] EP[execution_plans] end subgraph Execution NF[Nextflow] CLI["pm CLI bridge"] PIX[Pixi environments] DK[Docker containers] end subgraph Storage["Artifact Storage"] RS[".runs/ global store"] VW["views/ symlinks"] MF[".manifest.json checksums"] LOG["run.log per run"] end subgraph Docs DF[dFlow] end HP & RP & PX --> WIZ WIZ --> S1 S4 -->|writes| MOD & MTH & MV & TF & PD & OC WC -->|reads/writes| WT PARAM -->|creates| EP MON -->|reads| RN & RL EP -->|generates| NF NF --> PIX & DK NF -->|each process calls| CLI CLI -->|registers & completes| RN CLI -->|writes artifacts| RS RS ---|integrity| MF RS -->|generates| VW RS ---|captures| LOG HP & RP -.->|scans| DF DF -->|generates| WMD[".workflow.md"] style DB fill:#198754,color:#fff style NF fill:#4361ee,color:#fff style Storage fill:#fd7e14,color:#fff

MVP Prototype

The MVP proves one thing: define a branched pipeline in the Canvas, fan it out across samples, execute it via Nextflow, and trace any result back through the full lineage tree. Everything else is polish.

Scope & Non-Goals

The demo

3 Python scripts (same pixi environment), 2 samples. The Canvas shows a 3-step pipeline. User hits Run, provides the sample list. The system generates a .nf file, Nextflow executes all 6 runs (3 steps × 2 samples), artifacts land in .runs/, the DB records everything, and the Canvas shows a lineage tree where clicking run 6 traces back through runs 4 → 2 → 1.

In scope

  • Postgres DB (reduced tables)
  • Nextflow .nf file generation from Canvas DAG
  • pm CLI bridge (register_run / complete_run)
  • Canvas: DAG builder, parameterization, execution monitor, lineage view
  • .runs/ flat artifact store with meta.json
  • Methods seeded manually (SQL insert or seed script)

Out of scope

  • Registration wizard (manual seeding instead)
  • dFlow / workflow documentation
  • Method versioning (hardcoded v1)
  • Parameter deduplication / fingerprints
  • Output contracts / validation
  • Views / symlinks
  • Integrity / checksumming
  • Exploratory mode
  • Duplicate run detection
  • Git auto-commit

Reduced Schema (7 tables)

Just enough to record methods, runs, and lineage:

erDiagram modules ||--o{ methods : contains methods ||--o{ tracked_functions : has tracked_functions ||--o{ param_defs : has methods ||--o{ runs : executes runs ||--o{ run_inputs : consumes runs ||--o{ run_outputs : produces modules { int id PK text name UK text description } methods { int id PK int module_id FK text name text script_path } tracked_functions { int id PK int method_id FK text function_name int ordinal } param_defs { int id PK int tracked_function_id FK text param_name text param_type text default_value } runs { int id PK int method_id FK jsonb params text sample text status text nf_process_name timestamp started_at timestamp finished_at } run_inputs { int id PK int run_id FK int source_run_id FK text input_name text artifact_path } run_outputs { int id PK int run_id FK text output_name text artifact_path text artifact_type }

MVP Simplifications

  • No param_sets — params stored directly as JSONB on runs. Deduplication can be added later.
  • No method_versions — implicit v1. The FK can be added later without changing the rest.
  • No execution_plans or workflow_templates — the Canvas stores the DAG in memory and generates the .nf directly. Templates saved to DB come later.
  • runs.sample — denormalized sample name so lineage queries are simple. No separate samples table.
  • runs.nf_process_name — links the DB record to the Nextflow process for log correlation.

Nextflow Generation

The system generates a concrete, disposable .nf file for each execution. No Nextflow params, no config files, no reusable templates. Everything is hardcoded. The .nf file is a throwaway build artifact; the DB is the record of truth.

Generation Strategy

The Canvas DAG has N method nodes with edges. The user provides a sample list. The generator produces a DSL2 .nf file where:

  1. Each method → one Nextflow process
  2. The sample list → a Channel.fromList()
  3. Each edge in the DAG → a channel connecting output to input
  4. Params for each process are hardcoded in the script block
  5. Each process script calls the pm CLI at start and end

Example: Generated .nf file

Given a Canvas DAG of preprocess → filter → label with samples ["Pa16c", "CFPAC"]:

nextflow.enable.dsl = 2

process preprocess {
    input:
        val sample
    output:
        tuple val(sample), path("output.parquet")
    script:
    """
    RUN_ID=\$(python -m pm.register_run \\
        --method preprocess \\
        --sample ${sample})

    pixi run python methods/data_preprocessing/preprocess/run.py \\
        --sample ${sample} \\
        --output output.parquet \\
        --threshold 0.5

    python -m pm.complete_run \\
        --run-id \$RUN_ID \\
        --status completed \\
        --output output.parquet
    """
}

process filter_cells {
    input:
        tuple val(sample), path(input_file)
    output:
        tuple val(sample), path("filtered.parquet")
    script:
    """
    PARENT_ID=\$(python -m pm.lookup_run \\
        --method preprocess --sample ${sample})

    RUN_ID=\$(python -m pm.register_run \\
        --method filter_cells \\
        --sample ${sample} \\
        --parent-run-id \$PARENT_ID)

    pixi run python methods/data_preprocessing/filter_cells/run.py \\
        --input ${input_file} \\
        --output filtered.parquet \\
        --min_quality 0.8

    python -m pm.complete_run \\
        --run-id \$RUN_ID \\
        --status completed \\
        --output filtered.parquet
    """
}

process label {
    input:
        tuple val(sample), path(input_file)
    output:
        tuple val(sample), path("labeled.parquet")
    script:
    """
    PARENT_ID=\$(python -m pm.lookup_run \\
        --method filter_cells --sample ${sample})

    RUN_ID=\$(python -m pm.register_run \\
        --method label \\
        --sample ${sample} \\
        --parent-run-id \$PARENT_ID)

    pixi run python methods/data_labeling/label/run.py \\
        --input ${input_file} \\
        --output labeled.parquet

    python -m pm.complete_run \\
        --run-id \$RUN_ID \\
        --status completed \\
        --output labeled.parquet
    """
}

workflow {
    samples_ch = Channel.fromList(["Pa16c", "CFPAC"])
    preprocess_out = preprocess(samples_ch)
    filter_out = filter_cells(preprocess_out)
    label(filter_out)
}

DAG Shapes Handled

ShapeExampleNextflow Mapping
Linear chainA → B → CChannels pipe output to input sequentially
Fan-outA → B, A → CA's output channel is consumed by both B and C (Nextflow handles this natively with into)
Fan-inB → D, C → DD takes a merged channel from B and C (via mix or join)
DiamondA → B → D, A → C → DCombination of fan-out and fan-in
One .nf per execution. No attempt at reuse. If the user changes one parameter and re-runs, a brand new .nf is generated. Old .nf files are kept in .runs/plans/{timestamp}/pipeline.nf for debugging but are never re-executed.

pm CLI Bridge

A minimal Python CLI with 3 commands, called by Nextflow process scripts:

CommandWhat It DoesReturns
pm register_runInserts a runs row with status='running', writes started_atRun ID (printed to stdout)
pm complete_runUpdates runs.status, writes finished_at, inserts run_outputs, copies artifacts to .runs/{id}/, writes meta.jsonNothing (exit 0 on success)
pm lookup_runFinds the most recent completed run for a given method + sampleRun ID (for wiring parent_run_id)

Artifact Routing

When pm complete_run is called:

  1. Creates .runs/{run_id}/ directory
  2. Copies the output file(s) into it
  3. Writes meta.json with run metadata
  4. Captures run.log (Nextflow already captures stdout/stderr per process via .command.log — copy it)
  5. Inserts run_outputs rows with artifact paths
  6. Inserts run_inputs row with source_run_id pointing to the parent run

The run_inputs.source_run_id FK is what creates the lineage chain. Every run knows which upstream run produced its input data.

Canvas MVP

The Canvas for the MVP has 4 views, all reading from the same Postgres database:

1. Method Palette

Sidebar listing all registered methods (from GET /methods). Shows method name, module, and parameter list. No registration wizard — methods are seeded into the DB manually or via seed script.

2. DAG Builder

LiteGraph.js canvas where the user drags methods from the palette, places them as nodes, and draws edges to connect outputs to inputs. Each node shows its parameter fields (from param_defs) with editable values. A "Samples" input at the top lets the user type a comma-separated list of sample names.

3. Execution Monitor

After hitting "Run":

  1. The Canvas sends the DAG + params + samples to POST /execute
  2. The backend generates the .nf file and launches nextflow run pipeline.nf
  3. The Canvas polls GET /runs?execution_id=X and updates node badges: ⏳ pending → 🔄 running → ✅ completed / ❌ failed
  4. Clicking a node shows its run.log and output file listing

4. Lineage View

The key view. Click any completed run and see its full ancestry:

graph LR R1["Run 1\npreprocess\nPa16c"] --> R3["Run 3\nfilter_cells\nPa16c"] R3 --> R5["Run 5\nlabel\nPa16c"] R2["Run 2\npreprocess\nCFPAC"] --> R4["Run 4\nfilter_cells\nCFPAC"] R4 --> R6["Run 6\nlabel\nCFPAC"] style R5 fill:#198754,color:#fff,stroke-width:3px style R6 fill:#198754,color:#fff,stroke-width:3px

Clicking Run 5 highlights the chain: Run 5 ← Run 3 ← Run 1. The panel shows: sample = Pa16c, method = label, params = {…}, input came from Run 3 (filter_cells), which came from Run 1 (preprocess). Full provenance in one click.

The lineage query is a recursive CTE on run_inputs.source_run_id:

WITH RECURSIVE lineage AS (
    -- Start from the clicked run
    SELECT r.id, r.method_id, r.sample, r.status, ri.source_run_id
    FROM runs r
    LEFT JOIN run_inputs ri ON ri.run_id = r.id
    WHERE r.id = 5   -- clicked run

    UNION ALL

    -- Walk up the chain
    SELECT r.id, r.method_id, r.sample, r.status, ri.source_run_id
    FROM runs r
    LEFT JOIN run_inputs ri ON ri.run_id = r.id
    JOIN lineage l ON l.source_run_id = r.id
)
SELECT * FROM lineage;

Lineage: The Core Feature

Lineage isn't a feature — it's THE feature. Everything else (the Canvas, Nextflow, the DB) exists to make this one thing work: click a result, see exactly how it was produced.

For each run in the lineage chain, the panel shows:

  • Method — what code ran
  • Sample — what data it processed
  • Parameters — every setting that was used
  • Input — which upstream run produced its input (clickable link)
  • Output — what files it produced (viewable/downloadable)
  • Timestamps — when it started and finished
  • Status — completed / failed
  • Log — full stdout/stderr

This is what you show your coworkers. Not the schema, not the Nextflow DSL — the lineage view. "I clicked this result and I can see every step and parameter that produced it."

Build Phases

PhaseWhatDeliverable
1. DB + pm CLI Postgres schema (7 tables), SQLAlchemy models, pm register_run / complete_run / lookup_run, seed script with 3 methods Run pm register_run --method preprocess --sample Pa16c from terminal, see the row in the DB
2. Nextflow gen + execution Python function that takes a DAG definition (JSON) + sample list + params → generates .nf file. Run it manually: nextflow run pipeline.nf. pm CLI populates DB. 3 scripts × 2 samples = 6 runs in the DB with full lineage chain. Query with the recursive CTE and see the tree.
3. FastAPI + Canvas REST endpoints for methods, runs, lineage. Canvas frontend: drag methods, wire them, set params, type samples, hit Run. Poll for status. Click node → see lineage. The full demo: build a pipeline visually, run it, trace any result.
Phase 1 and 2 have no UI. They're backend-only and testable from the terminal. This means you can validate that the DB schema, pm CLI, and Nextflow generation all work correctly before touching any frontend code. The Canvas (Phase 3) is a skin over a working system.

What Exists Today

ComponentStatusNotes
Pipeline notebooksDoneRegionprops → QC → Ploidy → Labeling → Analysis
dFlow comment-basedDoneGenerates .workflow.md from Step markers
dFlow decoratorsWIP@workflow, @task, WorkflowRegistry, AST scanning
Workflow Canvas frontendWIPLiteGraph.js + FastAPI, basic DAG rendering
Registration wizard (Canvas)Planned4-step guided flow: files → functions → outputs → register
Postgres databasePlanned15-table schema: modules, methods, method_versions, runs, etc.
Parameter deduplicationPlannedFingerprint-based param_sets, duplicate run detection
Output contractsPlannedPer-module output shape validation
Run parameterization (Canvas)PlannedConfigure runs in the Canvas UI with upstream data selection
Nextflow exportPlannedCanvas DAG → .nf file, pixi/Docker execution
Method versioning + gitPlannedAuto-commit per version, git tags, structured diffs
Exploratory ModePlannedSpawn pre-wired notebooks from Canvas for interactive editing
Execution plansPlannedTemplate + data sources + param branches → Nextflow run tree
Workflow templatesPlannedCanvas DAG stored as JSONB in workflow_templates table
pm CLI bridgePlannedregister_run / complete_run called by each Nextflow process
Artifact storage (.runs/)PlannedGlobal managed flat store with meta.json, views/ symlinks
Data integrityPlannedTiered checksumming, read-only lock, upstream verification
Per-run loggingPlannedrun.log files + structured run_logs table

Comparisons

This system isn't trying to replace any single tool — it fills the gaps between them.

ToolWhat It Does WellGap This Fills
W&BBeautiful dashboards, team collaborationCloud-only, no notebook-native workflow, no lineage chains
Airflow / PrefectProduction orchestration, schedulingToo heavy for exploratory research, DAGs defined in code not visually
NextflowHPC pipeline execution, reproducibilityDoesn't help with interactive exploration or parameter discovery — we use it as the execution engine, not the brain
KNIMEVisual pipeline building, no-codeCan't integrate with custom Python notebooks, limited extensibility
CellProfilerImage analysis pipelinesDomain-specific, no general-purpose downstream tracking
Key differentiator: This system is built around the researcher's existing workflow. You write plain Python, register via a visual wizard, and the tracking happens automatically. The Canvas is the brain (design, parameterize, monitor); Nextflow is the muscle (execute, parallelize, resume); Postgres is the memory (lineage, parameters, integrity).

Future Features

Tests Per Method

Attach test suites directly to methods in the registry. When a method runs, its tests run automatically. Three levels:

  • Contract tests — Does the output match the declared schema?
  • Sanity tests — Row count > 0, no NaN in label columns, no duplicate cell IDs
  • Domain tests — Proliferation rate between 5-80%, intensity values in expected range

This turns every pipeline step into a validated, testable unit without requiring researchers to write test frameworks from scratch.

CLI Interface

An optional command-line interface for power users who want to script operations outside the Canvas GUI. This is not a priority — the Canvas is the primary interface — but could be added later for automation and scripting.