Concepts
Five Node Classes
Every node in the DAG is classified into one of five semantic roles:
Class |
Role |
Examples |
|---|---|---|
Source |
Data acquisition scripts |
|
Input |
Raw data and configuration |
|
Processing |
Transform and analysis scripts |
|
Output |
Intermediate and final data products |
|
Claim |
Manuscript assertions tied to evidence |
|
This classification turns the DAG into a navigable map of the research project. The key operation is backpropagation from claims to sources: starting from a manuscript assertion (claim), Clew traces backward through outputs, processing scripts, and inputs to the original raw data — verifying every hash along the way.
Three Verification Modes
Mode |
Scope |
API |
Description |
|---|---|---|---|
Project |
Entire pipeline |
|
Verifies every session in the database in topological order. “Is the whole project intact?” |
Files |
Specific outputs |
|
Traces backward from target files through their dependency chain. “Can I trust this specific file?” |
Claims |
Manuscript assertions |
|
Verifies claims linked to source sessions. “Is this figure still backed by the data?” |
DAG as Research Logic
Beyond verification, the DAG itself is valuable: it is the simplest formal representation of a project’s research logic. Each node is a concrete artifact (script, data file, or claim) and each edge is a recorded dependency. This skeleton structure lets you:
Understand a project at a glance — which scripts produce which outputs
Navigate from any claim back to the raw data that supports it
Communicate project structure to collaborators and reviewers
Enable AI agents to reason about the research pipeline programmatically
The Mermaid diagram (clew.mermaid()) renders this logic as a visual flowchart,
making the implicit structure of any research project explicit and inspectable.
Figure 2. The DAG is both a verification tool and a structural map of research logic — from raw data through processing to manuscript claims.
How It Works
Recording: During a session, Clew computes SHA-256 hashes of all input and output files, storing them alongside session metadata in a local SQLite database.
Verification: At any point, you can verify a session by recomputing hashes and comparing them to the recorded values. If any file has changed, the verification fails.
DAG Construction: Sessions are linked by shared files — when one session’s output is another session’s input, Clew infers a dependency edge. This builds a complete DAG of the research project.
Provenance Tracing: Given any file, Clew can trace its complete lineage — which sessions produced it, which inputs were consumed, and whether the chain is intact.
Claim Verification: Manuscript assertions (e.g., “Figure 1 shows p < 0.05”) are linked to the sessions that produced the evidence. Verification ensures the claim is still supported by the data.
Architecture
┌─────────────────────────────────────────────────┐
│ scitex-clew │
├─────────────────────────────────────────────────┤
│ Python API (19 functions) │
│ status, run, chain, dag, rerun, mermaid, ... │
├─────────────────────────────────────────────────┤
│ CLI (Click) │ MCP (FastMCP) │
│ clew status │ clew mcp start │
│ clew verify ... │ 9 tools │
├─────────────────────────────────────────────────┤
│ Core Engine │
│ _hash.py _chain.py _dag.py _claim.py │
│ _tracker.py _stamp.py _rerun.py │
├─────────────────────────────────────────────────┤
│ Storage: SQLite (clew.db) │
│ runs, file_hashes, session_parents, claims │
└─────────────────────────────────────────────────┘
Zero dependencies — pure stdlib + sqlite3. Optional extras: click (CLI),
fastmcp (MCP server), sphinx (docs).