pcq experiment contracts

Apache-2.0 · Python · MCP-ready

Your AI agent needs evidence.
Your experiments need a contract.

pcq is the contract. cq.yaml declares the run; your training code stays yours. pcq turns every run into structured evidence — config, metrics, manifest, validation, lineage, run record — and exposes 14 Model Context Protocol tools so Claude Code, Codex, or any MCP-aware agent can read it without scraping stdout.

PyPI version PyPI downloads per month License: Apache-2.0 CI status (main branch) pcq MCP server verified on Glama
pcq agent install --target both --mcp

Connect

One command. Both runtimes. Identical surface.

Wire-up is the same for Claude Code, Codex, and any MCP-aware runtime — only the --target flag differs. Existing .mcp.json entries are preserved; only the pcq server is merged.

Claude Code · Codex · any MCP runtime

stdio · sse

Install & serve

uv add 'pcq[mcp]'
pcq agent install --target both --mcp     # or: --target claude / --target codex
pcq mcp serve                             # stdio (default); --transport sse --port 8765 for HTTP

.mcp.json (project root)

{
  "mcpServers": {
    "pcq": {
      "command": "pcq",
      "args": ["mcp", "serve"]
    }
  }
}

First call: inspect_project

{
  "schema_version": 1,
  "project_root": "/path/to/my-exp",
  "selected_yaml": "cq.yaml",
  "name": "sklearn-baseline",
  "cmd": "uv run python train.py",
  "metrics": ["epoch", "eval_acc"],
  "output_dir": "output",
  "artifacts": ["output/"]
}

--target both writes Model Context Protocol wire-up for Claude Code and Codex in one call. Any other MCP-aware runtime works too — point it at pcq mcp serve. All 14 read-only and mutating tools are listed in the agent manifest and inlined under Tools →.

Tool responses · real

What your agent receives.

Below are four canonical MCP tool responses captured from a real run of examples/contract_sklearn. Volatile fields (timestamps, git SHA, env hashes, absolute paths) are elided as "..."; structure and values are unmodified. JSON contracts →

describe_run

read-only

Call

pcq describe-run output --json
# MCP: mcp__pcq__describe_run({"path": "output"})

Response

{
  "schema_version": 1,
  "run_id": "run_20260510_001107_c87d7d",
  "name": "sklearn-iris-contract",
  "status": "completed",
  "output_dir": "...",
  "cmd": "uv run python train.py",
  "target_metric": "eval_acc",
  "mode": "max",
  "best":  { "epoch": 0, "value": 1.0, "metrics": {"eval_acc": 1.0}, "checkpoint": "best.ckpt" },
  "best_value": 1.0,
  "best_epoch": 0,
  "last":  { "epoch": 0, "value": 1.0, "metrics": {"eval_acc": 1.0}, "checkpoint": "last.ckpt" },
  "epochs_completed": 1,
  "partial": false,
  "last_updated_at": "...",
  "git_sha": "...",
  "dirty": true,
  "python": "3.12.10",
  "platform": "Darwin-arm64",
  "metrics_declared": [{"name": "epoch"}, {"name": "eval_acc"}],
  "artifacts": [
    {"path": "config.json",      "kind": "config",  "sha256": "...", "size_bytes": 250,    "created_at": "..."},
    {"path": "metrics.json",     "kind": "metrics", "sha256": "...", "size_bytes": 74,     "created_at": "..."},
    {"path": "model.pkl",        "kind": "model",   "sha256": "...", "size_bytes": 186929, "created_at": "..."},
    {"path": "run_summary.json", "kind": "summary", "sha256": "...", "size_bytes": 586,    "created_at": "..."}
  ],
  "artifacts_summary": {"config": 1, "metrics": 1, "model": 1, "summary": 1},
  "validation_status": "pass",
  "decision_facts": {
    "run_completed": true,
    "validation_passed": true,
    "has_target_metric": true,
    "has_best": true,
    "has_parent": false,
    "artifact_count": 4,
    "metric_count": 2,
    "dirty_source": true,
    "has_lockfile": true
  }
}

compare_runs

read-only

Call

pcq compare-runs output_a output_b --json
# MCP: mcp__pcq__compare_runs({"a": "output_a", "b": "output_b"})

Response

{
  "schema_version": 1,
  "a_run_id": "run_20260510_001106_26fbda",
  "b_run_id": "run_20260510_001107_c87d7d",
  "target_metric": "eval_acc",
  "mode": "min",
  "best":  { "a": 1.0, "b": 1.0, "delta": 0.0, "direction": "tied", "epoch_a": 0, "epoch_b": 0 },
  "metric_delta": 0.0,
  "metric_direction": "tied",
  "last":  { "a": 1.0, "b": 1.0, "delta": 0.0, "direction": "tied", "epoch_a": 0, "epoch_b": 0 },
  "validation": { "a": "fail", "b": "pass", "same": false },
  "artifacts": { "a_count": 6, "b_count": 4 },
  "source":    { "same_git_sha": true, "same_cq_yaml_sha256": true, "dirty_changed": false },
  "a_status": "completed",
  "b_status": "completed",
  "a_is_ancestor_of_b": false,
  "notes": [
    "both runs picked epoch 0 as best — likely same initial weights (seed) and no improvement during training. agent should consider this a 'no learning' signal."
  ],
  "decision_facts": {
    "comparable": true,
    "best_improved": false,
    "best_tied": true,
    "candidate_completed": true,
    "candidate_validated": true,
    "config_changed": false,
    "has_lineage_relation": false
  }
}

validate_run

read-only

Call

pcq validate-run output --strictness 3 --json
# MCP: mcp__pcq__validate_run({"path": "output", "strictness": 3})

Response

{
  "schema_version": 1,
  "status": "pass",
  "strictness": 3,
  "strictness_name": "reproducible",
  "checks": [
    { "id": "manifest_evidence",           "status": "pass", "detail": "schema v2, 4 entries verified" },
    { "id": "metrics_well_formed",         "status": "pass", "detail": "1 epoch(s) recorded" },
    { "id": "summary_metrics_consistent",  "status": "pass", "detail": "run_summary best/last align with metrics history" },
    { "id": "run_record_complete",         "status": "pass", "detail": "run_record schema v1, all required keys present" },
    { "id": "run_finalized",               "status": "pass", "detail": "run finalized with status='completed'" },
    { "id": "source_reproducibility",      "status": "pass", "detail": "git_sha=..., dirty=True" },
    { "id": "environment_reproducibility", "status": "pass", "detail": "python/platform environment evidence recorded" },
    { "id": "lockfile_evidence",           "status": "pass", "detail": "lockfile recorded: uv.lock" },
    { "id": "seed_evidence",               "status": "pass", "detail": "seed recorded: 42" },
    { "id": "metrics_schema_evidence",     "status": "pass", "detail": "2 metric declaration(s) recorded" }
  ],
  "blocking_count": 0,
  "warning_count": 0
}

lineage_chain

read-only

Call

pcq lineage output --json
# MCP: mcp__pcq__lineage_chain({"path": "output"})

Response

{
  "schema_version": 1,
  "chain": [
    {
      "run_id": "run_20260510_001107_c87d7d",
      "output_dir": "...",
      "depth": 0,
      "name": "sklearn-iris-contract",
      "status": "completed",
      "target_metric": "eval_acc",
      "best_value": 1.0
    }
  ],
  "truncated": false,
  "notes": []
}

Capture procedure (reproducible): pcq run --path examples/_sklearn_run --json twice (rename outputoutput_a in between), then call the four tools against the resulting directories. Volatile fields elided in this page only — the raw JSON your agent receives keeps every field.

Compare

How pcq sits next to MLflow, W&B, and Neptune.

These are different tools for different jobs. The table below states what each owns and exposes — not what is "better". Use whichever matches your workflow; pcq is designed to coexist (CQ service can ingest run records produced alongside any of them).

Tool Training-loop ownership MCP support Self-host vs SaaS Agent-readable surfaces
pcq does not own — your stack stays built-in (14 tools, v4.1.0) self-host (OSS, Apache-2.0) JSON / JSONL / MCP, llms.txt, agent-manifest.json
MLflow does not own — wraps any framework not built-in self-host (OSS, Apache-2.0) REST API, Python SDK
Weights & Biases does not own — SDK callbacks per framework not built-in SaaS (self-host on Enterprise) REST API, Python SDK, GraphQL
Neptune.ai does not own — SDK integrations not built-in SaaS (self-host on Enterprise) REST API, Python SDK

The row that differs most is "MCP support" — pcq is the only one in this set with first-class Model Context Protocol tools. That is the design center, not a value judgment about the others. Need full experiment dashboards or team RBAC? MLflow and W&B both ship those today; pcq does not.

Case studies · Production dogfoods

Real runs that exercised the contract end-to-end.

All four are production dogfoods — pcq's own validation cycle on real ML workloads — not external customer references. Each documents setup, friction observed, and concrete fixes that made it back into the library.

MNIST Dogfood

2026-05-08 · pcq v2.11 · MNIST · Claude Code

First end-to-end dogfood. 9 fresh agent generations, ML→DL evolution, eval_acc 0.9583 → 1.0. Surfaced the first round of friction that drove v2.12 fixes.

Read case study →

Tabular Dogfood

2026-05-09 · pcq 3.0.1 · breast-cancer · TabPFN/PyCaret/FLAML/XGBoost/sklearn

First post-PyPI install path (no git URL workaround). Different domain, framework diversity test. Validated the fresh-user uv add pcq entry point.

Read case study →

MCP Dogfood

2026-05-10 · pcq[mcp] 4.1.0 · Claude Code MCP

First v4.1.0 MCP loop end-to-end — agent operates the experiment via mcp__pcq__* tools instead of subprocess CLI. 3 sequential generations, fresh-context per gen.

Read case study →

CQ Worker Dogfood

2026-05-10 · pcq[mcp] 4.2.0 · CQ Go worker (RTX 5080)

First production CQ Go service worker dispatch end-to-end. Verified cq.yaml + CQ_CONFIG_JSON + 6-artifact protocol on real production infrastructure.

Read case study →

Roadmap

Where pcq is going (direction, not dates).

The thesis stays the same: pcq does not compete with the means of training. The work ahead strengthens the framework-neutral evidence and control layer — broader real-world contract coverage, deeper validation/lineage facts, more machine-readable surfaces for agent runtimes, and tighter integration with the CQ managed consumer. Built-in models, losses, datasets, and per-framework adapter matrices remain deliberately out of scope.

Recent (2026-05-12): pcq is now verified on Glama — stdio MCP server builds, starts, and exposes all 14 tools to the catalog. Listed on awesome-mcp-servers (PR pending).
2026-05-10: spec/ foundation landed — contract spec moved from docs/ to spec/ with versioning + conformance policies and auto-exported JSON Schemas. First brick of the 1·2·3 (spec separation · schema versioning · conformance) arc.

Why pcq

Three reasons your agent prefers pcq.

Framework-neutral

PyTorch, HF Trainer, Lightning, sklearn, XGBoost, shell, or custom Python. pcq standardizes the evidence around the run, not the training loop.

Agent-readable

JSON/JSONL CLI surfaces and 14 MCP tools — facts, not prose. --events persists live evidence for post-hoc audits.

MCP-ready

pcq[mcp] wires into Claude Code, Codex, and any MCP-aware runtime. Frozen JSON contracts since v2.13.

Examples

Same contract, any framework.

Training code stays yours. pcq only needs three calls: pcq.config(), pcq.log(...), pcq.save_all(...). Full set including the minimal cq.yaml, NumPy, and the post-run command list lives in examples/.

sklearn — RandomForest on Iris

No adapter, no Trainer subclass. Three pcq calls.

# train.py
import pcq
import joblib
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cfg = pcq.config()
out = pcq.output_dir()
pcq.seed_everything(cfg.get("seed", 42))

X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                          random_state=cfg["seed"])

model = RandomForestClassifier(n_estimators=cfg.get("n_estimators", 100))
model.fit(X_tr, y_tr)
acc = float(model.score(X_te, y_te))

pcq.log(epoch=0, eval_acc=acc)
joblib.dump(model, out / "model.pkl")
pcq.save_all(history=[{"epoch": 0, "eval_acc": acc}],
             artifacts={"model": "model.pkl"})

PyTorch — training loop

Your own model, optimizer, dataloader. pcq sits at the boundary.

# train.py
import pcq, torch
from torch import nn

cfg = pcq.config()
out = pcq.output_dir()
pcq.seed_everything(cfg.get("seed", 42))

model = nn.Linear(cfg["in_dim"], cfg["out_dim"])
opt = torch.optim.Adam(model.parameters(), lr=cfg["lr"])

history = []
for epoch in range(cfg["epochs"]):
    train_loss = train_one_epoch(model, opt)        # your code
    val_acc = evaluate(model)                       # your code
    pcq.log(epoch=epoch, train_loss=train_loss, val_acc=val_acc)
    history.append({"epoch": epoch,
                    "train_loss": train_loss,
                    "val_acc": val_acc})

torch.save(model.state_dict(), out / "model.pt")
pcq.save_all(history=history, artifacts={"model": "model.pt"})

Agent-operable by design

Enough structure for an agent to inspect, run, validate, and decide.

pcq gives agents stable machine-readable surfaces while leaving policy to the agent or service. The library reports facts: what ran, what changed, what passed, what failed, which artifacts exist, and how a candidate compares with its parent.

resolve inspect validate run --json run --jsonl validate-run describe-run compare-runs lineage apply-plan agent install mcp serve

Quickstart

Create a contract project and produce a run record.

uv add 'pcq[mcp]'
pcq init-experiment --style script --output ./my-exp --with-pyproject --agent claude
cd ./my-exp
uv sync
pcq agent install --target claude --path . --mcp
pcq run --json
pcq validate-run output --json
pcq describe-run output --json

FAQ

Three things people actually ask.

Longer FAQ + JSON contracts + strictness reference live in the docs/.

Does pcq replace PyTorch / HF Trainer / Lightning / sklearn?

No. pcq does not own the training loop. Your project keeps any framework. pcq only standardizes the surrounding evidence: cq.yaml, metric emission, artifact layout, validation, comparison, lineage, and the final run_record.json.

How does pcq integrate with Claude Code or Codex?

Install pcq[mcp], run pcq agent install --target both --path . --mcp to write .mcp.json, and start pcq mcp serve. Both runtimes then see 14 mcp__pcq__* tools and call them with structured arguments instead of parsing stdout. See tool responses →

How does an agent decide whether a run passed?

Use pcq validate-run output --strictness 3 --json for pass/warn/fail facts and pcq describe-run output --json for decision facts. pcq deliberately reports facts; the agent or service chooses policy.

Relationship with CQ

pcq is the open contract. CQ is one managed consumer.

pcq

Open-source authoring, validation, artifact, and run-evidence library. Apache-2.0. Useful standalone.

CQ

Managed execution, queueing, artifact collection, dashboards, and agent loops. Consumes the same contract.