Metadata-Version: 2.4
Name: touchstone-eval
Version: 0.1.0
Summary: Personal eval benchmark: compare model outcomes across swappable CLI-agent harnesses on custom tasks.
Project-URL: Homepage, https://github.com/krimvp/touchstone
Project-URL: Repository, https://github.com/krimvp/touchstone
Project-URL: Issues, https://github.com/krimvp/touchstone/issues
Author-email: krimvp <anton.balboa@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: acp,agent,benchmark,claude-code,cli,eval,evaluation,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: pydantic>=2.6
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: judge
Requires-Dist: anthropic>=0.40; extra == 'judge'
Provides-Extra: langfuse
Requires-Dist: langfuse>=2.0; extra == 'langfuse'
Description-Content-Type: text/markdown

# touchstone

> A *touchstone* is the dark stone jewelers rub gold against to read its purity from the
> streak it leaves — telling true gold from convincing fakes. That is this benchmark's whole
> job: telling apart models that look identical on paper, by the marks they leave on real work.

A personal eval benchmark for answering one question: **for my usecases, which model
works best?**

Each eval (a *case*) bundles its own task, its own input source files, its own AI
artifacts (skills / commands / plugins / MCP), and its own definition of a correct
outcome. A *run* executes a **matrix** of cells — one cell per
`(case × harness × model × trial)` — fully isolated and persisted independently, then
aggregates everything into a single report.

## Core model

```
Case (one eval)            Matrix axes            Cell (unit of work + persistence)
  task / prompt      ×   harnesses[]        =     sandbox + transcript + output
  source/ files          models[]                 + grader scores + metrics + status
  artifacts/             trials (k)
  graders[]
```

- **Harness** — the swappable thing that turns a task into an output, behind one interface
  (`harness/base.py`). `echo` (fake) and `claude-code` are output-only. For *rich* runs
  (a Trace of tool calls / tokens / cost) there are two paths: **`claude-code-stream`** drives
  Claude natively over `--output-format stream-json` (no ACP, no Node; Tracing-only,
  autonomous via skip-permissions — see `docs/adr/0006`), and the **ACP adapter** drives any
  Agent Client Protocol agent (droid, gemini, codex, claude-acp, devin-cli) with full
  observation **and** bidirectional interaction. ACP is one rich path, not the only one — the
  Trace is the contract.
- **Graders** — `command` (run tests/build), `files` (expected files / grep patterns),
  `model_judge` (LLM-as-judge), and `trace` (assert over observed tool usage / token &
  cost budgets). All run; combined per the case's `expect.pass_threshold`.
- **Observation & interaction** (opt-in per case via `observe:`) — capture a normalized
  **Trace** (tool calls, tokens, cost, permission events) and answer the agent's mid-run
  requests with an **Interaction Policy** (`auto-approve`/`auto-deny`/`scripted`/
  `llm-based`/`manual`). See `CONTEXT.md` + `docs/adr/`.
- **Resumability & parallelism** — each cell's `result.json` is the source of truth (the
  manifest is a derived index), so cells run in parallel (`--workers`) without contention
  and `run --resume <id>` continues after a crash.

## Install

The published package is `touchstone-eval`; the command it installs is `touchstone`
(the bare `touchstone` name on PyPI belongs to an unrelated, abandoned project).

```bash
uvx touchstone-eval --help          # run without installing (recommended)
pipx install touchstone-eval        # or install as an isolated tool
pip install touchstone-eval         # or into the current environment
```

Add the optional extras when you need them: `touchstone-eval[judge]` (Anthropic SDK for
`model_judge`), `[langfuse]` (export), `[dev]` (pytest).

For local development from a checkout:

```bash
pip install -e ".[judge,dev]"   # judge = Anthropic SDK for model_judge; dev = pytest
```

## Usage

```bash
touchstone validate                 # schema-check every evals/<case>/case.yaml
touchstone list                     # list cases and past runs
touchstone run                      # run the whole evals/ suite
touchstone run --eval example-case --harness echo --trials 2
touchstone run --harness droid --with-model A --with-model B  # compare models, same harness
touchstone run --workers 4          # run cells in parallel
touchstone run --resume <run_id>    # continue an interrupted run
touchstone report <run_id>          # (re)generate runs/<run_id>/report.md
touchstone export <run_id> [--push] # write runs/<id>/langfuse.json (and optionally push)
```

### Comparing models on the same harness

The matrix is what answers "which model for my usecases?" — distinct models become
distinct cells, and the report ranks them in a per-case matrix + a leaderboard (score,
cost, time, tools, tokens). A case can declare the models inline
(`matrix.models` / `matrix.entries[].models`), or you can hold a harness fixed and push
models through it at run time without editing the case:

```bash
# Run these models on droid even if the cases declared only one — they replace the
# case's models for that harness. Each becomes its own row in the comparison.
touchstone run --harness droid \
  --with-model custom:glm-5.1:cloud-0 --with-model custom:glm-4.6:cloud
```

`--with-model` *replaces* the declared models (so you can introduce new ones); `--model`
only *filters* the models a case already declares. Models are agent-specific opaque
strings, so prefix `HARNESS=` (`--with-model droid=A`) to scope an override to one harness
when a run spans several.

ACP agents are configured in `acp_agents.yaml` (see `acp_agents.yaml.example`); the
built-in profiles (`droid`, `gemini`, `codex`, `claude-acp`, `devin-cli`) work out of the
box once the agent's CLI is on `PATH`. `evals/observed-droid/` is a worked example of a
fully observed, interactive, multi-turn case.

Real harnesses (e.g. `claude-code`) cost money and require their CLI on `PATH`.
The built-in `echo` harness runs the full loop with no network/API spend — use it for
testing the framework itself.

## Defining a case

See `evals/example-case/case.yaml` for a worked example. Schema:

```yaml
id: my-case
description: ...
task:
  prompt: |
    What the model/agent must accomplish.
source:                  # optional; copied fresh into every cell sandbox
  path: ./source         # ...or  {repo: owner/name, commit: <sha>}  (pinned clone)
  # repo form may add `subdir: <dir>` to use just one sub-directory of the clone as the
  # sandbox — lets one fixtures repo hold many cases (see "Source fixtures repo" below).
artifacts:               # optional AI artifacts injected into the harness
  skills:   [./artifacts/skills/foo]
  commands: [./artifacts/commands/bar.md]
  mcp:      ./artifacts/.mcp.json
environment:             # optional per-cell dependency setup (the "broader sandbox")
  kind: pip-venv               # pip-venv (default) | uv | command  — how deps are provisioned
  requirements: [markupsafe]   # (pip-venv/uv) installed into an isolated venv per cell
  install: editable            # (pip-venv/uv) `pip install -e .` (src-layout pkg + its deps)
  # kind: command → run shell installs for project-local ecosystems, e.g.
  #   commands: ["npm ci"]      # node_modules / target/ etc. live in the sandbox
setup:                   # optional; introduce the task state after clone, before the agent
  stub: [{file: pkg/mod.py, function: target}]   # blank a fn body -> NotImplementedError
  run:  ["rm -rf .git"]                           # shell commands in the sandbox
matrix:
  harnesses: [claude-code]
  models:    [opus, sonnet, haiku]
  trials:    3
graders:
  - {type: command, cmd: "pytest -q", weight: 1.0}
  - {type: files, patterns: ["retry", "backoff"]}
  - {type: model_judge, rubric: ./graders/rubric.md, model: opus, pass_threshold: 0.8}
expect:
  pass_threshold: 1.0
```

### Source fixtures repo

A case's bulky, hand-written assets — synthetic codebases to debug **and** the `hidden/`
oracle test suites — live **out of this repo**, in a separate fixtures repo
([`krimvp/touchstone-eval-fixtures`](https://github.com/krimvp/touchstone-eval-fixtures)), so they
don't pollute the runner/eval tree. The eval repo keeps only the *contract* (task, graders,
expectations); the fixtures repo holds the code. Each case has one directory there, split by
**visibility**:

```
<case-id>/
  source/   # agent-VISIBLE input  → promoted to the sandbox before the agent runs
  hidden/   # grader ORACLE         → injected at grade time only; the agent never sees it
```

A case wires the two halves with two independent pins (both default-pinned by commit):

```yaml
source:    {repo: krimvp/touchstone-eval-fixtures, commit: <sha>, subdir: <case-id>/source}
fixtures:  {repo: krimvp/touchstone-eval-fixtures, commit: <sha>}   # subdir defaults to <case-id>
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"]}   # resolved under <case-id>/hidden/
```

- **`source`** clones the repo, checks out the commit, and promotes `<case-id>/source/` into
  the sandbox (no `.git`, like `copy`). SWE-bench-style cases point `source` at the *real
  upstream repo* instead, so they have only a `hidden/` in the fixtures repo (no `source/`).
- **`fixtures`** names the repo that graders resolve `inject:` paths against — `Case.asset()`
  pulls each hidden file from a host-cached clone (`src/touchstone/fixtures.py`), at grade
  time, *after* the agent has stopped. Because `source/` and `hidden/` are sibling directories
  and only `source/` is promoted, the oracle can never leak into the agent's sandbox.

Keep the fixtures repo **private** for the anti-memorization cases. `evals/example-case/`
stays local (`source: path`) as the offline worked example / integration fixture.

### Real-repo (SWE-bench-style) cases

A case can pin a real GitHub repo at a commit (`source: {repo, commit}`), `setup.stub` a
function to blank its body, and inject **hidden tests** (oracle = the real function) only
at grade time — so the agent reimplements real library code and the `pytest` grader scores
the fraction of FAIL→PASS tests. See `evals/repo-*-droid/`.

When a repo needs third-party dependencies or isn't importable from its root (a `src/`
layout), declare an **`environment`**: each cell gets its own throwaway virtualenv, into
which `requirements` are pip-installed and — with `install: editable` — the repo itself
(`pip install -e .`, which resolves a src-layout package and pulls its deps). Every
subprocess the cell spawns (harness, setup, and the `command`/`pytest` graders) runs under
that venv via an explicit env, so dependency-bearing cases stay reproducible and
parallel-safe (no shared site-packages). Worked examples: `repo-smarttruncate-droid`
(a `requirements` dep) and `repo-securefilename-droid` (`install: editable`, src-layout).

### Non-Python projects

Cases aren't Python-specific. The `command`, `files`, `model_judge`, and `trace` graders
are language-agnostic, and the **`tests`** grader gives the same partial-credit scoring as
`pytest` for any runner whose results it can read. Two substrates, **XML primary with a
console fallback**:

- **JUnit XML** (`junit_xml: <glob>`) — the universal report format every framework/build
  tool can emit (Maven Surefire, Gradle, pytest `--junitxml`, vitest/jest/mocha reporters,
  `go-junit-report`, `cargo2junit`). Deterministic, exact per-test counts, framework-agnostic.
- **Console summary** (`_parse_counts`) — scraped when no XML report is produced: pytest/
  unittest, `node --test`/TAP, Maven Surefire, **`go test -v`** (`--- PASS:`/`--- FAIL:`), and
  **`cargo test`** (`test result: … N passed; M failed`).

A `tests` grader with `gate: true` is a validity **gate** (never adds credit; disqualifies the
cell to 0 on failure) — use it to mirror SWE-bench's PASS_TO_PASS regression gate in any
language. `inject` takes either a bare filename (dropped at the sandbox root) or `{src, dest}`
to place a hidden test at a runner-specific path (e.g. Maven's `src/test/java/...`). Use
`setup.run` to blank the function (the AST-based `setup.stub` is Python-only); the
`implemented` gate works on any language when pointed at explicit `files`. Worked examples:
`repo-js-wordwrap-droid` (CommonJS, `node --test`), `repo-java-camelcase-droid` (Maven,
Surefire), and the `repo-swebench-*` battery — real recent GitHub issues across Python, Go
(`go test`), Java (Surefire + JUnit XML), JS/TS (mocha/ava/TAP), and Rust (`cargo test`).

**Dependencies aren't special — how they're *isolated* is.** Real projects have
dependencies; the question is only whether installing them safely needs the `environment`
venv. It depends on where the ecosystem puts deps:

| Ecosystem | Where deps go | Isolation | How to declare |
| --- | --- | --- | --- |
| Python | shared `site-packages` (mutable) | needs the per-cell venv | `environment:` `kind: pip-venv` (or `uv`) + `requirements` / `install: editable` |
| Node / Rust / Go | project-local (`node_modules`, `target/`, build cache) | per-cell for free | `environment:` `kind: command` + `commands: ["npm ci"]` etc. |
| Java / Maven | shared `~/.m2` (versioned, immutable artifacts) | safe to share across cells | resolved by the build (`mvn test`) |

The `environment.kind` is the one declarative knob (mirroring the Sandbox's Isolation Mode):
`pip-venv` and `uv` build an isolated venv and install into it; `command` runs your install
commands for ecosystems whose deps are project-local.

### OS-level isolation + OS packages (containers)

For cases that need OS packages or a pinned, reproducible build/grade environment, declare a
**`container`**: provisioning, `setup.run`, and the `command`/`tests`/`pytest` graders then run
inside it (via `docker exec`), with the cell bind-mounted at its same path.

```yaml
container:
  image: python:3.12-slim          # pin by digest (…@sha256:…) for full reproducibility
  setup: ["apt-get update -qq", "apt-get install -y -qq libxml2"]   # OS packages, once at start
  caches: [".cache/pip"]           # share the host's cache so cells don't re-download deps
environment:
  kind: pip-venv                   # the venv is now built *inside* the container
  requirements: [lxml, pytest]
graders:
  - {type: pytest, inject: ["./hidden/test_x.py"], weight: 4.0}     # runs in the container
```

`caches` mounts a home-relative dir (e.g. `.cache/pip`, `.m2`) shared with the host and
across cells, so a fresh container per cell reuses already-downloaded dependencies instead
of re-fetching them — the same shared-cache benefit the host's `~/.m2` gives today. The
suite uses this on its dependency-bearing cases: `repo-js-wordwrap` (`node:20-slim`,
zero-dep), `repo-smarttruncate` / `repo-securefilename` (`python:3.12-slim` + pip cache),
and `repo-java-camelcase` (`maven:3.9-eclipse-temurin-21` + shared `~/.m2`).

Every provisioner and grader runs through the Cell's **Executor** — `LocalExecutor` (host
subprocess) by default, `ContainerExecutor` when a `container` is declared — so the same
recipe runs under either backend (needs the docker daemon running). The Harness (the agent
under test) still runs on the host against the bind-mounted Sandbox; running the agent
itself in-container is future work. See `docs/adr/0005`.

So the earlier zero-dep examples were picked to keep the *demo* offline, not because deps
are rare. `repo-java-camelcase-droid` is a genuinely dependency-bearing non-Python case:
commons-text's source needs `commons-lang3`, which Maven resolves from Maven Central.

## Bring your own private repos (reachability & fallback)

`touchstone` is an **engine + a public sample battery**. The verdict you can actually trust for
"which model is best **for me**" comes from *your own* tasks, so the design is built to pull
case material from external git repos you own — both the agent-visible `source: {repo, commit}`
and the hidden oracle in `fixtures: {repo, commit}` — some of them private. Auth is just your
normal git credentials (SSH agent / `gh` / a credential helper); nothing extra to configure.

Because a given host may not have access to every referenced repo (a teammate's private
fixtures, a CI box without keys, an offline laptop), a run **probes each case's external repos
before doing any work** (`git ls-remote`, cached per URL) and applies a policy:

```bash
touchstone run                                  # default: FAIL FAST if any required repo is unreachable
touchstone run --on-unavailable skip            # degrade: skip unreachable cases, run the rest
touchstone validate --check-access              # preflight only: report what a run would skip/fail on
```

- **Fail by default.** A missing repo on a host you expected to be complete is a *loud, early*
  error — never a silently smaller benchmark (which would corrupt cross-model comparisons).
- **`--on-unavailable skip`** degrades the unreachable cases to a `skipped` status: excluded
  from every score and the leaderboard, surfaced in a "Skipped (unavailable)" report section,
  and **not** counted as failures. Resume re-probes, so a transient outage is retried.
- **Per-case `availability: optional`** marks a case that may reference a repo you might not
  have — it degrades to `skipped` even under the default fail mode.
- Only *access* failures (no auth / no network / not found) are degradable; a bad commit or
  schema error is a defect and still fails loudly.

A fork can repoint the default hidden-fixtures repo to its own private one without editing every
case by setting `TOUCHSTONE_FIXTURES_REPO=owner/my-fixtures`. Your fully-private held-out suite
lives in `evals-private/` (gitignored) and runs with `--evals-dir evals-private` — see its
README. Design: `docs/adr/0008-reachability-and-availability-policy.md`.

## Layout

```
evals/<case>/        the benchmark suite (one dir per case)
src/touchstone/     the framework (config, harness/, grader/, runner, report, cli)
runs/<run_id>/       results (gitignored): manifest.json + cells/ + report.md
```
