Metadata-Version: 2.4
Name: state-trace
Version: 0.3.3
Summary: Graph-native working memory for coding agents with causal retrieval and bounded capacity.
Author: Razroo
License: MIT License
        
        Copyright (c) 2026 Razroo
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/razroo/state-trace
Project-URL: Repository, https://github.com/razroo/state-trace
Project-URL: Issues, https://github.com/razroo/state-trace/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115.0
Requires-Dist: networkx>=3.3
Requires-Dist: numpy>=1.26.0
Requires-Dist: pydantic>=2.8.0
Provides-Extra: api
Requires-Dist: uvicorn>=0.30.0; extra == "api"
Provides-Extra: bench
Requires-Dist: graphiti-core[kuzu]>=0.28.2; extra == "bench"
Requires-Dist: datasets>=3.0.0; extra == "bench"
Provides-Extra: llm
Requires-Dist: openai>=2.0.0; extra == "llm"
Provides-Extra: mcp
Requires-Dist: mcp>=1.2.0; extra == "mcp"
Provides-Extra: adapters
Requires-Dist: langchain-core>=0.3.0; extra == "adapters"
Requires-Dist: llama-index-core>=0.11.0; extra == "adapters"
Dynamic: license-file

# state-trace

<!-- mcp-name: io.github.razroo/state-trace -->

> Graph-native working memory for coding agents: typed memories, causal retrieval, bounded capacity, and compact briefs for small models.

`state-trace` is a bounded working-memory layer for coding and debugging agents that need the right file, failure, and next action under tight token budgets. It is not a replacement for a general-purpose temporal knowledge graph like Graphiti — see [ARCHITECTURE.md](./ARCHITECTURE.md) for the honest comparison.

What it is optimized for:

- artifact-first retrieval for coding agents
- current-vs-stale task state (`engine.current_state()`, `engine.failed_hypotheses()`)
- compact harness-facing briefs for smaller models
- online agent loops and post-hoc trajectory ingestion
- bounded memory with decay, compression, and lifecycle retention
- MCP-mountable, local-first deployment

## Headline: SWE-bench-Verified localization — n=500

The credibility benchmark. Cold-start artifact localization on the full SWE-bench-Verified test split: given only the GitHub issue text and hints (no trajectory), rank the correct patch file at 1 and at 5.

```bash
pip install -e ".[bench]"
python3 examples/swebench_verified_eval.py --limit 500 --backends no_memory bm25 state_trace graphiti
```

<!-- BENCHMARK:SWEBENCH_N500:START -->
| backend | n | Artifact@1 | Artifact@1 CI | Artifact@5 | Artifact@5 CI | AvgLatencyMs |
| --- | ---: | ---: | :---: | ---: | :---: | ---: |
| no_memory | 500 | 0.000 | [0.000, 0.000] | 0.000 | [0.000, 0.000] | 0.00 |
| bm25 | 500 | 0.176 | [0.144, 0.208] | 0.300 | [0.262, 0.338] | 0.10 |
| **state_trace** | 500 | **0.254** | [0.218, 0.290] | **0.376** | [0.336, 0.414] | 15.04 |
| graphiti | 500 | 0.098 | [0.072, 0.126] | 0.216 | [0.182, 0.254] | 4851.46 |
<!-- BENCHMARK:SWEBENCH_N500:END -->

What this says, plainly:

- **state_trace leads on both Artifact@1 and Artifact@5 against every baseline, with non-overlapping 95% CIs across the board.**
- **vs. Graphiti:** a wide, definitive gap (A@1 0.254 vs 0.098; A@5 0.376 vs 0.216). Non-overlapping CIs on both metrics. On the same input with the same deterministic embedder/reranker stub, the typed coding-agent ontology + cold-start lexical fallback localizes the right file and puts it in the top 5 meaningfully more often.
- **vs. BM25:** a consistent win with non-overlapping CIs on both A@1 (lower bound 0.218 > BM25 upper bound 0.208) and A@5 (lower bound 0.336 > BM25 upper bound 0.338). BM25's pure file-token lexical search is a strong baseline; state_trace's coding-agent ontology + module-to-path translation + GitHub-URL extraction beats it decisively on cold-start localization.
- **Latency:** state_trace retrieves in ~15ms vs BM25's ~0.1ms vs Graphiti's ~4,850ms. For per-action memory lookups in an agent loop, the ~320× delta over Graphiti compounds meaningfully over a long session.

v0.3.0 landed a module-to-path translator in `retrieve_brief`'s lexical fallback: dotted Python module references in issue text (`astropy.modeling.separable_matrix`) now resolve to file path candidates (`astropy/modeling/separable.py`), which pushed A@1 from 0.216 → 0.254 on n=500.

### Caveats

- Graphiti is run with a deterministic hash-embedder and BM25 + cosine + BFS → RRF reranker (no LLM entity extraction). That's the same simplification `graphiti_head_to_head_eval.py` uses for reproducibility without API keys. A full Graphiti pipeline with GPT-4-class extraction might close some of the gap, at materially higher cost per ingest.
- Cold-start localization from issue text is only one axis. Trajectory-informed retrieval (BENCHMARKS.md) is where state_trace's larger advantage lives.

## Live solve-rate — n=20 with Codex CLI + swebench docker harness

Localization leads need to be converted into downstream solve wins to matter. Running the actual swebench test suite on patches Codex CLI produces with vs. without a state-trace brief:

| arm | resolved | unresolved | errored | solve-rate |
| --- | ---: | ---: | ---: | ---: |
| state_trace | 7 | 3 | 10 | 7/20 = 35% |
| no_memory | 7 | 2 | 11 | 7/20 = 35% |

Same aggregate solve-rate. But **the two arms solve different instances**:

- Both arms solve: 5 instances (astropy-12907, -13453, -14309, -14995, -7671)
- state_trace only: 2 instances (astropy-14598, -7166)
- no_memory only: 2 instances (astropy-14508, -7336)
- Union (at least one arm solves): 9/20 = 45%

Honest read:

- **At this sample size and with Codex CLI as the downstream model, state_trace's n=500 retrieval advantage does not translate into an aggregate solve-rate advantage.** The file-level proxy predicted this — Codex already localizes files near-ceiling from issue text, so the retrieval win has nowhere to compound.
- **state_trace does change Codex's behavior** — different instances resolve under each arm. Net-zero at n=20 could be noise or could be a genuine redirect-sideways effect. Larger N (50-100) would resolve which.
- **The errors are mostly patch-apply failures** — Codex produces diffs with wrong line numbers or malformed hunks, and the harness rejects them before running tests. Same pattern across both arms. That's a downstream-model problem, not a memory-layer problem.
- **Union of 9/20 = 45%** means routing-by-oracle between the two arms would beat either arm alone by 10 points. Suggests state_trace's context is genuinely orthogonal to Codex's baseline knowledge, just not uniformly in the correct direction.

### Solve-rate caveats

- n=20 is too small for confident conclusions about the *direction* of state_trace's effect on solve-rate. What we can say: **no big win, no big loss, identical aggregate.**
- This was run against the first 20 SWE-bench-Verified instances (mostly astropy). A harder subset could shift the result either way.
- Codex CLI is a substantially stronger downstream model than a raw LLM call. Results with a smaller/weaker agent (free-tier OpenRouter, a small local model) would likely show a larger gap — in one direction or the other — because retrieval-quality wins matter more when the downstream model can't compensate.
- Reproducing this: see [BENCHMARKS.md](./BENCHMARKS.md) for the exact harness commands.

## What makes the architecture different

Typed coding-agent ontology, not generic Entity/Edge:

- **Nodes:** `task`, `observation`, `decision`, `file`, `goal`, `session`, `command`, `test`, `symbol`, `patch_hunk`, `error_signature`, `episode`
- **Edges:** `patches_file`, `fails_in`, `verified_by`, `rejected_by`, `supersedes`, `contradicts`, `solves`, `derived_from`, `precedes`, `motivates`, and more
- **Intent routing:** the retrieval scorer re-prioritizes edge types per query intent (`locate_file`, `failure_analysis`, `history`, `general`).

Bounded working memory as a first-class constraint:

- `enforce_capacity()` runs decay, compression, and summarization on every step.
- `current_state(session)` answers "what's live right now" directly — cheap for state-trace, expensive for a general-purpose knowledge graph.
- `failed_hypotheses(session)` returns invalidated, superseded, or unrecovered-error nodes — the "don't propose this again" signal.

Local-first, MCP-mountable:

- Hot graph is an in-process `networkx.MultiDiGraph`. Cold storage is WAL SQLite+FTS5.
- `state-trace-mcp` is a stdio MCP server you can mount in Claude Code / Cursor / Codex CLI.

See [ARCHITECTURE.md](./ARCHITECTURE.md) for why these choices matter vs. Graphiti, and [BENCHMARKS.md](./BENCHMARKS.md) for the smaller repo-local benchmarks.

## vs. Graphiti

Graphiti is the stronger general-purpose temporal knowledge graph for AI agents. `state-trace` is narrower: working memory for one coding/debugging session at a time. We're not claiming to replace Graphiti — we're claiming a specific lane where the tradeoffs land differently.

Each row below is a concrete, measured axis, not a vibe.

| Axis | state-trace | Graphiti | Winner for coding agents |
| --- | --- | --- | --- |
| **Artifact@1** on SWE-bench-Verified, n=500 | **0.254** [0.218, 0.290] | 0.098 [0.072, 0.126] | **state-trace** — non-overlapping 95% CIs |
| **Artifact@5** on SWE-bench-Verified, n=500 | **0.376** [0.336, 0.414] | 0.216 [0.182, 0.254] | **state-trace** — non-overlapping 95% CIs |
| **Per-retrieval latency** (same benchmark) | **15 ms** | 4,851 ms | **state-trace** — ~320× faster |
| **Write path per agent step** | Typed insert, zero LLM calls | `add_episode` → LLM entity extraction each step | **state-trace** — cheaper, deterministic, no API key |
| **Default deploy** | Pure Python + local SQLite/JSON; `state-trace-mcp` stdio binary | Neo4j / Kuzu / FalkorDB graph DB + embedder + LLM | **state-trace** — local-first, no external services |
| **Coding-agent ontology** | Typed: `file`, `patch_hunk`, `error_signature`, `test`, `command`, `symbol`, `observation`, `decision`, `task`, `goal`, `session`, `episode` | Generic `EntityNode` / `EntityEdge` / `EpisodicNode` | **state-trace** — retrieval scorer routes on these types |
| **"What's true right now in this session?"** | `engine.current_state(session)` — direct O(graph) query | Inferred from temporal facts via Cypher or LLM | **state-trace** — first-class API |
| **"What have I already tried and rejected?"** | `engine.failed_hypotheses(session)` — direct query returning `invalid_at` + superseded + unrecovered-error nodes | Has to be inferred from `invalid_at` + contradictions | **state-trace** — first-class API |
| **Working-memory capacity bound** | `enforce_capacity` with decay + compression + lifecycle retention. Long-horizon pressure benchmark: Artifact@1 0.771 *while* staying within a 96-unit budget 100% of the time | Unbounded by design; relies on the graph DB to scale | **state-trace** for long debugging sessions that need a memory ceiling |
| **Small-model brief** | `retrieve_brief` produces ~220-token structured brief (`patch_file`, `rerun_command`, `tests_to_rerun`, `failed_attempts`, `recommended_actions`, …) that fits a tight budget | Returns raw nodes/facts; caller compresses | **state-trace** — built for small-model harnesses |
| **MCP-mountable** | `state-trace-mcp` stdio server in the `[mcp]` extra — 11 tools exposed, drop into `~/.claude/settings.json` | No official MCP server; library-first | **state-trace** — plug straight into Claude Code / Cursor / Codex / opencode |
| **Long-lived temporal knowledge across weeks** | Scoped to a session or repo namespace; no cross-namespace fact merging | First-class; bi-temporal validity, contradiction resolution, fact supersession across episodes | **Graphiti** |
| **Multi-tenant SaaS scale** | Single-writer process model; authoritative graph is in-process networkx | Built for it on Neo4j/Kuzu substrate | **Graphiti** |
| **Cross-session learning about users / orgs / policies** | Out of scope | First-class | **Graphiti** |

### When to pick which

Use **state-trace** when:

- Your agent is editing code in a single debugging or refactoring session.
- You talk to an MCP client (Claude Code, Cursor, Codex CLI, opencode) and want working memory without standing up a graph DB.
- Per-action latency matters — you're calling memory on every tool invocation in an agent loop.
- You run on small models where a 220-token structured brief beats a 1,000-token raw dump.
- You need "what file should I patch / what did I already try" to be a direct query, not inferred.

Use **Graphiti** when:

- You need a knowledge graph of facts about the world, users, or an organization that evolves across weeks.
- Multi-tenant, multi-agent shared memory is part of the design.
- You're willing to run Neo4j/Kuzu and pay the LLM-extraction cost per ingest for the ontological payoff.
- Your retrieval patterns are richer than "which file, which test, which failed hypothesis."

They solve adjacent problems. The only reason a comparison is even interesting is that both ship as "memory for AI agents" — the honest answer is they're different products that happen to live on the same shelf.

## Installation

```bash
pip install "state-trace"      # library
pip install "state-trace[mcp]" # stdio MCP server for Claude Code / Cursor / Codex CLI

uv sync                       # repo development
pip install -e ".[mcp]"       # editable MCP install for local development
pip install -e ".[bench]"     # graphiti-core[kuzu] + datasets (for the headline benchmark)
pip install -e ".[llm]"       # OpenAI-backed live benchmarks + LLM ingestion
pip install -e ".[adapters]"  # LangGraph / LlamaIndex adapter shims
pip install -e ".[api]"       # FastAPI app
```

Distribution name: `state-trace`. Python import path: `state_trace`.

## Quickstart

```python
from state_trace import MemoryEngine

engine = MemoryEngine(capacity_limit=24.0, storage_path="memory.json")

task = engine.store(
    "Fix login by tracing the refresh token path",
    {"type": "task", "session": "auth-debug", "goal": "restore login", "file": "auth.ts", "importance": 0.92},
)
engine.store(
    "Login still returns 401 after refresh token exchange",
    {"type": "observation", "session": "auth-debug", "goal": "restore login", "file": "auth.ts",
     "blocks": [task.id], "importance": 0.88},
)
engine.store(
    "Authorization header is dropped before the retry request reaches auth.ts",
    {"type": "decision", "session": "auth-debug", "goal": "restore login",
     "related_to": [task.id], "file": "auth.ts", "importance": 0.91},
)

result = engine.retrieve("Why is login still broken?", {"session": "auth-debug", "goal": "restore login"})
```

## Current state, live hypotheses, failed attempts

The architectural wedge. These APIs return a live view of the session without re-ranking:

```python
state = engine.current_state(session="auth-debug", goal="restore login")
# → {"active_task": ..., "latest_observation": ..., "active_files": [...], ...}

failures = engine.failed_hypotheses(session="auth-debug")
# → [{"id": ..., "reason": ["superseded"], "content": "Login still returns 401 ..."}, ...]
```

`current_state` filters out invalidated and superseded nodes; `failed_hypotheses` surfaces them as "do not propose again" context. A general-purpose temporal graph has to infer this from fact updates; here it's a direct query.

## MCP Server

Install the official PyPI package:

```bash
python3 -m pip install --upgrade "state-trace[mcp]"
state-trace-mcp
```

For project-scoped MCP installs, generate a local `.mcp.json` from the installed package. The generated config uses an absolute `state-trace-mcp` path so clients with a minimal PATH can still start it.

```bash
cd /path/to/your/repo
state-trace-mcp-config --namespace "$(basename "$PWD")" > .mcp.json
# If your shell cannot find the console script:
python3 -m state_trace.mcp_config --namespace "$(basename "$PWD")" > .mcp.json
```

Environment config:

- `STATE_TRACE_STORAGE_PATH` — durable path; `.db`/`.sqlite` uses the SQLite backend. Default: `~/.state-trace/memory.db`.
- `STATE_TRACE_NAMESPACE` — default namespace (e.g. the repo slug).
- `STATE_TRACE_CAPACITY_LIMIT` — working-memory budget (default `256`).

Tools exposed: `store`, `retrieve`, `retrieve_brief`, `record_action`, `record_observation`, `record_test_result`, `ingest_agent_log_file`, `current_state`, `failed_hypotheses`, `list_namespaces`, `graph_snapshot`.

### OpenAI Codex

Recommended per-repo install:

```bash
python3 -m pip install --upgrade "state-trace[mcp]"
cd /path/to/your/repo
state-trace-mcp-config --namespace "$(basename "$PWD")" > .mcp.json
# If your shell cannot find the console script:
python3 -m state_trace.mcp_config --namespace "$(basename "$PWD")" > .mcp.json
```

Manual project-level `.mcp.json`:

```json
{
  "mcpServers": {
    "state-trace": {
      "command": "/absolute/path/to/state-trace-mcp",
      "env": {
        "STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/repo-x.db",
        "STATE_TRACE_NAMESPACE": "repo-x",
        "STATE_TRACE_CAPACITY_LIMIT": "256"
      }
    }
  }
}
```

Restart Codex after adding or changing `.mcp.json`.

Codex also supports global MCP config:

```bash
STATE_TRACE_MCP="$(python3 -c 'from state_trace.mcp_config import resolve_entrypoint; print(resolve_entrypoint())')"
codex mcp add state-trace \
  --env STATE_TRACE_NAMESPACE=repo-x \
  --env STATE_TRACE_STORAGE_PATH=/Users/me/.state-trace/repo-x.db \
  -- "$STATE_TRACE_MCP"
```

Use the project `.mcp.json` path when you want state-trace mounted one repo at a time.

### Claude Code

Project-level `.mcp.json` is the same as Codex. For a global install:

```bash
claude mcp add state-trace -- state-trace-mcp
```

To uninstall:

```bash
claude mcp remove state-trace
```

### Cursor

Open Settings → MCP → Add new MCP server, or add the same `mcpServers.state-trace` entry to `.cursor/mcp.json`.

### Other MCP clients

Any MCP client that supports stdio transport can run:

```json
{
  "command": "/absolute/path/to/state-trace-mcp"
}
```

Use `python3 -m state_trace.mcp_config` to print a complete config with storage, namespace, and capacity env vars.

## Online agent loop

```python
engine = MemoryEngine(capacity_limit=256.0)
ctx = {"session": "auth-debug", "goal": "restore login", "repo": "example/auth-service"}

engine.record_action('open "src/auth.ts"', {**ctx, "files": ["src/auth.ts"]})
engine.record_observation(
    "AttributeError: login still fails with a 401 in src/auth.ts",
    {**ctx, "files": ["src/auth.ts"], "status": "error"},
)
engine.record_action('edit "src/auth.ts"', {**ctx, "files": ["src/auth.ts"], "action_kind": "edit"})
engine.record_test_result(
    "pytest tests/test_auth.py::test_refresh_retry",
    "tests/test_auth.py::test_refresh_retry PASSED",
    {**ctx, "files": ["src/auth.ts", "tests/test_auth.py::test_refresh_retry"]},
)

brief = engine.retrieve_brief(
    "Which file should I patch and what test should I rerun?",
    {"session": "auth-debug", "goal": "restore login"},
    mode="small_model",
)
```

The brief fields: `patch_file`, `rerun_command`, `target_files`, `tests_to_rerun`, `current_state`, `failed_attempts`, `recommended_actions`, `evidence`, `symbols`, `patch_hints`, `confidence`, `token_estimate`.

## Trajectory ingestion

```python
engine = MemoryEngine(capacity_limit=256.0)
engine.store_agent_log_file("examples/data/agent_logs/marshmallow__marshmallow-1867.json")
```

Supported inputs: normalized `agent_log` JSON, raw SWE-agent `.traj` files, raw OpenHands event JSON logs.

### From iso-trace (Claude Code / Cursor / Codex / opencode sessions)

If you've accumulated session history with [`@razroo/iso-trace`](https://www.npmjs.com/package/@razroo/iso-trace), feed it directly:

```bash
# Export a session via iso-trace's CLI
npx @razroo/iso-trace export <session-id> --json --out session.json
```

```python
from state_trace import MemoryEngine
from state_trace.iso_trace_adapter import ingest_iso_trace_session

engine = MemoryEngine(capacity_limit=256.0, namespace="my-repo")
ingest_iso_trace_session(engine, "session.json")
```

The adapter reads iso-trace's documented Session → Turn → Event[] JSON and converts it to state-trace's `agent_log` format — typed nodes for files, edits, tests, errors. Months of accumulated harness history become queryable working memory without re-running the agent.

## Live solve-rate (next credibility step)

`examples/swebench_verified_solve_rate.py` scaffolds end-to-end solve-rate measurement: state-trace brief → LLM patch proposal → SWE-bench-Verified prediction JSONL. It does not run the swebench docker harness; that step is documented in the script's header.

```bash
python3 examples/swebench_verified_solve_rate.py --limit 5 --model gpt-5.1-mini --dry-run
```

## Storage backends

`MemoryEngine(storage_path=...)` picks the backend from the file extension:

- `.db` / `.sqlite` / `.sqlite3` — durable SQLite with WAL journal + FTS5 seed index. Recommended for long-running agent harnesses.
- any other path — JSON blob (simple, single-writer, fine for benchmarks).

See [ARCHITECTURE.md](./ARCHITECTURE.md) for the "why networkx + SQLite, not Neo4j" explainer.

## Namespaces

```python
engine = MemoryEngine(storage_path="memory.db", namespace="payments-api")
engine.retrieve("why is login broken?")  # scoped to payments-api by default
engine.retrieve("...", include_all_namespaces=True)  # opt out
```

Nodes without a namespace remain visible in every view so pre-namespace data is not lost.

## Framework adapters

```python
from state_trace.adapters import StateTraceLangGraphMemory, StateTraceLlamaIndexMemory

lg_memory = StateTraceLangGraphMemory(default_session="coding-session")
li_memory = StateTraceLlamaIndexMemory(session_id="agent-session")
```

Neither adapter imports the host framework; they satisfy the duck-typed memory contract used by each.

## FastAPI

```python
from state_trace.api import app  # POST /store, /retrieve, /retrieve_brief, GET /graph
```

Pass `"explain": true` on retrieve to include per-node score breakdowns.

## Tests

```bash
python3 -m pytest -q
```

## Benchmarks

Full set of repo-local benchmarks and their honest caveats lives in [BENCHMARKS.md](./BENCHMARKS.md). The SWE-bench-Verified row above is the only one that's at a scale worth citing externally.

## Positioning

See [**vs. Graphiti**](#vs-graphiti) above for the head-to-head comparison and [ARCHITECTURE.md](./ARCHITECTURE.md) for the architecture tradeoffs in detail. tl;dr: different products, adjacent problems — `state-trace` owns the narrow coding-agent working-memory lane; Graphiti owns weeks-of-history temporal knowledge graphs.
