Metadata-Version: 2.4
Name: benchd-harness
Version: 0.1.0
Summary: The neutral benchmark harness for AI memory systems. Run LongMemEval, LOCOMO, and more against any memory system.
Author-email: Bench'd <hello@benchd.ai>
License: MIT
Project-URL: Homepage, https://benchd.ai
Project-URL: Documentation, https://benchd.ai/docs
Project-URL: Repository, https://github.com/benchdai/harness
Project-URL: Issues, https://github.com/benchdai/harness/issues
Project-URL: Leaderboard, https://benchd.ai/leaderboard
Keywords: ai,memory,benchmark,llm,agent,longmemeval,locomo,mcp,evaluation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: pynacl>=1.5.0
Requires-Dist: requests>=2.28.0
Provides-Extra: mem0
Requires-Dist: mem0ai>=0.1.0; extra == "mem0"
Provides-Extra: langchain
Requires-Dist: langchain>=1.0; extra == "langchain"
Requires-Dist: langchain-openai>=1.0; extra == "langchain"
Requires-Dist: langchain-core>=1.0; extra == "langchain"
Provides-Extra: llamaindex
Requires-Dist: llama-index>=0.11; extra == "llamaindex"
Requires-Dist: llama-index-llms-openai>=0.3; extra == "llamaindex"
Provides-Extra: all
Requires-Dist: mem0ai>=0.1.0; extra == "all"
Requires-Dist: langchain>=1.0; extra == "all"
Requires-Dist: langchain-openai>=1.0; extra == "all"
Requires-Dist: langchain-core>=1.0; extra == "all"
Requires-Dist: llama-index>=0.11; extra == "all"
Requires-Dist: llama-index-llms-openai>=0.3; extra == "all"
Requires-Dist: openai>=1.0; extra == "all"
Requires-Dist: litellm>=1.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# Bench'd Harness

The neutral benchmark harness for AI memory systems. Every score is independently run, cryptographically signed, and verifiable by anyone.

**[Leaderboard](https://benchd.ai/leaderboard)** | **[Docs](https://benchd.ai/docs)** | **[Methodology](https://benchd.ai/methodology)** | **[Submit Results](https://benchd.ai/submit)**

## Quick Start

```bash
pip install benchd-harness

# Generate signing keys
benchd keys generate --out ./keys

# Set your LLM API key (for the judge)
export OPENROUTER_API_KEY=sk-or-...

# Run LongMemEval against your MCP-compatible memory system
benchd run -a mcp -b longmemeval-v1 --judge --key ./keys/private.key \
  --adapter-config '{"endpoint": "http://localhost:3000/mcp"}'

# Submit results to the leaderboard
benchd submit ./runs/run_xxx/manifest.signed.json
```

## MCP Systems: Zero-Code Testing

If your memory system exposes an MCP server with `ingest` and `query` tools, you don't need to write any adapter code:

```bash
benchd run -a mcp -b longmemeval-v1 --judge \
  --adapter-config '{"endpoint": "http://localhost:3000/mcp"}'
```

The MCP adapter auto-discovers your tools and maps them to Bench'd's interface.

## Available Benchmarks

| Benchmark | Slug | Questions | What it tests |
|-----------|------|-----------|---------------|
| LongMemEval | `longmemeval-v1` | 500 | Recall, temporal reasoning, knowledge updates |
| LoCoMo | `locomo-v1` | 1,540 | Multi-session conversational memory |
| Smoke | `smoke-memory-v0` | 10 | Quick sanity check |

## Built-in Adapters

| Adapter | System | Install |
|---------|--------|---------|
| `mcp` | Any MCP server | Built-in |
| `mem0-local` | Mem0 OSS | `pip install benchd-harness[mem0]` |
| `langchain-memory` | LangChain | `pip install benchd-harness[langchain]` |
| `llamaindex-memory` | LlamaIndex | `pip install benchd-harness[llamaindex]` |
| `llm-baseline` | Raw LLM (no memory) | `pip install openai` |
| `echo` | Test adapter | Built-in |

## Writing a Custom Adapter

```python
from benchd_harness.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):
    @property
    def name(self) -> str:
        return "my-system"

    def setup(self) -> None:
        self.client = MyMemoryClient()

    def ingest(self, turns: list[dict]) -> None:
        for turn in turns:
            self.client.add(role=turn["role"], content=turn["content"])

    def recall(self, query: str) -> str:
        return self.client.search(query).text

    def reset(self) -> None:
        self.client.clear()
```

Register in `benchd_harness/adapters/__init__.py` and run with `benchd run -a my-system`.

## Commands

| Command | Description |
|---------|-------------|
| `benchd run` | Run a benchmark against a memory system |
| `benchd submit` | Submit signed results to benchd.ai |
| `benchd verify` | Verify a signed manifest |
| `benchd keys generate` | Generate Ed25519 signing keys |
| `benchd list` | List available adapters and benchmarks |

## Signing & Verification

Every run produces an Ed25519-signed manifest containing all inputs, outputs, scores, and failure traces. Anyone can verify:

```bash
benchd verify ./runs/run_xxx/manifest.signed.json
```

## Current Results (May 2026)

| # | System | LongMemEval | Status |
|---|--------|-------------|--------|
| 1 | LlamaIndex | 59.0% | Verified |
| 1 | LangChain | 59.0% | Verified |
| 3 | LLM Baseline | 57.6% | Verified |
| 4 | Mem0 OSS | 32.4% | Verified |

Full results at [benchd.ai/leaderboard](https://benchd.ai/leaderboard).

## Links

- **Website**: [benchd.ai](https://benchd.ai)
- **Leaderboard**: [benchd.ai/leaderboard](https://benchd.ai/leaderboard)
- **Docs**: [benchd.ai/docs](https://benchd.ai/docs)
- **Submit**: [benchd.ai/submit](https://benchd.ai/submit)
