Metadata-Version: 2.4
Name: june-bench
Version: 0.0.4
Summary: Reproducible benchmark suite for memory/QA systems — June + pluggable competitors.
Author: Junemind
License: MIT
Project-URL: Homepage, https://github.com/Junemind
Keywords: benchmark,rag,memory,qa,retrieval,evaluation,llm,june
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: june-api
Requires-Dist: httpx>=0.24; extra == "june-api"
Provides-Extra: june-local
Requires-Dist: june>=0.1.0; extra == "june-local"
Provides-Extra: cognee
Requires-Dist: cognee[evals]; extra == "cognee"
Provides-Extra: all
Requires-Dist: httpx>=0.24; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Dynamic: license-file

# june-bench

A pip-installable, **reproducible** benchmark suite for memory / QA systems — **June + pluggable
competitors** — over LoCoMo, LongMemEval, HotpotQA/2Wiki/MuSiQue, and FinanceBench, with the same
data and the same scorer.

```bash
pip install june-bench
june-bench list
june-bench run --system echo --dataset smoke --split smoke    # offline, no key, no download
```

A benchmark is `run(system, dataset) → records → score`. Two typed ports are the only extension
points:

* **`System`** — the thing benchmarked. `JuneApiSystem` (default; a thin HTTP client to June's
  `/v1/answer`, so **no June source is shipped**), `JuneLocalSystem` (`[june-local]` extra; a
  source-protected compiled wheel), `CogneeSystem` (`[cognee]` extra), or any future system as one
  adapter.
* **`Dataset`** — what it runs on. The four benchmarks behind a registry.

The scorer is the canonical SQuAD/HotpotQA EM/F1 + selective-accuracy/coverage/cost — Cognee-comparable.
Tiny **smoke fixtures ship in the wheel** (offline wiring proof); full splits are **fetched, sha-verified,
from a pinned release**. No score is ever baked into the package — every result row records
dataset + scorer + system + model + cost, so a published number is reproducible by a stranger.

Status: **SB0** (contracts + no-deps smoke + skeleton). Datasets, June/Cognee systems, and the full CLI
land in SB1–SB6.
