Metadata-Version: 2.4
Name: chunkbench-rag
Version: 0.1.0
Summary: A chunk-source-agnostic evaluation harness for RAG chunking strategies
Project-URL: Homepage, https://github.com/ghassenov/chunkbench
Author: Ghassen Naouar
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: numpy>=1.26
Requires-Dist: pydantic>=2.7
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: openai>=1.30; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.2; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: twine>=5.1; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.30; extra == 'openai'
Description-Content-Type: text/markdown

<p align="center">
  <img src="docs/assets/logo.svg" alt="chunkbench" width="480">
</p>

<p align="center">
  <a href="https://pypi.org/project/chunkbench/"><img src="https://img.shields.io/pypi/v/chunkbench.svg" alt="PyPI version"></a>
  <a href="https://github.com/ghassenov/chunkbench/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT license"></a>
  <img src="https://img.shields.io/badge/python-3.12%20%7C%203.13-blue.svg" alt="Python 3.12 | 3.13">
  <img src="https://img.shields.io/badge/types-strict%20mypy-informational.svg" alt="mypy --strict">
</p>

<p align="center"><em>Every RAG tutorial picks a chunk size, shrugs, and moves on. chunkbench is what happens after you stop shrugging.</em></p>

---

## The problem, in one sentence

You split your documents into chunks somehow — fixed size, paragraphs, a chunking library's default recipe, vibes — and that one decision quietly determines whether your retriever can ever find the right answer. Most teams never measure it. They just ship the first thing that seemed to work on three test queries and hope.

chunkbench replaces the hoping with a number. Feed it a corpus, a handful of chunking strategies, and a set of real questions with known-correct answers, and it tells you — with recall, precision, and cost figures side by side — which strategy actually retrieves the right information, instead of which one merely *feels* right.

It does **not** chunk your documents. It does **not** pick your embedding model. It does **not** talk you into using its favorite LLM. Those are your calls, made with your tools — chunkbench just tells you, honestly, whether the call you made was any good. Think of it less as a library and more as the friend who actually reads the whole receipt before saying "yeah, that seems fair."

Full design rationale — why golden questions live at the section level, what each metric actually measures, and the exact list of things chunkbench deliberately refuses to do — lives in [`docs/chunkbench.md`](docs/chunkbench.md).

## Install

```bash
pip install chunkbench-rag
```

(The PyPI distribution is `chunkbench-rag` — `chunkbench` alone was too close to an existing project's name — but the import and the CLI command are both still plain `chunkbench`.)

Core install is three dependencies deep (`pydantic`, `pyyaml`, `numpy`) — no embedding SDK, no LLM SDK, no chunking library, because chunkbench isn't going to make that decision for you. The one shipped convenience extra:

```bash
pip install chunkbench-rag[openai]   # adds chunkbench.embedding.providers.openai
                                      # and chunkbench.generation.providers.openai
```

Using something else — chonkie, Gemini, Cohere, a model you trained in your garage — see [Bring your own everything](#bring-your-own-everything) below. No extra required; it's a ~15-line function either way.

## 60-second quickstart

```python
from chunkbench import run_comparison
from chunkbench.corpus import directory_corpus_loader

report = run_comparison(
    corpus=directory_corpus_loader("examples/quickstart/corpus", extensions=(".md",)),
    embedder=toy_embedder,               # any Embedder — see below
    golden_set="examples/quickstart/golden_qa.yaml",
    chunk_sources={
        "whole_section": whole_section_chunker,
        "paragraph": paragraph_chunker,
    },
    k=2,
)

report.to_markdown("report.md")
report.to_json("report.json")
```

`toy_embedder`, `whole_section_chunker`, and `paragraph_chunker` are tiny example functions in [`examples/quickstart/quickstart.py`](examples/quickstart/quickstart.py) — this exact snippet runs today, unmodified, no API key, no network call:

```bash
python examples/quickstart/quickstart.py
```

```
whole_section: recall@2=1.00
paragraph: recall@2=1.00
Wrote examples/quickstart/report.md and examples/quickstart/report.json
```

The embedder there is a dependency-free hashing stand-in, good for proving the plumbing works and not much else. Swap it for something real before trusting the numbers.

## Bring your own everything

There is exactly one base class in chunkbench you're required to inherit from: none. `ChunkSource`, `Embedder`, `Generator`, and `Judge` are all plain function shapes (`Callable[...]`) — wrap whatever you already use and hand it over.

**Chunking, with [chonkie](https://docs.chonkie.ai):**

```python
from chonkie import RecursiveChunker
from chunkbench import Chunk, Document

def chonkie_chunker(document: Document) -> list[Chunk]:
    chunker = RecursiveChunker()
    chunks = []
    for slug, section_text in _sections(document.content):   # your own section splitter
        for i, piece in enumerate(chunker(section_text)):
            chunks.append(Chunk(
                id=f"{document.id}-{slug}-{i}", doc_id=document.id,
                section=slug, text=piece.text,
            ))
    return chunks
```

**Embedding and generation, with Gemini 2.5 Flash:**

```python
from google import genai
from chunkbench import Embedder, Vector

def gemini_embedder(model: str = "gemini-embedding-001") -> Embedder:
    client = genai.Client()
    def embed(texts: list[str]) -> list[Vector]:
        return [e.values for e in client.models.embed_content(model=model, contents=texts).embeddings]
    return embed
```

```python
from chunkbench import run_comparison

report = run_comparison(
    corpus=my_corpus_loader,
    embedder=gemini_embedder(),
    chunk_sources={"chonkie_recursive": chonkie_chunker},
    golden_set="golden_qa.yaml",
    k=5,
)
```

Neither `chonkie` nor `google-genai` is a chunkbench dependency — install what you need yourself. Full runnable versions, plus the same pattern applied to a judge model, live in [`docs/providers.md`](docs/providers.md) and [`examples/providers/`](examples/providers/). Swap in Cohere, Voyage, `sentence-transformers`, or an in-house model gateway the same way — chunkbench genuinely does not care.

## The composable API

For finer control — running only part of the pipeline, or scoring a custom metric:

```python
from chunkbench import Pipeline, registry

@registry.metric("my_custom_metric")
class MyMetric:
    def score(self, retrieved, golden) -> float:
        ...

pipeline = Pipeline(embedder=my_embed_function, golden_set=my_golden_set)
chunks = pipeline.run_chunking(corpus, chunk_source=my_semantic_chunker)
results = pipeline.run_retrieval(chunks, k=5)
scores = pipeline.score(results, metrics=["recall", "precision", "my_custom_metric"])
```

[`docs/api-stability.md`](docs/api-stability.md) names exactly which extension points (chunk-source contract, metric registry, embedder/vector-store interfaces) carry a semver stability guarantee — the short version: the things listed above, forever; the internals, whenever we find a better way.

## CLI

```bash
# Config-file-driven — see docs/chunkbench.md for the full schema.
chunkbench run --config chunkbench.yaml

# Flag-driven, for one-off use. --chunkers/--embedder/--generator/--judge
# all take 'module:attribute' import strings — chunkbench doesn't ship
# chunking algorithms or provider integrations, so these point at your
# own code, importable from wherever you run the command.
chunkbench run \
  --corpus ./docs \
  --golden golden_qa.yaml \
  --chunkers whole_section=mypkg.chunkers:whole_section,semantic=mypkg.chunkers:semantic \
  --embedder mypkg.providers:gemini_embedder \
  --k 5

# Re-render a previous run's results.json in another format.
chunkbench report --from results.json --format html
```

A `regression_gate` section in the config file makes `chunkbench run` exit non-zero when a metric drops below a threshold ("fail if `recall_at_k` for `semantic` drops below 0.8") — drop it into CI as a quality gate on chunking changes instead of finding out in production.

## What you get back

A `Report`, in three flavors: a Python object (iterable/indexable per approach and per question), Markdown (drop into a PR description), and JSON (the stable integration point — schema pinned in [`docs/report-schema.json`](docs/report-schema.json)).

## Documentation

- [`docs/chunkbench.md`](docs/chunkbench.md) — the full design doc: core idea, what's measured and why, pipeline stages, what chunkbench deliberately doesn't do.
- [`docs/providers.md`](docs/providers.md) — wiring in chonkie, Gemini, or any other chunking/embedding/LLM provider.
- [`docs/api-stability.md`](docs/api-stability.md) — which extension points carry a semver guarantee.
- [`docs/report-schema.json`](docs/report-schema.json) — JSON Schema for `results.json`.
- [`CONTRIBUTING.md`](CONTRIBUTING.md) — dev setup, checks, and code style.
- [`CHANGELOG.md`](CHANGELOG.md) — release history.

## License

MIT — see [`LICENSE`](LICENSE).
