an evaluation harness for RAG chunking

Stop guessing where your documents should be cut.

Every RAG pipeline splits documents into chunks before it can retrieve anything — and that one decision, made once and rarely revisited, quietly caps how good retrieval can ever be. chunkbench runs your candidate chunking strategies against the same corpus and the same real questions, and reports back which one actually finds the right answer — recall, precision, and cost, side by side, not vibes.

$ pip install chunkbench-rag
whole_section
×
recall 0.75
paragraph
×
recall 0.75
chonkie_recursive
recall 1.00

✓ question answered from a retrieved chunk× answer never retrieved

The problem

Most teams pick a chunk size once and never test it again.

Split too coarse and a retriever pulls back noise along with the answer. Split too fine and the answer gets severed from the context that made it true. Somewhere between those failure modes is a boundary that actually works for your documents and your users' questions — and the only way to find it is to measure, not guess.

chunkbench doesn't chunk your documents, pick your embedding model, or talk you into its favorite LLM. It takes chunks from whatever you're already using or considering, runs them through the same corpus and the same golden questions, and tells you — honestly — which one is actually working.

How it thinks

Four ideas the whole tool is built on

01

A golden question set is the ground truth

Nothing is measurable without real questions, known-correct answers, and known-correct source sections. Building this set well matters more than any setting in the tool.

02

Relevance lives above the chunk

Different strategies draw chunk boundaries in different places, so golden questions are annotated at the section level — chunkbench resolves which chunks cover a relevant section at scoring time, whatever produced them.

03

Quality without cost is a misleading number

A strategy that scores marginally higher but costs three times as much to produce isn't automatically better. Every report pairs quality with time and dollar cost, in the same table.

04

Every dependency is swappable

The corpus, the embedding model, the chunking approach, the judging LLM — all things you bring. chunkbench supplies the methodology, not an opinion about which of those choices is correct.

Under the hood

Six cacheable stages, each keyed to disk

01

chunk

every configured strategy runs against the corpus

02

embed

chunks embedded with your embedding function

03

retrieve

golden questions matched against each approach's index

04

generate

optional — an answer produced from retrieved context

05

score

every metric computed per approach, per question

06

report

scored results rendered to Python, Markdown, or JSON

Bring your own everything

No base class. Wrap a function, hand it over.

There is exactly one base class chunkbench requires you to inherit from: none. Every extension point is a plain function shape.

ContractShapePlugs in
ChunkSource(Document) -> list[Chunk]any chunking library — chonkie, your own splitter, an LLM-based chunker
Embedder(list[str]) -> list[Vector]OpenAI, Gemini, Cohere, Voyage, sentence-transformers, an in-house model
Generator(str, Sequence[str]) -> strwhatever LLM you already call to answer a question from context
Judge(str, str) -> floatan LLM-as-judge function, or the built-in dependency-free default
chunking, with chonkie
from chonkie import RecursiveChunker
from chunkbench import Chunk, Document

def chonkie_chunker(document: Document) -> list[Chunk]:
    chunker = RecursiveChunker()
    chunks = []
    for slug, text in _sections(document.content):
        for i, piece in enumerate(chunker(text)):
            chunks.append(Chunk(
                id=f"{document.id}-{slug}-{i}",
                doc_id=document.id,
                section=slug,
                text=piece.text,
            ))
    return chunks
embedding, with Gemini 2.5
from google import genai
from chunkbench import Embedder, Vector

def gemini_embedder(
    model: str = "gemini-embedding-001",
) -> Embedder:
    client = genai.Client()

    def embed(texts: list[str]) -> list[Vector]:
        r = client.models.embed_content(
            model=model, contents=texts
        )
        return [e.values for e in r.embeddings]

    return embed

Neither chonkie nor google-genai is a chunkbench dependency — install what you actually use. Full runnable versions, plus a Gemini judge, live in docs/providers.md.

60-second quickstart

One call, a report back

python
# from examples/quickstart/quickstart.py — runs unmodified,
# no API key, no network call
from chunkbench import run_comparison
from chunkbench.corpus import directory_corpus_loader

report = run_comparison(
    corpus=directory_corpus_loader("corpus", extensions=(".md",)),
    embedder=toy_embedder,          # any Embedder
    golden_set="golden_qa.yaml",
    chunk_sources={
        "whole_section": whole_section_chunker,
        "paragraph": paragraph_chunker,
    },
    k=2,
)

report.to_markdown("report.md")
report.to_json("report.json")
output
$ python examples/quickstart/quickstart.py
whole_section: recall@2=1.00
paragraph: recall@2=1.00
Wrote examples/quickstart/report.md and examples/quickstart/report.json

For finer control — running only part of the pipeline, or scoring a custom metric — Pipeline and the metric registry are the composable, lower-level API. See docs/api-stability.md for exactly which extension points carry a semver guarantee.

Command line

The same pipeline, wired into CI

bash
# config-file-driven
chunkbench run --config chunkbench.yaml

# flag-driven, for one-off use — --chunkers/--embedder/--generator/--judge
# take 'module:attribute' import strings, since chunkbench ships no
# chunking algorithms or provider integrations itself
chunkbench run \
  --corpus ./docs --golden golden_qa.yaml \
  --chunkers whole_section=mypkg:whole_section,semantic=mypkg:semantic \
  --embedder mypkg.providers:gemini_embedder --k 5

# re-render a previous run's results.json in another format
chunkbench report --from results.json --format html

A regression_gate section in the config file makes chunkbench run exit non-zero when a metric drops below a threshold — "fail if recall_at_k for semantic drops below 0.8." Drop it into CI as a quality gate on chunking changes instead of finding out in production.

What you get back

One report, three forms

Python object

Iterable and indexable per approach and per question — feed it straight into your own tooling or analysis.

Markdown

A human-readable comparison table, ready to paste into a design doc or a pull request description.

JSON

The stable integration point — versioned schema, meant for a regression gate, a dashboard, or any downstream tool.

Documentation

Where the rest of this lives