an evaluation harness for RAG chunking
Every RAG pipeline splits documents into chunks before it can retrieve anything — and that one decision, made once and rarely revisited, quietly caps how good retrieval can ever be. chunkbench runs your candidate chunking strategies against the same corpus and the same real questions, and reports back which one actually finds the right answer — recall, precision, and cost, side by side, not vibes.
✓ question answered from a retrieved chunk× answer never retrieved
The problem
Split too coarse and a retriever pulls back noise along with the answer. Split too fine and the answer gets severed from the context that made it true. Somewhere between those failure modes is a boundary that actually works for your documents and your users' questions — and the only way to find it is to measure, not guess.
chunkbench doesn't chunk your documents, pick your embedding model, or talk you into its favorite LLM. It takes chunks from whatever you're already using or considering, runs them through the same corpus and the same golden questions, and tells you — honestly — which one is actually working.
How it thinks
Nothing is measurable without real questions, known-correct answers, and known-correct source sections. Building this set well matters more than any setting in the tool.
Different strategies draw chunk boundaries in different places, so golden questions are annotated at the section level — chunkbench resolves which chunks cover a relevant section at scoring time, whatever produced them.
A strategy that scores marginally higher but costs three times as much to produce isn't automatically better. Every report pairs quality with time and dollar cost, in the same table.
The corpus, the embedding model, the chunking approach, the judging LLM — all things you bring. chunkbench supplies the methodology, not an opinion about which of those choices is correct.
Under the hood
every configured strategy runs against the corpus
chunks embedded with your embedding function
golden questions matched against each approach's index
optional — an answer produced from retrieved context
every metric computed per approach, per question
scored results rendered to Python, Markdown, or JSON
Bring your own everything
There is exactly one base class chunkbench requires you to inherit from: none. Every extension point is a plain function shape.
| Contract | Shape | Plugs in |
|---|---|---|
| ChunkSource | (Document) -> list[Chunk] | any chunking library — chonkie, your own splitter, an LLM-based chunker |
| Embedder | (list[str]) -> list[Vector] | OpenAI, Gemini, Cohere, Voyage, sentence-transformers, an in-house model |
| Generator | (str, Sequence[str]) -> str | whatever LLM you already call to answer a question from context |
| Judge | (str, str) -> float | an LLM-as-judge function, or the built-in dependency-free default |
from chonkie import RecursiveChunker
from chunkbench import Chunk, Document
def chonkie_chunker(document: Document) -> list[Chunk]:
chunker = RecursiveChunker()
chunks = []
for slug, text in _sections(document.content):
for i, piece in enumerate(chunker(text)):
chunks.append(Chunk(
id=f"{document.id}-{slug}-{i}",
doc_id=document.id,
section=slug,
text=piece.text,
))
return chunks
from google import genai
from chunkbench import Embedder, Vector
def gemini_embedder(
model: str = "gemini-embedding-001",
) -> Embedder:
client = genai.Client()
def embed(texts: list[str]) -> list[Vector]:
r = client.models.embed_content(
model=model, contents=texts
)
return [e.values for e in r.embeddings]
return embed
Neither chonkie nor google-genai is a chunkbench
dependency — install what you actually use. Full runnable versions, plus a
Gemini judge, live in
docs/providers.md.
60-second quickstart
# from examples/quickstart/quickstart.py — runs unmodified,
# no API key, no network call
from chunkbench import run_comparison
from chunkbench.corpus import directory_corpus_loader
report = run_comparison(
corpus=directory_corpus_loader("corpus", extensions=(".md",)),
embedder=toy_embedder, # any Embedder
golden_set="golden_qa.yaml",
chunk_sources={
"whole_section": whole_section_chunker,
"paragraph": paragraph_chunker,
},
k=2,
)
report.to_markdown("report.md")
report.to_json("report.json")
$ python examples/quickstart/quickstart.py
whole_section: recall@2=1.00
paragraph: recall@2=1.00
Wrote examples/quickstart/report.md and examples/quickstart/report.json
For finer control — running only part of the pipeline, or scoring a
custom metric — Pipeline and the metric registry are
the composable, lower-level API. See
docs/api-stability.md
for exactly which extension points carry a semver guarantee.
Command line
# config-file-driven
chunkbench run --config chunkbench.yaml
# flag-driven, for one-off use — --chunkers/--embedder/--generator/--judge
# take 'module:attribute' import strings, since chunkbench ships no
# chunking algorithms or provider integrations itself
chunkbench run \
--corpus ./docs --golden golden_qa.yaml \
--chunkers whole_section=mypkg:whole_section,semantic=mypkg:semantic \
--embedder mypkg.providers:gemini_embedder --k 5
# re-render a previous run's results.json in another format
chunkbench report --from results.json --format html
A regression_gate section in the config file makes
chunkbench run exit non-zero when a metric drops below a
threshold — "fail if recall_at_k for semantic
drops below 0.8." Drop it into CI as a quality gate on chunking
changes instead of finding out in production.
What you get back
Iterable and indexable per approach and per question — feed it straight into your own tooling or analysis.
A human-readable comparison table, ready to paste into a design doc or a pull request description.
The stable integration point — versioned schema, meant for a regression gate, a dashboard, or any downstream tool.
Documentation