jeevesagent.loader.chunking¶
Chunking strategies for splitting documents into LLM-friendly pieces.
Four strategies, picked by what your downstream RAG / context window needs:
RecursiveChunker— the production default. Splits on a hierarchy of separators (paragraph → line → sentence → word) so semantic boundaries survive when possible. The same algorithm LangChain’sRecursiveCharacterTextSplitteruses; widely recommended in Anthropic’s RAG cookbook.MarkdownChunker— splits on heading boundaries (#,##,###, …). Each chunk’s metadata records the trail of parent headers, so retrieval surfaces section context. Use this for the markdown produced by the PDF / DOCX / Excel loaders.SentenceChunker— sentence-boundary chunks. Use for QA-style RAG where each chunk should answer one short question.TokenChunker— chunk by token count viatiktoken(lazy import). Use when you need tight control over context- window fit.
Defaults¶
All chunkers default to chunk_size=800 characters with
chunk_overlap=100 (12.5% overlap) — the values Anthropic
recommends in their RAG documentation. Override per-chunker as
needed.
Convenience factory: chunk() picks a strategy by name:
from jeevesagent.loader import chunk
pieces = chunk(text, strategy="recursive", chunk_size=800)
pieces = chunk(text, strategy="markdown")
pieces = chunk(text, strategy="sentence", chunk_size=400)
pieces = chunk(text, strategy="token", chunk_size=512)
Attributes¶
Classes¶
Anything with a |
|
Split markdown on heading boundaries. |
|
Recursive character splitter — the production workhorse. |
|
Sentence-boundary chunks. |
|
Chunk by exact token count using |
Functions¶
|
One-liner chunking: pick a strategy by name and split. |
Module Contents¶
- class jeevesagent.loader.chunking.Chunker[source]¶
Bases:
ProtocolAnything with a
split(text) -> list[Chunk]method.
- class jeevesagent.loader.chunking.MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]¶
Split markdown on heading boundaries.
Each chunk corresponds to one section: the heading line plus its content up to (but not including) the next heading at the same OR shallower depth. Long sections are further split via
RecursiveChunkerso no chunk exceedschunk_size.Each chunk’s metadata records the trail of parent headers (the path from the document root to this section), letting the retriever show users where each chunk came from.
Use this for markdown produced by the PDF / DOCX / Excel loaders — it preserves the document’s hierarchy.
- chunk_overlap = 100¶
- chunk_size = 800¶
- class jeevesagent.loader.chunking.RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)[source]¶
Recursive character splitter — the production workhorse.
Aims for chunks of
chunk_sizecharacters withchunk_overlapchars of overlap. Splits on a hierarchy of separators (paragraph → line → sentence → word → char), trying to preserve semantic boundaries.This is the algorithm LangChain calls
RecursiveCharacterTextSplitterand the one most production RAG pipelines default to. Anthropic’s cookbook specifically recommends it for general text.- chunk_overlap = 100¶
- chunk_size = 800¶
- separators¶
- class jeevesagent.loader.chunking.SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]¶
Sentence-boundary chunks.
Splits on sentence terminators (
.,!,?) followed by whitespace and a capital letter. Greedily packs sentences up tochunk_sizecharacters; addschunk_overlapchars between chunks (rounded to the nearest sentence boundary).Best for QA-style RAG where each chunk should answer one short question.
- chunk_overlap = 100¶
- chunk_size = 800¶
- class jeevesagent.loader.chunking.TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')[source]¶
Chunk by exact token count using
tiktoken.Each chunk is at most
chunk_sizeTOKENS (not characters) withchunk_overlaptokens of overlap. Use this when you need tight control over context-window fit (embedding models have hard token limits — text-embedding-3-large is 8191).Requires
tiktoken:pip install 'jeevesagent[loader]'.- chunk_overlap = 64¶
- chunk_size = 512¶
- encoding_name = 'cl100k_base'¶
- jeevesagent.loader.chunking.chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') list[jeevesagent.loader.base.Chunk][source]¶
One-liner chunking: pick a strategy by name and split.
strategy="recursive"(default) — char-level recursive split. Honoursseparators(list of separator strings, tried in order). Default separators: paragraphs, sentences, words, characters.strategy="markdown"— splits on heading boundaries; preserves the header trail in each chunk’s metadata.strategy="sentence"— splits on sentence boundaries.strategy="token"— chunks by exact token count viatiktoken(requires theloader-tokenextra). Honoursencoding(default"cl100k_base"for GPT-4 / 4o / 4.1).
Strategy-specific kwargs are silently ignored when not applicable (e.g.
encoding=is harmless if you usestrategy="markdown").
- jeevesagent.loader.chunking.DEFAULT_CHUNK_OVERLAP = 100¶
- jeevesagent.loader.chunking.DEFAULT_CHUNK_SIZE = 800¶