jeevesagent.loader.chunking

Chunking strategies for splitting documents into LLM-friendly pieces.

Four strategies, picked by what your downstream RAG / context window needs:

  • RecursiveChunker — the production default. Splits on a hierarchy of separators (paragraph → line → sentence → word) so semantic boundaries survive when possible. The same algorithm LangChain’s RecursiveCharacterTextSplitter uses; widely recommended in Anthropic’s RAG cookbook.

  • MarkdownChunker — splits on heading boundaries (#, ##, ###, …). Each chunk’s metadata records the trail of parent headers, so retrieval surfaces section context. Use this for the markdown produced by the PDF / DOCX / Excel loaders.

  • SentenceChunker — sentence-boundary chunks. Use for QA-style RAG where each chunk should answer one short question.

  • TokenChunker — chunk by token count via tiktoken (lazy import). Use when you need tight control over context- window fit.

Defaults

All chunkers default to chunk_size=800 characters with chunk_overlap=100 (12.5% overlap) — the values Anthropic recommends in their RAG documentation. Override per-chunker as needed.

Convenience factory: chunk() picks a strategy by name:

from jeevesagent.loader import chunk

pieces = chunk(text, strategy="recursive", chunk_size=800)
pieces = chunk(text, strategy="markdown")
pieces = chunk(text, strategy="sentence", chunk_size=400)
pieces = chunk(text, strategy="token", chunk_size=512)

Attributes

Classes

Chunker

Anything with a split(text) -> list[Chunk] method.

MarkdownChunker

Split markdown on heading boundaries.

RecursiveChunker

Recursive character splitter — the production workhorse.

SentenceChunker

Sentence-boundary chunks.

TokenChunker

Chunk by exact token count using tiktoken.

Functions

chunk(→ list[jeevesagent.loader.base.Chunk])

One-liner chunking: pick a strategy by name and split.

Module Contents

class jeevesagent.loader.chunking.Chunker[source]

Bases: Protocol

Anything with a split(text) -> list[Chunk] method.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
class jeevesagent.loader.chunking.MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]

Split markdown on heading boundaries.

Each chunk corresponds to one section: the heading line plus its content up to (but not including) the next heading at the same OR shallower depth. Long sections are further split via RecursiveChunker so no chunk exceeds chunk_size.

Each chunk’s metadata records the trail of parent headers (the path from the document root to this section), letting the retriever show users where each chunk came from.

Use this for markdown produced by the PDF / DOCX / Excel loaders — it preserves the document’s hierarchy.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
class jeevesagent.loader.chunking.RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)[source]

Recursive character splitter — the production workhorse.

Aims for chunks of chunk_size characters with chunk_overlap chars of overlap. Splits on a hierarchy of separators (paragraph → line → sentence → word → char), trying to preserve semantic boundaries.

This is the algorithm LangChain calls RecursiveCharacterTextSplitter and the one most production RAG pipelines default to. Anthropic’s cookbook specifically recommends it for general text.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
separators
class jeevesagent.loader.chunking.SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]

Sentence-boundary chunks.

Splits on sentence terminators (., !, ?) followed by whitespace and a capital letter. Greedily packs sentences up to chunk_size characters; adds chunk_overlap chars between chunks (rounded to the nearest sentence boundary).

Best for QA-style RAG where each chunk should answer one short question.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
class jeevesagent.loader.chunking.TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')[source]

Chunk by exact token count using tiktoken.

Each chunk is at most chunk_size TOKENS (not characters) with chunk_overlap tokens of overlap. Use this when you need tight control over context-window fit (embedding models have hard token limits — text-embedding-3-large is 8191).

Requires tiktoken: pip install 'jeevesagent[loader]'.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 64
chunk_size = 512
encoding_name = 'cl100k_base'
jeevesagent.loader.chunking.chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') list[jeevesagent.loader.base.Chunk][source]

One-liner chunking: pick a strategy by name and split.

  • strategy="recursive" (default) — char-level recursive split. Honours separators (list of separator strings, tried in order). Default separators: paragraphs, sentences, words, characters.

  • strategy="markdown" — splits on heading boundaries; preserves the header trail in each chunk’s metadata.

  • strategy="sentence" — splits on sentence boundaries.

  • strategy="token" — chunks by exact token count via tiktoken (requires the loader-token extra). Honours encoding (default "cl100k_base" for GPT-4 / 4o / 4.1).

Strategy-specific kwargs are silently ignored when not applicable (e.g. encoding= is harmless if you use strategy="markdown").

jeevesagent.loader.chunking.DEFAULT_CHUNK_OVERLAP = 100
jeevesagent.loader.chunking.DEFAULT_CHUNK_SIZE = 800