jeevesagent.loader¶
Document loaders + chunking.
Reads .pdf, .docx, .xlsx, .csv, .tsv, .md,
.txt, and .html files into a normalized Document
whose content is markdown text. From there, four chunking
strategies break the document into LLM-friendly pieces:
RecursiveChunker— the production workhorse (LangChain- compatible behaviour)MarkdownChunker— splits on#heading boundaries; preserves the header trail in chunk metadata. Best for the markdown produced by the PDF / DOCX / Excel loaders.SentenceChunker— sentence-boundary chunks for QA-style RAG.TokenChunker— chunk by token count viatiktoken(lazy import).
One-liner usage:
from jeevesagent.loader import load, chunk
doc = load("research.pdf") # auto-detect format
chunks = chunk(doc.content) # default: RecursiveChunker
Or pick the loader and chunker explicitly:
from jeevesagent.loader import load_pdf, MarkdownChunker
doc = load_pdf("research.pdf")
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)
Optional dependencies¶
The PDF / DOCX / Excel / HTML loaders are gated behind extras so the framework’s base install stays lean:
pip install 'jeevesagent[loader]' # all four
pip install 'jeevesagent[loader-pdf]' # just pypdf
pip install 'jeevesagent[loader-docx]' # just python-docx
pip install 'jeevesagent[loader-excel]' # just openpyxl
pip install 'jeevesagent[loader-html]' # just beautifulsoup4
Each loader raises a helpful ImportError if its dependency
is missing.
Submodules¶
Classes¶
One piece of a chunked document. |
|
A loaded document, normalized to markdown. |
|
Split markdown on heading boundaries. |
|
Recursive character splitter — the production workhorse. |
|
Sentence-boundary chunks. |
|
Chunk by exact token count using |
Functions¶
|
One-liner chunking: pick a strategy by name and split. |
|
Load a document by auto-detecting its format from the file |
|
Load a comma-separated file → markdown table. |
|
Load a |
|
Load an Excel workbook → markdown. |
|
Load an HTML file → markdown. |
|
Load a markdown file. Just reads UTF-8 text. |
|
Load a PDF, convert to markdown. |
|
Load a plain-text file. Wraps content in markdown by |
|
Load a tab-separated file → markdown table. |
Package Contents¶
- class jeevesagent.loader.Chunk[source]¶
One piece of a chunked document.
contentis a substring of the source document’s content (with possible cleanup — trimmed whitespace, etc.).metadatacarries:source— pass-through from the parentDocumentindex— zero-based chunk index in the sourcechunk_size— actual length ofcontent(chars)Strategy-specific keys (e.g.
headersfromMarkdownChunker,token_countfromTokenChunker).
- class jeevesagent.loader.Document[source]¶
A loaded document, normalized to markdown.
contentThe full markdown text. Loaders produce reasonable markdown: PDF / DOCX preserve headings + paragraphs; Excel / CSV become markdown tables; HTML preserves heading + paragraph + list structure.
metadataFree-form dict with at least:
source— the source file path (str)format— the source format ("pdf","docx","xlsx","csv","tsv","md","txt","html")
Format-specific keys may be present (
"page_count"for PDFs,"sheet_names"for Excel,"row_count"for CSV, etc.).
- class jeevesagent.loader.MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]¶
Split markdown on heading boundaries.
Each chunk corresponds to one section: the heading line plus its content up to (but not including) the next heading at the same OR shallower depth. Long sections are further split via
RecursiveChunkerso no chunk exceedschunk_size.Each chunk’s metadata records the trail of parent headers (the path from the document root to this section), letting the retriever show users where each chunk came from.
Use this for markdown produced by the PDF / DOCX / Excel loaders — it preserves the document’s hierarchy.
- chunk_overlap = 100¶
- chunk_size = 800¶
- class jeevesagent.loader.RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)[source]¶
Recursive character splitter — the production workhorse.
Aims for chunks of
chunk_sizecharacters withchunk_overlapchars of overlap. Splits on a hierarchy of separators (paragraph → line → sentence → word → char), trying to preserve semantic boundaries.This is the algorithm LangChain calls
RecursiveCharacterTextSplitterand the one most production RAG pipelines default to. Anthropic’s cookbook specifically recommends it for general text.- chunk_overlap = 100¶
- chunk_size = 800¶
- separators¶
- class jeevesagent.loader.SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]¶
Sentence-boundary chunks.
Splits on sentence terminators (
.,!,?) followed by whitespace and a capital letter. Greedily packs sentences up tochunk_sizecharacters; addschunk_overlapchars between chunks (rounded to the nearest sentence boundary).Best for QA-style RAG where each chunk should answer one short question.
- chunk_overlap = 100¶
- chunk_size = 800¶
- class jeevesagent.loader.TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')[source]¶
Chunk by exact token count using
tiktoken.Each chunk is at most
chunk_sizeTOKENS (not characters) withchunk_overlaptokens of overlap. Use this when you need tight control over context-window fit (embedding models have hard token limits — text-embedding-3-large is 8191).Requires
tiktoken:pip install 'jeevesagent[loader]'.- chunk_overlap = 64¶
- chunk_size = 512¶
- encoding_name = 'cl100k_base'¶
- jeevesagent.loader.chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') list[jeevesagent.loader.base.Chunk][source]¶
One-liner chunking: pick a strategy by name and split.
strategy="recursive"(default) — char-level recursive split. Honoursseparators(list of separator strings, tried in order). Default separators: paragraphs, sentences, words, characters.strategy="markdown"— splits on heading boundaries; preserves the header trail in each chunk’s metadata.strategy="sentence"— splits on sentence boundaries.strategy="token"— chunks by exact token count viatiktoken(requires theloader-tokenextra). Honoursencoding(default"cl100k_base"for GPT-4 / 4o / 4.1).
Strategy-specific kwargs are silently ignored when not applicable (e.g.
encoding=is harmless if you usestrategy="markdown").
- jeevesagent.loader.load(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a document by auto-detecting its format from the file extension. Supported:
.pdf,.docx,.xlsx,.xlsm,.csv,.tsv,.md,.markdown,.txt,.html,.htm.Raises
ValueErrorfor unknown extensions.
- jeevesagent.loader.load_csv(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a comma-separated file → markdown table.
- jeevesagent.loader.load_docx(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a
.docxfile → markdown.Requires
python-docx:pip install 'jeevesagent[loader-docx]'.
- jeevesagent.loader.load_excel(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load an Excel workbook → markdown.
Each sheet becomes
## {sheet_name}with the cell grid as a markdown table. Formula cells return their cached values (data_only=True).Requires
openpyxl:pip install 'jeevesagent[loader-excel]'.
- jeevesagent.loader.load_html(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load an HTML file → markdown.
Requires
beautifulsoup4:pip install 'jeevesagent[loader-html]'.
- jeevesagent.loader.load_markdown(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a markdown file. Just reads UTF-8 text.
- jeevesagent.loader.load_pdf(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a PDF, convert to markdown.
Each page becomes
## Page Nfollowed by the extracted text. Requirespypdf:pip install 'jeevesagent[loader-pdf]'.
- jeevesagent.loader.load_text(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a plain-text file. Wraps content in markdown by adding a
# {filename}heading so downstream chunkers / consumers see consistent markdown.
- jeevesagent.loader.load_tsv(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]¶
Load a tab-separated file → markdown table.