jeevesagent.loader

Document loaders + chunking.

Reads .pdf, .docx, .xlsx, .csv, .tsv, .md, .txt, and .html files into a normalized Document whose content is markdown text. From there, four chunking strategies break the document into LLM-friendly pieces:

  • RecursiveChunker — the production workhorse (LangChain- compatible behaviour)

  • MarkdownChunker — splits on # heading boundaries; preserves the header trail in chunk metadata. Best for the markdown produced by the PDF / DOCX / Excel loaders.

  • SentenceChunker — sentence-boundary chunks for QA-style RAG.

  • TokenChunker — chunk by token count via tiktoken (lazy import).

One-liner usage:

from jeevesagent.loader import load, chunk

doc = load("research.pdf")              # auto-detect format
chunks = chunk(doc.content)             # default: RecursiveChunker

Or pick the loader and chunker explicitly:

from jeevesagent.loader import load_pdf, MarkdownChunker

doc = load_pdf("research.pdf")
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)

Optional dependencies

The PDF / DOCX / Excel / HTML loaders are gated behind extras so the framework’s base install stays lean:

pip install 'jeevesagent[loader]'              # all four
pip install 'jeevesagent[loader-pdf]'          # just pypdf
pip install 'jeevesagent[loader-docx]'         # just python-docx
pip install 'jeevesagent[loader-excel]'        # just openpyxl
pip install 'jeevesagent[loader-html]'         # just beautifulsoup4

Each loader raises a helpful ImportError if its dependency is missing.

Submodules

Classes

Chunk

One piece of a chunked document.

Document

A loaded document, normalized to markdown.

MarkdownChunker

Split markdown on heading boundaries.

RecursiveChunker

Recursive character splitter — the production workhorse.

SentenceChunker

Sentence-boundary chunks.

TokenChunker

Chunk by exact token count using tiktoken.

Functions

chunk(→ list[jeevesagent.loader.base.Chunk])

One-liner chunking: pick a strategy by name and split.

load(→ jeevesagent.loader.base.Document)

Load a document by auto-detecting its format from the file

load_csv(→ jeevesagent.loader.base.Document)

Load a comma-separated file → markdown table.

load_docx(→ jeevesagent.loader.base.Document)

Load a .docx file → markdown.

load_excel(→ jeevesagent.loader.base.Document)

Load an Excel workbook → markdown.

load_html(→ jeevesagent.loader.base.Document)

Load an HTML file → markdown.

load_markdown(→ jeevesagent.loader.base.Document)

Load a markdown file. Just reads UTF-8 text.

load_pdf(→ jeevesagent.loader.base.Document)

Load a PDF, convert to markdown.

load_text(→ jeevesagent.loader.base.Document)

Load a plain-text file. Wraps content in markdown by

load_tsv(→ jeevesagent.loader.base.Document)

Load a tab-separated file → markdown table.

Package Contents

class jeevesagent.loader.Chunk[source]

One piece of a chunked document.

content is a substring of the source document’s content (with possible cleanup — trimmed whitespace, etc.). metadata carries:

  • source — pass-through from the parent Document

  • index — zero-based chunk index in the source

  • chunk_size — actual length of content (chars)

  • Strategy-specific keys (e.g. headers from MarkdownChunker, token_count from TokenChunker).

content: str
metadata: dict[str, Any]
class jeevesagent.loader.Document[source]

A loaded document, normalized to markdown.

content

The full markdown text. Loaders produce reasonable markdown: PDF / DOCX preserve headings + paragraphs; Excel / CSV become markdown tables; HTML preserves heading + paragraph + list structure.

metadata

Free-form dict with at least:

  • source — the source file path (str)

  • format — the source format ("pdf", "docx",

    "xlsx", "csv", "tsv", "md", "txt", "html")

Format-specific keys may be present ("page_count" for PDFs, "sheet_names" for Excel, "row_count" for CSV, etc.).

content: str
metadata: dict[str, Any]
class jeevesagent.loader.MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]

Split markdown on heading boundaries.

Each chunk corresponds to one section: the heading line plus its content up to (but not including) the next heading at the same OR shallower depth. Long sections are further split via RecursiveChunker so no chunk exceeds chunk_size.

Each chunk’s metadata records the trail of parent headers (the path from the document root to this section), letting the retriever show users where each chunk came from.

Use this for markdown produced by the PDF / DOCX / Excel loaders — it preserves the document’s hierarchy.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
class jeevesagent.loader.RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)[source]

Recursive character splitter — the production workhorse.

Aims for chunks of chunk_size characters with chunk_overlap chars of overlap. Splits on a hierarchy of separators (paragraph → line → sentence → word → char), trying to preserve semantic boundaries.

This is the algorithm LangChain calls RecursiveCharacterTextSplitter and the one most production RAG pipelines default to. Anthropic’s cookbook specifically recommends it for general text.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
separators
class jeevesagent.loader.SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)[source]

Sentence-boundary chunks.

Splits on sentence terminators (., !, ?) followed by whitespace and a capital letter. Greedily packs sentences up to chunk_size characters; adds chunk_overlap chars between chunks (rounded to the nearest sentence boundary).

Best for QA-style RAG where each chunk should answer one short question.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 100
chunk_size = 800
class jeevesagent.loader.TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')[source]

Chunk by exact token count using tiktoken.

Each chunk is at most chunk_size TOKENS (not characters) with chunk_overlap tokens of overlap. Use this when you need tight control over context-window fit (embedding models have hard token limits — text-embedding-3-large is 8191).

Requires tiktoken: pip install 'jeevesagent[loader]'.

split(text: str, *, source: str = '') list[jeevesagent.loader.base.Chunk][source]
chunk_overlap = 64
chunk_size = 512
encoding_name = 'cl100k_base'
jeevesagent.loader.chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') list[jeevesagent.loader.base.Chunk][source]

One-liner chunking: pick a strategy by name and split.

  • strategy="recursive" (default) — char-level recursive split. Honours separators (list of separator strings, tried in order). Default separators: paragraphs, sentences, words, characters.

  • strategy="markdown" — splits on heading boundaries; preserves the header trail in each chunk’s metadata.

  • strategy="sentence" — splits on sentence boundaries.

  • strategy="token" — chunks by exact token count via tiktoken (requires the loader-token extra). Honours encoding (default "cl100k_base" for GPT-4 / 4o / 4.1).

Strategy-specific kwargs are silently ignored when not applicable (e.g. encoding= is harmless if you use strategy="markdown").

jeevesagent.loader.load(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a document by auto-detecting its format from the file extension. Supported: .pdf, .docx, .xlsx, .xlsm, .csv, .tsv, .md, .markdown, .txt, .html, .htm.

Raises ValueError for unknown extensions.

jeevesagent.loader.load_csv(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a comma-separated file → markdown table.

jeevesagent.loader.load_docx(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a .docx file → markdown.

Requires python-docx: pip install 'jeevesagent[loader-docx]'.

jeevesagent.loader.load_excel(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load an Excel workbook → markdown.

Each sheet becomes ## {sheet_name} with the cell grid as a markdown table. Formula cells return their cached values (data_only=True).

Requires openpyxl: pip install 'jeevesagent[loader-excel]'.

jeevesagent.loader.load_html(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load an HTML file → markdown.

Requires beautifulsoup4: pip install 'jeevesagent[loader-html]'.

jeevesagent.loader.load_markdown(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a markdown file. Just reads UTF-8 text.

jeevesagent.loader.load_pdf(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a PDF, convert to markdown.

Each page becomes ## Page N followed by the extracted text. Requires pypdf: pip install 'jeevesagent[loader-pdf]'.

jeevesagent.loader.load_text(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a plain-text file. Wraps content in markdown by adding a # {filename} heading so downstream chunkers / consumers see consistent markdown.

jeevesagent.loader.load_tsv(path: str | pathlib.Path) jeevesagent.loader.base.Document[source]

Load a tab-separated file → markdown table.