jeevesagent.loader.base

Core types for the loader: Document and Chunk.

Every loader normalizes its source format to a Document whose content is markdown text and whose metadata carries provenance (source path, MIME type, page / sheet count, etc.). The chunkers in jeevesagent.loader.chunking consume the content and produce Chunk objects with their own metadata pointing back at the source.

Classes

Chunk

One piece of a chunked document.

Document

A loaded document, normalized to markdown.

Module Contents

class jeevesagent.loader.base.Chunk[source]

One piece of a chunked document.

content is a substring of the source document’s content (with possible cleanup — trimmed whitespace, etc.). metadata carries:

  • source — pass-through from the parent Document

  • index — zero-based chunk index in the source

  • chunk_size — actual length of content (chars)

  • Strategy-specific keys (e.g. headers from MarkdownChunker, token_count from TokenChunker).

content: str
metadata: dict[str, Any]
class jeevesagent.loader.base.Document[source]

A loaded document, normalized to markdown.

content

The full markdown text. Loaders produce reasonable markdown: PDF / DOCX preserve headings + paragraphs; Excel / CSV become markdown tables; HTML preserves heading + paragraph + list structure.

metadata

Free-form dict with at least:

  • source — the source file path (str)

  • format — the source format ("pdf", "docx",

    "xlsx", "csv", "tsv", "md", "txt", "html")

Format-specific keys may be present ("page_count" for PDFs, "sheet_names" for Excel, "row_count" for CSV, etc.).

content: str
metadata: dict[str, Any]