jeevesagent.loader
==================

.. py:module:: jeevesagent.loader

.. autoapi-nested-parse::

   Document loaders + chunking.

   Reads ``.pdf``, ``.docx``, ``.xlsx``, ``.csv``, ``.tsv``, ``.md``,
   ``.txt``, and ``.html`` files into a normalized :class:`Document`
   whose ``content`` is markdown text. From there, four chunking
   strategies break the document into LLM-friendly pieces:

   * :class:`RecursiveChunker` — the production workhorse (LangChain-
     compatible behaviour)
   * :class:`MarkdownChunker` — splits on ``#`` heading boundaries;
     preserves the header trail in chunk metadata. Best for the
     markdown produced by the PDF / DOCX / Excel loaders.
   * :class:`SentenceChunker` — sentence-boundary chunks for QA-style
     RAG.
   * :class:`TokenChunker` — chunk by token count via ``tiktoken``
     (lazy import).

   One-liner usage::

       from jeevesagent.loader import load, chunk

       doc = load("research.pdf")              # auto-detect format
       chunks = chunk(doc.content)             # default: RecursiveChunker

   Or pick the loader and chunker explicitly::

       from jeevesagent.loader import load_pdf, MarkdownChunker

       doc = load_pdf("research.pdf")
       chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
       chunks = chunker.split(doc.content)

   Optional dependencies
   ---------------------

   The PDF / DOCX / Excel / HTML loaders are gated behind extras so
   the framework's base install stays lean::

       pip install 'jeevesagent[loader]'              # all four
       pip install 'jeevesagent[loader-pdf]'          # just pypdf
       pip install 'jeevesagent[loader-docx]'         # just python-docx
       pip install 'jeevesagent[loader-excel]'        # just openpyxl
       pip install 'jeevesagent[loader-html]'         # just beautifulsoup4

   Each loader raises a helpful :class:`ImportError` if its dependency
   is missing.



Submodules
----------

.. toctree::
   :maxdepth: 1

   /api/jeevesagent/loader/base/index
   /api/jeevesagent/loader/chunking/index
   /api/jeevesagent/loader/csv/index
   /api/jeevesagent/loader/dispatch/index
   /api/jeevesagent/loader/docx/index
   /api/jeevesagent/loader/excel/index
   /api/jeevesagent/loader/html/index
   /api/jeevesagent/loader/pdf/index
   /api/jeevesagent/loader/text/index


Classes
-------

.. autoapisummary::

   jeevesagent.loader.Chunk
   jeevesagent.loader.Document
   jeevesagent.loader.MarkdownChunker
   jeevesagent.loader.RecursiveChunker
   jeevesagent.loader.SentenceChunker
   jeevesagent.loader.TokenChunker


Functions
---------

.. autoapisummary::

   jeevesagent.loader.chunk
   jeevesagent.loader.load
   jeevesagent.loader.load_csv
   jeevesagent.loader.load_docx
   jeevesagent.loader.load_excel
   jeevesagent.loader.load_html
   jeevesagent.loader.load_markdown
   jeevesagent.loader.load_pdf
   jeevesagent.loader.load_text
   jeevesagent.loader.load_tsv


Package Contents
----------------

.. py:class:: Chunk

   One piece of a chunked document.

   ``content`` is a substring of the source document's content
   (with possible cleanup — trimmed whitespace, etc.).
   ``metadata`` carries:

   * ``source`` — pass-through from the parent :class:`Document`
   * ``index`` — zero-based chunk index in the source
   * ``chunk_size`` — actual length of ``content`` (chars)
   * Strategy-specific keys (e.g. ``headers`` from
     :class:`MarkdownChunker`, ``token_count`` from
     :class:`TokenChunker`).


   .. py:attribute:: content
      :type:  str


   .. py:attribute:: metadata
      :type:  dict[str, Any]


.. py:class:: Document

   A loaded document, normalized to markdown.

   ``content``
       The full markdown text. Loaders produce reasonable
       markdown: PDF / DOCX preserve headings + paragraphs; Excel
       / CSV become markdown tables; HTML preserves heading +
       paragraph + list structure.

   ``metadata``
       Free-form dict with at least:

       * ``source`` — the source file path (str)
       * ``format`` — the source format (``"pdf"``, ``"docx"``,
           ``"xlsx"``, ``"csv"``, ``"tsv"``, ``"md"``, ``"txt"``,
           ``"html"``)

       Format-specific keys may be present (``"page_count"`` for
       PDFs, ``"sheet_names"`` for Excel, ``"row_count"`` for CSV,
       etc.).


   .. py:attribute:: content
      :type:  str


   .. py:attribute:: metadata
      :type:  dict[str, Any]


.. py:class:: MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)

   Split markdown on heading boundaries.

   Each chunk corresponds to one section: the heading line plus
   its content up to (but not including) the next heading at the
   same OR shallower depth. Long sections are further split via
   :class:`RecursiveChunker` so no chunk exceeds ``chunk_size``.

   Each chunk's metadata records the trail of parent headers
   (the path from the document root to this section), letting the
   retriever show users where each chunk came from.

   Use this for markdown produced by the PDF / DOCX / Excel
   loaders — it preserves the document's hierarchy.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



.. py:class:: RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)

   Recursive character splitter — the production workhorse.

   Aims for chunks of ``chunk_size`` characters with
   ``chunk_overlap`` chars of overlap. Splits on a hierarchy of
   separators (paragraph → line → sentence → word → char), trying
   to preserve semantic boundaries.

   This is the algorithm LangChain calls
   ``RecursiveCharacterTextSplitter`` and the one most production
   RAG pipelines default to. Anthropic's cookbook specifically
   recommends it for general text.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



   .. py:attribute:: separators


.. py:class:: SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)

   Sentence-boundary chunks.

   Splits on sentence terminators (``.``, ``!``, ``?``) followed by
   whitespace and a capital letter. Greedily packs sentences up to
   ``chunk_size`` characters; adds ``chunk_overlap`` chars between
   chunks (rounded to the nearest sentence boundary).

   Best for QA-style RAG where each chunk should answer one short
   question.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



.. py:class:: TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')

   Chunk by exact token count using ``tiktoken``.

   Each chunk is at most ``chunk_size`` TOKENS (not characters)
   with ``chunk_overlap`` tokens of overlap. Use this when you
   need tight control over context-window fit (embedding models
   have hard token limits — text-embedding-3-large is 8191).

   Requires ``tiktoken``: ``pip install 'jeevesagent[loader]'``.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 64



   .. py:attribute:: chunk_size
      :value: 512



   .. py:attribute:: encoding_name
      :value: 'cl100k_base'



.. py:function:: chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') -> list[jeevesagent.loader.base.Chunk]

   One-liner chunking: pick a strategy by name and split.

   * ``strategy="recursive"`` (default) — char-level recursive
     split. Honours ``separators`` (list of separator strings,
     tried in order). Default separators: paragraphs, sentences,
     words, characters.
   * ``strategy="markdown"`` — splits on heading boundaries;
     preserves the header trail in each chunk's metadata.
   * ``strategy="sentence"`` — splits on sentence boundaries.
   * ``strategy="token"`` — chunks by exact token count via
     ``tiktoken`` (requires the ``loader-token`` extra). Honours
     ``encoding`` (default ``"cl100k_base"`` for GPT-4 / 4o / 4.1).

   Strategy-specific kwargs are silently ignored when not
   applicable (e.g. ``encoding=`` is harmless if you use
   ``strategy="markdown"``).


.. py:function:: load(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a document by auto-detecting its format from the file
   extension. Supported: ``.pdf``, ``.docx``, ``.xlsx``, ``.xlsm``,
   ``.csv``, ``.tsv``, ``.md``, ``.markdown``, ``.txt``, ``.html``,
   ``.htm``.

   Raises :class:`ValueError` for unknown extensions.


.. py:function:: load_csv(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a comma-separated file → markdown table.


.. py:function:: load_docx(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a ``.docx`` file → markdown.

   Requires ``python-docx``:
   ``pip install 'jeevesagent[loader-docx]'``.


.. py:function:: load_excel(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load an Excel workbook → markdown.

   Each sheet becomes ``## {sheet_name}`` with the cell grid as a
   markdown table. Formula cells return their cached values
   (data_only=True).

   Requires ``openpyxl``:
   ``pip install 'jeevesagent[loader-excel]'``.


.. py:function:: load_html(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load an HTML file → markdown.

   Requires ``beautifulsoup4``:
   ``pip install 'jeevesagent[loader-html]'``.


.. py:function:: load_markdown(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a markdown file. Just reads UTF-8 text.


.. py:function:: load_pdf(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a PDF, convert to markdown.

   Each page becomes ``## Page N`` followed by the extracted text.
   Requires ``pypdf``: ``pip install 'jeevesagent[loader-pdf]'``.


.. py:function:: load_text(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a plain-text file. Wraps content in markdown by
   adding a ``# {filename}`` heading so downstream chunkers /
   consumers see consistent markdown.


.. py:function:: load_tsv(path: str | pathlib.Path) -> jeevesagent.loader.base.Document

   Load a tab-separated file → markdown table.


