jeevesagent.loader.chunking
===========================

.. py:module:: jeevesagent.loader.chunking

.. autoapi-nested-parse::

   Chunking strategies for splitting documents into LLM-friendly pieces.

   Four strategies, picked by what your downstream RAG / context
   window needs:

   * :class:`RecursiveChunker` — the production default. Splits on a
     hierarchy of separators (paragraph → line → sentence → word) so
     semantic boundaries survive when possible. The same algorithm
     LangChain's ``RecursiveCharacterTextSplitter`` uses; widely
     recommended in Anthropic's RAG cookbook.
   * :class:`MarkdownChunker` — splits on heading boundaries (``#``,
     ``##``, ``###``, …). Each chunk's metadata records the trail of
     parent headers, so retrieval surfaces section context. Use this
     for the markdown produced by the PDF / DOCX / Excel loaders.
   * :class:`SentenceChunker` — sentence-boundary chunks. Use for
     QA-style RAG where each chunk should answer one short question.
   * :class:`TokenChunker` — chunk by token count via ``tiktoken``
     (lazy import). Use when you need tight control over context-
     window fit.

   Defaults
   --------

   All chunkers default to ``chunk_size=800`` characters with
   ``chunk_overlap=100`` (12.5% overlap) — the values Anthropic
   recommends in their RAG documentation. Override per-chunker as
   needed.

   Convenience factory: :func:`chunk` picks a strategy by name::

       from jeevesagent.loader import chunk

       pieces = chunk(text, strategy="recursive", chunk_size=800)
       pieces = chunk(text, strategy="markdown")
       pieces = chunk(text, strategy="sentence", chunk_size=400)
       pieces = chunk(text, strategy="token", chunk_size=512)



Attributes
----------

.. autoapisummary::

   jeevesagent.loader.chunking.DEFAULT_CHUNK_OVERLAP
   jeevesagent.loader.chunking.DEFAULT_CHUNK_SIZE


Classes
-------

.. autoapisummary::

   jeevesagent.loader.chunking.Chunker
   jeevesagent.loader.chunking.MarkdownChunker
   jeevesagent.loader.chunking.RecursiveChunker
   jeevesagent.loader.chunking.SentenceChunker
   jeevesagent.loader.chunking.TokenChunker


Functions
---------

.. autoapisummary::

   jeevesagent.loader.chunking.chunk


Module Contents
---------------

.. py:class:: Chunker

   Bases: :py:obj:`Protocol`


   Anything with a ``split(text) -> list[Chunk]`` method.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


.. py:class:: MarkdownChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)

   Split markdown on heading boundaries.

   Each chunk corresponds to one section: the heading line plus
   its content up to (but not including) the next heading at the
   same OR shallower depth. Long sections are further split via
   :class:`RecursiveChunker` so no chunk exceeds ``chunk_size``.

   Each chunk's metadata records the trail of parent headers
   (the path from the document root to this section), letting the
   retriever show users where each chunk came from.

   Use this for markdown produced by the PDF / DOCX / Excel
   loaders — it preserves the document's hierarchy.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



.. py:class:: RecursiveChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, separators: collections.abc.Sequence[str] = _DEFAULT_SEPARATORS)

   Recursive character splitter — the production workhorse.

   Aims for chunks of ``chunk_size`` characters with
   ``chunk_overlap`` chars of overlap. Splits on a hierarchy of
   separators (paragraph → line → sentence → word → char), trying
   to preserve semantic boundaries.

   This is the algorithm LangChain calls
   ``RecursiveCharacterTextSplitter`` and the one most production
   RAG pipelines default to. Anthropic's cookbook specifically
   recommends it for general text.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



   .. py:attribute:: separators


.. py:class:: SentenceChunker(chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP)

   Sentence-boundary chunks.

   Splits on sentence terminators (``.``, ``!``, ``?``) followed by
   whitespace and a capital letter. Greedily packs sentences up to
   ``chunk_size`` characters; adds ``chunk_overlap`` chars between
   chunks (rounded to the nearest sentence boundary).

   Best for QA-style RAG where each chunk should answer one short
   question.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 100



   .. py:attribute:: chunk_size
      :value: 800



.. py:class:: TokenChunker(chunk_size: int = 512, chunk_overlap: int = 64, encoding: str = 'cl100k_base')

   Chunk by exact token count using ``tiktoken``.

   Each chunk is at most ``chunk_size`` TOKENS (not characters)
   with ``chunk_overlap`` tokens of overlap. Use this when you
   need tight control over context-window fit (embedding models
   have hard token limits — text-embedding-3-large is 8191).

   Requires ``tiktoken``: ``pip install 'jeevesagent[loader]'``.


   .. py:method:: split(text: str, *, source: str = '') -> list[jeevesagent.loader.base.Chunk]


   .. py:attribute:: chunk_overlap
      :value: 64



   .. py:attribute:: chunk_size
      :value: 512



   .. py:attribute:: encoding_name
      :value: 'cl100k_base'



.. py:function:: chunk(text: str, *, strategy: str = 'recursive', chunk_size: int = DEFAULT_CHUNK_SIZE, chunk_overlap: int = DEFAULT_CHUNK_OVERLAP, source: str = '', separators: collections.abc.Sequence[str] | None = None, encoding: str = 'cl100k_base') -> list[jeevesagent.loader.base.Chunk]

   One-liner chunking: pick a strategy by name and split.

   * ``strategy="recursive"`` (default) — char-level recursive
     split. Honours ``separators`` (list of separator strings,
     tried in order). Default separators: paragraphs, sentences,
     words, characters.
   * ``strategy="markdown"`` — splits on heading boundaries;
     preserves the header trail in each chunk's metadata.
   * ``strategy="sentence"`` — splits on sentence boundaries.
   * ``strategy="token"`` — chunks by exact token count via
     ``tiktoken`` (requires the ``loader-token`` extra). Honours
     ``encoding`` (default ``"cl100k_base"`` for GPT-4 / 4o / 4.1).

   Strategy-specific kwargs are silently ignored when not
   applicable (e.g. ``encoding=`` is harmless if you use
   ``strategy="markdown"``).


.. py:data:: DEFAULT_CHUNK_OVERLAP
   :value: 100


.. py:data:: DEFAULT_CHUNK_SIZE
   :value: 800


