Metadata-Version: 2.4
Name: entropy-chunker
Version: 0.1.0
Summary: Adaptive Information-Aware Chunking for RAG and Agentic Systems, driven by information density instead of fixed token counts.
Author-email: Akshat Tulsani <a97tulsani@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Tulsani/entropy-chunker
Project-URL: Repository, https://github.com/Tulsani/entropy-chunker
Project-URL: Issues, https://github.com/Tulsani/entropy-chunker/issues
Keywords: rag,chunking,retrieval-augmented-generation,nlp,information-theory,text-splitting,embeddings
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: tokens
Requires-Dist: tiktoken>=0.5; extra == "tokens"
Provides-Extra: eval
Requires-Dist: backoff>=2.0; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: matplotlib>=3.5; extra == "dev"
Dynamic: license-file

# entropy-chunker

**Adaptive Information-Aware Chunking for RAG and Agentic Systems.** Chunk boundaries
follow information density instead of a fixed token count — no
embeddings, no language model, anywhere in the chunking decision.

```bash
pip install entropy-chunker
```

```python
from entropy_chunker import InfoTheoreticChunker

chunker = InfoTheoreticChunker()
chunks = chunker.split_text(my_document_text)
```

## Why

Most chunkers assume **equal token count ≈ equal information content**.
It's usually false: a legal contract's definitions section, packed with
entities referenced throughout the document, gets split as arbitrarily as
its boilerplate governing-law clause under a fixed 512-token splitter.

`entropy-chunker` instead scores each sentence on three signals — and
walks through the document emitting a chunk boundary when *accumulated
information*, not accumulated tokens, crosses a threshold:

| Signal | What it measures | How |
|---|---|---|
| **Compression** | redundancy vs. recent context | marginal gzip-compressed size |
| **Lexical novelty** | new vocabulary introduced | running vocabulary set |
| **Word-frequency entropy** | rarity vs. the rest of this document | classical Shannon surprisal |

All three are closed-form statistics over the text itself — no vector
embeddings, no pretrained model, no API calls. This also means it's fast
(the ~740K-character Finance benchmark corpus chunks in well under a
second) and fully deterministic.

## Where it outshines standard and embedding-based chunking

Benchmarked against [Chroma's chunking evaluation](https://www.trychroma.com/research/evaluating-chunking)
methodology (472 real queries, 5 corpora), using the same embedding model
as the paper's primary table:

| Metric | entropy-chunker | Best paper baseline |
|---|---|---|
| Precision | **8.93** | 7.0 (Recursive-200) |
| Precision-Ω | **37.68** | 29.9 (Recursive-200) |
| IoU | **8.84** | 6.9 (Recursive-200) |
| Recall | 83.9 | 91.9 (LLM-GPT4o) |

**Precision, Precision-Ω, and IoU all beat every baseline tested** —
including `ClusterSemanticChunker` (which uses embeddings directly to pick
boundaries) and `LLMSemanticChunker` (which prompts GPT-4o). Recall trails
by several points: smaller, more targeted chunks retrieve precisely but
are slightly more likely to split a long excerpt across a boundary. That
tradeoff is the honest cost of the precision gain, not a bug.

Chunk sizes also vary **4-5x more** than a fixed-size baseline across
every corpus tested — direct evidence boundaries track real information
density rather than token count.

Full methodology, per-corpus breakdowns, and the weight-sensitivity
analysis behind the presets below: **[BENCHMARKS.md](BENCHMARKS.md)**.

## Presets

Equal weighting is a safe default, but benchmark sweeps found it's never
actually optimal. Three tuned presets, backed by real sweep data:

```python
InfoTheoreticChunker(preset="precise")         # best IoU/precision in benchmarks
InfoTheoreticChunker(preset="recall_focused")   # trades some IoU for recall
InfoTheoreticChunker(preset="tabular")          # for tables, boilerplate-heavy docs
InfoTheoreticChunker(preset="balanced")         # equal weighting -- this is the package default
```

No `preset` argument is equivalent to `preset="balanced"`. If you don't
know your corpus's structure in advance, `precise` is the better starting
point for most prose/document use cases — `balanced` is kept as the
actual default only because it was never the worst option in any
benchmark corpus, a safer unbiased choice absent more information.

Or set weights directly: `InfoTheoreticChunker(w_compression=0.1, w_novelty=0.0, w_entropy=0.9)`.

## Tunable parameters

```python
InfoTheoreticChunker(
    info_threshold=1.0,   # cumulative info score that triggers a boundary
    max_tokens=800,       # hard ceiling, regardless of info score
    preset="precise",     # or set w_compression/w_novelty/w_entropy directly
)
```

## Honest limitations

- Regex-based sentence splitting (by design — no model in the pipeline),
  which can misfire on unusual punctuation or line-wrap conventions.
- Recall trails embedding- and LLM-based chunkers by several points; see
  [BENCHMARKS.md](BENCHMARKS.md) for the full tradeoff discussion.
- Validated on prose/structured-text corpora; code as a domain hasn't
  been separately benchmarked.

## Installation extras

```bash
pip install "entropy-chunker[tokens]"  # exact token counting via tiktoken
pip install "entropy-chunker[eval]"    # for running the benchmark yourself
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git  # required for [eval]; not on PyPI
```

Without `[tokens]`, token counting falls back to a chars/4 approximation
— this only affects the precision of the `max_tokens` ceiling, not where
chunk boundaries are placed.

## License

MIT
