Metadata-Version: 2.4
Name: chunktuner
Version: 0.1.2
Summary: Auto chunking tuner and MCP server for RAG pipelines
Project-URL: Homepage, https://github.com/shantanu-deshmukh/chunktuner
Project-URL: Documentation, https://shantanu-deshmukh.github.io/chunktuner/
Project-URL: Repository, https://github.com/shantanu-deshmukh/chunktuner
Project-URL: Issues, https://github.com/shantanu-deshmukh/chunktuner/issues
Project-URL: Changelog, https://github.com/shantanu-deshmukh/chunktuner/blob/main/CHANGELOG.md
License: MIT
License-File: LICENSE
Keywords: chunking,embeddings,llm,mcp,nlp,rag,retrieval
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: fastapi<1,>=0.111
Requires-Dist: httpx<1,>=0.28.1
Requires-Dist: litellm<2,>=1.40
Requires-Dist: numpy<3,>=1.26
Requires-Dist: pydantic<3,>=2.0
Requires-Dist: pyyaml<7,>=6.0.3
Requires-Dist: rich<14,>=13.0
Requires-Dist: tenacity<9,>=8.0
Requires-Dist: tiktoken<1,>=0.7
Requires-Dist: typer<1,>=0.12
Requires-Dist: uvicorn<1,>=0.29
Provides-Extra: all
Requires-Dist: docling>=2.0; extra == 'all'
Requires-Dist: mcp[cli]>=1.7; extra == 'all'
Requires-Dist: ragas<1,>=0.1.0; extra == 'all'
Requires-Dist: semchunk>=1.0; extra == 'all'
Requires-Dist: tree-sitter-javascript; extra == 'all'
Requires-Dist: tree-sitter-python; extra == 'all'
Requires-Dist: tree-sitter>=0.22; extra == 'all'
Provides-Extra: code
Requires-Dist: tree-sitter-javascript; extra == 'code'
Requires-Dist: tree-sitter-python; extra == 'code'
Requires-Dist: tree-sitter>=0.22; extra == 'code'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: docs
Requires-Dist: mkdocs-gen-files>=0.5; extra == 'docs'
Requires-Dist: mkdocs-literate-nav>=0.6; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.0; extra == 'docs'
Provides-Extra: mcp
Requires-Dist: mcp[cli]>=1.7; extra == 'mcp'
Provides-Extra: ragas
Requires-Dist: ragas<1,>=0.1.0; extra == 'ragas'
Provides-Extra: semantic
Requires-Dist: semchunk>=1.0; extra == 'semantic'
Description-Content-Type: text/markdown

# chunktuner

[![PyPI version](https://img.shields.io/pypi/v/chunktuner.svg)](https://pypi.org/project/chunktuner/)
[![Python versions](https://img.shields.io/pypi/pyversions/chunktuner.svg)](https://pypi.org/project/chunktuner/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![CI](https://github.com/shantanu-deshmukh/chunktuner/actions/workflows/ci.yml/badge.svg)](https://github.com/shantanu-deshmukh/chunktuner/actions/workflows/ci.yml)
[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://shantanu-deshmukh.github.io/chunktuner/)

Auto chunking tuner and MCP server for RAG pipelines.

Give it your documents. It tries multiple chunking strategies, measures which one lets an AI answer questions most accurately, and tells you the winner.

![chunktuner project flow: documents through strategies, evaluation, to a recommended configuration](https://raw.githubusercontent.com/shantanu-deshmukh/chunktuner/main/docs/assets/project-flow.svg)

---

## What it does

When building a RAG pipeline, how you split documents into chunks directly impacts retrieval quality. `chunktuner` automates the process of finding the optimal chunking strategy for your specific corpus, embedding model, and use case.

It benchmarks strategies like fixed-token windows, recursive character splitting, semantic splitting, PDF structural chunking, and AST-based code chunking — then scores each one against real retrieval metrics (token recall, MRR, NDCG) and optional generation metrics (RAGAS faithfulness, answer relevancy).

---

## Interfaces

- **Python library** — programmatic integration into your pipeline
- **CLI** (`chunk-tune`) — human-driven tuning from the terminal
- **MCP server** — use directly from Claude Desktop or any MCP host

---

## Quickstart

```bash
# Install
uv tool install chunktuner

# Initialize workspace
chunk-tune init --provider openai

# See cost estimate before running anything
chunk-tune estimate ./my_docs --use-case rag_qa

# Get a recommendation
chunk-tune recommend ./my_docs --use-case rag_qa
```

**Python API:**

```python
from pathlib import Path
from chunktuner import FileIngestor, LiteLLMEmbeddingFunction, AutoTuner
from chunktuner import default_registry, Evaluator, ScoreCalculator

docs = FileIngestor().ingest_dir(Path("./my_docs"))
embedding_fn = LiteLLMEmbeddingFunction("text-embedding-3-small")
tuner = AutoTuner(
    strategies=default_registry,
    evaluator=Evaluator(embedding_fn),
    scorer=ScoreCalculator(use_case="rag_qa"),
)
result = tuner.recommend(docs, use_case="rag_qa")
print(result.best.config)
```

---

## Supported strategies

| Strategy              | Best for                              |
| --------------------- | ------------------------------------- |
| `fixed_tokens`        | Baseline; uniform token windows       |
| `recursive_character` | General prose and documentation       |
| `semantic`            | Theme-heavy articles                  |
| `markdown_semantic`   | Structured Markdown docs              |
| `pdf_structural`      | PDFs with layout regions and tables   |
| `structural_semantic` | PDF/DOCX with mixed layout and text   |
| `late_chunking`       | Long docs with dense cross-references |
| `agentic`             | High-value narrative documents        |
| `code_ast`            | Code repos (Python, JavaScript)       |
| `code_window`         | Code baseline (sliding window)        |

---

## MCP server (Claude Desktop)

Python **FastMCP** (`chunk-tune-mcp`, stdio). No Node.js build. See `docs/mcp_setup.md`.

Add to your `.mcp.json`:

```json
{
  "mcpServers": {
    "chunktuner": {
      "command": "uvx",
      "args": ["--from", "chunktuner[mcp]", "chunk-tune-mcp"],
      "env": {
        "CHUNK_TUNER_BASE_DIR": "/path/to/your/corpus"
      }
    }
  }
}
```

Tools available: `list_strategies`, `preview_chunks`, `evaluate_chunking`, `recommend_config`.

---

## CLI reference

```
chunk-tune init       Bootstrap workspace config
chunk-tune analyze    Quick structural scan (no API cost)
chunk-tune estimate   Dry-run cost/token estimate
chunk-tune evaluate   Full evaluation across strategies
chunk-tune recommend  Evaluation + best config recommendation
chunk-tune compare    Side-by-side comparison of specific strategies
chunk-tune preview    Inspect how a strategy splits a document
chunk-tune cache      Manage embedding and chunk cache
```

---

## Installation options

```bash
uv add chunktuner                    # library
uv tool install chunktuner           # global CLI
uvx --from chunktuner chunk-tune …   # ephemeral CLI (no install)

# With optional extras
uv add "chunktuner[docling]"         # PDF/DOCX support
uv add "chunktuner[ragas]"           # generation metrics
uv add "chunktuner[semantic]"        # semantic chunking
uv add "chunktuner[code]"            # AST code chunking
uv add "chunktuner[all]"             # everything
```

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

---

## 👨🏻‍💻 Author

[Shantanu Deshmukh](https://shantanudeshmukh.com)

Full stack developer with experience in building E2E AI applications.

[Linkedin](https://www.linkedin.com/in/shantanud/)
/ [Twitter](https://twitter.com/askshantanu) / [AngelList](https://angel.co/u/dshantanu)
