Metadata-Version: 2.4
Name: omnichunk
Version: 0.9.0
Summary: Structure-aware deterministic chunking for code, prose, and markup.
Project-URL: Homepage, https://github.com/oguzhankir/omnichunk
Project-URL: Documentation, https://github.com/oguzhankir/omnichunk#readme
Project-URL: Issues, https://github.com/oguzhankir/omnichunk/issues
Project-URL: Source, https://github.com/oguzhankir/omnichunk
Project-URL: Changelog, https://github.com/oguzhankir/omnichunk/blob/main/CHANGELOG.md
License: MIT
License-File: LICENSE
Keywords: chunking,code-analysis,embedding,llm,rag,semantic-search,tree-sitter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: tree-sitter-go>=0.23.0
Requires-Dist: tree-sitter-java>=0.23.0
Requires-Dist: tree-sitter-javascript>=0.23.0
Requires-Dist: tree-sitter-python>=0.23.0
Requires-Dist: tree-sitter-rust>=0.23.0
Requires-Dist: tree-sitter-typescript>=0.23.0
Requires-Dist: tree-sitter>=0.23.0
Provides-Extra: all-languages
Requires-Dist: tree-sitter-c-sharp>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-c>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-cpp>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-kotlin>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-php>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-ruby>=0.23.0; extra == 'all-languages'
Requires-Dist: tree-sitter-swift>=0.0.1; extra == 'all-languages'
Provides-Extra: dev
Requires-Dist: maturin>=1.7.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: nltk>=3.8.1; extra == 'dev'
Requires-Dist: pre-commit>=3.7.0; extra == 'dev'
Requires-Dist: pypdf>=3.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: python-docx>=1.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: semchunk>=2.0.0; extra == 'dev'
Requires-Dist: tiktoken>=0.5.0; extra == 'dev'
Requires-Dist: tree-sitter-c-sharp>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-c>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-cpp>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-kotlin>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-php>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-ruby>=0.23.0; extra == 'dev'
Requires-Dist: tree-sitter-swift>=0.0.1; extra == 'dev'
Provides-Extra: docx
Requires-Dist: python-docx>=1.0; extra == 'docx'
Provides-Extra: formats
Requires-Dist: pypdf>=3.0; extra == 'formats'
Requires-Dist: python-docx>=1.0; extra == 'formats'
Provides-Extra: graph
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.2.0; extra == 'langchain'
Provides-Extra: llamaindex
Requires-Dist: llama-index-core>=0.10.0; extra == 'llamaindex'
Provides-Extra: pdf
Requires-Dist: pypdf>=3.0; extra == 'pdf'
Provides-Extra: pinecone
Provides-Extra: profiling
Requires-Dist: line-profiler>=4.1.0; extra == 'profiling'
Requires-Dist: py-spy>=0.3.14; extra == 'profiling'
Provides-Extra: rust
Requires-Dist: maturin>=1.7.0; extra == 'rust'
Provides-Extra: semantic
Provides-Extra: supabase
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5.0; extra == 'tiktoken'
Provides-Extra: transformers
Requires-Dist: transformers>=4.30.0; extra == 'transformers'
Provides-Extra: vectordb
Provides-Extra: weaviate
Description-Content-Type: text/markdown

<div align="center">
  <img src="https://raw.githubusercontent.com/oguzhankir/omnichunk/main/assets/omnichunk-logo.png" alt="omnichunk" width="360">
  <br><br>
  <a href="https://pypi.org/project/omnichunk/"><img src="https://img.shields.io/pypi/v/omnichunk?v=3" alt="PyPI"></a>
  <a href="https://github.com/oguzhankir/omnichunk/actions/workflows/ci.yml"><img src="https://github.com/oguzhankir/omnichunk/actions/workflows/ci.yml/badge.svg?v=3" alt="CI"></a>
  <a href="https://pypi.org/project/omnichunk/"><img src="https://img.shields.io/pypi/pyversions/omnichunk?v=3" alt="Python"></a>
  <a href="https://github.com/oguzhankir/omnichunk/blob/main/LICENSE"><img src="https://img.shields.io/pypi/l/omnichunk?v=3" alt="License"></a>
</div>

Chunk code, prose, and markup files with structure awareness.

`omnichunk` is a Python library that splits files into smaller pieces while keeping useful context:

- **Code**: respects function/class boundaries, includes scope and import information
- **Markdown**: respects headings and sections
- **JSON/YAML/TOML**: splits by top-level keys/sections
- **HTML/XML**: splits by elements
- **Mixed files**: handles notebooks and Python files with long docstrings

Each chunk includes:
- The original text slice
- Byte and line ranges for lossless reconstruction
- Context (scope, entities, headings, imports, siblings)
- Optional `contextualized_text` for embeddings

The library is deterministic and works without external APIs.

## Installation

```bash
pip install omnichunk
```

Optional extras:

```bash
pip install omnichunk[tiktoken]        # tiktoken tokenizer support
pip install omnichunk[transformers]    # HuggingFace tokenizer support
pip install omnichunk[all-languages]   # Extended language grammars
pip install omnichunk[langchain]       # LangChain Document export support
pip install omnichunk[llamaindex]      # LlamaIndex Document export support
pip install omnichunk[profiling]       # py-spy / line-profiler helpers
pip install omnichunk[rust]            # maturin tooling for Rust backend PoC
pip install omnichunk[dev]             # Development tools
pip install omnichunk[pinecone]        # Vector DB adapter extra (no client lib)
pip install omnichunk[weaviate]        # Vector DB adapter extra (no client lib)
pip install omnichunk[supabase]        # Vector DB adapter extra (no client lib)
pip install omnichunk[vectordb]        # Meta-group for all vector export extras (empty deps)
pip install omnichunk[semantic]        # Marker extra (semantic stack uses core numpy only)
pip install omnichunk[graph]           # Marker extra (GraphRAG uses existing chunk entities)
pip install omnichunk[pdf]             # PDF text extraction (pypdf)
pip install omnichunk[docx]            # Word documents (python-docx)
pip install omnichunk[formats]         # pdf + docx
```

**v0.9** adds multiformat chunking (`.ipynb`, `.tex`, optional `.pdf` / `.docx`), near-duplicate removal (`dedup_chunks`), and offline evaluation (`evaluate_chunks`). Jupyter and LaTeX need no extra packages; PDF and DOCX use the extras above. `Chunker.chunk_file()` picks loaders by extension. Evaluate saved JSONL with:

```bash
omnichunk eval ./chunks.jsonl --metrics all --source ./original.txt
```

## Examples

Runnable scripts and Jupyter notebooks live under [`examples/`](examples/). They cover chunking, hierarchical trees, incremental diffs, token budgets, semantic boundaries, GraphRAG, vector export shapes, and the plugin API. See [`examples/README.md`](examples/README.md) for how to run them.

## CLI

```bash
omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk app.py --max-size 256 --size-unit chars --nws-backend python
omnichunk README.md --format csv --output chunks.csv
```

## Quick start

### One-shot API

```python
from omnichunk import chunk

code = """
import os

def hello(name: str) -> str:
    return f"hello {name}"
"""

chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")

for c in chunks:
    print(c.index, c.byte_range, c.context.breadcrumb)
    print(c.contextualized_text)
```

### Reusable `Chunker`

```python
from omnichunk import Chunker

chunker = Chunker(
    max_chunk_size=1024,
    min_chunk_size=80,
    tokenizer="cl100k_base",
    context_mode="full",
    overlap=0.1,
    overlap_lines=1,
)

chunks = chunker.chunk("api.py", source_code)

for c in chunker.stream("large.py", large_source):
    consume(c)
```

### Async API

```python
import asyncio
from omnichunk import Chunker

chunker = Chunker(max_chunk_size=1024, size_unit="tokens")

# Single file async
chunks = asyncio.run(chunker.achunk("api.py", source_code))

# Async streaming
async def process():
    async for chunk in chunker.astream("large.py", large_source):
        consume(chunk)

# Async batch (concurrent)
results = asyncio.run(chunker.abatch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
    ],
    concurrency=8,
))
```

```python
batch_results = chunker.batch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
        {"filepath": "README.md", "code": readme_md},
    ],
    concurrency=8,
)

directory_results = chunker.chunk_directory(
    "./src",
    glob="**/*.py",
    exclude=["**/tests/**"],
    concurrency=8,
)

all_chunks = [chunk for result in directory_results for chunk in result.chunks]

jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)

stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
    all_chunks,
    min_chunk_size=80,
    max_chunk_size=1024,
    size_unit="chars",
)

langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)

# Vector DB–ready rows (you compute embeddings elsewhere)
from omnichunk import chunks_to_pinecone_vectors, chunks_to_supabase_rows

emb = [[0.1, 0.2, 0.3] for _ in all_chunks]  # same length as chunks
pinecone_batch = chunks_to_pinecone_vectors(all_chunks, emb, namespace="my_ns")
weaviate_batch = chunker.to_weaviate_objects(all_chunks, emb, class_name="Doc")
supabase_rows = chunks_to_supabase_rows(all_chunks, emb)
```

### Semantic chunking

Embedding boundaries are **user-supplied** (`semantic_embed_fn`). Omnichunk never calls an external API.

```python
import numpy as np
from omnichunk import Chunker

def embed(texts):
    # Replace with your actual embedding model
    return np.random.default_rng(0).standard_normal((len(texts), 384))

chunker = Chunker(max_chunk_size=512, size_unit="tokens")
essay = "Your prose here…"
chunks = chunker.semantic_chunk("essay.md", essay, embed_fn=embed)
```

For code and other non-prose content types, structural engines are used even if `semantic=True`.

### Topic shift detection

```python
from omnichunk.semantic import detect_topic_shifts, split_sentences

text = "Your document…"
sentences_with_offsets = split_sentences(text)
sentences = [s for s, _, _ in sentences_with_offsets]
shifts = detect_topic_shifts(sentences, window=5, threshold=0.4)
```

### GraphRAG: entity-chunk graph

```python
from omnichunk import Chunker, build_chunk_graph

source = "class MyClass:\n    pass\n"
chunks = Chunker().chunk("repo.py", source)
graph = build_chunk_graph(chunks)
print(graph.entity_chunks("MyClass"))       # chunk indices containing MyClass
print(graph.chunk_neighbors(0))             # chunks sharing entities with chunk 0
data = graph.to_dict()                      # JSON-serializable
```

### Hierarchical chunking (multi-level RAG)

```python
from omnichunk import Chunker

chunker = Chunker(size_unit="chars")
source = "..."  # your file contents
tree = chunker.hierarchical_chunk(
    "service.py", source,
    levels=[64, 256, 1024],   # leaf → root
)

small_chunks = tree.leaves()   # embed and index these
large_chunks = tree.roots()    # pass these to LLM as context
parent = tree.parent(small_chunks[0])  # navigate up
```

### Incremental / differential chunking

```python
from omnichunk import Chunker

chunker = Chunker(max_chunk_size=512, size_unit="chars")
new_source = "..."  # updated file contents
diff = chunker.chunk_diff(
    "api.py",
    new_source,
    previous_chunks=old_chunks,
)
# diff.added        → upsert to vector DB
# diff.removed_ids  → delete from vector DB
# diff.unchanged    → skip re-embedding
```

### Token budget optimizer

```python
from omnichunk.budget import TokenBudgetOptimizer

optimizer = TokenBudgetOptimizer(budget=4096, strategy="greedy")
result = optimizer.select(retrieved_chunks, scores=relevance_scores)
# result.selected → pass to LLM
```

## Vector database export (serialization)

Adapters produce plain dicts/lists only—**no** Pinecone, Weaviate, or Supabase client is installed by these extras. You compute embeddings yourself and pass parallel lists:

- `chunks_to_pinecone_vectors` / `Chunker.to_pinecone_vectors` — `id`, `values`, `metadata` (+ optional `namespace` per row)
- `chunks_to_weaviate_objects` / `Chunker.to_weaviate_objects` — `class`, `vector`, `properties`
- `chunks_to_supabase_rows` / `Chunker.to_supabase_rows` — `content`, `embedding`, plus flat metadata columns

## Plugin API

Register custom parsers or formatters at import time (no edits to omnichunk core):

```python
from omnichunk import register_parser, register_formatter, Chunker

def my_parse(filepath: str, content: str):
    # Return a tree-sitter-like tree, or None to use the built-in parser.
    return None

register_parser("python", my_parse, overwrite=True)

def my_fmt(chunks):
    return str(len(chunks))

register_formatter("count", my_fmt)
```

### File API

```python
from omnichunk import chunk_file

chunks = chunk_file("path/to/file.py")
```

### Directory API

```python
from omnichunk import chunk_directory

results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")

for result in results:
    if result.error:
        print("error", result.filepath, result.error)
    else:
        print(result.filepath, len(result.chunks))
```

## Chunk model

Every `Chunk` includes raw content, exact offsets, and rich context:

- `text`: exact source slice (lossless reconstruction)
- `contextualized_text`: embedding-ready representation
- `byte_range`, `line_range`
- `context`: scope, entities, siblings, imports, headings, section metadata
- `token_count`, `char_count`, `nws_count`

## Supported content

### Code

- Python
- JavaScript / TypeScript
- Rust
- Go
- Java
- C / C++ / C#
- Ruby / PHP / Kotlin / Swift (grammar-dependent)

### Prose

- Markdown
- Plaintext

Markdown fenced blocks are delegated by language:

- fenced code (`python`, `ts`, etc.) routes to `CodeEngine`
- fenced markup (`json`, `yaml`, `toml`, `html`, `xml`) routes to `MarkupEngine`

### Markup

- JSON
- YAML
- TOML
- HTML / XML

### Hybrid

- Python with heavy docstrings
- Notebook-style `# %%` cell files

## Architecture

```
src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│   ├── router.py
│   ├── code_engine.py
│   ├── prose_engine.py
│   ├── markup_engine.py
│   └── hybrid_engine.py
├── parser/
│   ├── tree_sitter.py
│   ├── markdown_parser.py
│   ├── html_parser.py
│   └── languages.py
├── context/
│   ├── entities.py
│   ├── scope.py
│   ├── siblings.py
│   ├── imports.py
│   └── format.py
├── sizing/
│   ├── nws.py
│   ├── tokenizers.py
│   └── counter.py
└── windowing/
    ├── greedy.py
    ├── merge.py
    ├── split.py
    └── overlap.py
```

## Determinism & integrity guarantees

`omnichunk` is built to preserve source fidelity:

- Chunk boundaries are deterministic
- Empty/whitespace-only chunks are dropped
- Chunks are contiguous and non-overlapping in source order
- Byte range integrity is validated in tests:

```python
original_bytes = source.encode("utf-8")
for chunk in chunks:
    assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text
```

## Testing

Run the test suite:

```bash
pytest -q
```

Run benchmark scenarios:

```bash
python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
python benchmarks/run_large_corpus.py --mode mega-python --repeat 120
python benchmarks/run_hotspot_profile.py --mode mega-python --repeat 120 --limit 30
```

Run repository checks:

```bash
python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality
```

Current suite covers:

- API usage (`chunk`, `chunk_file`, `Chunker`)
- Code/prose/markup/hybrid behavior
- Context metadata (imports, siblings, scope, headings)
- Sizing/tokenization/NWS logic
- Overlap behavior
- Edge cases (empty input, unicode, malformed syntax, range contiguity)

## Contributing

Contribution and project process files:

- `CONTRIBUTING.md`
- `CODE_OF_CONDUCT.md`
- `SECURITY.md`
- `GOVERNANCE.md`
- `MAINTAINERS.md`
- `ROADMAP.md`
- `ARCHITECTURE.md`

Install dev tooling and run pre-commit hooks:

```bash
pip install -e .[dev]
pre-commit install
pre-commit run --all-files
```

## Notes

- Tree-sitter grammars are resolved dynamically and cached per language.
- If a parser is unavailable, the system degrades gracefully with fallback heuristics.
- `contextualized_text` is optimized for embedding quality while preserving raw `text` separately.