Metadata-Version: 2.4
Name: chunking-ai
Version: 0.0.1
Summary: A fast and light-weight library for ingesting and chunking files
Author-email: trducng <trungduc1992@gmail.com>, tadashi <taprosoft@gmail.com>
Project-URL: Homepage, https://github.com/chunking-ai/chunking
Project-URL: Repository, https://github.com/chunking-ai/chunking
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 2 - Pre-Alpha
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: llm
Requires-Dist: click
Requires-Dist: numpy
Requires-Dist: pypandoc
Requires-Dist: tree-sitter
Requires-Dist: tqdm
Requires-Dist: img2table
Requires-Dist: pillow
Requires-Dist: rapid-layout
Requires-Dist: chunking-pdftext
Requires-Dist: opencv-python
Requires-Dist: rapidocr
Requires-Dist: python-pptx
Requires-Dist: pypdfium2
Requires-Dist: openpyxl
Requires-Dist: beautifulsoup4
Requires-Dist: tree-sitter-markdown
Requires-Dist: openai-whisper
Requires-Dist: pydub
Provides-Extra: dev
Requires-Dist: coverage; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: notebook; extra == "dev"
Requires-Dist: ipython; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: pdf-others
Requires-Dist: docling; extra == "pdf-others"
Requires-Dist: unstructured[pdf]; extra == "pdf-others"
Requires-Dist: sycamore; extra == "pdf-others"
Dynamic: license-file

## Installation

```bash
pip install ...
```

Install the following system dependencies for the chunking library:

- `pandoc`: for parsing markup language (e.g. epub, html, rtf, rst, docx...)
- `libreoffice`: for file conversion (e.g. doc, ppt, xls to docx, pptx, xlsx...)

## Features

Library responsibility:

Graph:
Files ----> Parse ----> Chunk ----> Index ----> Search ----> Represent retrieval context -----> LLM
We are responsible for:
- Parse, Chunk
- Represent retrieval context

**How `chunking` is different?** We chunk the document structure rather than the text. This is to take the document structure into account to create more sensible chunks.

- Maintain sectional layout structure during parsing and chunking (*).
- Supported formats (refer ... for suitable parsers for each format):
    - Text: html, md, txt, epub, latex, org, rtf, rst
    - Office documents: pdf, docx, pptx, xlsx
    - Images: jpg, png
    - Audio: wav, mp3
    - Video: mp4
    - Code: ipynb, (coming soon: py, js)
    - Data interchange: csv, json yaml, toml
- Content linking across files.
- Suppport LLM integration for correct content parsing.
- Task description for agent-oriented RAG strategies.
- Fast.
    - At least 100MB/s parsing
    - At least 100MB/s splitting
- Extensible.
    - Easy to add new strategy
    - Easy to change configure of the current strategy
- Traceable: trace from chunk to source.
- Developer-friendly
    - Evaluation
    - Benchmark
    - Config selector
    - Docker
    - CLI to chunk
- Complete.
    - All common file types
    - All chunking strategies

(*) Due to the complexity and variety of how structures can be represented in different file types, there can be errors. `chunking` treats this as best effort. Refer xxx for difficult cases. File an issue if you encounter a problem.

## Usage

### Add LLM support

By default, `chunking` uses the `llm` ([repo](https://github.com/simonw/llm)) with
alias `chunking-llm` to interact with LLM. Please setup the desired LLM
provider according to their docs, and set the alias `chunking-llm` to that
model. Example, using Gemini model (as of April 2025):

```bash
# Install the LLM gemini
$ llm install llm-gemini

# Set the Gemini API key
$ llm keys set gemini

# Alias LLM to 'chunking-llm' (you can see other model ids by running `llm models`)
$ llm aliases set chunking-llm gemini-2.5-flash-preview-04-17

# Check the LLM is working correctly
$ llm -m chunking-llm "Explain quantum mechanics in 100 words"
```

## Cookbook

TBD.

- Application in agent-oriented RAG strategies.

## Examples

TBD. Code snippet of prominent features:

- Drop-in replacement for file parsing.
- Show case of maintaining sectional layout structure. Show case the `Chunk` interface.
- Show case of a notable chunking strategy.
- Use as tool for agent.

## Contributing

Ensure that you have `git` and `git-lfs` installed. `git` will be used for version control and `git-lfs` will be used for test data.

```bash
# Clone the repository
git clone git@github.com:chunking-ai/chunking.git
cd chunking

# Fetch the test data
git submodule update --init --recursive

# Install development dependnecy
pip install -e ".[dev]"

# Initialize pre-commit hooks
pre-commit install
```

## License

Apache 2.0.
