Metadata-Version: 2.1
Name: delos-unichunking
Version: 0.6.0
Summary: Universal chunking functions to extract LLM-friendly chunks from any file type.
Author: AlexandreBertinDelos
Author-email: alexandrebertin@delosintelligence.fr
Requires-Python: >=3.9,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: delos-llmax (>=0.10.3,<0.11.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: nbformat (>=5.10.4,<6.0.0)
Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
Requires-Dist: pydantic-settings (>=2.4.0,<3.0.0)
Requires-Dist: pymupdf (>=1.24.9,<2.0.0)
Requires-Dist: python-docx (>=1.1.2,<2.0.0)
Requires-Dist: python-pptx (>=1.0.2,<2.0.0)
Requires-Dist: unstructured (>=0.15.7,<0.16.0)
Requires-Dist: xlcalculator (>=0.5.0,<0.6.0)
Description-Content-Type: text/markdown

# Unichunking

Extract LLM-friendly chunks from any file type.

Supported file types are :

 - DOCX & DOCX-like (DOC, ODT)
 - PPTX & PPTX-like (PPT, ODP)
 - XLSX & XLSX-like (XLS, ODS)
 - TXT, MD, CSV
 - IPYNB

# Installation

To install, run the following command:

```bash
python3 -m pip install delos-unichunking
```

# How to use

The main functions are :

 - `extract_subchunks` returns a list of all the text particles in the file.
 - `split_chunks_with_overlap` transforms a list of subchunks on a given page into a list of chunks following default or specified parameters for minimum/maximum token size and overlap.
 - `build_chunked_pages` returns a list of "pages", which are lists of formated chunks, following the structure of the document.
 - `compute_pages` approximates the pagination of a file that does not have a native pagination system (such as DOCX) by comparing it to a PDF version.

# Specificities

Please note that the package requires a LibreOffice installation to run `soffice` commands, used during file conversions : for instance, DOC/ODT are first converted to DOCX format and processed as such.

The page numbers computed for DOCX files are an approximation and can be off by a few pages for large files.

Artifical page numbers are used for page-less structures such as TXT files, or to split large XLSX sheets into multiple pages, to follow a tokens-per-page limit.

