Metadata-Version: 2.4
Name: openingestion
Version: 0.1.1
Summary: RAG ingestion pipeline — Chef → Chunker → Refinery → Porter
Author: openingestion contributors
License: MIT
Project-URL: Homepage, https://github.com/Isopope/openIngestion.git
Project-URL: Bug Tracker, https://github.com/Isopope/openIngestion/issues
Keywords: rag,ingestion,chunking,pdf,nlp,llm,mineru,docling,vector-store
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: loguru
Requires-Dist: chonkie-core
Provides-Extra: mineru
Requires-Dist: mineru[pipeline]>=2.7.6; extra == "mineru"
Provides-Extra: docling
Requires-Dist: docling; extra == "docling"
Provides-Extra: semantic
Requires-Dist: sentence-transformers; extra == "semantic"
Requires-Dist: numpy; extra == "semantic"
Requires-Dist: scipy; extra == "semantic"
Provides-Extra: web
Requires-Dist: playwright; extra == "web"
Provides-Extra: sharepoint
Requires-Dist: office365-rest-python-client; extra == "sharepoint"
Requires-Dist: msal; extra == "sharepoint"
Provides-Extra: slumber
Requires-Dist: openai; extra == "slumber"
Requires-Dist: pydantic; extra == "slumber"
Requires-Dist: tenacity; extra == "slumber"
Requires-Dist: tqdm; extra == "slumber"
Provides-Extra: tiktoken
Requires-Dist: tiktoken; extra == "tiktoken"
Provides-Extra: hf-tokenizers
Requires-Dist: tokenizers; extra == "hf-tokenizers"
Provides-Extra: transformers
Requires-Dist: transformers; extra == "transformers"
Provides-Extra: langchain
Requires-Dist: langchain-core; extra == "langchain"
Provides-Extra: llamaindex
Requires-Dist: llama-index-core; extra == "llamaindex"
Provides-Extra: cpu
Requires-Dist: openingestion[docling]; extra == "cpu"
Requires-Dist: openingestion[semantic]; extra == "cpu"
Requires-Dist: openingestion[tiktoken]; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: openingestion[semantic]; extra == "gpu"
Requires-Dist: openingestion[tiktoken]; extra == "gpu"
Provides-Extra: all
Requires-Dist: openingestion[docling]; extra == "all"
Requires-Dist: openingestion[semantic]; extra == "all"
Requires-Dist: openingestion[slumber]; extra == "all"
Requires-Dist: openingestion[tiktoken]; extra == "all"
Requires-Dist: openingestion[hf-tokenizers]; extra == "all"
Requires-Dist: openingestion[transformers]; extra == "all"
Requires-Dist: openingestion[langchain]; extra == "all"
Requires-Dist: openingestion[llamaindex]; extra == "all"
Requires-Dist: openingestion[web]; extra == "all"
Requires-Dist: openingestion[sharepoint]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-mock; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# openingestion

Pipeline d'ingestion RAG au-dessus de MinerU / Docling.

```
Fetcher → Chef → Chunker → Refinery → Porter
```

## Installation

### 1. Cloner et installer en mode éditable

```bash
git clone <repo-url>
cd openingestion
pip install -e .
```

> L'installation éditable (`-e`) est **obligatoire** pour que les imports
> `from openingestion import …` se résolvent correctement depuis les scripts
> et notebooks, car la racine du dépôt *est* le package Python.

### 1bis. Setup Windows / PowerShell

Le projet demande `Python >= 3.10`. Sur Windows, un setup simple ressemble Ã  :

```powershell
py -3.14 -m venv .venv
. .\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e .
```

Pour un premier run CPU sans GPU, ajoutez Docling :

```powershell
python -m pip install -e ".[docling]"
```

### 2. Extras optionnels

```bash
# Parser MinerU (GPU recommandé)
pip install -e ".[mineru]"

# Parser Docling (CPU, pas de GPU nécessaire)
pip install -e ".[docling]"

# Chunking sémantique (sentence-transformers + scipy)
pip install -e ".[semantic]"

# Chunking LLM-guidé SlumberChunker + OpenAIGenie
pip install -e ".[slumber]"

# Tokenizer OpenAI exact (cl100k_base, o200k_base…)
pip install -e ".[tiktoken]"

# Tokenizer HuggingFace rapide (Rust, BPE/WordPiece…)
pip install -e ".[hf-tokenizers]"

# AutoTokenizer HuggingFace (transformers complet)
pip install -e ".[transformers]"

# Tout à la fois
pip install -e ".[mineru,docling,semantic,slumber,tiktoken]"
```

## Utilisation rapide

```python
from openingestion import ingest

# Depuis un PDF brut (MinerU tourne en arrière-plan)
chunks = ingest("rapport.pdf")

# Depuis un répertoire de sortie MinerU existant (pas de re-parsing)
chunks = ingest("./output/rapport/auto/")

# Avec Docling (CPU, pas de GPU)
chunks = ingest("rapport.pdf", parser="docling", strategy="by_token")

# Format LangChain
docs = ingest("rapport.pdf", output_format="langchain")
```

## Architecture

| Étape | Classe | Rôle |
|---|---|---|
| Chef | `MinerUChef`, `DoclingChef` | Parse le document → `ContentBlock[]` |
| Chunker | `TokenChunker`, `SentenceChunker`, `SemanticChunker`… | Groupe les blocs → `RagChunk[]` |
| Refinery | `RagRefinery`, `ContextualRagRefinery` | Enrichit les chunks (tokens, hash, images, contexte LLM) |
| Porter | `JSONPorter`, `to_langchain`, `to_llamaindex` | Exporte vers le format cible |

Voir [specv3.md](specv3.md) pour les spécifications techniques détaillées.
