Metadata-Version: 2.4
Name: donkit-read-engine
Version: 0.5.3
Summary: Document loading helpers for Donkit RagOps
License: MIT
Author: Donkit AI
Author-email: opensource@donkit.ai
Requires-Python: >=3.12,<3.14
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: aioboto3 (>=15.4.0,<16.0.0)
Requires-Dist: asyncio (>=3.4.3,<4.0.0)
Requires-Dist: docling (>=2.75.0,<3.0.0)
Requires-Dist: donkit-llm (>=0.1.4,<0.2.0)
Requires-Dist: json-repair (>=0.52.3,<0.53.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pillow (>=10.2.0,<11.0.0)
Requires-Dist: pymupdf (>=1.26.4,<2.0.0)
Requires-Dist: pypdf (>=6.1.3,<7.0.0)
Requires-Dist: python-docx (>=1.1.0,<2.0.0)
Requires-Dist: python-dotenv (>=1.1.0,<2.0.0)
Requires-Dist: python-pptx (>=1.0.2,<2.0.0)
Requires-Dist: unstructured[pdf] (>=0.18.15,<0.19.0)
Description-Content-Type: text/markdown

# donkit-read-engine

Document extraction library for [Donkit RagOps](https://donkit.ai). Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.

**PyPI:** `pip install donkit-read-engine`
**Python:** 3.12 – 3.13
**License:** MIT

---

## Features

- Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
- Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
- Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
- S3 support: uploads JSON result and images to S3 with a single `s3_output_prefix` parameter
- Fully async (`aread_document`) with sync wrapper (`read_document`)
- LLM token usage tracking per document

---

## Installation

```bash
pip install donkit-read-engine
```

---

## Quick Start

### Text only (no LLM)

```python
from donkit.read_engine.read_engine import DonkitReader

reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")

print(result.output_path)   # ./processed/report.json
print(result.page_count)    # 42
```

### With LLM image descriptions + save images locally

```python
from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader

llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",  # default
)

result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png

print(result.images_dir)              # ./output/images_report
print(result.total_llm_requests)      # 5
print(result.total_prompt_tokens)     # 3200
```

### Async

```python
result = await reader.aread_document("report.pdf", output_dir="./output")
```

### With S3 output

Pass `s3_service` at construction and `s3_output_prefix` per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.

```python
from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader

s3 = S3Service(S3Credentials(
    access_key_id="...",
    secret_access_key="...",
    region_name="us-east-1",
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-bucket",
))

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",
    s3_service=s3,
)

result = await reader.aread_document(
    "report.pdf",
    s3_output_prefix="experiments/exp123/reading",
)
# result.output_path  = "experiments/exp123/reading/report.json"   (S3 key)
# result.images_dir   = "experiments/exp123/reading/images_report" (S3 prefix)
```

S3 layout:
```
experiments/exp123/reading/
  report.json
  images_report/
    page0001_img0000.png
    page0003_img0001.png
```

> **Note:** when S3 is used, `result.output_path` and `result.images_dir` contain S3 keys, not local paths.

---

## Pipelines

| Pipeline | Description | Requires `llm_model` | Saves images |
|---|---|---|---|
| `docling_llm` | Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3. | Yes | Yes |
| `docling` | Docling only; built-in VLM describes images. No images saved to disk. | No | No |
| `llm` | PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved. | Yes | No |

> Image saving (`images_dir`) is only supported in the `docling_llm` pipeline.

---

## Supported Formats

### Via Docling (primary engine)

| Extension | Format |
|---|---|
| `.pdf` | PDF |
| `.docx` | Word |
| `.pptx` | PowerPoint |
| `.xlsx`, `.csv` | Excel / CSV |
| `.html` | HTML |
| `.md`, `.txt` | Markdown / plain text |
| `.tex` | LaTeX |
| `.asciidoc` | AsciiDoc |
| `.vtt` | WebVTT subtitles |
| `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.tiff` | Images |

### Legacy readers

| Extension | Format |
|---|---|
| `.json` | JSON documents |
| `.xls` | Legacy Excel (pre-2007) |

---

## Output Format

Every pipeline writes a single JSON file:

```json
{
  "content": [
    {
      "page": 1,
      "type": "Text",
      "content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
      "images": [
        "./output/images_report/page0001_img0000.png"
      ]
    },
    {
      "page": 2,
      "type": "Text",
      "content": "More text on page two."
    }
  ]
}
```

- **`page`** — 1-based page number from the source document
- **`content`** — Markdown-formatted text: headings, tables, lists, code blocks, captions
- **`images`** — list of saved image paths (local paths or S3 keys); only present when images were extracted

Page headers, footers, and decorative images (detected by LLM and answered `SKIP`) are excluded from `content` but images are still saved.

---

## API Reference

### `DonkitReader`

```python
class DonkitReader:
    def __init__(
        self,
        output_format: Literal["json", "text", "md"] = "json",
        progress_callback: Callable[[int, int, str | None], None] | None = None,
        llm_model: LLMModelAbstract | None = None,
        reading_pipeline: str = "docling_llm",
        s3_service: S3Service | None = None,
    ) -> None
```

| Parameter | Type | Default | Description |
|---|---|---|---|
| `output_format` | `"json"` / `"text"` / `"md"` | `"json"` | Format hint passed to the image analysis service |
| `progress_callback` | `Callable` | `None` | `(current, total, message)` — called per page in `llm` pipeline |
| `llm_model` | `LLMModelAbstract` | `None` | LLM for image descriptions. Required for `docling_llm` and `llm` |
| `reading_pipeline` | `str` | `"docling_llm"` | `"docling_llm"`, `"docling"`, or `"llm"` |
| `s3_service` | `S3Service` | `None` | If set, results are uploaded to S3 when `s3_output_prefix` is provided |

```python
def read_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

async def aread_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult
```

| Parameter | Description |
|---|---|
| `file_path` | Path to the source document |
| `output_dir` | Local directory for results. Defaults to `processed/` next to the source file. When S3 is used and `output_dir` is `None`, a temporary directory is created and cleaned up automatically |
| `s3_output_prefix` | S3 key prefix for all output artifacts (JSON + images). Requires `s3_service` set at construction |

### `ReadDocumentResult`

```python
@dataclass
class ReadDocumentResult:
    output_path: str                           # Local path to JSON, or S3 key when S3 is used
    page_count: int                            # Number of unique pages
    images_dir: str | None                     # Local images dir, or S3 prefix, or None
    total_llm_requests: int
    total_prompt_tokens: int
    total_completion_tokens: int
    page_split_duration_ms: int | None
    reading_duration_ms: int | None
    gc_duration_ms: int | None
    json_serialize_duration_ms: int | None
    rasterize_duration_ms: int | None
```

### `S3Service` / `S3Credentials`

```python
@dataclass
class S3Credentials:
    access_key_id: str
    secret_access_key: str
    region_name: str
    endpoint_url: str
    bucket_name: str

class S3Service:
    def __init__(self, credentials: S3Credentials)
    async def download_file(self, s3_path: str, local_path: str) -> None
    async def upload_file(self, local_path: str, s3_path: str) -> None
    async def upload_content(self, s3_path: str, content: bytes) -> None
```

---

## CLI

```bash
# Single file
donkit-read-engine report.pdf

# Directory (recursive)
donkit-read-engine ./documents/

# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng
```

| Argument | Values | Default |
|---|---|---|
| `file_path` | file or directory | — |
| `--output-type` | `text`, `json`, `markdown` | `json` |
| `--pdf-strategy` | `fast`, `hi_res`, `ocr_only`, `auto` | — |
| `--ocr-lang` | e.g. `rus+eng` | — |

The CLI uses the `docling` pipeline (no LLM).

---

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `UNSTRUCTURED_STRATEGY` | `hi_res` | PDF OCR strategy for the unstructured backend |
| `UNSTRUCTURED_OCR_LANG` | `rus+eng` | OCR language codes |

LLM credentials are not read from environment. Pass `llm_model` explicitly.

---

## Dependencies

| Package | Purpose |
|---|---|
| `docling` | Primary document conversion engine |
| `pymupdf` | PDF rasterization (`llm` pipeline) |
| `unstructured[pdf]` | PDF OCR with Tesseract |
| `python-docx`, `python-pptx` | Office format parsing |
| `pandas` | Excel / CSV processing |
| `pillow` | Image extraction and encoding |
| `donkit-llm` | LLM provider abstraction |
| `aioboto3` | Async S3 client |
| `json-repair` | Fix malformed JSON from LLM output |
| `loguru` | Structured logging |

