Metadata-Version: 2.4
Name: llama-index-readers-docdigitizer
Version: 0.1.0
Summary: LlamaIndex reader for the DocDigitizer document processing API
Project-URL: Homepage, https://github.com/DocDigitizer/dd-v3-integrations
Project-URL: Repository, https://github.com/DocDigitizer/dd-v3-integrations
Author-email: DocDigitizer <support@docdigitizer.com>
License-Expression: MIT
Keywords: docdigitizer,extraction,llama-index,llamaindex,ocr,pdf,reader
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: docdigitizer>=0.1.0
Requires-Dist: llama-index-core>=0.11.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# llama-index-readers-docdigitizer

LlamaIndex reader for the [DocDigitizer](https://docdigitizer.com) document processing API.

## Installation

```bash
pip install llama-index-readers-docdigitizer
```

## Usage

```python
from llama_index.readers.docdigitizer import DocDigitizerReader

# Load a single PDF
reader = DocDigitizerReader(api_key="dd_live_...")
documents = reader.load_data(file_path="invoice.pdf")

print(documents[0].text)          # JSON with extracted fields
print(documents[0].metadata)      # document_type, confidence, etc.

# Load all PDFs from a directory
documents = reader.load_data(file_path="invoices/")

# Use in a RAG pipeline
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the invoice total?")
```

## Configuration

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| `api_key` | `DOCDIGITIZER_API_KEY` | — |
| `base_url` | `DOCDIGITIZER_BASE_URL` | `https://apix.docdigitizer.com/sync` |
| `timeout` | `DOCDIGITIZER_TIMEOUT` | `300` |
| `max_retries` | — | `3` |
| `pipeline` | — | `None` |
| `content_format` | — | `"json"` |

### Content Formats

- `"json"` (default): Document text is a JSON string of extracted fields
- `"text"`: Key-value pairs separated by newlines (`key: value`)
- `"kv"`: `key=value` pairs separated by newlines

## Document Metadata

Each LlamaIndex `Document` includes metadata:

| Field | Type | Description |
|-------|------|-------------|
| `source` | `str` | File path of the processed PDF |
| `document_type` | `str` | Detected document type (e.g., "Invoice") |
| `confidence` | `float` | Classification confidence (0-1) |
| `country_code` | `str` | Detected country code (e.g., "PT") |
| `pages` | `list[int]` | Page numbers where document was found |
| `trace_id` | `str` | Unique trace identifier |

## License

MIT
