Metadata-Version: 2.4
Name: strutex
Version: 1.3.7
Summary: Structured AI document processing with robust fallback strategies.
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: pdf,ai,llm,extraction,json
Author: aquiles
Author-email: achillezongo07@gmail.com
Requires-Python: >=3.12,<4.0
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: all
Provides-Extra: anthropic
Provides-Extra: cli
Provides-Extra: fallback
Provides-Extra: gliner
Provides-Extra: google
Provides-Extra: groq
Provides-Extra: guardrails
Provides-Extra: haystack
Provides-Extra: langchain
Provides-Extra: llamaindex
Provides-Extra: ocr
Provides-Extra: ollama
Provides-Extra: openai
Provides-Extra: rag
Provides-Extra: server
Requires-Dist: anthropic (>=0.75.0,<0.76.0) ; extra == "anthropic" or extra == "all"
Requires-Dist: click (>=8.1.0,<9.0.0) ; extra == "cli" or extra == "all"
Requires-Dist: fastapi (>=0.128.0,<0.129.0) ; extra == "server" or extra == "all"
Requires-Dist: fastembed (>=0.7.4,<0.8.0) ; extra == "rag" or extra == "all"
Requires-Dist: gliner (>=0.2.0,<0.3.0) ; extra == "gliner" or extra == "all"
Requires-Dist: google-genai (>=1.53.0,<2.0.0) ; extra == "google" or extra == "all"
Requires-Dist: groq (>=1.0.0,<2.0.0) ; extra == "groq" or extra == "all"
Requires-Dist: haystack-ai (>=2.0.0,<3.0.0) ; extra == "haystack" or extra == "all"
Requires-Dist: langchain (>=1.2.1,<2.0.0) ; extra == "rag" or extra == "all"
Requires-Dist: langchain-community (>=0.4.1,<0.5.0) ; extra == "rag" or extra == "all"
Requires-Dist: langchain-core (>=1.2.5,<2.0.0) ; extra == "langchain" or extra == "all"
Requires-Dist: langchain-google-genai (>=4.1.3,<5.0.0) ; extra == "rag" or extra == "all"
Requires-Dist: langchain-openai (>=1.1.6,<2.0.0) ; extra == "rag" or extra == "all"
Requires-Dist: langgraph (>=1.0.5,<2.0.0) ; extra == "rag" or extra == "all"
Requires-Dist: llama-index-core (>=0.14.0,<0.15.0) ; extra == "llamaindex" or extra == "all"
Requires-Dist: numpy (>=2.1.0,<3.0.0) ; extra == "guardrails" or extra == "rag" or extra == "all"
Requires-Dist: ollama (>=0.6.1,<0.7.0) ; extra == "ollama" or extra == "all"
Requires-Dist: openai (>=2.8.1,<3.0.0) ; extra == "openai" or extra == "all"
Requires-Dist: openpyxl (>=3.1.5,<4.0.0)
Requires-Dist: pandas (>=2.3.3,<3.0.0)
Requires-Dist: pdf2image (>=1.17.0,<2.0.0) ; extra == "ocr" or extra == "all"
Requires-Dist: pdfplumber (>=0.11.8,<0.12.0)
Requires-Dist: pluggy (>=1.5.0,<2.0.0)
Requires-Dist: pydantic (>=2.12.5,<3.0.0)
Requires-Dist: pypdf (>=6.4.0,<7.0.0)
Requires-Dist: pytesseract (>=0.3.10,<0.4.0) ; extra == "ocr" or extra == "all"
Requires-Dist: python-dotenv (>=1.2.1,<2.0.0)
Requires-Dist: python-multipart (>=0.0.21,<0.0.22) ; extra == "server" or extra == "all"
Requires-Dist: qdrant-client (>=1.16.2,<2.0.0) ; extra == "rag" or extra == "all"
Requires-Dist: sentence-transformers (>=3.3.0,<4.0.0) ; extra == "guardrails" or extra == "all"
Requires-Dist: unstructured (>=0.11.0,<0.12.0) ; extra == "fallback" or extra == "all"
Requires-Dist: uvicorn (>=0.40.0,<0.41.0) ; extra == "server" or extra == "all"
Project-URL: Homepage, https://github.com/Aquilesorei/strutex
Project-URL: Repository, https://github.com/Aquilesorei/strutex
Description-Content-Type: text/markdown

# strutex

> **Stru**ctured **T**ext **Ex**traction — Extract structured JSON from documents using LLMs

[![CI](https://github.com/Aquilesorei/strutex/actions/workflows/ci.yml/badge.svg)](https://github.com/Aquilesorei/strutex/actions/workflows/ci.yml)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/pypi/v/strutex.svg)](https://pypi.org/project/strutex/)

```bash
pip install strutex
```

## @

## The Simplest Example

```python
from strutex import extract
from strutex.schemas import INVOICE_US

invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)
```

**That's it.** Three lines. No custom schema to write.

> **Schemas are required** — but you have options:
>
> - **Built-in schemas** — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)
> - **Native types** — `Object`, `String`, `Number`, `Array` (lightweight, no dependencies)
> - **Pydantic models** — Full type safety and validation

---

## What You Can Do

| Level             | Features                  | When to use                        |
| ----------------- | ------------------------- | ---------------------------------- |
| **Basic**         | `extract()`, schemas      | Most use cases — just extract data |
| **Reliability**   | `verify=True`, validators | Production — ensure accuracy       |
| **Scale**         | caching, async, batch     | High volume — reduce costs         |
| **Architecture**  | agentic, router, fallback | Complex reasoning & robustness     |
| **Extensibility** | plugins, hooks, CLI       | Advanced — extend anything         |

> **Most users only need Level 1.** The rest is there when you need it.

---

## Level 1: Basic Extraction

### With Pydantic (recommended)

```python
import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)
```

### With Native Schema

```python
from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)
```

### With Built-in Schemas

```python
from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)
```

Available: `INVOICE_GENERIC`, `INVOICE_US`, `INVOICE_EU`, `RECEIPT`, `PURCHASE_ORDER`, `BILL_OF_LADING`, `RESUME`, `BANK_STATEMENT`, etc.

---

## Level 2: Reliability Features

### Optional Double-Check

Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:

```python
result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM reviews its own output
)
```

### Choosing a Provider

Create a provider instance for full control over API keys and configuration:

```python
from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)
```

> **Note:** String providers like `provider="gemini"` are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.

---

## Level 3: Scale Features

### Caching (reduce API costs)

```python
from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)
```

### Async Processing

```python
import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())
```

---

## Level 4: Advanced Architectures

Move beyond simple extraction with specialized processors for complex workflows.

### Agentic RAG (Self-Correcting)

The `AgenticProcessor` uses a planner-actor-optimizer loop to solve complex queries by actively searching, reading, and correcting itself.

```python
from strutex import AgenticProcessor

processor = AgenticProcessor()
# Automatically plans, searches, and compiles answer
result = await processor.aprocess(
    file_path="handbook.pdf",
    prompt="What is the policy for jury duty based on the employee handbook?"
)
```

### Specialized Processors

Compose robust pipelines using built-in strategies:

- **`FallbackProcessor`**: Switch providers if primary fails.
- **`RouterProcessor`**: Route to different models based on document type.
- **`EnsembleProcessor`**: Query multiple models and vote on the best answer.
- **`PrivacyProcessor`**: Redact PII locally before sending to cloud LLMs.

See [Advanced Processors Documentation](docs/advanced-processors.md) for details.

---

## Level 5: Extensibility

### Plugin System

Everything is pluggable. Just inherit from a base class:

| Type             | Purpose                 | Examples                          |
| ---------------- | ----------------------- | --------------------------------- |
| `Provider`       | LLM backends            | Gemini, OpenAI, Claude, Ollama    |
| `Extractor`      | Document parsing        | PDF, Image OCR, Excel             |
| `Validator`      | Output validation       | Schema, sum checks, date formats  |
| `SecurityPlugin` | Input/output protection | Injection detection, sanitization |
| `Postprocessor`  | Data transformation     | Date/number normalization         |

```python
from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )
```

### CLI Commands

```bash
strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider
```

### For Distributable Packages

```toml
# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"
```

### Hooks System

Inject logic at any point in the processing pipeline:

```python
from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors
```

---

## Optional Extras

```bash
pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[rag]          # RAG support (Qdrant, FastEmbed)
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything
```

---

## Supported Formats

| Format | Extensions              | Method                              |
| ------ | ----------------------- | ----------------------------------- |
| PDF    | `.pdf`                  | Text extraction with fallback chain |
| Images | `.png`, `.jpg`, `.tiff` | Direct vision or OCR                |
| Excel  | `.xlsx`, `.xls`         | Converted to structured text        |
| Text   | `.txt`, `.csv`          | Direct input                        |

---

## Full Feature List

<details>
<summary>Click to expand all features</summary>

- **Plugin System v2** — Auto-registration via inheritance, lazy loading, entry points
- **Hooks** — Callbacks and decorators for pre/post processing pipeline
- **CLI Tooling** — `strutex plugins list|info|refresh` commands
- **Multi-Provider LLM Support** — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
- **Universal Document Support** — PDFs, images, Excel, and custom formats
- **Schema-Driven Extraction** — Define your output structure, get consistent JSON
- **Verification & Self-Correction** — Built-in audit loop for high accuracy
- **Security First** — Built-in input sanitization and output validation
- **RAG Capabilities** — Built-in Retrieval-Augmented Generation with Qdrant and FastEmbed
- **Framework Integrations** — LangChain, LlamaIndex, Haystack compatibility
- **Caching** — Memory, SQLite, and file-based caching
- **Async & Batch** — Process multiple documents in parallel
- **Streaming** — Real-time extraction feedback

</details>

---

## Documentation

📚 **[Read the Docs](https://aquilesorei.github.io/strutex/latest/)**

---

## Roadmap

See [ROADMAP.md](ROADMAP.md) for the full development plan.

**Recent releases:**

- [x] v0.1.0 — Core functionality
- [x] v0.2.0 — Plugin registry + Security layer
- [x] v0.3.0 — Plugin System v2
- [x] v0.6.0 — Built-in Schemas & Logging
- [x] v0.7.0 — Providers & Retries
- [x] v0.8.0 — Async, Batch, Cache, Verification
- [x] v0.8.1 — Documentation & Coverage Fixes
- [x] v1.3.7 — Agentic RAG, Advanced Processors, & Full Async Support

---

## License

This project is licensed under the **GNU General Public License v3.0** — see [LICENSE](LICENSE) for details.

---

## Contributing

Contributions welcome! Priority areas:

1. **New plugins** — Providers, extractors, validators
2. **Documentation** — Examples and tutorials
3. **Testing** — Expand test coverage

