Metadata-Version: 2.4
Name: mag-file-handler
Version: 0.1.0
Summary: Universal document parser: PDF / Office / email / images / HTML — tiered routing for cost & accuracy
Author-email: Magure <aman.p@magureinc.com>
Maintainer-email: Magure <aman.p@magureinc.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/Magure-Tech/magoneai-file-handler
Project-URL: Source, https://github.com/Magure-Tech/magoneai-file-handler
Project-URL: Issues, https://github.com/Magure-Tech/magoneai-file-handler/issues
Project-URL: Changelog, https://github.com/Magure-Tech/magoneai-file-handler/blob/main/file_handler/CHANGELOG.md
Project-URL: Documentation, https://github.com/Magure-Tech/magoneai-file-handler/blob/main/file_handler/README.md
Keywords: pdf,ocr,document,parser,extraction,tika,extractous,vision-llm,office,docx
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Markup
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: pypdfium2>=4.30
Requires-Dist: Pillow>=10.0
Requires-Dist: numpy>=1.26
Requires-Dist: filetype>=1.2
Requires-Dist: extractous>=0.3
Provides-Extra: email
Requires-Dist: extract-msg>=0.55; extra == "email"
Provides-Extra: vision
Provides-Extra: all
Requires-Dist: extract-msg>=0.55; extra == "all"
Provides-Extra: test
Requires-Dist: pytest>=7; extra == "test"
Requires-Dist: python-docx>=1.0; extra == "test"
Requires-Dist: python-pptx>=0.6; extra == "test"
Requires-Dist: openpyxl>=3.1; extra == "test"
Dynamic: license-file

# mag-file-handler

Universal document parser for any file format your product receives.

```python
from file_handler import parse

result = parse("inbound/email.eml")
print(result.text)            # extracted plain text
print(result.format)          # "eml"
print(result.engine)          # "email"
print(result.extra)           # engine-specific provenance
```

## Why this exists

Different document types need different tools — and getting it wrong is
expensive. Born-digital PDFs should hit a free text-layer extractor; scanned
newspapers should hit Tika; complex slides and textbooks need a vision LLM.
This library routes each document to the right engine, automatically.

## System requirements

- **Python 3.10, 3.11, 3.12, or 3.13** (CPython).
- **OS / arch:** Linux x86-64 (glibc ≥ 2.28), macOS x86-64 (≥ 10.12),
  macOS ARM64 (≥ 11), Windows x86-64. Linux ARM64 (Graviton, RPi)
  and Windows ARM64 are **not** currently supported because the
  `extractous` dependency does not ship wheels for those targets.
- **No JVM is required.** `extractous` ships native binaries built with
  GraalVM AOT compilation, so installation and runtime are pure-native
  even though Apache Tika is used internally for the long tail of formats.
- **Optional:** a `VisionClient` implementation (yours or your platform's
  LLM gateway's) for image OCR and scanned-PDF Tier 2 routing.
  See [Vision is pluggable](#vision-is-pluggable--and-there-is-no-default).

## Install

```bash
pip install mag-file-handler            # core (PDF router, Office, HTML, txt, EML)
pip install mag-file-handler[email]     # + Outlook .msg support
pip install mag-file-handler[all]       # everything optional
```

> `[vision]` is a documentation-only marker extra — the library ships
> zero LLM SDK dependencies by design. To use vision OCR, pass a
> `VisionClient` you implement (or one your platform provides) into
> `parse(path, vision_client=...)`. See below for the protocol.

## Usage

### Library

```python
from file_handler import parse

result = parse("/path/to/file")
if result.ok:
    process(result.text)
else:
    log.warn("parse failed: %s", result.error)

# Engine-specific provenance lives in result.extra
if result.engine == "pdf_router":
    print(result.extra["tier_counts"])    # {"tier0_pdfium": 5, "tier1_extractous": 0, "tier2_claude": 2}
    print(result.extra["claude_usage"])   # {"input_tokens": ..., "output_tokens": ...}
```

### CLI

```bash
file-handler parse  document.pdf            # prints extracted text
file-handler parse  document.pdf --json     # full result as JSON
file-handler detect document.pdf            # format detection only
file-handler info                           # version + which engines are available
```

## How routing works

```
                     FORMAT DETECTION
                     (magic bytes + ext)
                            │
        ┌───────────────────┼───────────────────────┐
        ▼                   ▼                       ▼
       PDF              Image (jpg/png/…)       Email (eml/msg)
        │                   │                       │
        ▼                   ▼                       ▼
   PDF Router          Claude Vision         parse + recurse on
   (per-page tiers)                          each attachment
   ┌──────────────┐
   │ Tier 0  pdfium native text-layer       free, ms-fast
   │ Tier 1  Extractous (Tika)              free, ~2 s
   │ Tier 2  Claude Haiku 4.5 vision        ~$0.008, ~15 s
   └──────────────┘

   Office, HTML, MD, TXT, …  →  Extractous (Tika handles them natively)
```

For PDFs, the per-page decision tree picks:

| Signal | Decision |
|---|---|
| `text_layer_chars >= 100` | Tier 0 — free, instant |
| `is_broadsheet` (long edge ≥ 1500 pt) | Tier 1 — Extractous wins on dense newspapers |
| `clean_columned` (2–3 uniform cols) | Tier 1 — Extractous wins on structured columns |
| else | Tier 2 — LLM vision (slides, mixed cols, textbooks) |

If Tier 0 / Tier 1 returns near-empty text on a page that visibly has content,
the engine falls back to Tier 2 automatically (conservative — only on empty
output, not on questionable quality).

## Returned types

```python
@dataclass
class ParseResult:
    text: str                      # extracted plain text
    format: str                    # "pdf", "docx", "eml", …
    engine: str                    # which engine handled it
    mime: str
    detection_confidence: str      # "magic" / "ext" / "content"
    path: str
    error: str | None              # set if parsing failed
    page_count: int | None
    extra: dict                    # engine-specific provenance
    ok: bool                       # property, True if error is None
```

`extra` always carries enough to audit a parse:

| Engine | Notable `extra` keys |
|---|---|
| `pdf_router` | `tier_counts`, `vision_usage`, `routing` (per-page) |
| `vision` | `vision_usage`, `vision_client` |
| `email` | `attachments` (list of sub-ParseResults) |
| `extractous` | `metadata_keys_count` |

## Vision is pluggable — and there is no default

The library ships ZERO LLM SDK dependencies. Vision (image OCR + Tier 2 of
the PDF router) requires the caller to inject a `VisionClient`. If none is
provided, vision-needing operations return an error in `result.error`
instead of falling back to a default provider.

```python
class VisionClient(Protocol):
    def ocr_image(
        self,
        image_bytes: bytes,
        media_type: str,
        *,
        prompt: str | None = None,
        max_tokens: int = 16384,
    ) -> tuple[str, dict[str, Any]]:
        """Return (extracted_text, usage_metadata)."""
```

This is intentional — production deployments use a per-org LLM gateway
(model selection, access control, cost tracking, secrets management) that
the library has no business knowing about. The `parse(path, vision_client=...)`
parameter (and recursively for email attachments) is the integration point.

## Temporal integration — the canonical production pattern

In production, `file_handler.parse()` is **not** called from inside a single
Temporal Activity. Instead, magoneai's workflow composes file_handler's
per-tier engines as separate activities, with vision as its own activity
backed by `LLMGateway`. This gives Temporal-native retries, observability,
and rate-limiting per tier — without any `asyncio.run` bridging.

```python
# magoneai/temporal/file_handler/activities.py

from pathlib import Path

from temporalio import activity

from be.core.database import get_async_session
from be.llm.gateway import LLMGateway, LoadedImage, build_vision_message

# file_handler exposes building blocks; activities orchestrate them.
import file_handler
from file_handler.engines import extractous_engine, pdfium_engine, pdf_router
from file_handler.engines._ocr_helpers import (
    render_image_file_for_ocr,
    render_pdf_page_for_ocr,
)
from file_handler.engines.page_features import extract_features


@activity.defn
async def detect_format_activity(file_uri: str) -> dict:
    fmt = file_handler.detect(file_uri)
    return {"format_id": fmt.format_id.value, "mime": fmt.mime}


@activity.defn
async def extract_via_extractous_activity(file_uri: str) -> dict:
    """Tika-based extraction for DOCX/PPTX/XLSX/HTML/MD/TXT and PDF Tier 1."""
    return extractous_engine.parse_file(Path(file_uri))


@activity.defn
async def plan_pdf_route_activity(file_uri: str) -> list[dict]:
    """Per-page tier plan, made by the workflow before dispatch."""
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    plan = []
    for i, page in enumerate(pdf):
        f = extract_features(page, i)
        tier, reason = pdf_router._decide_tier(f)
        plan.append({"page": i, "tier": tier, "reason": reason})
    return plan


@activity.defn
async def pdfium_text_layer_activity(file_uri: str, page_idx: int) -> str:
    import pypdfium2 as pdfium
    pdf = pdfium.PdfDocument(file_uri)
    return pdfium_engine.extract_text(pdf[page_idx])


@activity.defn
async def extractous_page_activity(file_uri: str, page_idx: int) -> dict:
    return extractous_engine.parse_pdf_page(Path(file_uri), page_idx)


@activity.defn
async def vision_ocr_activity(
    file_uri: str,
    page_idx: int | None,        # None = whole-file image; int = PDF page
    project_id: str,
    llm_config_id: str,
) -> dict:
    """Vision OCR via the magoneai LLM gateway. Native async — no asyncio.run."""
    if page_idx is None:
        image_bytes, media_type, _ = render_image_file_for_ocr(Path(file_uri))
    else:
        import pypdfium2 as pdfium
        pdf = pdfium.PdfDocument(file_uri)
        image_bytes, media_type, _ = render_pdf_page_for_ocr(pdf[page_idx])

    async with get_async_session() as session:
        gateway = LLMGateway(session)
        loaded = LoadedImage(
            base64_data=__import__("base64").standard_b64encode(image_bytes).decode("ascii"),
            mime_type=media_type,
            file_id=f"file_handler-{file_uri}:{page_idx}",
            size_bytes=len(image_bytes),
        )
        response = await gateway.complete(
            project_id=project_id,
            llm_config_id=llm_config_id,
            messages=[build_vision_message(text="Transcribe this page.", images=[loaded])],
            parameters={"max_tokens": 16384},
            images=[loaded],
            source_type="file_handler",
        )
    return {
        "text": response.content,
        "input_tokens": response.usage.tokens_in,
        "output_tokens": response.usage.tokens_out,
        "request_id": response.usage.request_id,
    }
```

```python
# magoneai/temporal/file_handler/workflows.py

from datetime import timedelta
from temporalio import workflow

from .activities import (
    detect_format_activity,
    extract_via_extractous_activity,
    plan_pdf_route_activity,
    pdfium_text_layer_activity,
    extractous_page_activity,
    vision_ocr_activity,
)


IMAGE_FORMATS = {"jpeg", "png", "tiff", "gif", "webp", "bmp"}


@workflow.defn
class ParseDocumentWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        fmt = await workflow.execute_activity(
            detect_format_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=10),
        )

        if fmt["format_id"] == "pdf":
            return await workflow.execute_child_workflow(
                ParsePdfWorkflow.run, file_uri, project_id, llm_config_id,
            )

        if fmt["format_id"] in IMAGE_FORMATS:
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, None, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return {"text": res["text"], "engine": "vision", "format": fmt["format_id"]}

        # Office, HTML, MD, TXT, EML, etc. → Tika handles them, no vision.
        res = await workflow.execute_activity(
            extract_via_extractous_activity, file_uri,
            start_to_close_timeout=timedelta(minutes=2),
        )
        return {**res, "format": fmt["format_id"]}


@workflow.defn
class ParsePdfWorkflow:
    @workflow.run
    async def run(self, file_uri: str, project_id: str, llm_config_id: str) -> dict:
        plan = await workflow.execute_activity(
            plan_pdf_route_activity, file_uri,
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Fan-out: each page runs its chosen tier as an independent activity.
        async def run_tier(page_decision: dict) -> str:
            i = page_decision["page"]
            tier = page_decision["tier"]
            if tier == "tier0_pdfium":
                return await workflow.execute_activity(
                    pdfium_text_layer_activity, file_uri, i,
                    start_to_close_timeout=timedelta(seconds=30),
                )
            if tier == "tier1_extractous":
                res = await workflow.execute_activity(
                    extractous_page_activity, file_uri, i,
                    start_to_close_timeout=timedelta(minutes=2),
                )
                return res.get("text", "")
            res = await workflow.execute_activity(
                vision_ocr_activity, file_uri, i, project_id, llm_config_id,
                start_to_close_timeout=timedelta(minutes=3),
            )
            return res["text"]

        # Bounded concurrency keeps gateway rate limits in check.
        page_texts = await workflow.gather(*(run_tier(p) for p in plan))
        return {
            "text": "\n\n".join(t for t in page_texts if t),
            "engine": "pdf_router",
            "format": "pdf",
            "page_count": len(plan),
            "per_page_decisions": plan,
        }
```

Why this shape:

- **Each tier is its own Temporal Activity.** Vision-rate-limit retries don't
  block fast pdfium calls. Extractous subprocess crashes retry independently.
  Per-tier timeouts and policies live where they belong.
- **The vision activity is async-native.** It calls `LLMGateway.complete()`
  directly with `await`. No `asyncio.run`, no thread bridge, no nested loops.
- **Routing decisions are visible in the workflow's event history.** You can
  query "what tier did page 7 use?" directly from Temporal.
- **Per-page fan-out is parallel.** A 50-page PDF runs N pages concurrently
  (worker concurrency caps total in-flight). file_handler's sync `parse()`
  would have processed them serially.

For local dev, scripts, and the benchmark, `file_handler.parse(path,
vision_client=YourClient())` is still the simplest path — pass any object
implementing `VisionClient`.

## Limitations / known issues

- **Legacy Office** (.doc / .ppt / .xls): falls back to Tika best-effort.
  A libreoffice-based pre-converter is on the roadmap.
- **Tables**: neither Extractous nor Claude returns structured tables.
  pdfplumber-based table extraction is on the roadmap.
- **No CJK OCR**: Tika's default Tesseract is English-only. We have not
  invested in Chinese / Japanese / Korean OCR (out of scope).
- **Dense tabloid pages**: the router's column detector occasionally
  mis-routes a tabloid (single dominant photo + scattered text) to Tier 2.
  Watch this if your traffic is heavy on tabloid layouts.

## Development

```bash
pip install -e ".[all,test]"
pytest
```
