Metadata-Version: 2.4
Name: psyduck
Version: 0.1.0
Summary: Agent-safe PDF extraction SDK with structured output and quality reports.
Project-URL: Repository, https://github.com/motosan-dev/psyduck
Project-URL: Issues, https://github.com/motosan-dev/psyduck/issues
Project-URL: Documentation, https://github.com/motosan-dev/psyduck#readme
Author: Psyduck Contributors
License-Expression: MIT
License-File: LICENSE
Keywords: agents,document-extraction,pdf,pymupdf,sdk
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.7
Requires-Dist: pymupdf>=1.24
Provides-Extra: dev
Requires-Dist: hatchling>=1.25; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: reportlab>=4.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

# Psyduck

Psyduck is a small Python SDK for turning PDFs into agent-ready structured documents.

It is intentionally SDK-only: import it from Python, run `Psyduck().process(...)`, and inspect the returned document, profile, quality report, and exported artifacts.

## Install

```bash
pip install psyduck
```

Development install:

```bash
pip install -e ".[dev]"
```

## Python API

```python
from psyduck import Psyduck

duck = Psyduck()
result = duck.process("report.pdf", goal="rag", mode="balanced", return_content=True)

if result.quality.needs_review:
    for warning in result.quality.warnings:
        print(warning.code, warning.message)

for block in result.document.blocks:
    print(block.page, block.text)
```

## Output Directory

```text
output/report-<timestamp>-<id>/
  run.json
  profile.json
  document.md
  document.json
  quality.json
  tables/
  assets/
  pages/
```

## SDK Contract

- Default extraction uses PyMuPDF.
- `process()` always writes Markdown, JSON, profile, quality, and run metadata.
- `return_content=False` keeps large document content out of the immediate result.
- `load_result(output_dir)` reloads a previous SDK run.
- Custom extractors can be supplied through `extractor_registry` and requested with `process(..., extractors=[...])`.
- `tables="force"` and `ocr="force"` report `needs_table_adapter` / `needs_ocr_adapter` warnings unless a caller-provided extractor handles that work.

## Custom Extractors

```python
from psyduck import Psyduck
from psyduck.extractors.base import ExtractorOutput
from psyduck.extractors.pymupdf import PyMuPDFExtractor


class MyExtractor:
    def extract(self, file_path, pages=None):
        return ExtractorOutput(source="my_extractor")


duck = Psyduck(
    extractor_registry={
        "pymupdf": PyMuPDFExtractor,
        "my_extractor": lambda: MyExtractor(),
    }
)

result = duck.process("report.pdf", extractors=["pymupdf", "my_extractor"])
```

## Agent Policy

1. Always call `process()` before answering questions about PDF contents.
2. Check `result.quality` before using extracted content.
3. Use Markdown for summaries and JSON for page-aware answers.
4. Do not claim complete extraction when `needs_review` is true.
