# Psyduck

> Python SDK for agent-safe PDF extraction. Psyduck turns PDFs into structured text, Markdown, JSON, page previews, assets, profiles, and quality reports.

- Version: 0.1.0
- GitHub: https://github.com/motosan-dev/psyduck
- Package: `psyduck`
- Runtime: Python 3.11+
- Default extractor: PyMuPDF only

## Install

```bash
pip install psyduck
```

For local development:

```bash
python -m venv .venv
.venv/bin/python -m pip install -e ".[dev]"
```

## Minimal SDK Usage

```python
from psyduck import Psyduck

client = Psyduck()
result = client.process(
    "document.pdf",
    goal="rag",
    output_dir="output",
    return_content=True,
)

print(result.status)
print(result.outputs["markdown"])
print(result.quality.warnings)
```

`process()` is the primary API. Psyduck is intentionally SDK-only. Do not expect a CLI.

## Public API

```python
from psyduck import (
    Psyduck,
    PsyduckBlock,
    PsyduckConfig,
    PsyduckDocument,
    PsyduckProfile,
    PsyduckQualityReport,
    PsyduckResult,
    PsyduckTable,
    PsyduckWarning,
)
```

### Psyduck.process()

Important keyword arguments:

| Argument | Purpose |
|----------|---------|
| `goal` | `auto`, `rag`, `summary`, `tables`, `archive` |
| `mode` | `fast`, `balanced`, `high_quality` |
| `pages` | Optional original PDF page numbers to process |
| `ocr` | `auto`, `force`, `off` |
| `tables` | `auto`, `force`, `off` |
| `extractors` | Optional extractor names from the registry |
| `output_dir` | Directory for run artifacts |
| `return_content` | Include `PsyduckDocument` in the returned object |

### Result Shape

`PsyduckResult` contains:

| Field | Meaning |
|-------|---------|
| `run_id` | Stable run directory name |
| `status` | `completed` or `needs_review` |
| `document` | Structured document, or omitted when `return_content=False` |
| `profile` | PDF page/profile metadata |
| `quality` | Warnings, score, and suggested actions |
| `outputs` | Paths to generated artifacts |

Generated artifacts commonly include:

- `document.md`
- `document.json`
- `quality.json`
- `profile.json`
- `run.json`
- `pages/` for rendered page previews in high-quality mode
- `assets/` for archive mode

## Quality Warnings

Agents should inspect `result.status` and `result.quality.warnings` before trusting output.

Common warning codes:

| Code | Meaning |
|------|---------|
| `EMPTY_PAGE` | A processed page has no extracted text |
| `OCR_NEEDED` | The PDF looks scan-heavy |
| `OCR_ADAPTER_NEEDED` | OCR was requested but the base SDK has no OCR adapter |
| `NO_TABLES_EXTRACTED` | Table goal requested but no tables were found |
| `TABLE_ADAPTER_NEEDED` | Table extraction was forced but no table adapter produced tables |
| `UNKNOWN_EXTRACTOR` | Requested extractor name is not registered |
| `EXTRACTOR_UNAVAILABLE` | Extractor dependency or runtime requirement is unavailable |
| `EXTRACTOR_FAILED` | Extractor raised an unexpected exception |

For agent workflows, treat `needs_review` as a routing signal. Ask for a stronger extractor, OCR pipeline, human review, or a narrower page range.

## Custom Extractors

Psyduck ships with PyMuPDF only. Integrators can register their own extractors:

```python
from psyduck import Psyduck
from psyduck.extractors.base import ExtractorOutput
from psyduck.schema import PsyduckBlock


class MyExtractor:
    name = "my-extractor"

    def extract(self, file_path, pages=None):
        return ExtractorOutput(
            source=self.name,
            blocks=[
                PsyduckBlock(
                    id="my_b1",
                    page=1,
                    type="paragraph",
                    text="Extracted content",
                    source=self.name,
                    confidence=0.9,
                )
            ]
        )


client = Psyduck(extractor_registry={"my-extractor": MyExtractor})
result = client.process("document.pdf", extractors=["my-extractor"])
```

An extractor should return `ExtractorOutput(source=..., blocks=..., tables=..., warnings=...)`.

## Design Rules

- Psyduck is SDK-only. Do not add CLI behavior unless the product direction changes.
- Keep the base install small: `pydantic` and `pymupdf`.
- Do not reintroduce built-in Docling, Camelot, pdfplumber, or OCR adapters without a design review.
- Optional extraction engines belong behind custom extractor registration.
- Keep page numbers as original PDF page numbers.
- Quality warnings should be conservative and machine-readable.

## Development Commands

```bash
.venv/bin/python -m ruff format --check .
.venv/bin/python -m ruff check .
.venv/bin/python -m pytest -q
.venv/bin/python -m build --sdist --wheel --outdir dist
.venv/bin/python -m twine check dist/*
```

## Release Checklist

Before publishing:

1. Update `pyproject.toml` version.
2. Update `CHANGELOG.md`.
3. Update `README.md`, `docs/API.md`, `docs/EXTENDING.md`, `llms.txt`, and `skills/psyduck/SKILL.md` if API behavior changed.
4. Run format, lint, tests, build, and `twine check`.
5. Tag the release, for example `v0.1.0`.
6. Upload `dist/*` to PyPI with `twine upload`.
