Metadata-Version: 2.4
Name: poveglia
Version: 1.0.0
Summary: Quarantine your imports — configurable content classification pipeline
Author-email: Christophe Pettus <christophe.pettus@pgexperts.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Xof/poveglia
Project-URL: Repository, https://github.com/Xof/poveglia
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiohttp>=3.9
Provides-Extra: vision
Requires-Dist: pillow>=10.0; extra == "vision"
Requires-Dist: onnxruntime>=1.17; extra == "vision"
Provides-Extra: clamav
Requires-Dist: pyclamd>=0.4; extra == "clamav"
Provides-Extra: storage
Requires-Dist: boto3>=1.34; extra == "storage"
Provides-Extra: all
Requires-Dist: poveglia[clamav,storage,vision]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: mypy>=1.11; extra == "dev"
Requires-Dist: pip-audit>=2.7; extra == "dev"
Dynamic: license-file

# Poveglia

*Quarantine your imports.*

A Python library that provides a configurable pipeline of content classifiers for scanning uploaded files. Virus scanning, explicit content detection, CSAM reporting, zip bomb detection, AI-generated image detection, and more — all through a single async API.

> For how the system is built internally, see [ARCHITECTURE.md](ARCHITECTURE.md); for the design rationale and tradeoffs, see [THEORY.md](THEORY.md).

## Quick Start

```bash
pip install poveglia
```

```python
import asyncio
from poveglia import classify, Status

result = asyncio.run(classify({
    "url": "s3://my-bucket/uploads/photo.jpg",
    "classifiers": ["virus", "explicit", "csam", "policy"],
    "classifier_config": {
        "explicit": {"api_callable": my_vision_api, "threshold": 0.7},
        "csam": {"api_callable": my_csam_api, "callback": my_csam_reporter},
        "policy": {"max_size_bytes": 50_000_000, "forbidden_mimetypes": ["video/*"]},
    },
    "metadata": {"user_id": "u_123", "upload_id": "up_456"},
}))

if result.status == Status.FORBID:
    reject_upload(result)
elif result.status == Status.REVIEW:
    queue_for_human_review(result)
```

Or use the sync wrapper:

```python
from poveglia import classify_sync

result = classify_sync({...})
```

## How It Works

Poveglia runs classifiers **in series**, in the order you specify. Each classifier returns one of four statuses:

| Status | Meaning | Pipeline behavior |
|---|---|---|
| `allow` | Content passes | Continue to next classifier |
| `review` | Uncertain — flag for human review | Continue to next classifier |
| `forbid` | Content fails | **Stop pipeline** |
| `mandatory_action` | Content fails, action required | Execute callback, then **stop pipeline** |

The result includes a top-level status (the worst across all classifiers), per-classifier details, any actions taken, and your metadata passed through untouched.

### Scoring Mode

If you want all classifiers to run regardless of failures (for ranking rather than gating):

```python
result = await classify({
    ...
    "scoring_mode": True,
})
# result.status is still the worst, but nothing was short-circuited
```

## Bundled Classifiers

### Detection

| Name | What it detects | Optional deps |
|---|---|---|
| `virus` | Malware via ClamAV | `poveglia[clamav]` |
| `zip_bomb` | Zip bombs (compression ratio, nesting depth) | none |
| `explicit` | Nudity, gore, violence, suggestive content | `poveglia[vision]` |
| `csam` | CSAM — returns `mandatory_action` on high-confidence hits | `poveglia[vision]` |
| `generated` | AI-generated imagery | `poveglia[vision]` |
| `identifiable` | Identifiable people (faces) | `poveglia[vision]` |
| `policy` | File size, MIME type (extension-based) | none |

### Actions

These run in the pipeline like any classifier, but are also available as standalone API calls:

| Name | What it does | Standalone API |
|---|---|---|
| `reporting` | Submits reports when classifier scores exceed thresholds | `poveglia.reporting.submit()` |
| `legal_hold` | Places objects on legal hold in storage | `poveglia.legal_hold.apply()` |
| `metadata` | Writes classification metadata to object store | `poveglia.metadata.upload()` |

## The Input Control Structure

```python
{
    # Required
    "url": "s3://bucket/uploads/file.jpg",
    "classifiers": ["virus", "zip_bomb", "explicit", "csam",
                     "identifiable", "reporting", "metadata"],

    # Per-classifier configuration
    "classifier_config": {
        "explicit": {
            "api_callable": my_vision_api,  # async callable
            "threshold": 0.7,               # forbid above this
            "review_threshold": 0.4,        # review above this
        },
        "csam": {
            "api_callable": my_csam_api,
            "callback": my_csam_handler,    # fires on mandatory_action
            "threshold": 0.8,
        },
        "reporting": {
            "triggers": {"csam": 0.8, "explicit": 0.95},
            "handler": my_report_handler,
        },
        "policy": {
            "max_size_bytes": 52428800,
            "forbidden_mimetypes": ["video/*"],
            "allowed_mimetypes": ["image/*"],
        },
        "metadata": {
            "backend": my_metadata_writer,
        },
    },

    # Skip downloading — use a local copy instead
    "local_path": "/tmp/staged/file.jpg",

    # Cap bytes pulled from a remote URL (DoS guard); omit or None for no cap.
    # Exceeding it raises ContentTooLargeError, recorded in result.errors.
    "max_download_bytes": 52428800,

    # Run all classifiers, never short-circuit
    "scoring_mode": False,

    # Where transformation classifiers write output. Exposed to classifiers as
    # content.output_url; a transforming classifier writes there and returns it
    # as ClassifierResult.transformed_url (surfaced on result.transformed_url).
    "output_url": "s3://bucket/transformed/file.jpg",

    # Passed through untouched to the result
    "metadata": {"user_id": "u_123", "upload_id": "up_456"},
}
```

The `classifiers` list controls both **which** classifiers run and **in what order**. Order matters — classifiers can share results through the blackboard (see below).

## The Result Object

```python
result.status             # Status.ALLOW / REVIEW / FORBID / MANDATORY_ACTION
result.is_clean           # True only if status == ALLOW AND errors is empty
result.classifiers        # {"virus": ClassifierResult(...), "explicit": ClassifierResult(...)}
result.actions_taken      # [ActionRecord(classifier="reporting", action="callback", result={...})]
result.errors             # [ErrorRecord(classifier="generated", error="ServiceUnavailable", ...)]
result.transformed_url    # "s3://..." if a transformation classifier produced output
result.metadata           # {"user_id": "u_123"} — your passthrough data
```

> **Important:** `result.status` alone is *not* a "safe to ship" signal. Classifier exceptions are recorded in `result.errors` and do **not** raise the aggregate status — a run where every classifier raised yields `Status.ALLOW` with populated `errors`. Use `result.is_clean` as the binary pass/fail predicate, or check `result.errors` explicitly alongside `result.status`.

## Content Access

Poveglia accesses files through a lazy content resolver. Some classifiers need only the URL (to pass to external APIs); others need the raw bytes or a local file path.

**The resolver downloads only when needed**, and caches the result — so if three classifiers call `.bytes()`, the file is downloaded once.

To avoid the download entirely, provide a `local_path` in the control structure pointing to a locally-staged copy.

### Memory footprint

`ContentResolver.bytes()` holds the full content in memory for the resolver's lifetime. For small uploads (images, documents) this is fine and avoids redundant I/O. For **large files** (video, archives, disk images) prefer `local_path()` in your classifier — it materializes a temp file once and hands out paths instead of keeping bytes resident. Classifiers that shell out to external binaries (ClamAV, ffmpeg, etc.) should always use `local_path()` regardless of size.

## The Blackboard

Classifiers can share intermediate results through a shared context dict, avoiding redundant API calls.

For example, if `explicit` calls a vision API that also returns face detection data, `identifiable` can reuse it instead of making a second call:

```python
# explicit classifier writes to the blackboard:
context["explicit.faces"] = [{"confidence": 0.85}, ...]

# identifiable classifier checks the blackboard first:
faces = context.get("explicit.faces")
if faces is not None:
    # reuse — no API call needed
```

Keys follow the convention `<classifier_name>.<key>`. Classifiers must always work standalone if the blackboard is empty — the optimization is never a hard dependency.

## Writing Custom Classifiers

```python
from poveglia import Classifier, ClassifierResult, Status

class MyClassifier(Classifier):
    name = "my_check"

    async def classify(self, content, config, context):
        data = await content.bytes()

        if looks_bad(data):
            return ClassifierResult(
                status=Status.FORBID,
                detail={"reason": "failed my_check"},
            )

        return ClassifierResult(
            status=Status.ALLOW,
            detail={"clean": True},
        )
```

Register it as an entry point in your package's `pyproject.toml`:

```toml
[project.entry-points."poveglia.classifiers"]
my_check = "my_package.classifiers:MyClassifier"
```

Then reference it by name: `"classifiers": ["virus", "my_check", "policy"]`.

## CSAM Handling

The CSAM classifier returns `mandatory_action` on high-confidence hits. This means:

1. The pipeline short-circuits (no further classifiers run)
2. The callback you provided in `classifier_config.csam.callback` fires automatically
3. The callback result is recorded in `result.actions_taken`

If no callback is configured, the classifier falls back to `forbid` — the content is still rejected, but no automatic reporting occurs. A warning is emitted on the `poveglia.classifiers.csam` logger whenever this fallback fires; route that logger at `WARNING` or above to your alerting channel.

For deployments where missing the callback is a compliance violation (not merely a dev-mode inconvenience), set `require_callback: True` in the csam config. With that flag on, a high-confidence detection without a callback raises — the misconfiguration lands in `result.errors` instead of silently rejecting the content.

Poveglia ships a reporting utility (`poveglia.reporting.submit()`) and a legal hold utility (`poveglia.legal_hold.apply()`) that you can wire up as callbacks. **You are responsible for configuring and using these** — Poveglia provides the tools, not the compliance.

## Error Handling

If a classifier raises an exception, the pipeline **catches it and continues**. The error is recorded in `result.errors`, but it doesn't stop other classifiers from running and doesn't affect the top-level status.

A failed mandatory callback (e.g., a CSAM report that couldn't be submitted) is recorded in `result.actions_taken` with error detail — surface this loudly so you can retry.

**Principle:** fail open in the pipeline, fail loud in the results.

The one exception is *configuration* errors. An unknown classifier name in `classifiers` is **not** caught — `classify()` / `classify_sync()` raises `KeyError` before any classifier runs (and before any download), so a typo'd name fails fast rather than silently producing an incomplete result. This is deliberate: a missing classifier is a programming error, not a content verdict.

## Installation

```bash
# Core + all classifiers (light deps only)
pip install poveglia

# With vision classifier dependencies
pip install poveglia[vision]

# With ClamAV support
pip install poveglia[clamav]

# With object storage support (metadata, legal_hold)
pip install poveglia[storage]

# Everything
pip install poveglia[all]
```

## Requirements

- Python 3.11+
- A running ClamAV daemon (for the `virus` classifier)
- Vision/CSAM API credentials (for `explicit`, `csam`, `generated`, `identifiable`)

## Development

```bash
# Editable install with the dev toolchain
pip install -e '.[dev]'

# Run the test suite (the "integration" marker is reserved for real-service
# tests; none exist yet, so this currently runs everything)
pytest -m "not integration"

# Lint and type-check — the same gates CI enforces
ruff check poveglia tests
mypy poveglia
```

CI runs lint, type-check, and tests on Python 3.11, 3.12, and 3.13 for every push and pull request; a `pip-audit` dependency scan runs report-only.

## Releasing

Releases publish to PyPI via GitHub Actions **OIDC trusted publishing** — no API token is stored anywhere. Publishing a GitHub Release triggers [`.github/workflows/publish.yml`](.github/workflows/publish.yml), which builds the sdist + wheel and uploads them with attestations.

One-time setup (PyPI side): add a Trusted Publisher for project `poveglia` → owner `Xof`, repo `poveglia`, workflow `publish.yml`, environment `pypi`.

## License

MIT
