Metadata-Version: 2.4
Name: pdfcanon
Version: 1.0.1
Summary: Official Python SDK for the PDFCanon API
Author-email: "Napzoom Inc." <admin@napzoom.com>
License: MIT
Project-URL: Homepage, https://pdfcanon.com
Project-URL: Documentation, https://docs.pdfcanon.com
Project-URL: Repository, https://github.com/PDFCanon
Project-URL: Bug Tracker, https://github.com/PDFCanon/issues
Keywords: pdf,normalize,pdfcanon,hash,sanitize
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# PDFCanon Python SDK

Official Python SDK for the [PDFCanon](https://pdfcanon.com) API — normalize, sanitize, and validate PDF files at scale. No third-party dependencies; requires Python 3.9+.

## Requirements

- Python 3.9 or later
- No external dependencies (uses only the standard library)

## Installation

```bash
pip install pdfcanon
```

## Authentication

Obtain an API key from the [PDFCanon portal](https://pdfcanon.com/portal/api-keys). Pass it directly or set the `PDFCANON_API_KEY` environment variable:

```bash
export PDFCANON_API_KEY="pdfn_your_key_here"
```

The SDK sends the key as `X-Api-Key: pdfn_…` in every request. If you call the REST API directly, use the same header:

```bash
curl -H "X-Api-Key: pdfn_your_key_here" https://api.pdfcanon.com/api/submissions
```

## Quickstart (Synchronous)

```python
import pdfcanon

# API key is read from PDFCANON_API_KEY if not passed
client = pdfcanon.Client(api_key="pdfn_your_key_here")

with open("/path/to/document.pdf", "rb") as f:
    response = client.normalize(f, file_name="document.pdf")

print(f"Status: {response.status}")
print(f"Submission ID: {response.submission_id}")
```

## Quickstart (Async)

```python
import asyncio
import pdfcanon

async def main():
    client = pdfcanon.AsyncClient(api_key="pdfn_your_key_here")

    async with open("/path/to/document.pdf", "rb") as f:
        response = await client.normalize(f, file_name="document.pdf")

    print(f"Status: {response.status}")
    print(f"Submission ID: {response.submission_id}")

asyncio.run(main())
```

## Async / Poll Flow

Large PDFs are processed asynchronously. Use `wait_for_completion` to poll until done:

```python
import pdfcanon

client = pdfcanon.Client()  # reads PDFCANON_API_KEY from environment

with open("/path/to/document.pdf", "rb") as f:
    initial = client.normalize(f, file_name="document.pdf")

# Poll until processing completes (up to 120 seconds)
result = client.wait_for_completion(initial.submission_id, timeout=120.0)

if result.status == "SUCCESS":
    # Download the normalized PDF
    pdf_bytes = client.download_artifact(result.normalized.sha256)
    with open("/path/to/normalized.pdf", "wb") as out:
        out.write(pdf_bytes)

    print(f"Original size:   {result.original.size_bytes:,} bytes")
    print(f"Normalized size: {result.normalized.size_bytes:,} bytes")
    print(f"JavaScript removed: {result.security.javascript_removed}")
else:
    print(f"Failed: [{result.failure.code}] {result.failure.message}")
```

## Webhook Flow

For production use, register a webhook endpoint instead of polling:

```python
client = pdfcanon.Client()

with open("/path/to/document.pdf", "rb") as f:
    response = client.normalize(
        f,
        file_name="document.pdf",
        webhook_url="https://your-app.example.com/webhooks/pdfcanon",
        remove_annotations=True,
        idempotency_key="unique-key-per-document",
    )
# Returns a response with status PENDING or IN_PROGRESS;
# webhook fires when processing completes.
print(f"Queued with submission ID: {response.submission_id}")
```

## Webhook Signature Verification

Verify incoming webhook signatures in your web framework:

```python
# Flask example
from flask import Flask, request, abort
from pdfcanon.webhooks import verify_signature, InvalidSignatureError
import os

app = Flask(__name__)
WEBHOOK_SECRET = os.environ["PDFCANON_WEBHOOK_SECRET"]

@app.post("/webhooks/pdfcanon")
def handle_pdfcanon_webhook():
    raw_body = request.get_data(as_text=True)
    signature = request.headers.get("X-PDFCanon-Signature", "")

    try:
        verify_signature(raw_body, signature, WEBHOOK_SECRET)
    except InvalidSignatureError:
        abort(401, "Invalid webhook signature")

    event = request.get_json(force=True)
    event_type = event.get("event_type")

    if event_type == "pdf.normalized":
        sha256 = event["normalized_sha256"]
        print(f"PDF ready: {sha256}")
        # Download and store the normalized PDF...
    elif event_type == "pdf.failed":
        print(f"PDF failed: {event['failure']['message']}")

    return {"ok": True}
```

## Configuration

```python
import pdfcanon

client = pdfcanon.Client(
    api_key="pdfn_your_key_here",
    base_url="https://api.pdfcanon.com/api",  # Default
    connect_timeout=5.0,                      # Seconds; default: 5
    read_timeout=120.0,                       # Seconds; default: 120
    max_retries=3,                            # Default: 3
)
```

## Error Handling

```python
import pdfcanon
from pdfcanon import (
    AuthenticationError,
    PolicyRejectionError,
    RateLimitError,
    ToolchainError,
    NetworkError,
    PDFCanonError,
)

try:
    with open("/path/to/document.pdf", "rb") as f:
        result = client.normalize(f)
except AuthenticationError:
    print("Invalid API key or expired token")
except PolicyRejectionError as e:
    # 422: the PDF violates intake policy (encrypted, too large, etc.)
    print(f"PDF rejected: {e}")
except RateLimitError as e:
    # 429: monthly quota or rate limit exceeded
    print(f"Rate limited. Retry after {e.retry_after} seconds")
except ToolchainError as e:
    # 5xx: server-side processing failure
    print(f"Server error: {e}")
except NetworkError as e:
    # Timeout, DNS failure, etc.
    print(f"Network error: {e}")
except PDFCanonError as e:
    # Base class — catch all SDK errors
    print(f"Unexpected SDK error: {e}")
```

## Error Reference

| Exception | HTTP Status | When |
|---|---|---|
| `AuthenticationError` | 401 | Invalid or missing API key |
| `PolicyRejectionError` | 422 | PDF rejected by intake policy |
| `RateLimitError` | 429 | Monthly quota or rate limit exceeded |
| `ToolchainError` | 5xx | Server-side processing failure |
| `NetworkError` | — | Network / timeout error |
| `PDFCanonError` | — | Base class for all SDK errors |

## Models Reference

| Model | Key Fields |
|---|---|
| `NormalizeResponse` | `status`, `submission_id`, `original`, `normalized`, `security`, `validation`, `warnings`, `failure` |
| `OriginalInfo` | `sha256`, `size_bytes` |
| `NormalizedInfo` | `sha256`, `size_bytes`, `pdf_version`, `linearized`, `download_url` |
| `SecurityInfo` | `javascript_removed`, `open_actions_removed`, `embedded_files_removed`, ... |
| `ValidationInfo` | `xref_rebuilt`, `object_streams_regenerated`, `pdfa_compliant`, ... |
| `WarningInfo` | `code`, `message` |
| `FailureInfo` | `code`, `message`, `stage` |

## Further Reading

- [Full API Reference](https://docs.pdfcanon.com/api)
- [OpenAPI Spec](../../openapi/openapi.json)
- [Postman Collection](../../docs/PDFCanon.postman_collection.json)
- [SDK Automation Plan](../../docs/sdk-automation.md)
