Metadata-Version: 2.4
Name: symagedocs
Version: 1.0.4
Summary: Python SDK for the SymageDocs synthetic data API
Project-URL: Homepage, https://symagedocs.ai
Project-URL: Documentation, https://symagedocs.ai/docs/api/
Project-URL: Repository, https://github.com/GeiselSoftware/paperlives
Project-URL: Changelog, https://symagedocs.ai/docs/api/changelog.html
Author-email: Geisel Software <support@symagedocs.ai>
License-Expression: MIT
License-File: LICENSE
Keywords: compliance,document-generation,ml-training,synthetic-data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx>=0.25
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: mypy>=1.5; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: progress
Requires-Dist: tqdm>=4.60; extra == 'progress'
Description-Content-Type: text/markdown

# SymageDocs Python SDK

Generate synthetic documents, identities, and tabular datasets for testing, ML training, and compliance.

## Installation

```bash
pip install symagedocs
```

For progress bars during long jobs:

```bash
pip install symagedocs[progress]
```

## Quick Start

```python
from symagedocs import Client

client = Client(api_key="sk_live_...")

# List available forms
forms = client.forms.list()
for f in forms:
    print(f"{f.id}: {f.name} ({f.credit_cost} credits)")

# Generate 100 W-2 documents
# JSON ground truth and CSV are always included in the bundle — no need to request them.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    output_formats=["pdf_typed"],
    # Augmentation knobs. `degradation_profile` affects credit cost —
    # `scanned`/`faxed` add 20%, `photographed` 30%, `mixed` 25% (`clean` = no surcharge).
    # `coherence_mode` controls cross-form identity correlation in multi-form jobs.
    degradation_profile="scanned",
    coherence_mode="coherent",
)
result = client.generate.wait(job.job_id)  # polls until complete
client.generate.download(job.job_id, "bundle", "./w2_documents.zip")

# Per-item training data
job = client.generate.create(
    form_id="irs_w2_2025",
    quantity=10,
    output_formats=["pdf_typed", "bio"],
    idempotency_key="my-retry-safe-key",
)
client.generate.wait(job.job_id)
for example in client.generate.iter_training_examples(job.job_id, format="bio"):
    print(example.item_id, len(example.bio.tokens))

# Generate tabular data from a description
schema = client.tabular.parse("name, age, SSN, city, state, annual income")
tab_job = client.tabular.generate(columns=schema.columns, quantity=5000)
client.tabular.wait(tab_job.job_id)
client.tabular.download(tab_job.job_id, "csv", "./dataset.csv")

# Check credit balance
balance = client.account.balance()
print(f"Credits used: {balance.credits_used}")
```

## Authentication

Get your API key at [symagedocs.ai/account?tab=api](https://symagedocs.ai/account?tab=api).

```python
# Pass directly
client = Client(api_key="sk_live_...")

# Or set environment variable
# export SYMAGEDOCS_API_KEY=sk_live_...
client = Client()  # reads from env
```

## Async Support

```python
from symagedocs import AsyncClient

async with AsyncClient(api_key="sk_live_...") as client:
    forms = await client.forms.list()
    job = await client.generate.create("irs_w2_2025", quantity=10)
    result = await client.generate.wait(job.job_id)
```

## Configuration

```python
client = Client(
    api_key="sk_live_...",
    base_url="https://symagedocs.ai",  # custom server
    timeout=30.0,                       # request timeout (seconds)
    max_retries=3,                      # retry on 429/5xx
)
```

## Method Reference

### Forms

| Method                      | Description                                           |
| --------------------------- | ----------------------------------------------------- |
| `forms.list(category=None)` | List available forms, optionally filtered by category |
| `forms.get(form_id)`        | Get detailed form info including field definitions    |

### Generation

| Method                                                                                                              | Description                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| `generate.create(form_id=None, *, form_ids=None, quantity=1, output_formats=["pdf_typed"], config=None, seed=None, webhook_url=None, ink_color=None, ink_color_distribution=None, writer_consistency=None, degradation_profile=None, coherence_mode=None, idempotency_key=None)` | Create an async generation job. Pass either `form_id` (single form) or `form_ids` (coherent multi-form generation across the same identity). `ink_color_distribution` (when set) must be a per-color weight map summing to exactly `100`. `degradation_profile` and `coherence_mode` are typed kwargs over what used to live inside `config={...}` — see the [augmentation knobs](#augmentation-knobs) section. `idempotency_key` attaches an `Idempotency-Key` header so retries within 24 hours return the original `job_id` and don't double-charge. The deprecated `realism_level` API field is intentionally not exposed; call the REST API directly if you need it. |
| `generate.list_jobs(limit=50, cursor=None, status=None)`                                                            | List generation jobs (cursor-paginated)                                                                                                      |
| `generate.get_job(job_id)`                                                                                          | Get full job status and progress                                                                                                             |
| `generate.list_downloads(job_id)`                                                                                   | List per-artifact presigned download URLs for a completed job                                                                                |
| `generate.download(job_id, format, path)`                                                                           | Download job output to a local file. Allowed for terminal-but-not-completed jobs (CANCELED / FAILED / EXPIRED) so partial output is recoverable. |
| `generate.wait(job_id, poll_interval=3.0)`                                                                          | Poll until job completes or fails                                                                                                            |
| `generate.cancel(job_id)`                                                                                           | Cancel a running job. Idempotent. Items rendered before the cancel observed remain downloadable via `download(format="bundle")`. |
| `generate.list_items(job_id, limit=50, cursor=None)`                                                                | List per-item records for a job. Cursor-paginated; each item carries its presigned download URLs. |
| `generate.download_item(job_id, item_id)`                                                                           | Presigned S3 URLs for one item's files.                                                                              |
| `generate.get_bio_labels(job_id, item_id)`                                                                          | Client-side helper: fetches the item's `_bio.json` sidecar and returns a parsed `BioDataset`.                        |
| `generate.get_word_annotations(job_id, item_id)`                                                                    | Client-side helper: fetches the item's `_words.json` sidecar and returns parsed `WordAnnotations`.                   |
| `generate.iter_training_examples(job_id, format="bio")`                                                             | Client-side helper: iterates all items, yielding training examples in the chosen format (`"bio"` (default), `"funsd"`, `"donut"`). |

> **`client.generation` alias.** `client.generation` and `client.generate` reference the same resource — use whichever name you prefer.

### Identities

| Method                                                    | Description                               |
| --------------------------------------------------------- | ----------------------------------------- |
| `identities.generate(quantity=1, config=None, seed=None)` | Generate raw synthetic identities as JSON |

### Tabular

| Method                                                                       | Description                                               |
| ---------------------------------------------------------------------------- | --------------------------------------------------------- |
| `tabular.parse(prompt)`                                                      | Convert natural language to a column schema (LLM-powered) |
| `tabular.generate(columns, quantity=100, output_formats=["csv"], seed=None)` | Create a tabular generation job                           |
| `tabular.status(job_id)`                                                     | Get tabular job progress and ETA                          |
| `tabular.download(job_id, format, path)`                                     | Download tabular output to a local file                   |
| `tabular.wait(job_id, poll_interval=2.0)`                                    | Poll until tabular job completes or fails                 |

### Account

| Method                   | Description                                              |
| ------------------------ | -------------------------------------------------------- |
| `account.balance()`      | Get credit balance (`credits_used`, `credits_allocated`) |
| `account.usage(days=30)` | Get usage summary for the specified period               |

### Pricing

The pricing endpoints are public/unauthenticated on the backend, but the SDK still requires an API key at construction time for consistency; the auth header is sent and ignored by these routes.

| Method                                                                                              | Description                                                                                          |
| --------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| `pricing.rates()`                                                                                   | Get the current credit rate constants (CSV per-row rate, PDF base + surcharge bands, multipliers, …) |
| `pricing.estimate(*, field_count, output_formats, record_count, degradation_profile=None)`         | Estimate the credit cost of a hypothetical job before submitting it                                  |

### Health

| Method            | Description                                                                                          |
| ----------------- | ---------------------------------------------------------------------------------------------------- |
| `client.health()` | Lightweight reachability probe (`GET /api/v1/health`). Returns the parsed JSON body. Works on both `Client` and `AsyncClient`. |

## Augmentation knobs

Two of the most-used keys in the freeform `config={...}` dict on
`generate.create` are also exposed as typed kwargs:

- `degradation_profile: Literal["clean", "scanned", "faxed", "photographed", "mixed"] | None`
- `coherence_mode: Literal["coherent", "shuffled", "random"] | None`

Why bother? Two reasons:

1. **`degradation_profile` affects credit cost.** Non-`clean`
   profiles need extra rendering work (rasterization, noise, paper
   warp), so the billing engine applies a multiplier: `scanned`/`faxed`
   are billed at 1.2×, `mixed` at 1.25×, and `photographed` at 1.3×.
   A typo on the freeform `config={...}` form silently falls back to
   the default 1.0× multiplier — meaning you don't get the
   degradation you asked for AND the typo isn't caught until you
   notice the artifacts (or don't). The typed kwarg form catches
   typos at type-check time.
2. **Pre-flight validation.** The Literal types fence off unknown
   values at edit time in any IDE that supports type checking. The
   backend also rejects unknown values with `400` for both knobs, so
   even untyped callers get a fast failure — but the typed form
   catches the mistake before the network round-trip.

The SDK exports the canonical value tuples too:

```python
from symagedocs import DEGRADATION_PROFILES, COHERENCE_MODES

assert "scanned" in DEGRADATION_PROFILES
assert "coherent" in COHERENCE_MODES
```

If you pass a value via both forms (e.g. `config={"degradation_profile":
"X"}` AND `degradation_profile="Y"`), the value in `config` wins and a
`RuntimeWarning` is emitted so the conflict isn't silent.

```python
# Typed kwarg form — recommended.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    degradation_profile="scanned",   # billed at 1.2× — see above
    coherence_mode="coherent",
)

# Equivalent freeform form — still supported, but typos cost money.
job = client.generate.create(
    "irs_w2_2025",
    quantity=100,
    config={"degradation_profile": "scanned", "coherence_mode": "coherent"},
)
```

## Error Handling

The SDK raises typed exceptions for API errors and retries automatically on `429` and `5xx`:

```python
from symagedocs import Client, AuthenticationError, RateLimitError, NotFoundError

try:
    forms = client.forms.list()
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests — SDK retries automatically")
except NotFoundError:
    print("Resource not found")
```

**All error classes:**

| Exception                  | HTTP Code | Description                                                     |
| -------------------------- | --------- | --------------------------------------------------------------- |
| `SymageDocsError`          | —         | Base exception for all SDK errors                               |
| `AuthenticationError`      | 401       | Invalid or revoked API key                                      |
| `PermissionDeniedError`    | 403       | Key missing required scope                                      |
| `NotFoundError`            | 404       | Resource not found                                              |
| `ValidationError`          | 400       | Invalid request parameters                                      |
| `InsufficientCreditsError` | 402       | Not enough credits for the operation                            |
| `ConflictError`            | 409       | Resource in unexpected state (e.g., downloading incomplete job) |
| `RateLimitError`           | 429       | Rate limit exceeded (SDK retries automatically)                 |
| `ServerError`              | 5xx       | Server-side error (SDK retries automatically)                   |

## Examples

See `examples/` in the downloaded SDK for complete working scripts:

- `list_forms.py` — Browse available forms and credit costs
- `generate_w2s.py` — Full pipeline: create job, wait, download PDF + JSON
- `tabular_dataset.py` — Parse NL description, generate 5k rows, download CSV
- `train_kie_model.py` — Create job with NIST3 labels, iterate training examples with BIO labels and spatial annotations

## Documentation

- [API User Manual](/docs/api-user-manual) — long-form guide with worked examples
- [API Explorer](/api/v1/docs) — interactive Swagger UI
- [API Reference](/api/v1/redoc) — three-panel ReDoc reference

## License

MIT
