# Pareta SDK — full documentation

> Pareta is a marketplace + control plane for open-weights models. The `pareta` Python SDK lets you deploy task-specific open-weights endpoints (Pareta picks the GPU), run metered OpenAI-compatible inference, browse a per-task benchmark catalog, and evaluate models on your own data — then deploy the winner. Install with `pip install pareta`; authenticate with a `pareta_sk_` key from the dashboard or `PARETA_API_KEY`.

This file concatenates the entire Pareta SDK documentation (guide + examples + reference) for single-read agent consumption. Source: sdk/docs/ in the repo; browsable at https://docs.pareta.ai.



---

<!-- guide/installation.md -->

# Installation & authentication

The `pareta` package is the Python client for [Pareta](https://pareta.ai). It deploys open-weights endpoints, runs metered OpenAI-compatible inference, browses the benchmark catalog, and evaluates models on your own data — all from code. This page gets you installed, authenticated, and making a first call.

A few platform truths to know up front, because they shape the whole API:

- **GPUs are hidden.** You never pass a hardware knob. `endpoints.deploy()` takes a task and a model; Pareta resolves the serving class.
- **Models are per-task aliases.** Open-weights ids are masked to public aliases. Real ids never cross the SDK boundary.
- **Inference and evals are metered against your org balance.** A successful call debits credit; an empty balance raises `InsufficientCreditsError`. Top-up is browser-only — the SDK never touches billing.
- **Inference is OpenAI-compatible.** A deployed endpoint speaks the OpenAI chat-completions wire format, so you can use this SDK or the stock `openai` client interchangeably.

## Install

`pareta` requires Python 3.10+ and depends only on `httpx`. Install it with whichever tool you already use:

```bash
pip install pareta
```

```bash
uv add pareta
```

```bash
poetry add pareta
```

The package ships type hints (`py.typed`), so editors and `mypy` get full autocomplete on every method and response model.

## Authenticate

Every request is authenticated with a `pareta_sk_` secret key sent as a Bearer token. You mint keys in the [dashboard](https://pareta.ai) — key management is browser-only, and the SDK only ever *consumes* a key. It never creates, lists, or revokes them.

### Recommended: `from_env()`

The cleanest path is to put your key in the environment and let the client read it. `from_env()` reads `PARETA_API_KEY` and the optional `PARETA_BASE_URL`:

```bash
export PARETA_API_KEY="pareta_sk_..."
```

```python
from pareta import Pareta

pa = Pareta.from_env()                       # reads PARETA_API_KEY (+ PARETA_BASE_URL)

# List the deployed endpoints your org can call.
for model in pa.models.list():
    print(model.id, model.owned_by)
```

Keeping the key out of source is the point — `from_env()` means your code carries no secret.

### Explicit key

You can also pass the key directly. The constructor is keyword-only:

```python
from pareta import Pareta

pa = Pareta(api_key="pareta_sk_...")
```

If `api_key` is falsy and `PARETA_API_KEY` is unset, the client raises `ParetaError` at construction time with a message pointing you to mint a key:

```python
import pareta

try:
    pa = pareta.Pareta(api_key=None)         # and PARETA_API_KEY unset
except pareta.ParetaError as e:
    print(e)  # missing API key. Pass api_key=… or set PARETA_API_KEY (mint a pareta_sk_ key in the dashboard).
```

## Constructor options

```python
Pareta(
    api_key: str | None = None,              # pareta_sk_ key; falls back to nothing (from_env reads the env)
    base_url: str | None = None,             # defaults to "https://api.pareta.ai"
    timeout=None,                            # defaults to httpx.Timeout(60.0, connect=10.0)
    max_retries: int = 2,                    # retries on 408/409/429/500/502/503/504
    http_client: httpx.Client | None = None, # bring your own httpx.Client
)
```

- **`base_url`** defaults to the production API, `https://api.pareta.ai`, and is normalized (trailing slash stripped). Override it only to point at a non-prod environment; set `PARETA_BASE_URL` to do the same via `from_env()`.
- **`max_retries`** (default 2) retries idempotent failures and rate limits with exponential backoff that honors a `Retry-After` header. See [Errors & retries](errors-and-retries.md).
- **`http_client`** lets you supply a pre-configured `httpx.Client` (custom proxies, connection limits, transport). When you pass one, the SDK does not own it and `close()` will not shut it down.

## Manage the connection

The client holds a pooled HTTP connection. Use it as a context manager so the pool is released cleanly:

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    resp = pa.chat.completions.create(
        model="ep_invoice_extract",          # an endpoint id from pa.models.list() or endpoints.deploy()
        messages=[{"role": "user", "content": "Extract the total from this invoice: ..."}],
    )
    print(resp.choices[0].message.content)
```

Outside a `with` block, call `pa.close()` when you are done. (`close()` is a no-op when you supplied your own `http_client`.)

## Async client

`AsyncPareta` mirrors `Pareta` exactly — same constructor, same `from_env()`, same resource namespaces — with awaitable methods and `aclose()` / `async with`:

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        resp = await pa.chat.completions.create(
            model="ep_invoice_extract",
            messages=[{"role": "user", "content": "Summarize this contract clause: ..."}],
        )
        print(resp.choices[0].message.content)

asyncio.run(main())
```

## Your first metered call

Inference debits your org balance on success. If the balance is empty, the call raises `InsufficientCreditsError` (402) — top up in the dashboard, which is the only place billing lives:

```python
from pareta import Pareta, InsufficientCreditsError

pa = Pareta.from_env()

try:
    resp = pa.chat.completions.create(
        model="ep_invoice_extract",
        messages=[{"role": "user", "content": "What is the invoice number?"}],
        temperature=0,                       # extra OpenAI params pass straight through
    )
    print(resp.choices[0].message.content)
    print(resp.usage.total_tokens, "tokens")
except InsufficientCreditsError:
    print("Org out of credit — top up in the dashboard.")
```

The `model` is an endpoint id — anything from `pa.models.list()` or returned by `endpoints.deploy()`. See [Inference](./inference.md) for streaming and the full chat-completions surface.

## Zero-install alternative for inference

You do not need this SDK to *call* a deployed endpoint. Because inference is OpenAI-compatible, you can point the stock `openai` client at Pareta's `base_url` and use the same `pareta_sk_` key:

```python
from openai import OpenAI

client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")

resp = client.chat.completions.create(
    model="ep_invoice_extract",
    messages=[{"role": "user", "content": "What is the invoice number?"}],
)
print(resp.choices[0].message.content)
```

This is handy for inference-only workloads or dropping Pareta into an existing OpenAI codebase. The `pareta` SDK's distinct value is the **control plane** that the OpenAI client cannot reach: deploying and operating endpoints, browsing the benchmark catalog, and running evals against your own data.

## Next steps

- [Inference](./inference.md) — chat completions, streaming, and metering.
- [Deploying endpoints](deploying-endpoints.md) — `endpoints.deploy()`, lifecycle, and metrics (no GPU knob).
- [Tasks & the catalog](discovery.md) — discover benchmark tasks and the `recommended` model alias.
- [Evals](evaluation.md) — build eval sets and run open vs. frontier comparisons.
- [Errors & retries](errors-and-retries.md) — the typed exception hierarchy and retry policy.



---

<!-- guide/quickstart.md -->

# Quickstart

Deploy the recommended open-weights model for a task and run inference against
it, end to end, in about a dozen lines. Pareta picks the GPU and serving config
for you, so there is no hardware to choose. Inference is OpenAI-compatible and
metered against your org's balance.

## Install

```bash
pip install pareta        # or: uv add pareta / poetry add pareta
```

## Authenticate

Mint a `pareta_sk_` key in the dashboard (key management is browser-only) and
export it. `Pareta.from_env()` reads `PARETA_API_KEY` (and an optional
`PARETA_BASE_URL`).

```bash
export PARETA_API_KEY="pareta_sk_..."
```

The SDK only ever consumes a key. It never creates, lists, or revokes them, and
it never exposes your balance or payment methods. Topping up credit is
browser-only.

## Deploy and run inference

This is the whole loop: name a task, let Pareta pick the recommended model,
deploy it, and send a request. The `wait=True` flag blocks through the deploy
SSE stream and hands you back a live `Endpoint`.

```python
from pareta import Pareta

pa = Pareta.from_env()                                  # reads PARETA_API_KEY

task = "contract-key-fields"                             # a subtask id from the catalog

# Inspect the model deploy(model="recommended") will pick (a per-task alias).
print("recommended:", pa.tasks.recommended(task))       # e.g. "qwen-vl-2"

# Deploy it. No GPU, quantization, or parallelism knob — Pareta resolves all of it.
ep = pa.endpoints.deploy(task=task, model="recommended", wait=True)
print("live endpoint:", ep.id, ep.status)               # e.g. "ep_a1b2c3" "live"

# Run OpenAI-compatible inference against the endpoint id.
resp = pa.chat.completions.create(
    model=ep.id,                                         # the endpoint id, not the alias
    messages=[{"role": "user", "content": "Say hello in one short sentence."}],
)
print(resp.choices[0].message.content)
print("tokens:", resp.usage.total_tokens)
```

Output:

```
recommended: qwen-vl-2
live endpoint: ep_a1b2c3 live
Hello, it is good to meet you.
tokens: 27
```

A few things worth pinning down:

- **`task`** is a subtask id (for example `"contract-key-fields"`). Discover
  ids with `pa.tasks.list()`, or turn a sentence into a task with
  `pa.tasks.match("extract fields from contracts")`. See
  [Tasks](discovery.md).
- **`model="recommended"`** (the default) resolves server-side to the task's
  curated pick, falling back to the top open model on the leaderboard. You can
  also pass a specific per-task alias. Real model ids never reach the client.
- **`ep.id`** is what you pass to `chat.completions.create(model=...)`. That is
  the deployed endpoint id, distinct from `ep.model`, which is the per-task
  public alias the endpoint serves.
- **No hardware knob.** `deploy()` takes only `task`, `model`, and an optional
  `name`. Pareta selects the GPU and serving class from its registry.

## Stream the response

Pass `stream=True` to get an iterator of `ChatCompletionChunk`. The incremental
text lives on `chunk.choices[0].delta.content` (it can be `None` on the first
and last chunks, so guard it).

```python
for chunk in pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Write a haiku about invoices."}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()
```

Extra OpenAI parameters (`temperature`, `max_tokens`, `top_p`, and so on) pass
straight through as keyword arguments.

## Cost and credit

Every successful completion debits your org's balance. If the balance is empty,
the call raises `InsufficientCreditsError` (HTTP 402). Top-up is browser-only.

```python
from pareta import InsufficientCreditsError

try:
    resp = pa.chat.completions.create(model=ep.id, messages=[
        {"role": "user", "content": "ping"},
    ])
except InsufficientCreditsError:
    print("Out of credit — top up in the dashboard.")
```

Evaluation runs are metered the same way (open plus frontier compute). An
`EvalRun` reports its billed total on `run.cost`, a `Decimal` in dollars floored
to whole cents (so a sub-cent run reads `Decimal("0.00")`); the raw value is on
`run.cost_micro_usd`. See [Evals](evaluation.md).

## Clean up

Stop the endpoint when you are done so it stops accruing cost, and close the
client (or use it as a context manager).

```python
pa.endpoints.stop(ep.id)        # later: pa.endpoints.start(ep.id) / pa.endpoints.delete(ep.id)
pa.close()
```

```python
# Context-manager form closes the HTTP client for you.
with Pareta.from_env() as pa:
    resp = pa.chat.completions.create(model=ep.id, messages=[
        {"role": "user", "content": "hi"},
    ])
```

## List what you can call

`models.list()` returns the OpenAI-compatible subset: deployed endpoints with a
live URL. Each `id` is usable directly in `chat.completions.create(model=...)`.

```python
for m in pa.models.list():
    print(m.id, m.owned_by)
```

## Async

`AsyncPareta` mirrors the sync client; resource methods are `async def` and
streams are async iterators.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        ep = await pa.endpoints.deploy(
            task="contract-key-fields", model="recommended", wait=True,
        )
        resp = await pa.chat.completions.create(
            model=ep.id,
            messages=[{"role": "user", "content": "Say hello."}],
        )
        print(resp.choices[0].message.content)

asyncio.run(main())
```

One async difference to note: `tasks.recommended()` and `tasks.leaderboard()`
are sync-only for now.

## Already using the OpenAI SDK?

You do not need this SDK just to call a deployed endpoint. Point the `openai`
client at your `base_url` plus your `pareta_sk_` key:

```python
from openai import OpenAI

client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")
resp = client.chat.completions.create(
    model="ep_a1b2c3",
    messages=[{"role": "user", "content": "hi"}],
)
```

This SDK's unique value is the control plane: deploy, operate, and eval models
from code.

## Next steps

- [Tasks](discovery.md) — browse the benchmark catalog, match intent to a task,
  and read leaderboards.
- [Endpoints](deploying-endpoints.md) — deploy, operate, and read endpoint metrics.
- [Evals](evaluation.md) — score candidate models on your own data before you
  deploy.
- [Errors](errors-and-retries.md) — the `ParetaError` hierarchy and retry behavior.



---

<!-- guide/core-concepts.md -->

# Core concepts

Pareta deploys open-weights models as endpoints, lets you evaluate them on your
own data, and serves OpenAI-compatible inference. This page covers the handful
of ideas the rest of the SDK assumes you understand: **tasks** (the benchmark
catalog), **open vs frontier** models, **per-task aliases**, why **hardware is
hidden**, how **metering** works, and the **discovery funnel** that ties them
together (match a task, read its leaderboard, eval candidates on your data,
deploy the winner).

Every code block below is runnable as written. They all start from a client:

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

`from_env()` is the path you want in almost every case. The explicit form is
`Pareta(api_key="pareta_sk_...", base_url="https://api.pareta.ai")`; arguments
are keyword-only. See [Authentication](installation.md) for key minting
(browser-only) and [The client](../reference/client.md) for timeouts, retries, and the
async `AsyncPareta` mirror.

## Tasks: the benchmark catalog

A **task** is a concrete, benchmarked job: "extract the key fields from a
contract," "classify a support ticket," "moderate a comment." Pareta has
measured open and frontier models against each task on real data, so a task is
the unit you pick a model *for*, evaluate *against*, and deploy *into*.

Every task has a stable `id` (e.g. `"contract-key-fields"`), a
`default_scorer` (the function that grades a model's output for that task), and
a `has_blob_input` flag (true when the task takes documents or images, not just
text).

```python
for task in pa.tasks.list():
    print(task.id, task.default_scorer, "blob" if task.has_blob_input else "text")

# Fetch one task, optionally with sample rows to see its input shape
t = pa.tasks.retrieve("contract-key-fields", examples_n=3)
print(t.id, t.default_scorer, t.has_blob_input)
```

If you do not know the task id, describe the job in plain English and let the
matcher rank candidates:

```python
m = pa.tasks.match("pull totals and dates out of vendor invoices", top_k=5)
if m.matched and m.chosen:
    print("best:", m.chosen.task_id, m.chosen.score, m.chosen.confidence)
else:
    for c in m.candidates:          # ranked alternates to choose from
        print(c.task_id, c.score, c.confidence)
print("ambiguous?", m.ambiguous, "via", m.matcher)
```

`match()` raises `ValueError` on an empty query. The matcher is a deterministic
keyword scorer today; `m.matcher` tells you which strategy answered.

## Open vs frontier models

Pareta ranks two kinds of model against every task:

- **Open** models are open-weights models Pareta can deploy and serve for you.
  These are the models you deploy and call.
- **Frontier** models are hosted vendor models (OpenAI, Anthropic, and so on).
  You do not deploy these. They exist as the **baseline** you measure against.
  The whole point of Pareta is showing that a cheaper open model matches or
  beats the frontier on *your* task.

A task's leaderboard ranks the open models by quality and cost and carries a
single `frontier` entry as the savings baseline. The `recommended` field is the
deployable model the platform would pick for you.

```python
lb = pa.tasks.leaderboard("contract-key-fields")
print("recommended:", lb.recommended, "metric:", lb.metric, "unit:", lb.cost_unit)

for e in lb.models:                 # ranked open candidates
    print(e.name, e.kind, e.quality, e.cost_per_request_micro_usd, f"{e.context_k}k ctx")

if lb.frontier:                     # the vendor baseline to beat
    print("baseline:", lb.frontier.name, lb.frontier.quality,
          lb.frontier.cost_per_request_micro_usd)

# Convenience: just the deployable pick (what deploy(model="recommended") resolves to)
print(pa.tasks.recommended("contract-key-fields"))
```

To enumerate the frontier roster you can evaluate against, annotated for a
given task, use `evals.frontier_models`:

```python
for fm in pa.evals.frontier_models(task="contract-key-fields"):
    print(fm.id, fm.vendor, "vision" if fm.vision else "text",
          "(on leaderboard)" if fm.benchmarked else "")
```

Passing `task=` annotates each model's `benchmarked` flag and filters the
roster by capability (for example, only vision-capable models are returned for
document tasks). Feed the `id` values into an eval run's `frontier=` list.

## Per-task aliases: real model ids stay hidden

Open-weights model identities are never exposed. Across the entire SDK surface
(leaderboard rows, `Endpoint.model`, eval `result.model_id`, and the `model=`
argument you pass to `endpoints.deploy()`) open models appear as **per-task
public aliases** (a stable name scoped to the task), not their underlying
repo/checkpoint ids. Frontier (vendor) ids are shown in the clear, since those
are public products.

This matters in practice for two reasons:

1. The string you read off a leaderboard entry or a recommendation is exactly
   the string you pass back into `deploy(model=...)` or an eval's `models=[...]`.
   You never translate ids yourself.
2. Do not hard-code an alias from one task and reuse it on another. Aliases are
   per-task; always source them from that task's leaderboard or recommendation.

```python
task = "contract-key-fields"
pick = pa.tasks.recommended(task)          # a per-task alias, e.g. "qwen-1"
ep = pa.endpoints.deploy(task=task, model=pick, wait=True)
print(ep.model)                            # the same alias, echoed back
```

## Hardware is hidden

You never choose a GPU, tensor-parallel degree, quantization scheme, or serving
mode. `endpoints.deploy()` takes a `task` and a `model` (alias, real-callable
id, or the literal `"recommended"`) and nothing about hardware. Pareta resolves
the serving class from its registry.

```python
# No hardware knobs. task + model is the whole decision.
ep = pa.endpoints.deploy(task="contract-key-fields", model="recommended", wait=True)
print(ep.id, ep.status, ep.url)            # ep.id is what you call for inference
```

`deploy()` streams progress. With `wait=True` it blocks and returns the live
`Endpoint` (raising `ParetaError` if the deploy fails). With `wait=False`
(the default) it returns an iterator of `{"event", "data"}` progress events so
you can render a progress bar:

```python
for evt in pa.endpoints.deploy(task="contract-key-fields", model="recommended"):
    if evt["event"] == "progress":
        print(evt["data"])                 # stage status
    elif evt["event"] == "complete":
        ep = evt["data"]["endpoint"]
        print("live:", ep["id"])
    elif evt["event"] == "error":
        # the SDK raises ParetaError on this event when wait=True
        print("failed:", evt["data"])
```

Operate and inspect endpoints with `list`, `retrieve`, `start`, `stop`,
`delete`, and `metrics`. See [Deploying endpoints](deploying-endpoints.md) for the full
lifecycle.

```python
for ep in pa.endpoints.list():
    print(ep.id, ep.task, ep.status, "LIVE" if ep.is_live else "")

perf = pa.endpoints.metrics(ep.id).performance()   # p50/p95/p99 latency (raw JSON)
```

## Inference is OpenAI-compatible

Once an endpoint is live, call it through `chat.completions.create`. The
endpoint id (`ep.id`) is the `model`. The request and response match the OpenAI
chat schema, so the official `openai` client works against the same base URL
and key.

```python
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the contract effective date."}],
    temperature=0,                          # extra OpenAI params pass straight through
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)
```

Streaming yields `ChatCompletionChunk` objects; the incremental text is on
`chunk.choices[0].delta.content`:

```python
for chunk in pa.chat.completions.create(model=ep.id, messages=[...], stream=True):
    print(chunk.choices[0].delta.content or "", end="")
```

`create()` raises `ValueError` up front if `model` or `messages` is empty. See
[Running inference](./inference.md) for streaming details and the async
iterator form.

## Metering and billing

Both inference and evals are **metered against your organization's balance**.

- **Inference:** a successful `chat.completions.create()` debits the org
  balance.
- **Evals:** `evals.runs.create()` debits for the compute it spends: both the
  open candidates and any frontier baselines you include.
- **Empty balance:** either path raises `InsufficientCreditsError` (HTTP 402).

```python
from pareta import InsufficientCreditsError

try:
    resp = pa.chat.completions.create(model=ep.id, messages=[{"role": "user", "content": "hi"}])
except InsufficientCreditsError:
    print("Top up the org balance in the dashboard, then retry.")
```

Topping up is **browser-only**. The SDK never exposes the balance, payment
methods, or top-up. It only consumes credit and surfaces the 402 when there is
none.

### Reading cost off an eval run

An eval run reports what it cost. The SDK follows one money convention
(`SDK_PLAN` §6): the **billed total is floored to whole cents** so the SDK never
overstates a charge, while sub-cent precision stays available in micro-USD.

- `run.cost` is a `Decimal` in dollars, floored to cents. A 5 µUSD run reads
  `Decimal("0.00")`.
- `run.cost_micro_usd` is the raw integer (`1_000_000` = `$1.00`).
- Per-item unit rates such as `result.mean_cost_micro_usd` and a leaderboard
  entry's `cost_per_request_micro_usd` stay in **micro-USD**. Flooring them to
  cents would erase the open-vs-frontier comparison that is the whole point.

```python
print(run.cost)               # Decimal("0.42"): billed dollars, floored to cents
print(run.cost_micro_usd)     # 420715: raw micro-USD
```

## The discovery funnel

The pieces above compose into one path from "I have a job" to "I have a cheaper
endpoint running it." This is the recommended flow:

```
match  ->  leaderboard  ->  eval on YOUR data  ->  deploy the winner
```

1. **Match** your intent to a task.
2. Read the task's **leaderboard** to see ranked open candidates and the
   frontier baseline.
3. **Eval** the top candidates (plus the frontier baseline) on *your own* data.
   Public benchmarks are a starting point; your rows are the deciding vote.
4. **Deploy** the model that wins on your data.

```python
from pareta import Pareta

pa = Pareta.from_env()

# 1. Match free-text intent to a task
match = pa.tasks.match("extract key fields from contracts")
task = match.chosen.task_id

# 2. See how open models rank against the frontier baseline
lb = pa.tasks.leaderboard(task)
candidates = [e.name for e in lb.models[:3]]      # top-3 open aliases

# 3. Evaluate those candidates + the benchmarked frontier on YOUR rows.
#    Pass task + items to create the eval set inline, or use an existing set id.
run = pa.evals.runs.create(
    task=task,
    items=[
        {"input": "...your contract text...", "expected": {"effective_date": "2026-01-01"}},
        # ...more rows...
    ],
    models=candidates,            # open candidates (per-task aliases)
    frontier="benchmarked",       # baselines on this task's leaderboard
    wait=True,                    # block until the run is terminal
)

# 4. Read results (quality + cost), then deploy the model that won on your data
for r in sorted(run.results, key=lambda r: (r.quality_mean or 0), reverse=True):
    print(r.model_id, r.kind, r.quality_mean, r.mean_cost_micro_usd, f"n={r.n_succeeded}")

print("eval cost:", run.cost)     # Decimal dollars, floored to cents

winner = run.results[0].model_id
ep = pa.endpoints.deploy(task=task, model=winner, wait=True)
print("serving:", ep.id, ep.url)
```

A few notes on the eval call:

- Provide **either** `eval_set=<id>` (an existing set) **or** `task=... +
  items=...` to create one inline. With neither, `create()` raises `ValueError`.
- `frontier=` accepts `None`/`"none"` (no baselines), an explicit list of
  frontier ids, `"all"` (every frontier model for the task), or `"benchmarked"`
  (only those on the task's leaderboard, vision-filtered for document tasks).
  Keyword resolution needs to know the task; with `eval_set=`, the SDK looks the
  task up for you.
- `wait=True` polls until the run reaches `"completed"` or `"failed"`
  (`run.is_terminal`), then returns the final `EvalRun`. For document tasks,
  attach binaries with `evals.sets.upload_document(...)` before running.

For the full eval API (building sets, attaching documents, inline vs. existing
sets, and polling semantics) see [Evaluating models](evaluation.md). For the
discovery primitives in depth, see [Finding the right model](discovery.md).

## Errors at a glance

Every SDK error subclasses `ParetaError`. The status-mapped subclasses let you
branch on what went wrong without inspecting status codes:

| Exception | Status | When |
|---|---|---|
| `AuthenticationError` | 401 | bad or missing key |
| `InsufficientCreditsError` | 402 | org out of credit (top up in the dashboard) |
| `PermissionDeniedError` | 403 | the user lacks permission |
| `NotFoundError` | 404 | unknown task, endpoint, or run |
| `ConflictError` | 409 | seed/legacy endpoint or transient contention |
| `RateLimitError` | 429 | throttled (auto-retried) |
| `EndpointNotReadyError` | 503 | endpoint stopped, cold, or provider down |
| `BadRequestError` | 400/422 | malformed request |
| `APIConnectionError` / `APITimeoutError` | n/a | transport failure (auto-retried) |

```python
import pareta

try:
    resp = pa.chat.completions.create(model=ep.id, messages=[{"role": "user", "content": "hi"}])
except pareta.EndpointNotReadyError:
    pa.endpoints.start(ep.id)            # wake a stopped endpoint, then retry
except pareta.InsufficientCreditsError:
    print("Out of credit. Top up in the dashboard.")
except pareta.ParetaError as e:
    print("request failed:", e)
```

See [Error handling](errors-and-retries.md) for the full hierarchy, the `request_id`
attribute for support, and the retry policy.



---

<!-- guide/inference.md -->

# Running inference

Once you have a live endpoint, you call it through `chat.completions.create`, which has the same shape as the OpenAI chat completions API. Pass the endpoint id as `model`, a list of messages, and you get a `ChatCompletion` back. Set `stream=True` and you get an iterator of token deltas instead.

Pareta is OpenAI-compatible on the wire, so you can run inference with this SDK, with the `openai` package, or with raw HTTP, whichever fits your stack. This SDK's extra value is the control plane (deploy, eval, discover); for plain inference the two are interchangeable.

A few platform truths that shape this page:

- **Models are per-task aliases.** The `model` you pass is an endpoint id from [deploy](deploying-endpoints.md), or a callable model alias. Real open-weights model ids never reach you; the backend resolves them. You never pick a GPU.
- **Inference is metered against your org balance.** A successful completion debits your balance. If the balance is empty, the call raises `InsufficientCreditsError` (402). Top-up is browser-only; the SDK has no balance or payment surface.

## Setup

Mint a `pareta_sk_` key in the dashboard, export it, and build the client from the environment:

```bash
export PARETA_API_KEY=pareta_sk_...
```

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

`from_env()` is the recommended path. You can also pass the key explicitly: `Pareta(api_key="pareta_sk_...")`. The client is a context manager, so `with Pareta.from_env() as pa:` cleans up the HTTP connection for you.

## A basic completion

Pass an endpoint id as `model` and a non-empty `messages` list in OpenAI format. You get back a `ChatCompletion`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    resp = pa.chat.completions.create(
        model="ep_invoice_xtract",   # an endpoint id from endpoints.deploy()
        messages=[
            {"role": "system", "content": "You extract structured fields from documents."},
            {"role": "user", "content": "What is the invoice total?\n\nINVOICE\nTotal due: $4,210.00"},
        ],
    )

    print(resp.choices[0].message.content)
    print(resp.usage.total_tokens, "tokens")
```

Where does the `model` value come from? Three sources, all interchangeable here:

- An endpoint id you deployed. See [Deploying endpoints](deploying-endpoints.md).
- Any id returned by `pa.models.list()` (see [Listing models](#listing-callable-models) below).
- A per-task model alias. The recommended pick for a task is `pa.tasks.recommended(task_id)`; see [Discovering tasks](discovery.md).

`model` and `messages` are both required. The SDK raises `ValueError` before sending if `model` is falsy or `messages` is empty, so a malformed call fails fast without burning a request.

## The ChatCompletion shape

`create()` returns a `ChatCompletion`. The fields mirror OpenAI:

```python
resp.id                              # str | None
resp.model                           # str | None: the alias that served the call
resp.created                         # int | None: Unix timestamp
resp.choices                         # list[Choice]
resp.choices[0].index                # int | None
resp.choices[0].finish_reason        # "stop", "length", ...
resp.choices[0].message.role         # "assistant"
resp.choices[0].message.content      # str | None: the generated text
resp.usage.prompt_tokens             # int | None
resp.usage.completion_tokens         # int | None
resp.usage.total_tokens              # int | None
```

Every response object keeps the raw server JSON. If a field isn't surfaced as a typed property, reach it with `resp.to_dict()` or `resp["..."]`. Nothing the API returns is lost behind the typed layer.

## Passthrough parameters

Any extra keyword you pass goes straight into the request body, so the full OpenAI parameter set is available without the SDK enumerating it:

```python
resp = pa.chat.completions.create(
    model="ep_invoice_xtract",
    messages=[{"role": "user", "content": "Summarize this contract clause: ..."}],
    temperature=0.2,
    max_tokens=512,
    top_p=0.9,
)
```

`temperature`, `max_tokens`, `top_p`, `stop`, `seed`, and friends all pass through unchanged.

## Streaming

Set `stream=True` and `create()` returns an iterator of `ChatCompletionChunk` objects instead of a single `ChatCompletion`. Each chunk carries a `delta` (not a `message`); the incremental text is at `chunk.choices[0].delta.content`.

```python
with Pareta.from_env() as pa:
    stream = pa.chat.completions.create(
        model="ep_invoice_xtract",
        messages=[{"role": "user", "content": "Draft a one-paragraph status update."}],
        stream=True,
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
    print()
```

`ChatCompletionChunk` has the same schema as `ChatCompletion`; it exists as a distinct type only for hinting. Guard `delta.content` with `or ""`: the first and last chunks of a stream often carry role or finish metadata with no text.

The stream is data-only SSE and always terminates on a `[DONE]` sentinel, which the SDK consumes for you, so the iterator simply ends. Note that retries only cover the initial handshake. Once tokens are flowing, a mid-stream drop raises immediately rather than silently resuming.

## Listing callable models

`models.list()` returns the OpenAI-compatible model list: only your deployed, url-bearing endpoints. Use it to discover ids you can pass to `create(model=...)`.

```python
with Pareta.from_env() as pa:
    models = pa.models.list()         # ModelList
    print(len(models))                # number of callable endpoints
    for m in models:                  # iterates Model objects
        print(m.id, m.owned_by)       # m.id is usable as chat.completions.create(model=...)
```

`ModelList` is iterable and has a length. Each `Model` exposes `.id` (the callable endpoint id), `.owned_by` (`"pareta"` or a vendor name), and `.created`. This is the inference-time view; to manage endpoint lifecycle (start, stop, metrics) use the [endpoints](deploying-endpoints.md) namespace.

## Async

`AsyncPareta` mirrors the sync client. Methods are `async def`; for streaming you `await` the call once, then `async for` over the chunks.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        # Non-streaming
        resp = await pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "What is the invoice total?"}],
        )
        print(resp.choices[0].message.content)

        # Streaming
        stream = await pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "Stream me a haiku about ledgers."}],
            stream=True,
        )
        async for chunk in stream:
            print(chunk.choices[0].delta.content or "", end="", flush=True)
        print()

asyncio.run(main())
```

## Handling metering and not-ready errors

Two error cases are specific to running inference. Both subclass `ParetaError`, so a single `except ParetaError` is a fine catch-all; the specific classes let you branch.

```python
from pareta import (
    Pareta,
    InsufficientCreditsError,   # 402: org balance empty
    EndpointNotReadyError,      # 503: endpoint stopped / cold / provider down
)

with Pareta.from_env() as pa:
    try:
        resp = pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "Hello"}],
        )
        print(resp.choices[0].message.content)
    except InsufficientCreditsError:
        # Balance hit zero. Top up in the dashboard (billing is browser-only);
        # the SDK exposes no balance or payment surface.
        print("Out of credit. Top up in the dashboard, then retry.")
    except EndpointNotReadyError:
        # The endpoint is stopped or cold-starting. Start it and wait for live.
        pa.endpoints.start("ep_invoice_xtract")
        print("Endpoint was not ready; started it, retry shortly.")
```

Transient failures (429 rate limits, 5xx, connection timeouts) are retried automatically with exponential backoff, `max_retries` times (default 2). You only see `RateLimitError` or `APITimeoutError` after retries are exhausted. See [Errors](errors-and-retries.md) for the full hierarchy.

## Using the OpenAI SDK instead

Because the endpoint is OpenAI-compatible, you don't need this SDK to *call* it. Point the `openai` client at Pareta's base URL with your `pareta_sk_` key. Note the `/v1` suffix the OpenAI client expects:

```python
from openai import OpenAI

client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")

resp = client.chat.completions.create(
    model="ep_invoice_xtract",
    messages=[{"role": "user", "content": "What is the invoice total?"}],
)
print(resp.choices[0].message.content)
```

Streaming, `temperature`, `max_tokens`, and the rest work exactly as they do against OpenAI. Metering still applies: a zero balance returns a 402, which the `openai` client surfaces as its own status error. Reach for the Pareta SDK when you want the control plane: [deploying endpoints](deploying-endpoints.md), [discovering tasks](discovery.md), and [running evals](evaluation.md).



---

<!-- guide/deploying-endpoints.md -->

# Deploying & operating endpoints

`client.endpoints` is the control plane for serving open-weights models. You
hand it a task and a model; it deploys an OpenAI-compatible inference endpoint,
hands you back a live `Endpoint`, and lets you start, stop, delete, and measure
it from code. No infrastructure to reason about.

Three platform truths shape this whole page:

- **GPUs are hidden.** `deploy()` takes a task and a model, nothing else. There
  is no GPU, tensor-parallel, or quantization knob. Pareta resolves the serving
  class from its registry.
- **Models are per-task aliases.** The `model` you deploy and the `Endpoint.model`
  you read back are public per-task aliases (`{family}-{rank}`), not raw
  open-weights ids. Real ids never cross into the SDK.
- **Inference is metered against your org balance.** Once an endpoint is live,
  every `chat.completions.create()` debits your org balance and raises
  [`InsufficientCreditsError`](errors-and-retries.md) (402) on an empty balance.
  Top-up is browser-only.

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

## Deploy an endpoint

```python
ep = pa.endpoints.deploy(
    task="contract-key-fields",
    model="recommended",   # default — Pareta picks the task's best open model
    wait=True,
)
print(ep.id, ep.status, ep.url)   # e.g. "ep_a1b2c3 live https://…"
```

Signature:

```python
endpoints.deploy(
    *,
    task: str,                 # required: a subtask id, e.g. "contract-key-fields"
    model: str = "recommended",
    name: str | None = None,   # auto-generated if omitted
    wait: bool = False,
    **extra,                   # passed through to the backend
) -> Iterator[dict] | Endpoint
```

- `task` (required) is a catalog subtask id. Discover one with
  [`tasks.match()` / `tasks.list()`](discovery.md).
- `model` is a per-task public alias, an explicit real-id-equivalent alias, or
  `"recommended"` (the default — the task's curated pick, else the
  leaderboard's top open model). To see what `"recommended"` resolves to before
  you deploy, read `pa.tasks.recommended(task)`.
- `name` is optional. Leave it off and Pareta names the endpoint for you.
- The return type depends entirely on `wait`. See the next two sections.

You never pass hardware. Pareta resolves the GPU and serving config for the
chosen model.

### `wait=True` — block and get the live `Endpoint`

The simplest path. `deploy(wait=True)` consumes the deploy stream internally,
blocks until the endpoint is live, and returns the `Endpoint`. If the deploy
fails, it raises `ParetaError`.

```python
ep = pa.endpoints.deploy(task="contract-key-fields", wait=True)

assert ep.is_live              # status == "live"
print(ep.id)                   # pass this to chat.completions.create(model=…)
print(ep.model)                # per-task public alias that got deployed
print(ep.url)                  # OpenAI-compatible inference URL

# Use it immediately — metered against your org balance.
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the parties and effective date."}],
)
print(resp.choices[0].message.content)
```

### `wait=False` — stream deploy progress (default)

With `wait=False` (the default), `deploy()` returns an iterator of named
progress events so you can render a deploy UI or log stages. Each event is a
`{"event": str, "data": dict}` dict.

```python
endpoint = None
for ev in pa.endpoints.deploy(task="contract-key-fields"):
    if ev["event"] == "progress":
        # data carries the deploy stage status, e.g. {"stage": "pulling weights", "pct": 45}
        print("progress:", ev["data"])
    elif ev["event"] == "complete":
        endpoint = ev["data"]["endpoint"]   # the live endpoint payload (dict)
        print("live:", endpoint)
    elif ev["event"] == "error":
        # The SDK raises ParetaError for you on wait=True; here you handle it.
        raise RuntimeError(ev["data"].get("message", "deploy failed"))
```

The terminal event is `"complete"` (its `data.endpoint` is the live endpoint)
or `"error"`. The stream always ends on one of them; if it ends without a
`"complete"`, the SDK raises `ParetaError`.

Pick `wait=True` for scripts and notebooks; pick `wait=False` only when you
want to surface live progress.

## List, retrieve, and address endpoints

```python
# Every endpoint your org can access.
for ep in pa.endpoints.list():
    print(ep.id, ep.status, ep.task, ep.model)

# One endpoint by id.
ep = pa.endpoints.retrieve("ep_a1b2c3")
print(ep.is_live, ep.url)
```

`Endpoint` fields:

| Field | Type | Meaning |
|---|---|---|
| `id` | `str \| None` | Endpoint id — pass as `chat.completions.create(model=…)` |
| `name` | `str \| None` | Display name |
| `model` | `str \| None` | Per-task public alias serving here |
| `status` | `str \| None` | `"live"`, `"starting"`, `"stopped"`, … |
| `task` | `str \| None` | Task name |
| `url` | `str \| None` | OpenAI-compatible inference URL |
| `is_live` | `bool` | `status == "live"` |

`endpoints.list()` returns every endpoint the org can access. For the
OpenAI-compatible subset (only deployed, url-bearing endpoints, shaped as
`Model` objects), use [`pa.models.list()`](inference.md) instead.

## Start, stop, and delete

A stopped endpoint costs nothing to keep but cannot serve. Stop it to pause
spend, start it to resume, delete it to remove it for good.

```python
pa.endpoints.stop("ep_a1b2c3")      # pause a live endpoint
pa.endpoints.start("ep_a1b2c3")     # resume a stopped one
pa.endpoints.delete("ep_a1b2c3")    # remove it (returns None)
```

While an endpoint is stopped or still cold, inference calls against it raise
[`EndpointNotReadyError`](errors-and-retries.md) (503). Call `start()` and wait
for `retrieve(id).is_live` before sending traffic.

```python
pa.endpoints.start("ep_a1b2c3")
while not pa.endpoints.retrieve("ep_a1b2c3").is_live:
    time.sleep(3)
```

## Measure an endpoint

`endpoints.metrics(id)` returns a `Metrics` handle with one method per
observability dimension. Each returns raw metric JSON (typed models are coming
in a later slice) and accepts arbitrary query params as keyword arguments.

```python
m = pa.endpoints.metrics("ep_a1b2c3")

m.performance()   # p50/p95/p99 latency
m.uptime()        # availability
m.cost()          # per-endpoint spend + vs-frontier savings
m.quality()       # judge windows
m.activity()      # usage stats

# Params pass straight through to the query string:
m.performance(window="24h")
m.cost(group_by="day")
```

| Method | Returns |
|---|---|
| `performance(**params)` | p50/p95/p99 latency |
| `uptime(**params)` | availability metrics |
| `cost(**params)` | per-endpoint spend and savings versus the frontier baseline |
| `quality(**params)` | judge-window quality scores |
| `activity(**params)` | usage stats |

`metrics(id).cost()` is per-endpoint **observability** — it tells you what this
endpoint spent and how much you saved against a frontier vendor. It is not your
account balance. Balance and top-up live in the dashboard only.

## End to end

Discover a task, deploy its recommended model, serve traffic, then tear down.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    # 1. Find a task for your intent.
    match = pa.tasks.match("pull key fields out of contracts")
    task_id = match.chosen.task_id          # e.g. "contract-key-fields"

    # 2. Deploy the recommended open model (no GPU knob).
    ep = pa.endpoints.deploy(task=task_id, model="recommended", wait=True)
    print(f"live: {ep.id} serving {ep.model}")

    # 3. Run metered inference (debits the org balance).
    resp = pa.chat.completions.create(
        model=ep.id,
        messages=[{"role": "user", "content": "Extract the governing-law clause."}],
    )
    print(resp.choices[0].message.content)

    # 4. Check what it cost and how it performed.
    print(pa.endpoints.metrics(ep.id).cost())

    # 5. Stop it to pause spend (or delete it to remove it).
    pa.endpoints.stop(ep.id)
```

## Async

`AsyncPareta` mirrors the sync surface. `deploy()`, `list()`, `retrieve()`,
`start()`, `stop()`, and `delete()` are `async def`. `metrics(id)` returns an
`AsyncMetrics` handle synchronously (it is not a coroutine), and its dimension
methods are awaitable.

```python
from pareta import AsyncPareta

async with AsyncPareta.from_env() as pa:
    # wait=True awaits the deploy and returns the live Endpoint.
    ep = await pa.endpoints.deploy(task="contract-key-fields", wait=True)

    # wait=False returns an async progress-event iterator.
    async for ev in await pa.endpoints.deploy(task="contract-key-fields"):
        if ev["event"] == "complete":
            print("live:", ev["data"]["endpoint"])

    m = pa.endpoints.metrics(ep.id)         # NOT awaited — returns the handle
    print(await m.performance())            # the dimension call IS awaited

    await pa.endpoints.stop(ep.id)
```

## Errors you will hit here

| Exception | Status | When |
|---|---|---|
| `BadRequestError` | 400 / 422 | Unknown task, malformed deploy params |
| `InsufficientCreditsError` | 402 | Org balance empty when you run inference against the endpoint |
| `NotFoundError` | 404 | Unknown `endpoint_id` |
| `ConflictError` | 409 | Seed/legacy endpoint conflict, or transient contention |
| `EndpointNotReadyError` | 503 | Endpoint stopped, cold-starting, or provider down |
| `ParetaError` | — | A deploy stream emitted an `"error"` event, or ended without `"complete"` |

`deploy(task="")` raises `ValueError` before any request goes out. See
[Errors & retries](errors-and-retries.md) for the full hierarchy and the
automatic retry policy.

## Related

- [Discovering tasks](discovery.md) — find a `task` id and inspect the recommended model.
- [Running inference](inference.md) — call `chat.completions.create()` against a deployed endpoint id.
- [Evaluating models](evaluation.md) — pick the right open model for a task before you deploy it.
- [Errors & retries](errors-and-retries.md) — the exception hierarchy and retry behavior.



---

<!-- guide/discovery.md -->

# Finding the right model

Before you deploy anything, you pick a **task** and a **model**. Pareta does both
for you from the SDK:

1. **Match** a free-text description of what you want to do to a benchmark task
   (`tasks.match`).
2. **Rank** the models on that task by quality and cost, and read off the
   recommended pick (`tasks.leaderboard`, `tasks.recommended`).
3. **List** the frontier (vendor) baselines you can measure that pick against
   (`evals.frontier_models`).

This is the discovery loop: intent -> task -> recommended open model + frontier
baseline. From there you either deploy the recommended model
([Deploying endpoints](deploying-endpoints.md)) or run it head to head against the
frontier on your own data ([Evaluating models](evaluation.md)).

Two platform facts shape everything below:

- **Models are per-task aliases.** Leaderboard rows, recommended picks, and
  result `model_id`s are public aliases like `qwen-1` or `recommended`, never the
  underlying open-weights ids. You pass the alias straight back into
  `endpoints.deploy(model=...)` or `evals.runs.create(models=[...])` - Pareta
  resolves the real model and the hardware. There is no GPU or quantization knob
  anywhere in this flow.
- **Frontier (vendor) ids are in the clear.** OpenAI/Anthropic/etc. model ids
  come back as real ids, because they are the baseline you are paying to beat.

All snippets assume:

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (and optional PARETA_BASE_URL)
```

---

## 1. Match intent to a task

`tasks.match(query, top_k=5)` turns a plain-English description into ranked
candidate tasks. The matcher is a deterministic keyword scorer (with a semantic
backstop on the backend), so the same query returns the same ranking.

```python
match = pa.tasks.match("pull line items and totals out of vendor invoices")

if match.matched:
    task_id = match.chosen.task_id          # the best task
    print(f"matched {task_id} via {match.matcher} "
          f"(confidence={match.chosen.confidence})")
else:
    # No high-confidence hit - show the user the ranked alternates.
    for cand in match.candidates:
        print(f"  {cand.task_id}  score={cand.score:.2f}  {cand.confidence}")
```

`match` returns a [`TaskMatch`](#taskmatch):

- `matched: bool` - a high-confidence task was found.
- `chosen: TaskMatchCandidate | None` - the best candidate, or `None` if nothing
  cleared the bar.
- `candidates: list[TaskMatchCandidate]` - the top-`top_k` ranked alternates
  (each has `task_id`, `score` in `[0, 1]`, and `confidence` of
  `"high"`/`"medium"`/`"low"`).
- `ambiguous: bool` - `True` when the top two scores are close. A good prompt to
  ask the user to disambiguate.
- `matcher: str | None` - which matcher fired (`"keyword"` or `"semantic"`).

A robust pattern handles the no-match and ambiguous cases instead of blindly
trusting `chosen`:

```python
match = pa.tasks.match("classify support tickets by urgency")

if not match.matched:
    raise SystemExit(f"no task matched; closest: "
                     f"{[c.task_id for c in match.candidates]}")
if match.ambiguous:
    print("ambiguous - top candidates:",
          [(c.task_id, round(c.score or 0, 2)) for c in match.candidates[:2]])

task_id = match.chosen.task_id
```

`match` raises `ValueError` if `query` is empty or whitespace.

### Inspecting the task

Once you have a `task_id`, `tasks.retrieve` gives you the task's schema. The key
field is `has_blob_input`: `True` means the task takes documents or images (PDFs,
scans), which determines how you build eval sets and which frontier models can
run it.

```python
task = pa.tasks.retrieve(task_id, examples_n=3)
print(task.id, task.default_scorer, "blob_input=", task.has_blob_input)
```

- `default_scorer: str | None` - the scorer used to grade outputs on this task.
- `has_blob_input: bool` - the task handles documents/images.
- `examples_n` (optional) - ask for N example items so you can see the input
  shape; pulled from the raw record via `task.to_dict()`.

To browse the whole catalog instead of matching, use `pa.tasks.list()`, which
returns `list[Task]`.

---

## 2. Rank the models on a task

`tasks.leaderboard(task_id)` returns the models scored on a task, ranked by
quality, with the per-request cost for each. This is how you choose between open
models and see, concretely, how far below the frontier the cost sits.

```python
board = pa.tasks.leaderboard(task_id)

print(f"metric={board.metric}  cost_unit={board.cost_unit}")
print(f"recommended: {board.recommended}")

for entry in board.models:
    cost = entry.cost_per_request_micro_usd or 0
    print(f"  {entry.name:<16} {entry.kind:<8} "
          f"quality={entry.quality:.3f}  "
          f"${cost / 1_000_000:.6f}/req  ctx={entry.context_k}k")

if board.frontier:
    f = board.frontier
    print(f"frontier baseline: {f.name}  quality={f.quality:.3f}  "
          f"${(f.cost_per_request_micro_usd or 0) / 1_000_000:.6f}/req")
```

`leaderboard` returns a [`Leaderboard`](#leaderboard):

- `recommended: str | None` - the deployable model alias Pareta recommends for
  this task. This is exactly what `endpoints.deploy(model="recommended")`
  resolves to. Pass it straight to `deploy(model=...)`.
- `models: list[LeaderboardEntry]` - the ranked entries. Each `LeaderboardEntry`
  has `name` (the alias / id), `kind` (`"open"` or `"frontier"`),
  `quality` in `[0, 1]`, `cost_per_request_micro_usd` (raw micro-USD,
  **not** floored), and `context_k` (context window in thousands).
- `frontier: LeaderboardEntry | None` - the vendor baseline this task is measured
  against, so you can read the open-vs-frontier gap directly.
- `metric` / `cost_unit` - what `quality` and the cost are measured in (e.g.
  `"quality"` and `"per_request"`).

> **Cost is in micro-USD here, on purpose.** Per-request rates are sub-cent, so
> the leaderboard keeps the raw `cost_per_request_micro_usd` integer
> (1,000,000 micro-USD = $1.00). Flooring to whole cents - which is how billed
> **totals** like `run.cost` work, see [Evaluating models](evaluation.md) - would
> erase the open-vs-frontier comparison. Divide by 1,000,000 to display dollars.

### The shortcut: `recommended`

If you only want the deployable pick and don't need the full ranking,
`tasks.recommended(task_id)` is a convenience wrapper over
`leaderboard(task_id).recommended`:

```python
model = pa.tasks.recommended(task_id)        # e.g. "qwen-1" or "recommended"
ep = pa.endpoints.deploy(task=task_id, model=model, wait=True)
print(ep.id, ep.status)
```

Passing `model="recommended"` to `deploy` does the same resolution server-side,
so `pa.tasks.recommended(task_id)` is mainly useful when you want to **see** the
pick (log it, show it, gate on it) before committing to a deploy.

> **Sync only, for now.** `leaderboard` and `recommended` live on the sync
> `Tasks` resource. `AsyncTasks` has `list`, `retrieve`, and `match`; the ranking
> methods land for async in a later slice. From async code, either call them on a
> short-lived sync `Pareta` or run them in a thread.

---

## 3. List the frontier baselines to eval against

Picking the recommended open model is the start; the point of Pareta is showing
it holds up against the frontier at a fraction of the cost. `evals.frontier_models`
returns the vendor roster you can put in an eval run as baselines.

```python
roster = pa.evals.frontier_models(task=task_id)

for fm in roster:
    flags = []
    if fm.vision:
        flags.append("vision")
    if fm.benchmarked:
        flags.append("benchmarked")
    print(f"  {fm.id:<28} {fm.vendor:<10} {' '.join(flags)}")
```

Each entry is a [`FrontierModel`](#frontiermodel):

- `id: str | None` - the real vendor model id. Pass these into
  `evals.runs.create(frontier=[...])`.
- `vendor: str | None` - `"openai"`, `"anthropic"`, etc.
- `vision: bool` - the model can take images/documents.
- `benchmarked: bool` - the model sits on this task's leaderboard. Only populated
  when you pass `task=`.

**Passing `task=` matters.** Without it you get the full roster, unannotated.
With it, Pareta annotates `benchmarked` and filters the roster by capability - for a document task (`has_blob_input == True`) that means vision-capable models
only, so you won't pick a baseline that physically cannot read the input.

```python
# All frontier models, no task context (no benchmarked flag, no filtering)
everything = pa.evals.frontier_models()

# Scoped to a document task: vision-filtered + benchmarked-annotated
for_task = pa.evals.frontier_models(task=task_id)
```

### Feeding the roster into a run

You can pass explicit frontier ids, or let the SDK resolve a roster keyword for
you. These two are equivalent when the keyword is `"benchmarked"`:

```python
# Explicit: filter the roster yourself
ids = [fm.id for fm in pa.evals.frontier_models(task=task_id) if fm.benchmarked]
run = pa.evals.runs.create(
    eval_set="es_…",
    models=[pa.tasks.recommended(task_id)],   # the open candidate(s)
    frontier=ids,                             # explicit list of vendor ids
    wait=True,
)

# Keyword: the SDK fetches + filters the roster for you
run = pa.evals.runs.create(
    eval_set="es_…",
    models=[pa.tasks.recommended(task_id)],
    frontier="benchmarked",                   # "all" | "benchmarked" | "none" | [ids]
    wait=True,
)
```

The `frontier=` keyword resolves SDK-side before the request is sent:

| Value | Resolves to |
|---|---|
| `None` or `"none"` | no baselines (`[]`) |
| `["id1", "id2"]` | the explicit list, passed through |
| `"all"` | every model from `frontier_models(task=...)` |
| `"benchmarked"` | only roster models with `benchmarked == True` |

When you use `"all"`/`"benchmarked"` the SDK needs to know the task: it uses the
`task=` you passed to `runs.create`, else looks it up from the `eval_set`'s task.
If it can't determine the task it raises `ValueError`; an unrecognized string
keyword raises `ValueError` too. See [Evaluating models](evaluation.md) for the full
run lifecycle, results, and cost.

---

## A full discovery pass

End to end: intent in, recommended open model + a benchmarked frontier baseline
out, ready to hand to a deploy or an eval.

```python
from pareta import Pareta

pa = Pareta.from_env()

# 1. intent -> task
match = pa.tasks.match("extract key fields from contracts")
if not match.matched:
    raise SystemExit(f"no task matched: {[c.task_id for c in match.candidates]}")
task_id = match.chosen.task_id

# 2. task -> recommended open model + the open-vs-frontier gap
board = pa.tasks.leaderboard(task_id)
pick = board.recommended
gap = (board.frontier.quality if board.frontier else None)
print(f"task={task_id}  recommend={pick}  frontier_quality={gap}")

# 3. the vendor baselines worth measuring against (vision-filtered, annotated)
baselines = [fm.id for fm in pa.evals.frontier_models(task=task_id) if fm.benchmarked]
print(f"baselines: {baselines}")

# now: deploy `pick`, or eval `pick` vs `baselines` on your own data.
```

Metering note: discovery itself (`match`, `leaderboard`, `recommended`,
`frontier_models`) is free - these are catalog reads. The meter starts when you
actually run compute: inference via `chat.completions.create` and eval runs via
`evals.runs.create` are debited against your org balance, and both raise
`InsufficientCreditsError` (402) on an empty balance. Top-up is browser-only; the
SDK never exposes balance or payment.

---

## Reference

### `tasks.match(query, *, top_k=5) -> TaskMatch`
Free-text intent to ranked candidate tasks. Raises `ValueError` on an empty
query. Deterministic keyword matcher (semantic backstop on the backend).

### `tasks.retrieve(task_id, *, examples_n=None) -> Task`
A task's schema: `id`, `default_scorer`, `has_blob_input`. `examples_n` requests
N example items (read via `task.to_dict()`).

### `tasks.leaderboard(task_id) -> Leaderboard`
Models ranked by quality/cost for a task, plus the `recommended` deployable alias
and the `frontier` baseline. Sync only.

### `tasks.recommended(task_id) -> str | None`
Convenience for `leaderboard(task_id).recommended` - the model alias to pass to
`endpoints.deploy(model=...)`. Sync only.

### `evals.frontier_models(task=None) -> list[FrontierModel]`
The vendor frontier roster. With `task=`, each entry is `benchmarked`-annotated
and the roster is capability-filtered (vision-only for document tasks). Feed `id`s
into `evals.runs.create(frontier=[...])`.

#### `TaskMatch`
`query`, `matched: bool`, `chosen: TaskMatchCandidate | None`,
`candidates: list[TaskMatchCandidate]`, `ambiguous: bool`, `matcher: str | None`.
Each `TaskMatchCandidate` has `task_id`, `score` (`[0, 1]`), `confidence`
(`"high"`/`"medium"`/`"low"`).

#### `Leaderboard`
`task_id`, `metric`, `cost_unit`, `recommended: str | None`,
`models: list[LeaderboardEntry]`, `frontier: LeaderboardEntry | None`. Each
`LeaderboardEntry`: `name`, `kind` (`"open"`/`"frontier"`), `quality` (`[0, 1]`),
`cost_per_request_micro_usd` (raw, not floored), `context_k`.

#### `FrontierModel`
`id`, `vendor`, `vision: bool`, `benchmarked: bool`.

Every response object keeps the raw server JSON: call `.to_dict()` (or index it
like a dict) to reach any field the typed layer doesn't surface yet.

---

See also: [Deploying endpoints](deploying-endpoints.md) · [Evaluating models](evaluation.md)
· [Running inference](./inference.md) · [Errors and retries](errors-and-retries.md)



---

<!-- guide/evaluation.md -->

# Evaluating on your own data

Benchmarks tell you which model wins on someone else's data. This page is about the only number that matters: how the candidates score on *your* rows.

You give Pareta a list of items for a task, point it at a set of open models (and, optionally, frontier baselines to beat), and get back per-model quality with confidence intervals and cost. The platform scores everything with the task's scorer, runs the open candidates and the frontier baselines on the same items, and meters the compute against your org balance. No GPUs to size, no scorer to wire up, no judge to host.

The shape is always the same:

1. Turn your rows into an **eval set** (`evals.sets.create`), or pass them inline.
2. Kick off an **eval run** over a list of models (`evals.runs.create`), optionally waiting for it to finish.
3. Read `run.results` to compare quality and cost; read `run.cost` for the bill.

## A complete run, top to bottom

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (and optional PARETA_BASE_URL)

run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[
        {"input": "Effective as of January 1, 2026, ...", "expected": {"effective_date": "2026-01-01"}},
        {"input": "This Agreement terminates on 2027-12-31 ...", "expected": {"termination_date": "2027-12-31"}},
    ],
    models=["llama-1", "qwen-2"],   # per-task open aliases
    frontier="benchmarked",          # baselines already on this task's leaderboard
    wait=True,                       # block until the run is terminal
)

print(run.status)          # "completed"
print(f"billed ${run.cost}")  # Decimal dollars, floored to cents

for r in run.results:
    print(f"{r.model_id:16} {r.kind:8} q={r.quality_mean:.3f} "
          f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
          f"~{r.mean_cost_micro_usd} uUSD/item  "
          f"({r.n_succeeded} ok, {r.error_count} err)")
```

That single call created the eval set inline, started the run, polled it to completion, and returned aggregates per model. Everything below unpacks the pieces so you can vary them.

The model ids in `models=` are **per-task public aliases** (`{family}-{rank}`), not raw model names. They come from a task's leaderboard. Frontier (vendor) ids are in the clear. See [Models and aliases](inference.md) for why and how to discover them.

## Step 1: build an eval set

An eval set is your rows bound to a task. Create one explicitly when you want to reuse it across several runs.

```python
eval_set = pa.evals.sets.create(
    task="contract-key-fields",
    items=[
        {"input": "Effective as of January 1, 2026, ...", "expected": {"effective_date": "2026-01-01"}},
        {"input": "This Agreement terminates on 2027-12-31 ...", "expected": {"termination_date": "2027-12-31"}},
    ],
    name="Q2 contracts sample",   # optional; defaults to "sdk eval set (N items)"
)

print(eval_set.id)               # pass this to runs.create(eval_set=...)
print(eval_set.task_id)          # "contract-key-fields"
print(eval_set.item_count)       # 2
print(eval_set.scoring_strategy) # e.g. "extraction" — how this task is scored
```

`items` is required and must be non-empty (the SDK raises `ValueError` otherwise). Each item is a row in the task's input schema; the rows go up as JSONL on the wire. The exact field names (`input`, `expected`, and any others) follow the task you chose. To inspect a task's schema and pull sample items before you format yours, use `tasks.retrieve(task_id, examples_n=...)` — see [Discovering tasks](discovery.md).

Manage sets like any other resource:

```python
pa.evals.sets.list()                  # -> list[EvalSet]
pa.evals.sets.retrieve(eval_set.id)   # -> EvalSet
pa.evals.sets.delete(eval_set.id)     # -> None
```

### Document and image tasks

Some tasks score over documents (PDFs, scanned invoices, images) rather than plain text. A task tells you this via `task.has_blob_input == True`. For those, each row references a binary that you attach after creating the set, one field at a time:

```python
eval_set = pa.evals.sets.create(
    task="invoice-extraction",
    items=[
        {"expected": {"total": "1240.00", "vendor": "Katana ML"}},   # the doc is attached next
        {"expected": {"total": "89.50", "vendor": "Acme"}},
    ],
)

# Attach the PDF for row 0's `document` field.
pa.evals.sets.upload_document(
    eval_set.id,
    "invoices/katana-0001.pdf",   # path, raw bytes, or a binary file-like object
    idx=0,                        # 0-based row index
    field_name="document",        # the blob input field on this task
)

pa.evals.sets.upload_document(eval_set.id, "invoices/acme-0002.pdf", idx=1, field_name="document")
```

`upload_document` collapses the whole upload dance into one call. Files under 5 MiB go up inline; larger files get a signed URL and stream straight to storage. It accepts a path (`str`/`Path`), raw `bytes`, or any object with `.read()`; anything else raises `TypeError`. The MIME type is guessed from the filename and can be overridden with `mime=`:

```python
with open("invoices/scan.tiff", "rb") as f:
    pa.evals.sets.upload_document(eval_set.id, f, idx=2, field_name="document", mime="image/tiff")
```

Frontier baselines on document tasks are automatically vision-filtered — you never accidentally score a contract scan against a text-only model.

## Step 2: run the eval

`evals.runs.create` is the workhorse. You can drive an existing set, or create one inline in the same call.

```python
# Against an existing set
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1", "qwen-2"], wait=True)

# Inline: create the set and run it in one shot
run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[{"input": "...", "expected": {...}}],
    models=["llama-1", "qwen-2"],
    wait=True,
)
```

You must pass **either** `eval_set=<id>` **or** `task=… + items=…`; the SDK raises `ValueError` if you give neither. `models` is required — it's the list of open candidate aliases to evaluate. Each run is **metered**: the org balance is debited for the compute across open and frontier models. If the balance is empty, `create` raises `InsufficientCreditsError` (402). Top-up is browser-only — the SDK never exposes balance or payment methods. See [Errors and metering](errors-and-retries.md).

### Choosing frontier baselines

`frontier=` controls which vendor models get scored alongside your open candidates, so the report shows you exactly how much quality (and cost) you're trading. It accepts a keyword or an explicit list, resolved SDK-side:

| `frontier=` | Baselines scored |
| --- | --- |
| `None` or `"none"` (default `None`) | none — open candidates only |
| `"all"` | every frontier model available for the task |
| `"benchmarked"` | frontier models already on the task's leaderboard (vision-filtered for document tasks) |
| `["gpt-4o", "claude-..."]` | exactly these frontier model ids |

```python
# Just the open candidates, no baseline
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"], frontier="none", wait=True)

# Everything in the frontier pool for the task
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"], frontier="all", wait=True)

# A hand-picked baseline
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"], frontier=["gpt-4o"], wait=True)
```

The `"all"` and `"benchmarked"` keywords need to know the task. When you create inline (`task=…`) the SDK already has it; when you pass `eval_set=…` it looks the task up from the set. If it still can't resolve a task it raises `ValueError`, and an unrecognized keyword (anything other than `"all"`/`"benchmarked"`/`"none"`) raises `ValueError` too.

To see and pin the roster yourself, list it first:

```python
roster = pa.evals.frontier_models(task="contract-key-fields")
for m in roster:
    print(m.id, m.vendor, "vision" if m.vision else "text", "benchmarked" if m.benchmarked else "-")

# Pin two of them explicitly
ids = [m.id for m in roster if m.benchmarked][:2]
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"], frontier=ids, wait=True)
```

`frontier_models()` annotates `benchmarked` and applies the capability filter only when you pass `task=`. Without a task it returns the full roster, unannotated.

### Waiting, or not

By default `create` returns as soon as the run is queued, so you can poll on your own schedule:

```python
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"])
print(run.status)   # "running" (or queued)

run = pa.evals.runs.retrieve(run.id)   # refetch full state
while not run.is_terminal:
    run = pa.evals.runs.retrieve(run.id)
```

Or let the SDK block for you. `wait=True` polls `runs.retrieve` every `poll_interval` seconds (default 3.0) until the run is terminal, up to `timeout` seconds (default 900.0), then returns the final `EvalRun`. If the deadline passes first it raises `ParetaError`. You can also poll an already-started run with the same semantics:

```python
run = pa.evals.runs.create(eval_set=eval_set.id, models=["llama-1"])
run = pa.evals.runs.wait(run.id, poll_interval=5.0, timeout=1800.0)
```

`is_terminal` is true when `status` is `"completed"` or `"failed"`. On failure, read `run.error_detail` for the message.

## Step 3: read the results

A terminal `EvalRun` carries one `EvalResult` per model in `run.results`, plus the bill.

```python
run = pa.evals.runs.retrieve(run_id)

if run.status == "failed":
    print("run failed:", run.error_detail)
else:
    # Best open model by mean quality
    open_models = [r for r in run.results if r.kind == "open"]
    best = max(open_models, key=lambda r: r.quality_mean or 0.0)
    print(f"best open: {best.model_id} @ {best.quality_mean:.3f}")

    for r in run.results:
        print(r.model_id, r.kind, r.quality_mean,
              r.quality_ci_low, r.quality_ci_high,
              r.mean_cost_micro_usd, r.n_succeeded, r.error_count)
```

Each `EvalResult` has:

- `model_id` — the per-task public alias (open) or vendor id (frontier).
- `kind` — `"open"` or `"frontier"`. Filter on this to separate candidates from baselines.
- `quality_mean`, `quality_ci_low`, `quality_ci_high` — mean score in `[0, 1]` with a 95% confidence interval. Use the interval: two models whose CIs overlap are not meaningfully different on this sample, so add rows before declaring a winner.
- `mean_cost_micro_usd` — average cost per item in **micro-USD** (1,000,000 = $1.00). This stays in micro-USD on purpose: flooring sub-cent unit rates to whole cents would erase the open-vs-frontier cost gap that the whole exercise is about.
- `n_succeeded`, `error_count` — how many items scored vs. errored for that model.

### What the run cost

The run total comes back two ways. `run.cost` is what you're billed — a `Decimal` in dollars, **floored to whole cents** (the SDK never rounds a charge up). `run.cost_micro_usd` is the raw integer for precise accounting.

```python
print(run.cost)             # Decimal('0.07')  -> dollars, floored to cents
print(run.cost_micro_usd)   # 74211            -> raw micro-USD
```

A run that costs less than a cent reads `Decimal("0.00")` on `run.cost` while still carrying its true micro-USD value on `run.cost_micro_usd`. The same money convention applies everywhere in the SDK; see [Errors and metering](errors-and-retries.md) for the full picture.

Every response object also keeps the raw server JSON: `run.to_dict()`, `result.to_dict()`, and `eval_set.to_dict()` give you lossless access to anything not yet surfaced as a typed field.

## Async

Every method has an async twin on `AsyncPareta` with the same signatures. Run candidates concurrently and await the results:

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        run = await pa.evals.runs.create(
            task="contract-key-fields",
            items=[{"input": "...", "expected": {...}}],
            models=["llama-1", "qwen-2"],
            frontier="benchmarked",
            wait=True,
        )
        for r in run.results:
            print(r.model_id, r.kind, r.quality_mean)
        print("billed", run.cost)

asyncio.run(main())
```

`await pa.evals.runs.wait(run_id)` and `await pa.evals.frontier_models(task=...)` work the same way. Document uploads are async too: `await pa.evals.sets.upload_document(...)`.

## From eval to production

Once the results pick a winner, deploy it for that task and serve inference against it:

```python
ep = pa.endpoints.deploy(task="contract-key-fields", model=best.model_id, wait=True)
print(ep.id, ep.is_live)

resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the effective date from: ..."}],
)
print(resp.choices[0].message.content)
```

Deploy takes only a task and a model — Pareta resolves the hardware, so there is no GPU or quantization knob. Inference is OpenAI-compatible and metered the same way evals are. See [Deploying endpoints](deploying-endpoints.md) and [Running inference](./inference.md) to go further.

## See also

- [Discovering tasks](discovery.md) — find the right task id, inspect its schema, pull example rows.
- [Models and aliases](inference.md) — where the per-task open aliases come from and why real ids are hidden.
- [Deploying endpoints](deploying-endpoints.md) — turn a winning model into a live endpoint.
- [Running inference](./inference.md) — the OpenAI-compatible chat surface.
- [Errors and metering](errors-and-retries.md) — `InsufficientCreditsError`, the money convention, and the exception hierarchy.



---

<!-- guide/errors-and-retries.md -->

# Errors, retries & timeouts

Every failure the SDK can raise is a subclass of `ParetaError`, so one `except`
clause catches everything, and a more specific clause catches exactly the case
you care about. The client also retries transient failures for you (network
blips, 429s, 5xx) with exponential backoff before giving up. This page is the
map: which exception means what, what is retried automatically, and how to tune
the timeout and retry budget.

Import the exceptions straight from the package:

```python
from pareta import (
    Pareta,
    ParetaError,                # base class for everything below
    APIConnectionError,         # never reached the server (DNS/TCP/TLS)
    APITimeoutError,            # subclass of APIConnectionError
    APIStatusError,             # any non-2xx from the server
    BadRequestError,            # 400, 422
    AuthenticationError,        # 401
    PermissionDeniedError,      # 403
    InsufficientCreditsError,   # 402 — org out of balance
    NotFoundError,              # 404
    ConflictError,              # 409
    RateLimitError,             # 429
    EndpointNotReadyError,      # 503 — endpoint stopped/cold/provider down
)
```

## The hierarchy

```
ParetaError
├── APIConnectionError          request never reached the server
│   └── APITimeoutError         timed out before any response
└── APIStatusError              server returned a non-2xx status
    ├── BadRequestError         400, 422
    ├── AuthenticationError     401
    ├── InsufficientCreditsError 402
    ├── PermissionDeniedError   403
    ├── NotFoundError           404
    ├── ConflictError           409
    ├── RateLimitError          429
    └── EndpointNotReadyError   503
```

`ParetaError` is also raised directly (not as an `APIStatusError`) in two
non-HTTP cases: constructing a client with no API key, and an `evals.runs.wait()`
poll loop that exceeds its `timeout`. See [Timeouts](#timeouts) below.

## Status code to exception

The server is FastAPI, so error bodies are `{"detail": "<message>"}` with an HTTP
status. The SDK maps the status to the most specific subclass so you catch by
meaning, not by sniffing integers.

| Status | Exception | What it means |
|--------|-----------|---------------|
| 400, 422 | `BadRequestError` | Request validation failed (bad params, malformed body) |
| 401 | `AuthenticationError` | API key missing or invalid |
| 402 | `InsufficientCreditsError` | Org is out of balance; top up in the dashboard |
| 403 | `PermissionDeniedError` | Authenticated, but not allowed to do this |
| 404 | `NotFoundError` | Endpoint / eval set / run / task id does not exist |
| 409 | `ConflictError` | Conflict (seed endpoint, transient lock/contention) |
| 429 | `RateLimitError` | Rate limited; honor `Retry-After` |
| 503 | `EndpointNotReadyError` | Target endpoint is stopped, cold-starting, or its provider is down |
| other 5xx | `APIStatusError` | Generic server error |

## Reading an `APIStatusError`

Every `APIStatusError` carries the fields you need to log and debug. `request_id`
comes from the `x-request-id` response header and is the fastest thing to quote
in a support thread.

```python
from pareta import Pareta, APIStatusError

with Pareta.from_env() as pa:
    try:
        pa.endpoints.retrieve("ep_does_not_exist")
    except APIStatusError as e:
        print(e.status_code)   # 404
        print(e.detail)        # server's `detail` string (or raw body)
        print(e.request_id)    # "req_…" — quote this in bug reports
        print(e.response)      # the underlying httpx.Response, for advanced use
```

`str(e)` is the server's `detail` message when present, otherwise `HTTP <code>`.

## The errors worth catching

Most code only needs to handle a handful of these explicitly. The rest are fine
to let bubble up to a top-level `except ParetaError`.

### `InsufficientCreditsError` (402) — out of balance

Both inference and evals are metered against your org's balance. A successful
[`chat.completions.create()`](./inference.md) debits the balance; an
[`evals.runs.create()`](evaluation.md) debits for the open and frontier compute it
runs. When the balance can't cover the call, you get a 402. Top-up is
browser-only — the SDK exposes no balance or payment surface — so the right move
is to surface a clear message pointing at the dashboard.

```python
from pareta import Pareta, InsufficientCreditsError

with Pareta.from_env() as pa:
    try:
        resp = pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Extract the parties."}],
        )
    except InsufficientCreditsError:
        raise SystemExit("Org balance is empty. Top up at https://pareta.ai dashboard.")
```

### `NotFoundError` (404) — wrong id

A stale or mistyped endpoint id, eval set id, run id, or task id.

```python
from pareta import Pareta, NotFoundError

with Pareta.from_env() as pa:
    try:
        ep = pa.endpoints.retrieve("ep_maybe_deleted")
    except NotFoundError:
        ep = pa.endpoints.deploy(task="contract-key-fields", wait=True)  # redeploy
```

### `EndpointNotReadyError` (503) — endpoint not serving

The endpoint exists but isn't serving yet: it's stopped, cold-starting, or the
provider is briefly unavailable. The SDK already retries 503 a couple of times
(see [Automatic retries](#automatic-retries)); if it still surfaces, start the
endpoint and retry your call. Remember that hardware is fully managed —
[`start()`](deploying-endpoints.md) takes no GPU knob, just the endpoint id.

```python
from pareta import Pareta, EndpointNotReadyError

with Pareta.from_env() as pa:
    try:
        resp = pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "ping"}],
        )
    except EndpointNotReadyError:
        pa.endpoints.start("ep_contract_kie")          # warm it back up
        ep = pa.endpoints.retrieve("ep_contract_kie")  # poll until ep.is_live, then retry
```

### `RateLimitError` (429) — slow down

Already retried automatically, honoring the server's `Retry-After`. You only see
it after the retry budget is exhausted. Back off and try again later.

```python
from pareta import Pareta, RateLimitError

with Pareta.from_env() as pa:
    try:
        pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "hi"}],
        )
    except RateLimitError as e:
        print(f"Still rate limited after retries (request {e.request_id}); back off.")
```

### `AuthenticationError` (401) vs missing key

A 401 means the key reached the server and was rejected (wrong or revoked). That
is distinct from constructing a client with *no* key at all, which fails fast
client-side with a plain `ParetaError` before any request goes out:

```python
import pareta

try:
    pa = pareta.Pareta(api_key="")   # or PARETA_API_KEY unset with from_env()
except pareta.ParetaError as e:
    print(e)  # "missing API key. Pass api_key=… or set PARETA_API_KEY …"
```

## Pre-flight `ValueError` / `TypeError`

Some mistakes never become an HTTP call. The SDK validates the obvious ones up
front and raises the standard Python exception — not a `ParetaError` — because
they are programming errors, not server responses:

- [`chat.completions.create()`](./inference.md) raises `ValueError` if `model`
  or `messages` is empty.
- [`tasks.match()`](discovery.md) raises `ValueError` if `query` is empty.
- [`evals.sets.create()`](evaluation.md) raises `ValueError` if `items` is empty.
- [`evals.runs.create()`](evaluation.md) raises `ValueError` if neither
  `eval_set=` nor `task=`+`items=` is supplied, and `ValueError`/`TypeError` if
  `frontier=` is an unparseable keyword or a frontier keyword can't be resolved
  to a task.
- [`evals.sets.upload_document()`](evaluation.md) raises `TypeError` if `file` is
  not a path, bytes, or a binary file-like object.

These are fine to let crash in development; they signal a bug in the call, not a
runtime condition to recover from.

## Automatic retries

The client retries transient failures for you before raising. You usually do not
need a retry loop of your own.

**What is retried:** status codes `408, 409, 429, 500, 502, 503, 504`, plus
connection-level errors that happen *between* attempts. The default budget is
`max_retries=2` (so up to three attempts total).

**Backoff:** if the server sent a `Retry-After` header, the SDK waits that many
seconds (capped at 30s). Otherwise it uses exponential backoff with jitter:
`min(0.5 * 2**attempt, 8.0) + random(0, 0.25)` seconds, so roughly 0.5s, then
1s, capped at 8s.

**What is not retried:** stable 4xx (400, 401, 402, 403, 404, 422) raise
immediately — retrying a bad request or an empty balance won't help. Connection
errors on the very first attempt are surfaced as `APIConnectionError` /
`APITimeoutError` once the budget is exhausted.

Tune the budget per client. Set `max_retries=0` to disable retries entirely:

```python
from pareta import Pareta

# More aggressive: up to 6 attempts on transient failures.
pa = Pareta.from_env(max_retries=5)

# No retries — fail fast and handle it yourself.
strict = Pareta.from_env(max_retries=0)
```

### Streaming and retries

Retries apply only to the initial handshake (connect and status line). Once SSE
bytes are flowing — token chunks from a streamed
[chat completion](./inference.md), or `progress`/`complete` events from a
streamed [deploy](deploying-endpoints.md) — a mid-stream drop raises immediately, because
the stream cannot be safely resumed. Catch it and restart the request from the
top if you need to.

```python
from pareta import Pareta, APIConnectionError

with Pareta.from_env() as pa:
    try:
        for chunk in pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Summarize the contract."}],
            stream=True,
        ):
            piece = chunk.choices[0].delta.content
            if piece:
                print(piece, end="", flush=True)
    except APIConnectionError:
        print("\n[stream dropped — re-issue the request to retry]")
```

## Timeouts

The default per-request timeout is `httpx.Timeout(60.0, connect=10.0)`: 60s
overall, 10s to establish the connection. A request that exceeds it raises
`APITimeoutError` (a subclass of `APIConnectionError`) after the retry budget is
spent. Override it with any `httpx.Timeout` (or a bare float):

```python
import httpx
from pareta import Pareta, APITimeoutError

# 120s overall, 5s to connect — handy for long generations.
pa = Pareta.from_env(timeout=httpx.Timeout(120.0, connect=5.0))

with pa:
    try:
        pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Write a long summary."}],
            max_tokens=4096,
        )
    except APITimeoutError:
        print("Request timed out; consider streaming or a larger timeout.")
```

### Eval-run wait timeout

[`evals.runs.create(wait=True)`](evaluation.md) and `evals.runs.wait()` are
different: they poll the run to completion. The `timeout` parameter there bounds
the *whole poll loop* (default 900s), not a single HTTP request. If the run
hasn't reached a terminal status (`completed` or `failed`) by the deadline, the
poll helper raises a plain `ParetaError` — the run keeps going server-side, so
you can re-`retrieve()` it later by id.

```python
from pareta import Pareta, ParetaError

with Pareta.from_env() as pa:
    try:
        run = pa.evals.runs.create(
            task="contract-key-fields",
            items=[{"input": "...", "expected": "..."}],
            models=["contract-1", "contract-2"],
            frontier="benchmarked",
            wait=True,
            timeout=600.0,      # give up waiting after 10 minutes
            poll_interval=5.0,
        )
        print(run.status, run.cost)        # e.g. "completed" Decimal("0.42")
    except ParetaError as e:
        print(e)  # "eval run … did not finish within 600s" — poll later with runs.retrieve(id)
```

Note that a run finishing with `status == "failed"` is *not* an exception — it's
a terminal state you read off the returned `EvalRun` (`run.is_terminal` is True,
`run.error_detail` carries the message). Only the wait *timeout* raises.

## Async

`AsyncPareta` raises the exact same exception classes; wrap `await` calls in the
same `try`/`except`. Retries, backoff, and timeouts behave identically — backoff
just uses `asyncio.sleep` under the hood.

```python
import asyncio
from pareta import AsyncPareta, InsufficientCreditsError, EndpointNotReadyError

async def main():
    async with AsyncPareta.from_env() as pa:
        try:
            resp = await pa.chat.completions.create(
                model="ep_contract_kie",
                messages=[{"role": "user", "content": "Extract the parties."}],
            )
            print(resp.choices[0].message.content)
        except InsufficientCreditsError:
            print("Top up your org balance in the dashboard.")
        except EndpointNotReadyError:
            await pa.endpoints.start("ep_contract_kie")

asyncio.run(main())
```

## A layered handler

A practical pattern: catch the few cases you can act on, then fall back to the
base class so nothing escapes unhandled.

```python
from pareta import (
    Pareta,
    InsufficientCreditsError,
    EndpointNotReadyError,
    RateLimitError,
    APITimeoutError,
    ParetaError,
)

with Pareta.from_env() as pa:
    try:
        resp = pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Extract the parties."}],
        )
        print(resp.choices[0].message.content)
    except InsufficientCreditsError:
        print("Out of balance — top up in the dashboard.")
    except EndpointNotReadyError:
        pa.endpoints.start("ep_contract_kie")  # warm it, then retry
    except RateLimitError:
        print("Rate limited after retries — back off and try again.")
    except APITimeoutError:
        print("Timed out — raise the timeout or stream the response.")
    except ParetaError as e:
        print(f"Unexpected SDK error: {e}")  # request_id is on APIStatusError subclasses
```

## See also

- [Inference](./inference.md) — OpenAI-compatible chat completions and streaming
- [Endpoints](deploying-endpoints.md) — deploy, start, stop, and the `is_live` check
- [Evals](evaluation.md) — eval sets, runs, `wait`, and `run.cost`
- [Tasks](discovery.md) — the benchmark catalog and `match()`



---

<!-- guide/async.md -->

# Async usage

`AsyncPareta` is the asyncio-native client. It mirrors the synchronous [`Pareta`](./quickstart.md) client method-for-method: same constructor, same resource namespaces (`chat`, `models`, `endpoints`, `tasks`, `evals`), same return types. The difference is that request methods are coroutines you `await`, streams are async iterators you drive with `async for`, and many independent calls can run concurrently under one event loop instead of blocking one after another.

Reach for it when you are inside an async app (FastAPI, an aiohttp worker, a Discord bot) or when you want to fan out work: score ten models in parallel, deploy several endpoints at once, or run inference against a batch of inputs without waiting on each round trip.

## The client

Build it from the environment, exactly like the sync client. `from_env()` reads `PARETA_API_KEY` and the optional `PARETA_BASE_URL`.

```python
import asyncio
from pareta import AsyncPareta


async def main():
    pa = AsyncPareta.from_env()  # reads PARETA_API_KEY
    try:
        models = await pa.models.list()
        for m in models:
            print(m.id, m.owned_by)
    finally:
        await pa.aclose()


asyncio.run(main())
```

`models.list()` returns the same `ModelList` as the sync path: only deployed, url-bearing endpoints, OpenAI-compatible. The `id` of each is what you pass to `chat.completions.create(model=...)`.

### Lifecycle: prefer `async with`

The client owns an `httpx.AsyncClient` and you must release it. Use `async with` and cleanup is automatic; otherwise call `await pa.aclose()` in a `finally`.

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        models = await pa.models.list()
        print(len(models), "endpoints")
    # the underlying HTTP client is closed here


asyncio.run(main())
```

The async lifecycle methods are `await pa.aclose()`, `async with` (which calls `__aenter__` / `__aexit__`). There is no sync `close()` on the async client. If you pass your own `http_client=httpx.AsyncClient(...)`, the SDK will not close it for you; that one is yours to manage.

You can also pass `api_key=`, `base_url=`, `timeout=`, and `max_retries=` directly, same as the sync client:

```python
from pareta import AsyncPareta

pa = AsyncPareta(api_key="pareta_sk_...", max_retries=4)
```

## Await every request method

Every resource method that hits the API is a coroutine. Await it.

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        # inference (OpenAI-compatible)
        completion = await pa.chat.completions.create(
            model="ep_contract_kie_01",
            messages=[{"role": "user", "content": "Extract the total due."}],
            temperature=0,
        )
        print(completion.choices[0].message.content)
        print(completion.usage.total_tokens, "tokens")

        # catalog discovery
        match = await pa.tasks.match("pull key fields out of contracts")
        if match.matched:
            print("task:", match.chosen.task_id, match.chosen.confidence)

        # eval roster
        frontier = await pa.evals.frontier_models(task="contract-key-fields")
        print([f.id for f in frontier])


asyncio.run(main())
```

`chat.completions.create()` is metered: a successful completion debits your org balance. If the balance is empty it raises `InsufficientCreditsError` (402). Top-up is browser-only; the SDK does not expose balance or payment. See [Errors](errors-and-retries.md) and [Billing](core-concepts.md).

Note one shape difference inside `evals`: `pa.endpoints.metrics(endpoint_id)` is **not** a coroutine. It returns an `AsyncMetrics` object synchronously; the dimension methods on it are what you await.

```python
m = pa.endpoints.metrics("ep_contract_kie_01")   # no await here
cost = await m.cost()                              # await the dimension call
print(cost)
```

## Streaming with `async for`

Streaming chat works in two steps. First `await` the `create(stream=True)` call to get the async iterator, then drive it with `async for`. Each chunk is a `ChatCompletionChunk`; the incremental text is `chunk.choices[0].delta.content` (which can be `None` on non-content frames, so guard it).

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        stream = await pa.chat.completions.create(
            model="ep_contract_kie_01",
            messages=[{"role": "user", "content": "Summarize this clause."}],
            stream=True,
        )
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                print(delta, end="", flush=True)
        print()


asyncio.run(main())
```

The stream ends on the wire's `[DONE]` sentinel; the async iterator simply stops. Retries apply only to the initial handshake. Once bytes are flowing, a mid-stream drop raises immediately rather than silently resuming.

### Deploying with progress events

`endpoints.deploy()` takes a task and a model alias and nothing about hardware. Pareta hides GPUs entirely: there is no GPU, tensor-parallel, or quantization knob. `model` defaults to `"recommended"`, the task's curated or top-open pick. Models are addressed by per-task public aliases, never raw weights ids.

With `wait=False` (the default), `await` the call to get an async iterator of `{"event": str, "data": dict}` progress events; the terminal event is `"complete"` (with `data["endpoint"]`) or `"error"`.

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        stream = await pa.endpoints.deploy(task="contract-key-fields", model="recommended")
        async for event in stream:
            if event["event"] == "progress":
                print("stage:", event["data"])
            elif event["event"] == "complete":
                ep = event["data"]["endpoint"]
                print("live:", ep["id"], ep["url"])
            elif event["event"] == "error":
                print("failed:", event["data"])


asyncio.run(main())
```

If you do not want to watch progress, pass `wait=True`. The SDK consumes the stream internally and returns the live `Endpoint` once it is up, raising `ParetaError` on a deploy `"error"` event.

```python
ep = await pa.endpoints.deploy(task="contract-key-fields", model="recommended", wait=True)
print(ep.id, ep.is_live, ep.url)

# then call it
completion = await pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the parties."}],
)
```

## Running many calls concurrently

This is the reason to go async. Independent calls can run at the same time under one event loop with `asyncio.gather`, instead of serializing on each network round trip. Reuse one client across all of them so they share the connection pool.

### Fan out inference over a batch

```python
import asyncio
from pareta import AsyncPareta

PROMPTS = [
    "Extract the invoice total.",
    "Extract the due date.",
    "Extract the vendor name.",
    "Extract the PO number.",
]


async def classify(pa: AsyncPareta, model: str, prompt: str) -> str:
    completion = await pa.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return completion.choices[0].message.content


async def main():
    async with AsyncPareta.from_env() as pa:
        results = await asyncio.gather(
            *(classify(pa, "ep_contract_kie_01", p) for p in PROMPTS)
        )
        for prompt, answer in zip(PROMPTS, results):
            print(prompt, "->", answer)


asyncio.run(main())
```

Each of those `create()` calls is metered independently and debits the org balance on success. If your balance runs out mid-batch, the in-flight calls that have not yet been billed raise `InsufficientCreditsError`. With `gather`, the first exception propagates and cancels the rest; pass `return_exceptions=True` if you would rather collect partial results and inspect failures per item.

```python
results = await asyncio.gather(
    *(classify(pa, "ep_contract_kie_01", p) for p in PROMPTS),
    return_exceptions=True,
)
for prompt, result in zip(PROMPTS, results):
    if isinstance(result, Exception):
        print(prompt, "FAILED:", result)
    else:
        print(prompt, "->", result)
```

### Deploy several endpoints at once

`deploy(..., wait=True)` is a coroutine, so a list of deploys parallelizes cleanly.

```python
import asyncio
from pareta import AsyncPareta

TASKS = ["contract-key-fields", "invoice-extraction", "doc-classification"]


async def main():
    async with AsyncPareta.from_env() as pa:
        endpoints = await asyncio.gather(
            *(pa.endpoints.deploy(task=t, model="recommended", wait=True) for t in TASKS)
        )
        for ep in endpoints:
            print(ep.task, ep.id, ep.is_live)


asyncio.run(main())
```

### Run several eval runs in parallel

`evals.runs.create(..., wait=True)` polls `runs.retrieve()` until the run is terminal using `asyncio.sleep`, so it never blocks the loop. That makes a leaderboard sweep, one run per candidate set, a natural `gather`.

```python
import asyncio
from pareta import AsyncPareta

ITEMS = [
    {"input": "Acme Corp agrees to pay $5,000 net 30.", "expected": {"amount": "5000"}},
    {"input": "Total due: $1,200 by 2026-07-01.", "expected": {"amount": "1200"}},
]


async def main():
    async with AsyncPareta.from_env() as pa:
        # create one shared eval set, then sweep candidate model lists against it
        eval_set = await pa.evals.sets.create(task="contract-key-fields", items=ITEMS)
        print("eval set:", eval_set.id, eval_set.item_count, "items")

        candidate_lists = [
            ["contract-key-fields-open-1"],
            ["contract-key-fields-open-2"],
        ]
        runs = await asyncio.gather(
            *(
                pa.evals.runs.create(eval_set=eval_set.id, models=models, wait=True)
                for models in candidate_lists
            )
        )
        for run in runs:
            print(run.id, run.status, "cost", run.cost)  # run.cost is a Decimal in dollars
            for r in run.results:
                print(" ", r.model_id, r.kind, r.quality_mean)


asyncio.run(main())
```

Eval runs are metered against the org balance for the compute used (open candidates plus any frontier baselines), and raise `InsufficientCreditsError` on an empty balance. `run.cost` is a `Decimal` in dollars, floored to whole cents (so a sub-cent run reads `Decimal("0.00")`); `run.cost_micro_usd` is the raw integer micro-USD if you need the exact figure. See [Evals](evaluation.md) and [Billing](core-concepts.md).

To add vendor baselines, pass `frontier=`. In the async client, `"all"` and `"benchmarked"` resolve the roster by awaiting `evals.frontier_models()` SDK-side, so they need a task to resolve against (taken from `task=` or looked up from the eval set):

```python
run = await pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["contract-key-fields-open-1"],
    frontier="benchmarked",   # or "all", or an explicit list of frontier ids, or None
    wait=True,
)
```

### Polling a run yourself

If you started a run with `wait=False`, await `runs.wait()` later, or poll `runs.retrieve()` on your own schedule. `wait()` accepts `poll_interval` (default 3.0s) and `timeout` (default 900s), and raises `ParetaError` if the run does not reach a terminal status in time.

```python
run = await pa.evals.runs.create(eval_set=eval_set.id, models=["contract-key-fields-open-1"])
print("queued:", run.id, run.status)
# ... do other work ...
final = await pa.evals.runs.wait(run.id, poll_interval=5.0, timeout=600.0)
print(final.status, final.is_terminal, final.cost)
```

## Bounding concurrency

`gather` launches everything at once. For large batches, cap the in-flight count with an `asyncio.Semaphore` so you do not overwhelm a single endpoint or trip rate limits (which surface as `RateLimitError`, 429; the client already retries those with backoff up to `max_retries`).

```python
import asyncio
from pareta import AsyncPareta


async def main():
    sem = asyncio.Semaphore(5)  # at most 5 concurrent requests

    async with AsyncPareta.from_env() as pa:
        async def one(prompt: str) -> str:
            async with sem:
                completion = await pa.chat.completions.create(
                    model="ep_contract_kie_01",
                    messages=[{"role": "user", "content": prompt}],
                )
                return completion.choices[0].message.content

        prompts = [f"Extract field {i}." for i in range(100)]
        answers = await asyncio.gather(*(one(p) for p in prompts))
        print(len(answers), "done")


asyncio.run(main())
```

## Errors

The async client raises the exact same exception hierarchy as the sync client; the only difference is that errors surface out of an awaited call or an `async for`. Catch them the usual way.

```python
from pareta import (
    AsyncPareta,
    InsufficientCreditsError,
    EndpointNotReadyError,
    RateLimitError,
    ParetaError,
)


async def safe_call(pa: AsyncPareta):
    try:
        return await pa.chat.completions.create(
            model="ep_contract_kie_01",
            messages=[{"role": "user", "content": "hi"}],
        )
    except InsufficientCreditsError:
        print("org balance is empty; top up in the dashboard")
    except EndpointNotReadyError:
        print("endpoint is cold or stopped; start it and retry")
    except RateLimitError:
        print("rate limited even after retries")
    except ParetaError as e:
        print("pareta error:", e)
```

Pre-flight validation (empty `model`/`messages`, empty `items`, an unparseable `frontier`) raises `ValueError`/`TypeError` when you `await` the call — the check runs at the top of the coroutine, before any network I/O (not when the coroutine object is first created). See [Errors](errors-and-retries.md) for the full mapping.

## Sync and async, side by side

| Concern | `Pareta` (sync) | `AsyncPareta` (async) |
|---|---|---|
| Build | `Pareta.from_env()` | `AsyncPareta.from_env()` |
| Cleanup | `pa.close()` / `with pa:` | `await pa.aclose()` / `async with pa:` |
| Request method | `pa.models.list()` | `await pa.models.list()` |
| Streaming chat | `for chunk in pa.chat.completions.create(stream=True)` | `stream = await pa...create(stream=True)` then `async for chunk in stream` |
| Deploy events | `for ev in pa.endpoints.deploy(...)` | `stream = await pa.endpoints.deploy(...)` then `async for ev in stream` |
| Deploy and block | `pa.endpoints.deploy(..., wait=True)` | `await pa.endpoints.deploy(..., wait=True)` |
| Wait on a run | `pa.evals.runs.wait(run_id)` | `await pa.evals.runs.wait(run_id)` |
| Metrics handle | `pa.endpoints.metrics(id)` (sync) | `pa.endpoints.metrics(id)` (sync, returns `AsyncMetrics`) |
| Metrics dimension | `m.cost()` | `await m.cost()` |
| Concurrency | thread pool / one at a time | `asyncio.gather`, one event loop |

Same metering, same aliases, same OpenAI-compatible inference, same hidden GPUs. Once you have the sync flow in [Quickstart](./quickstart.md), the async version is the same calls with `await` in front and `async for` over the streams.



---

<!-- guide/configuration.md -->

# Configuration

Every Pareta call goes through one client object. Configuration is just how you build that client: which API key it sends, which environment it points at, how patient it is on slow or flaky requests, and (optionally) what HTTP stack it rides on. This page covers all of it for both `Pareta` (sync) and `AsyncPareta` (async).

The short version: set `PARETA_API_KEY` and use `Pareta.from_env()`. Everything below is for when the defaults are not enough.

```python
from pareta import Pareta

pa = Pareta.from_env()
print(pa.models.list())
```

## The fast path: `from_env()`

`from_env()` reads two environment variables and builds the client for you:

- `PARETA_API_KEY` — your `pareta_sk_…` key (required)
- `PARETA_BASE_URL` — optional environment override (defaults to production)

```bash
export PARETA_API_KEY="pareta_sk_live_…"
```

```python
from pareta import Pareta

pa = Pareta.from_env()
```

`from_env()` forwards any extra keyword arguments straight to the constructor, so you can keep the key in the environment while overriding everything else in code:

```python
pa = Pareta.from_env(max_retries=5, timeout=120.0)
```

The async client works the same way:

```python
from pareta import AsyncPareta

pa = AsyncPareta.from_env()
```

Prefer `from_env()` over hardcoding keys. It keeps `pareta_sk_…` secrets out of source control and lets the same code run against staging or production by flipping one environment variable.

## Constructor parameters

Both clients take the same arguments:

```python
from pareta import Pareta

pa = Pareta(
    api_key="pareta_sk_live_…",
    base_url="https://api.pareta.ai",
    timeout=60.0,
    max_retries=2,
    http_client=None,
)
```

| Parameter | Type | Default | What it does |
|-----------|------|---------|--------------|
| `api_key` | `str \| None` | `None` | Your `pareta_sk_…` key. Sent as a Bearer token. Required. |
| `base_url` | `str \| None` | `"https://api.pareta.ai"` | API root. Use the staging URL to point at the staging environment. |
| `timeout` | `httpx.Timeout \| float \| None` | `httpx.Timeout(60.0, connect=10.0)` | Per-request timeout. |
| `max_retries` | `int` | `2` | Automatic retries on transient failures. |
| `http_client` | `httpx.Client \| httpx.AsyncClient \| None` | `None` | Bring your own httpx client (proxies, custom transports, connection pools). |

`AsyncPareta` is identical except `http_client` takes an `httpx.AsyncClient`.

## `api_key`

The key is the one required piece of configuration. Pass it explicitly or via `PARETA_API_KEY`; the SDK sends it as `Authorization: Bearer <key>` on every request.

```python
from pareta import Pareta

pa = Pareta(api_key="pareta_sk_live_…")
```

If the key is missing or empty (and the env var is unset when using `from_env()`), the constructor raises `ParetaError` before any network call:

```python
from pareta import Pareta, ParetaError

try:
    pa = Pareta(api_key="")
except ParetaError as e:
    print(e)
    # missing API key. Pass api_key=… or set PARETA_API_KEY
    # (mint a pareta_sk_ key in the dashboard).
```

Mint keys in the dashboard. If the key is present but rejected by the server, you get a `401` as `AuthenticationError` on the first request, not at construction time. See [Errors](errors-and-retries.md) for the full exception hierarchy.

## `base_url` (production vs staging)

`base_url` selects the environment. It defaults to production and is normalized with a trailing-slash strip, so `https://api.pareta.ai/` and `https://api.pareta.ai` behave identically.

| Environment | `base_url` |
|-------------|------------|
| Production (default) | `https://api.pareta.ai` |
| Staging | `https://api-staging.pareta.ai` |

```python
from pareta import Pareta

# Production — base_url omitted, defaults applied.
prod = Pareta(api_key="pareta_sk_live_…")

# Staging — pass it explicitly, or set PARETA_BASE_URL.
staging = Pareta(
    api_key="pareta_sk_test_…",
    base_url="https://api-staging.pareta.ai",
)
```

Via the environment, no code change needed:

```bash
export PARETA_API_KEY="pareta_sk_test_…"
export PARETA_BASE_URL="https://api-staging.pareta.ai"
```

```python
from pareta import Pareta

pa = Pareta.from_env()   # now talks to staging
```

Keys are environment-scoped: a production key will not authenticate against staging and vice versa. Pair each `base_url` with a key minted for that environment.

## `timeout`

`timeout` caps how long a single request may take. The default is `httpx.Timeout(60.0, connect=10.0)`: up to 10 seconds to establish the connection and 60 seconds overall. A bare float sets the overall timeout for read, write, and connect alike.

```python
import httpx
from pareta import Pareta

# Simple: one number for everything.
pa = Pareta(api_key="pareta_sk_live_…", timeout=120.0)

# Granular: long read budget for big generations, short connect budget.
pa = Pareta(
    api_key="pareta_sk_live_…",
    timeout=httpx.Timeout(120.0, connect=10.0),
)
```

When to raise it:

- **Long completions.** A 4096-token generation can run well past 60 seconds. Either raise `timeout` or stream the response so tokens arrive incrementally (see [Inference](inference.md)).
- **Long eval runs.** `evals.runs.create(..., wait=True)` does its own polling and has a separate `timeout` argument (default `900.0` seconds) that governs the wait loop, independent of the per-request HTTP timeout. See [Evals](evaluation.md).

A request that exceeds the timeout raises `APITimeoutError` (a subclass of `APIConnectionError`) after retries are exhausted.

## `max_retries`

The SDK automatically retries transient failures. The default is `2` (so up to 3 attempts total). Values below zero are clamped to `0`.

Retries fire only on these status codes:

```
408  Request Timeout
409  Conflict (transient lock/contention)
429  Too Many Requests
500  Internal Server Error
502  Bad Gateway
503  Service Unavailable
504  Gateway Timeout
```

Backoff is exponential with jitter, capped at 8 seconds: `min(0.5 * 2 ** attempt, 8.0)` plus a small random jitter. When the server sends a `Retry-After` header, the SDK honors it (capped at 30 seconds) instead of computing its own delay.

```python
from pareta import Pareta

# More patient: handy for batch jobs against a busy environment.
pa = Pareta(api_key="pareta_sk_live_…", max_retries=5)

# Disable retries entirely: every failure surfaces immediately.
pa = Pareta(api_key="pareta_sk_live_…", max_retries=0)
```

What is *not* retried:

- **4xx errors other than 408/409/429** — these are your request, not a transient blip. A `402 InsufficientCreditsError`, `401 AuthenticationError`, or `404 NotFoundError` raises on the first attempt.
- **Connection errors on initial connect** (DNS, TCP, TLS refusal) — raised after the retry budget for the handshake is spent.
- **Mid-stream drops.** Streaming calls (`chat.completions.create(stream=True)`, `endpoints.deploy()`) retry only the initial handshake. Once SSE bytes are flowing, a drop raises immediately, because a partial stream cannot be safely resumed.

A `409` is worth a note: it is in the retry set because some backends use it for transient lock contention. Pareta's own stable `409` (for example, a seed or legacy endpoint) simply exhausts the retries and then raises `ConflictError`, so you see the right error either way. See [Errors](errors-and-retries.md).

## `http_client` (bring your own httpx)

By default the client constructs its own httpx client, configured with your `timeout`, and closes it for you. Pass `http_client=` when you need control over the transport layer: an outbound proxy, a custom transport, mTLS, shared connection pools, or test doubles.

```python
import httpx
from pareta import Pareta

# Route through a corporate proxy with a tuned connection pool.
my_client = httpx.Client(
    proxy="http://proxy.internal:8080",
    limits=httpx.Limits(max_connections=50, max_keepalive_connections=10),
    timeout=httpx.Timeout(120.0, connect=10.0),
)

pa = Pareta(api_key="pareta_sk_live_…", http_client=my_client)
```

The async client takes an `httpx.AsyncClient`:

```python
import httpx
from pareta import AsyncPareta

my_async_client = httpx.AsyncClient(proxy="http://proxy.internal:8080")
pa = AsyncPareta(api_key="pareta_sk_live_…", http_client=my_async_client)
```

**Ownership matters.** When you inject a client, you own its lifecycle. `pa.close()` (or `await pa.aclose()`) will *not* close a client you passed in. Close it yourself:

```python
my_client.close()           # you opened it, you close it
```

When the SDK creates the client (the default), `close()` / `aclose()` does close it. The context-manager forms below rely on this.

One caveat: an injected client carries its own timeout configuration. The constructor's `timeout` argument is applied to the SDK-owned client only, so set the timeout on your own client when you bring one.

## Lifecycle and cleanup

Each client owns an HTTP connection pool. Release it when you are done.

### Sync

Use the context manager so cleanup happens automatically:

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    completion = pa.chat.completions.create(
        model="ep_contract_kie_qwen",
        messages=[{"role": "user", "content": "Extract the parties."}],
    )
    print(completion.choices[0].message.content)
# HTTP client closed on exit
```

Or close it explicitly:

```python
pa = Pareta.from_env()
try:
    pa.models.list()
finally:
    pa.close()
```

### Async

```python
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        models = await pa.models.list()
        print(models)
    # HTTP client closed on exit
```

Or close it explicitly with `await pa.aclose()`.

Remember the ownership rule: if you passed `http_client=`, neither `close()` nor exiting the context manager touches it. Close your own client.

## Platform truths worth knowing

These hold no matter how you configure the client. They are why there is no GPU knob, no balance API, and no per-environment model catalog to wire up.

- **GPUs are hidden.** You configure a key, a URL, timeouts, and retries — never hardware. `endpoints.deploy(task=…, model=…)` takes a task and a model alias; Pareta resolves the GPU, tensor-parallelism, and quantization from its registry. There is no hardware parameter anywhere in the SDK. See [Deploy endpoints](deploying-endpoints.md).
- **Models are per-task aliases.** Every model id you see or pass — in `deploy(model=…)`, on leaderboard rows, in `run.results[].model_id`, in `endpoints.list()[].model` — is a per-task public alias like `qwen-vl-2`. Real internal ids never cross into the SDK. See [Tasks and the catalog](discovery.md).
- **Inference and evals are metered against your org balance.** A successful `chat.completions.create()` debits your balance; `evals.runs.create()` debits for both open and frontier compute. `run.cost` reports the billed total as a `Decimal` in dollars (floored to whole cents), and `run.cost_micro_usd` the raw micro-USD. When the balance hits zero, both paths raise `InsufficientCreditsError` (402). Top-up is browser-only — the SDK exposes neither balance nor payment methods, by design. See [Evals](evaluation.md).
- **Inference is OpenAI-compatible.** `base_url` plus your `pareta_sk_…` key is a drop-in OpenAI endpoint. You can point the `openai` SDK at the same `base_url` to call a deployed endpoint; Pareta's SDK adds the control plane (deploy, eval, discovery) that `openai` cannot do. See [Inference](inference.md).

## Configuration cookbook

A few complete, runnable setups.

**Production, defaults, env-driven** — the recommended baseline:

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    print(pa.models.list())
```

**Staging, patient retries, long timeout** — for a batch job against a busy environment:

```python
import httpx
from pareta import Pareta

pa = Pareta(
    api_key="pareta_sk_test_…",
    base_url="https://api-staging.pareta.ai",
    timeout=httpx.Timeout(180.0, connect=10.0),
    max_retries=5,
)
```

**Fail fast** — no retries, surface every error on the first attempt (good for tests):

```python
from pareta import Pareta

pa = Pareta(api_key="pareta_sk_test_…", max_retries=0)
```

**Async, custom transport** — own the httpx client, own the cleanup:

```python
import asyncio
import httpx
from pareta import AsyncPareta

async def main():
    client = httpx.AsyncClient(proxy="http://proxy.internal:8080")
    pa = AsyncPareta.from_env(http_client=client)
    try:
        print(await pa.models.list())
    finally:
        await client.aclose()   # you opened it, you close it

asyncio.run(main())
```

## See also

- [Inference](inference.md) — OpenAI-compatible chat completions, streaming, and metering.
- [Deploy endpoints](deploying-endpoints.md) — deploy a model to a task and operate it.
- [Tasks and the catalog](discovery.md) — discover tasks, match intent, read leaderboards.
- [Evals](evaluation.md) — score models on your own data, including `run.cost`.
- [Errors](errors-and-retries.md) — the full exception hierarchy and how retries interact with it.



---

<!-- examples/deploy-and-infer.md -->

# Deploy a model and call it

This is the shortest path from "I have a task" to "I'm getting completions back": pick a task, deploy the recommended open-weights model for it, then call the live endpoint with OpenAI-compatible inference. Pareta picks the GPU and serving config for you, so deploy takes a task and a model alias and nothing about hardware.

The whole flow is two calls:

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

# 1. Deploy the recommended model for a task and block until it is live.
endpoint = pa.endpoints.deploy(
    task="contract-key-fields",
    model="recommended",
    wait=True,
)

# 2. Call it. The endpoint id is what you pass as `model`.
resp = pa.chat.completions.create(
    model=endpoint.id,
    messages=[
        {"role": "user", "content": "Extract the parties and effective date from this clause: ..."},
    ],
)
print(resp.choices[0].message.content)
```

That is the entire happy path. The rest of this page unpacks each step and the platform facts you should know before you run it for real.

## Before you start

You need a `pareta_sk_` API key. Mint one in the dashboard (key management is browser-only) and either pass it as `api_key=` or set `PARETA_API_KEY`. `Pareta.from_env()` reads `PARETA_API_KEY` and the optional `PARETA_BASE_URL`; the base URL defaults to `https://api.pareta.ai`.

```python
from pareta import Pareta

# Preferred: pull credentials from the environment.
pa = Pareta.from_env()

# Or pass them explicitly.
pa = Pareta(api_key="pareta_sk_...", base_url="https://api.pareta.ai")
```

The client is a context manager, which is the clean way to release the HTTP connection when you are done:

```python
with Pareta.from_env() as pa:
    endpoint = pa.endpoints.deploy(task="contract-key-fields", wait=True)
    resp = pa.chat.completions.create(
        model=endpoint.id,
        messages=[{"role": "user", "content": "..."}],
    )
```

## Step 1: Deploy

```python
endpoint = pa.endpoints.deploy(
    task="contract-key-fields",   # required: a subtask id from the catalog
    model="recommended",          # default; resolves to the task's curated pick
    wait=True,                    # block until the endpoint is live
)
```

`deploy()` takes a `task` and a `model`. There is no GPU, tensor-parallel, quantization, or run-mode knob. Pareta resolves the serving class from its registry, so you describe what you want to run, not the hardware to run it on.

A few things worth knowing:

- **`task` is required** and is a subtask id (for example `"contract-key-fields"`). Omitting it raises `ValueError`. Browse the catalog or turn free-text intent into a task id with the [tasks resource](../guide/discovery.md).
- **`model` defaults to `"recommended"`**, which resolves server-side to the task's curated pick (or the leaderboard's top open model). You can also pass a specific per-task model alias. To see what `"recommended"` will resolve to before you commit, call `pa.tasks.recommended(task_id)`.
- **`model` is always a per-task public alias**, never a raw open-weights model id. Real ids never cross into the SDK. The same is true of `endpoint.model` on the object you get back and of every model id in [evals](evaluate-on-your-data.md) and leaderboards.
- **`name` is optional.** Pareta auto-generates one if you do not pass it.

### wait=True versus the progress stream

With `wait=True` the call blocks through the deploy and returns the live `Endpoint`. If the deploy fails, it raises `ParetaError` with the backend's message.

If you want to surface progress (pulling weights, warming up, and so on), drop `wait` and iterate the event stream instead. Each event is a plain `{"event": str, "data": dict}` dict:

```python
endpoint_id = None
for event in pa.endpoints.deploy(task="contract-key-fields", model="recommended"):
    if event["event"] == "progress":
        print("deploying:", event["data"])
    elif event["event"] == "complete":
        endpoint_id = event["data"]["endpoint"]["id"]
        print("live:", endpoint_id)
    elif event["event"] == "error":
        raise RuntimeError(event["data"].get("message"))
```

`wait=True` consumes this exact stream for you and returns the `Endpoint` parsed from the `complete` event, so reach for the raw iterator only when you actually want to render progress.

### What you get back

The `Endpoint` object carries everything you need to call and operate the endpoint:

```python
endpoint.id        # the id you pass as `model` to chat.completions.create
endpoint.name      # display name
endpoint.model     # per-task public alias that was deployed
endpoint.status    # "live", "starting", "stopped", ...
endpoint.task      # task name
endpoint.url       # raw inference URL (OpenAI-compatible)
endpoint.is_live   # True when status == "live"
endpoint.to_dict() # the full raw record, nothing dropped
```

Note that `endpoint.id` is the name, and that is the value `chat.completions.create(model=...)` expects, not `endpoint.model` (the alias). After a `wait=True` deploy `endpoint.is_live` is `True`.

## Step 2: Call the endpoint

Inference is OpenAI-compatible. Pass the endpoint id as `model` and a non-empty list of message dicts:

```python
resp = pa.chat.completions.create(
    model=endpoint.id,
    messages=[
        {"role": "system", "content": "You extract structured fields from contracts."},
        {"role": "user", "content": "Extract the parties and effective date: ..."},
    ],
    temperature=0,   # extra OpenAI params pass straight through
    max_tokens=512,
)

choice = resp.choices[0]
print(choice.message.content)
print(choice.finish_reason)          # "stop", "length", ...
print(resp.usage.total_tokens)       # prompt_tokens + completion_tokens
```

`model` and `messages` are both required; passing either falsy raises `ValueError` before any request goes out. Any extra OpenAI keyword arguments (`temperature`, `max_tokens`, `top_p`, and so on) pass through as body fields untouched.

### Streaming

Set `stream=True` to get an iterator of `ChatCompletionChunk` objects. The incremental text lives on `chunk.choices[0].delta.content`:

```python
for chunk in pa.chat.completions.create(
    model=endpoint.id,
    messages=[{"role": "user", "content": "Summarize this clause: ..."}],
    stream=True,
):
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
```

The stream ends on its own (the SDK consumes the terminal `[DONE]`). Retries cover only the initial connection; once tokens are flowing, a mid-stream drop raises immediately because it cannot be safely resumed.

### You do not strictly need this SDK to call the endpoint

Because inference is OpenAI-compatible, you can point the `openai` client at your Pareta base URL and key:

```python
from openai import OpenAI

client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")
resp = client.chat.completions.create(model=endpoint.id, messages=[...])
```

The Pareta SDK's value is the control plane around inference: deploy, operate, discover, and eval. Use whichever inference client you like once an endpoint is live.

## Cost and billing

Both deploy-driven inference and evals are metered against your org balance.

- **Inference debits on success.** Each completed `chat.completions.create()` call debits your org balance.
- **A zero balance raises `InsufficientCreditsError` (402)** on both inference and eval paths. Catch it and tell the user to top up.
- **Top-up is browser-only.** The SDK consumes credit; it never exposes balance, payment methods, or top-up. There is no API call to add funds.

```python
from pareta import InsufficientCreditsError

try:
    resp = pa.chat.completions.create(model=endpoint.id, messages=[...])
except InsufficientCreditsError:
    print("Org balance is empty. Top up in the dashboard, then retry.")
```

When you run an [eval](evaluate-on-your-data.md), the resulting `EvalRun` reports spend in dollars on `run.cost` (a `Decimal`, floored to whole cents) with the raw value on `run.cost_micro_usd`. Inference cost is metered server-side rather than returned on the completion object.

## Errors worth catching

Every SDK error subclasses `ParetaError`. The ones you will hit most around this flow:

```python
from pareta import (
    ParetaError,               # base class; also raised on a failed deploy
    AuthenticationError,       # 401 - bad or missing key
    InsufficientCreditsError,  # 402 - org out of credit; top up in the dashboard
    NotFoundError,             # 404 - unknown endpoint or task
    EndpointNotReadyError,     # 503 - endpoint stopped, cold, or provider down
    RateLimitError,            # 429 - throttled (auto-retried)
    BadRequestError,           # 400/422 - malformed request
)
```

If you call an endpoint that is stopped or still cold, you get `EndpointNotReadyError` (503). Start a stopped endpoint with `pa.endpoints.start(endpoint.id)`. The client auto-retries 429s and transient 5xx/timeouts with exponential backoff (`max_retries`, default 2).

## Operating the endpoint afterward

Once deployed, the endpoint persists. You can stop it to save spend and start it again later:

```python
pa.endpoints.stop(endpoint.id)    # take it offline
pa.endpoints.start(endpoint.id)   # bring it back
pa.endpoints.delete(endpoint.id)  # remove it entirely

for ep in pa.endpoints.list():    # everything your org can access
    print(ep.id, ep.status)
```

To see the endpoints you can call right now (the OpenAI-compatible subset with a live URL), use `models.list()`:

```python
for m in pa.models.list():
    print(m.id, m.owned_by)
```

Latency, uptime, cost-versus-frontier, and judge-quality readouts live under `pa.endpoints.metrics(endpoint.id)`. See [operate and measure endpoints](../guide/deploying-endpoints.md).

## Async

Everything above mirrors on `AsyncPareta`. The shapes match; methods are `async def` and streams are async iterators.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        endpoint = await pa.endpoints.deploy(
            task="contract-key-fields",
            model="recommended",
            wait=True,
        )
        resp = await pa.chat.completions.create(
            model=endpoint.id,
            messages=[{"role": "user", "content": "Extract the total: ..."}],
        )
        print(resp.choices[0].message.content)

        # Streaming: await create() once, then `async for` the chunks.
        stream = await pa.chat.completions.create(
            model=endpoint.id,
            messages=[{"role": "user", "content": "Summarize: ..."}],
            stream=True,
        )
        async for chunk in stream:
            print(chunk.choices[0].delta.content or "", end="")

asyncio.run(main())
```

For the async deploy progress stream (without `wait=True`), `deploy()` returns an async iterator you consume with `async for`, yielding the same `{"event", "data"}` dicts as the sync version.

## Next steps

- [Discover tasks](../guide/discovery.md): find the right task id from free-text intent before you deploy.
- [Evaluate models](evaluate-on-your-data.md): compare open models against frontier baselines on your own data before committing.
- [Operate and measure endpoints](../guide/deploying-endpoints.md): start, stop, and read latency, cost, and quality metrics.



---

<!-- examples/find-and-deploy-best-model.md -->

# From a sentence to a deployed winner

You have a job to do ("pull the key fields out of these contracts") and a pile of
your own examples. You want the cheapest open-weights model that actually does
the job well, serving live inference, without ever touching a GPU console.

This page walks the whole funnel, end to end:

1. **`tasks.match`** turns your plain-English intent into a benchmark task id.
2. **`tasks.leaderboard`** shows you which models lead that task and what the
   recommended pick is.
3. **`evals.runs.create`** scores a shortlist on *your* data, with frontier
   baselines for context.
4. You **pick the best open model** from the results (`kind == "open"`).
5. **`endpoints.deploy`** stands it up. Pareta resolves the hardware.
6. **`chat.completions.create`** runs OpenAI-compatible inference against it.

A few platform truths that shape the code below:

- **GPUs are hidden.** `endpoints.deploy(task=, model=)` takes a task and a model,
  never a GPU, tensor-parallel degree, or quantization knob. Pareta resolves the
  serving class.
- **Models are per-task aliases.** Every open-model id you see (`leaderboard`
  rows, `run.results[].model_id`, `endpoint.model`) is a public per-task alias.
  The real weights id never crosses into the SDK. Pass the alias straight back to
  `deploy(model=...)`.
- **Evals and inference are metered against your org balance.** An eval run debits
  for the open *and* frontier compute it used; `run.cost` is the billed total in
  dollars. A successful completion debits too. An empty balance raises
  `InsufficientCreditsError` (402). Top-up is browser-only; the SDK never exposes
  balance or payment.
- **Inference is OpenAI-compatible.** Once deployed, the endpoint behaves like any
  OpenAI chat endpoint.

## Setup

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

`from_env()` is the preferred constructor. To pass the key explicitly:
`Pareta(api_key="pareta_sk_...")`.

## Step 1: Intent to task with `tasks.match`

`match` takes free text and returns a ranked `TaskMatch`. Use `.matched` to gate
on a confident hit and `.chosen` for the best candidate.

```python
m = pa.tasks.match("extract the key fields from a contract", top_k=5)

if not m.matched:
    # No confident hit. Inspect the alternates and pick one, or rephrase.
    for c in m.candidates:
        print(f"{c.task_id:30}  score={c.score:.2f}  ({c.confidence})")
    raise SystemExit("No confident task match. Pick a candidate above.")

task_id = m.chosen.task_id
print(f"matched {task_id}  (score={m.chosen.score:.2f}, via {m.matcher})")
if m.ambiguous:
    print("Heads up: the top two candidates scored close together.")
```

`TaskMatch` fields: `.query`, `.matched` (bool), `.chosen`
(`TaskMatchCandidate | None`), `.candidates` (ranked list), `.ambiguous`,
`.matcher` (`"keyword"` or `"semantic"`). Each `TaskMatchCandidate` has
`.task_id`, `.score` (0 to 1), and `.confidence` (`"high"` / `"medium"` /
`"low"`). `match` raises `ValueError` if the query is empty.

If you already know the task id, skip straight to step 2. Browse the full catalog
with `pa.tasks.list()` (each `Task` has `.id`, `.default_scorer`, and
`.has_blob_input`, where the last tells you whether the task takes documents or
images).

## Step 2: See who leads with `tasks.leaderboard`

The leaderboard ranks models for a task by quality and cost, names the
`recommended` deployable pick, and includes a `frontier` baseline so you know what
you are saving against.

```python
lb = pa.tasks.leaderboard(task_id)

print(f"recommended: {lb.recommended}")
print(f"ranked by {lb.metric}, cost per {lb.cost_unit}\n")

for e in lb.models:
    cost_usd = (e.cost_per_request_micro_usd or 0) / 1_000_000
    print(f"{e.name:24} {e.kind:8} q={e.quality:.3f}  ${cost_usd:.6f}/req  {e.context_k}k ctx")

if lb.frontier:
    f = lb.frontier
    print(f"\nfrontier baseline: {f.name}  q={f.quality:.3f}")
```

`Leaderboard` fields: `.task_id`, `.metric`, `.cost_unit`, `.recommended`
(deployable model alias, or `None`), `.models` (ranked `LeaderboardEntry` list),
and `.frontier` (a single baseline entry, or `None`). Each `LeaderboardEntry` has
`.name`, `.kind` (`"open"` or `"frontier"`), `.quality` (0 to 1),
`.cost_per_request_micro_usd` (raw micro-USD, **not** floored), `.context_k`
(context window in thousands), and `.run_mode`.

`cost_per_request_micro_usd` is raw micro-USD: 1,000,000 micro-USD = $1.00. The
SDK keeps sub-cent unit rates in micro-USD on purpose. Flooring them to whole
cents would erase the open-vs-frontier gap that makes the comparison worth doing.

Want just the deployable pick without the full board:

```python
pick = pa.tasks.recommended(task_id)   # -> str | None, the recommended alias
```

This is exactly what `endpoints.deploy(model="recommended")` resolves to under the
hood. Inspect it here before you commit.

## Step 3: Prove it on your data with `evals.runs.create`

The leaderboard is the catalog's published view. Your contracts are not the
catalog's contracts. Run a real eval on *your* rows before you deploy anything.

Build a shortlist from the leaderboard's open entries, then score them against
your data. You can create the eval set inline in the same call by passing
`task=` + `items=`.

```python
# Shortlist: top open models off the leaderboard (these are deployable aliases).
candidates = [e.name for e in lb.models if e.kind == "open"][:3]

# Your data: one dict per row. Shape depends on the task's scorer; for an
# extraction task each row carries the input plus the expected fields.
items = [
    {"input": "MASTER SERVICES AGREEMENT ... Term: 24 months ... Fee: $48,000",
     "expected": {"term_months": 24, "annual_fee_usd": 48000}},
    {"input": "STATEMENT OF WORK ... Term: 12 months ... Fee: $9,500",
     "expected": {"term_months": 12, "annual_fee_usd": 9500}},
    # ... more rows. More rows means tighter confidence intervals.
]

run = pa.evals.runs.create(
    task=task_id,
    items=items,
    models=candidates,        # open candidates to score
    frontier="benchmarked",   # baselines on this task's leaderboard, for context
    name="contracts shortlist v1",
    wait=True,                # block until terminal, then return the final run
)
```

`evals.runs.create` parameters:

- Provide **either** `eval_set=<id>` (an existing set) **or** `task=` + `items=`
  to create one inline. `models=` is required and is the list of open candidate
  aliases to score.
- `frontier=` controls the vendor baselines, resolved SDK-side:
  - `None` or `"none"` -> no baselines.
  - `"all"` -> every frontier model for the task.
  - `"benchmarked"` -> only the frontier models on the task's leaderboard
    (vision-filtered for document tasks).
  - an explicit list of frontier ids -> passed through as-is.
- `wait=True` polls until the run is terminal (`"completed"` or `"failed"`), every
  `poll_interval` seconds (default 3.0), up to `timeout` seconds (default 900.0),
  then returns the final `EvalRun`. It raises `ParetaError` if the timeout is hit.
  `wait=False` returns immediately with a `"running"`/queued run; poll it yourself
  with `pa.evals.runs.wait(run.id)` or `pa.evals.runs.retrieve(run.id)`.

`create` raises `ValueError` if neither `eval_set` nor `task`+`items` is given,
and `ValueError` if `items` is empty.

This call is metered. The org balance is debited for the open and frontier compute
the run used. If the balance is empty it raises `InsufficientCreditsError`.

```python
from pareta import InsufficientCreditsError

try:
    run = pa.evals.runs.create(task=task_id, items=items, models=candidates,
                               frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    raise SystemExit("Org balance is empty. Top up in the dashboard (browser-only).")
```

### Document and image tasks

If `task.has_blob_input` is true, the rows reference binary documents (PDFs,
scans). Create the set first, attach each file to its row, then start the run
against the set id:

```python
es = pa.evals.sets.create(task=task_id, items=items, name="contracts with PDFs")

# Attach a PDF to row 0's "document" blob field. Files under 5 MiB go inline;
# larger ones use a signed-URL upload. The SDK picks the path for you.
pa.evals.sets.upload_document(es.id, "contracts/0001.pdf", idx=0, field_name="document")

run = pa.evals.runs.create(eval_set=es.id, models=candidates,
                           frontier="benchmarked", wait=True)
```

`upload_document` accepts a path (`str`/`Path`), raw `bytes`, or a binary
file-like object; the MIME type is guessed from the filename unless you pass
`mime=`. `EvalSet` exposes `.id`, `.task_id`, `.name`, `.item_count`, and
`.scoring_strategy`.

## Step 4: Read the results, pick the best open model

A terminal `EvalRun` carries per-model aggregates in `.results`. Each `EvalResult`
has `.model_id` (the per-task alias), `.kind` (`"open"` or `"frontier"`),
`.quality_mean`, `.quality_ci_low` / `.quality_ci_high` (95% CI),
`.mean_cost_micro_usd` (raw average cost per item), `.n_succeeded`, and
`.error_count`.

```python
print(f"run {run.id}: {run.status}")
print(f"billed: ${run.cost} ({run.cost_micro_usd} micro-USD)\n")

for r in sorted(run.results, key=lambda r: (r.quality_mean or 0), reverse=True):
    cost_usd = (r.mean_cost_micro_usd or 0) / 1_000_000
    print(f"{r.model_id:24} {r.kind:8} "
          f"q={r.quality_mean:.3f} [{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
          f"${cost_usd:.6f}/item  ({r.n_succeeded} ok, {r.error_count} err)")

# The winner: the highest-quality OPEN model. Frontier rows are baselines only:
# they are vendor APIs, not something you deploy here.
open_results = [r for r in run.results if r.kind == "open"]
if not open_results:
    raise SystemExit("No open candidates succeeded. Widen the shortlist.")

winner = max(open_results, key=lambda r: (r.quality_mean or 0))
print(f"\nwinner: {winner.model_id}  (quality {winner.quality_mean:.3f})")
```

Two money fields, two purposes:

- `run.cost` is a `Decimal`, the **billed total in dollars, floored to whole
  cents** (the SDK never rounds a charge up). A run that cost 5 micro-USD reads
  `Decimal("0.00")`.
- `run.cost_micro_usd` is the raw integer total in micro-USD when you need exact
  precision.
- Per-model `mean_cost_micro_usd` stays in raw micro-USD for the same reason the
  leaderboard rates do: flooring sub-cent unit costs would collapse the
  open-vs-frontier comparison.

The frontier rows are there to answer "how much quality am I giving up, and how
much am I saving?" You deploy the open winner, not the frontier baseline.

## Step 5: Deploy the winner with `endpoints.deploy`

Hand the winning alias straight to `deploy`. No hardware knob: Pareta resolves the
serving class for the task and model. With `wait=True`, the call blocks through the
deploy and returns the live `Endpoint`.

```python
ep = pa.endpoints.deploy(
    task=task_id,
    model=winner.model_id,   # the open alias from the eval, deployed as-is
    name="contracts-prod",   # optional; auto-generated if omitted
    wait=True,
)

print(f"endpoint {ep.id}  status={ep.status}  live={ep.is_live}  url={ep.url}")
```

`Endpoint` fields: `.id` (the name you pass to `chat.completions.create(model=...)`),
`.name`, `.model` (the per-task alias), `.status` (`"live"`, `"starting"`,
`"stopped"`, ...), `.task`, `.url`, and `.is_live` (`status == "live"`).

To pass the leaderboard's recommended pick instead of an eval winner, use
`model="recommended"` (the default) and skip the model argument entirely.

### Watching deploy progress

With `wait=False`, `deploy` returns an iterator of progress events. Each event is
a `{"event": str, "data": dict}` dict. The stream ends with a `"complete"` event
(its `data` carries the `Endpoint`) or an `"error"` event.

```python
ep = None
for event in pa.endpoints.deploy(task=task_id, model=winner.model_id):
    if event["event"] == "progress":
        print("...", event["data"])
    elif event["event"] == "complete":
        ep = pa.endpoints.retrieve(event["data"]["endpoint"]["id"])
    elif event["event"] == "error":
        raise SystemExit(f"deploy failed: {event['data']}")
```

With `wait=True` the SDK consumes this stream internally and raises `ParetaError`
on an `"error"` event. `deploy` raises `ValueError` if `task` is missing.

## Step 6: Inference with `chat.completions.create`

The deployed endpoint is OpenAI-compatible. Pass `ep.id` as the `model`:

```python
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[
        {"role": "system", "content": "Extract term_months and annual_fee_usd as JSON."},
        {"role": "user", "content": "MASTER SERVICES AGREEMENT ... Term: 36 months ... Fee: $72,000"},
    ],
    temperature=0,   # any OpenAI chat param passes straight through
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens, "tokens")
```

`create` returns a `ChatCompletion` with `.id`, `.model`, `.created`, `.choices`
(each `Choice` has `.index`, `.finish_reason`, `.message`), and `.usage`
(`.prompt_tokens`, `.completion_tokens`, `.total_tokens`). It raises `ValueError`
if `model` or `messages` is empty, and (like the eval run) debits the org balance
on success, raising `InsufficientCreditsError` if the balance is empty.

### Streaming

Pass `stream=True` for an iterator of `ChatCompletionChunk`. Each chunk's
incremental text is at `.choices[0].delta.content`:

```python
for chunk in pa.chat.completions.create(model=ep.id, messages=[...], stream=True):
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

You never need this SDK to *call* the endpoint. Point the `openai` client at the
same `base_url` + your `pareta_sk_` key. The SDK's value is the control plane you
just walked: match, leaderboard, eval, deploy.

## The whole funnel

```python
from pareta import Pareta, InsufficientCreditsError

pa = Pareta.from_env()

# 1. intent -> task
m = pa.tasks.match("extract the key fields from a contract")
assert m.matched, "no confident task match"
task_id = m.chosen.task_id

# 2. who leads this task
lb = pa.tasks.leaderboard(task_id)
candidates = [e.name for e in lb.models if e.kind == "open"][:3]

# 3. prove it on your data (open candidates + benchmarked frontier baselines)
items = [{"input": "...", "expected": {...}}]  # your rows
try:
    run = pa.evals.runs.create(task=task_id, items=items, models=candidates,
                               frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    raise SystemExit("Top up the org balance in the dashboard (browser-only).")

# 4. pick the best OPEN model
winner = max((r for r in run.results if r.kind == "open"),
             key=lambda r: (r.quality_mean or 0))

# 5. deploy it (Pareta resolves the hardware)
ep = pa.endpoints.deploy(task=task_id, model=winner.model_id, wait=True)

# 6. infer (OpenAI-compatible)
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract fields from: ..."}],
)
print(resp.choices[0].message.content)
```

## Operating and measuring the live endpoint

Once it is serving, operate it from code: `pa.endpoints.list()`,
`pa.endpoints.retrieve(ep.id)`, `pa.endpoints.stop(ep.id)`,
`pa.endpoints.start(ep.id)`, `pa.endpoints.delete(ep.id)`. Read its dimensions via
`pa.endpoints.metrics(ep.id).performance()` (and `.uptime()`, `.cost()`,
`.quality()`, `.activity()`). The `.cost()` dimension reports per-endpoint spend
and savings versus the frontier baseline.

## Async

Every step has an async twin on `AsyncPareta`, with the same names and arguments,
all `await`-ed. `wait=True` and `deploy(wait=False)` return awaitables and async
iterators rather than their sync equivalents.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        m = await pa.tasks.match("extract the key fields from a contract")
        task_id = m.chosen.task_id
        run = await pa.evals.runs.create(
            task=task_id, items=[...], models=[...], frontier="benchmarked", wait=True)
        winner = max((r for r in run.results if r.kind == "open"),
                     key=lambda r: (r.quality_mean or 0))
        ep = await pa.endpoints.deploy(task=task_id, model=winner.model_id, wait=True)
        resp = await pa.chat.completions.create(
            model=ep.id, messages=[{"role": "user", "content": "..."}])
        print(resp.choices[0].message.content)

asyncio.run(main())
```

Note: async `tasks` does not yet expose `leaderboard()` / `recommended()`. Fetch
the catalog or leaderboard with the sync client (or hardcode the shortlist) when
running the async path.

## Related

- [Run an eval on your own data](evaluate-on-your-data.md): the eval set and run
  surface in depth, including document uploads and confidence intervals.
- [Deploy and operate an endpoint](../guide/deploying-endpoints.md): start/stop,
  metrics, and the deploy progress stream.
- [OpenAI-compatible inference](migrate-from-openai.md): streaming, usage,
  and pointing the `openai` client at Pareta.
- [Money and metering](../guide/core-concepts.md): how `run.cost`, micro-USD rates, and
  `InsufficientCreditsError` work.



---

<!-- examples/evaluate-on-your-data.md -->

# Benchmark models on your own data

A public leaderboard tells you which model wins on someone else's data. It does not tell you which model wins on *yours*. This page shows how to take your own labeled rows, score a slate of open-weights candidates against a frontier baseline, and read back a ranked, cost-annotated result, all in one `evals.runs.create(...)` call.

The shape is always the same:

1. Pick a task (it carries the scorer and the input schema).
2. Build an eval set from your rows.
3. Run open candidates against `frontier="benchmarked"`.
4. Read the ranked results and the dollar cost of the run.

Evals are metered: the org balance is debited for the compute you ran (open candidates **and** frontier baselines). `run.cost` is the billed total in dollars; an empty balance raises `InsufficientCreditsError`. Top-up is browser-only.

## Setup

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (and optional PARETA_BASE_URL)
```

`from_env()` is the path you want; it keeps the key out of your source. See [Authentication](../guide/installation.md) for the constructor form and key formats.

## 1. Pick a task

A task defines what gets scored and how. Every eval set, run, and result is anchored to one task id. The task also owns the `default_scorer` (the metric your candidates are judged on) and tells you, via `has_blob_input`, whether rows carry documents or images.

If you already know the id, skip ahead. Otherwise, match free text against the catalog:

```python
match = pa.tasks.match("extract key fields from a contract", top_k=5)

if match.matched:
    task_id = match.chosen.task_id          # best candidate
    print(task_id, match.chosen.confidence)  # e.g. "contract-key-fields" "high"
else:
    # nothing landed with confidence; inspect the ranked alternates
    for c in match.candidates:
        print(c.task_id, round(c.score, 3), c.confidence)
    raise SystemExit("refine the query")
```

`match.ambiguous` is `True` when the top two scores are close, worth surfacing to a human before committing. Confirm the scorer and input schema before you build a set:

```python
task = pa.tasks.retrieve(task_id)
print(task.default_scorer)   # the metric your run will report (e.g. "macro_joint_f1")
print(task.has_blob_input)   # True → rows attach PDFs/images (see step 2b)
```

See [Discover tasks](../guide/discovery.md) for the full matching and catalog walkthrough.

## 2. Build an eval set from your rows

An eval set is your labeled data, stored server-side and reusable across runs. Each row is a dict whose fields match the task schema. The exact keys are task-specific, but the universal shape is **inputs the model sees** plus a **target** (the gold label the scorer compares against).

```python
items = [
    {
        "text": "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
        "target": {"effective_date": "2026-03-03", "parties": ["Acme Corp", "Globex LLC"]},
    },
    {
        "text": "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
        "target": {"effective_date": "2026-01-12", "parties": ["Initech", "Hooli"]},
    },
    # ... more rows. A few dozen labeled rows already give you a usable signal.
]

eval_set = pa.evals.sets.create(task=task_id, items=items)

print(eval_set.id)                # use this in runs.create(eval_set=...)
print(eval_set.item_count)        # 2
print(eval_set.scoring_strategy)  # e.g. "extraction"
```

`items` must be non-empty (an empty list raises `ValueError` before any request goes out). If you omit `name`, the set is labeled `"sdk eval set (N items)"`.

Reuse a set across many runs, or list and prune as you iterate:

```python
for s in pa.evals.sets.list():
    print(s.id, s.task_id, s.item_count, s.name)

# pa.evals.sets.delete(eval_set.id)   # when you are done with it
```

### 2b. Document tasks: attach the file to each row

When `task.has_blob_input` is `True`, the row carries a binary document. Create the set with the row's text/label fields and a placeholder for the blob, then attach the file to that row by index:

```python
doc_task = "invoice-extraction"   # a has_blob_input task

eval_set = pa.evals.sets.create(
    task=doc_task,
    items=[
        {"target": {"invoice_number": "INV-7781", "total": "1240.00"}},
        {"target": {"invoice_number": "INV-7782", "total": "98.50"}},
    ],
)

# Attach one PDF per row. idx is the 0-based row; field_name is the blob input
# field from the task schema. MIME is auto-detected from the filename.
pa.evals.sets.upload_document(eval_set.id, "invoices/7781.pdf", idx=0, field_name="document")
pa.evals.sets.upload_document(eval_set.id, "invoices/7782.pdf", idx=1, field_name="document")
```

`upload_document` accepts a path (`str`/`Path`), raw `bytes`, or any binary file-like object; anything else raises `TypeError`. Files under 5 MiB upload inline; larger ones go through a signed-URL direct-to-storage flow. Either way the call returns the completion response dict. Pass `mime="application/pdf"` to override detection.

## 3. Run open candidates against a frontier baseline

This is the core call. You name the open-weights candidates (per-task public aliases) and let `frontier="benchmarked"` pull the vendor baselines that sit on this task's leaderboard. The run scores everything on the same rows with the same scorer, so the numbers are directly comparable.

```python
run = pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["contract-kie-1", "contract-kie-2"],  # open candidates (aliases)
    frontier="benchmarked",                        # vendor baselines on this leaderboard
    wait=True,                                      # block until the run is terminal
)

print(run.status)  # "completed"
```

The `models` list is the open candidates you want to rank; it is required. `frontier` controls the baselines:

| `frontier=` | Evaluates against |
|---|---|
| `None` or `"none"` | nothing (open candidates only) |
| `"benchmarked"` | frontier models on this task's leaderboard (vision-filtered for document tasks) |
| `"all"` | every frontier model in the eval pool for the task |
| `["gpt-4o", "claude-..."]` | exactly these frontier ids |

The `"benchmarked"` and `"all"` keywords need to know the task. With `eval_set=...` the SDK looks it up from the set; if you pass an explicit list of ids it skips the lookup entirely.

GPUs and serving hardware never enter this call. There is no GPU, quantization, or run-mode knob. You name a task and models; Pareta resolves the rest. Open-weights model ids are per-task aliases, and frontier ids are the vendor names in the clear.

### Inline create (skip step 2)

If you do not need a reusable set, hand the rows straight to the run. Pass `task=` and `items=` instead of `eval_set=`, and the SDK creates the set for you:

```python
run = pa.evals.runs.create(
    task=task_id,
    items=items,
    models=["contract-kie-1", "contract-kie-2"],
    frontier="benchmarked",
    wait=True,
)
```

You must pass either `eval_set=<id>` or both `task=` and `items=`; anything else raises `ValueError`.

### Picking candidates from the leaderboard

If you want the curated pick rather than hand-naming aliases, read the leaderboard and feed its `recommended` id into the run:

```python
lb = pa.tasks.leaderboard(task_id)
print(lb.recommended)                # the deployable alias Pareta curates for this task
print(lb.frontier.name)              # the savings baseline

candidates = [lb.recommended] + [m.name for m in lb.models[:2] if m.kind == "open"]
run = pa.evals.runs.create(eval_set=eval_set.id, models=candidates,
                           frontier="benchmarked", wait=True)
```

To enumerate the frontier roster directly (for example, to build an explicit `frontier=[...]` list), use `pa.evals.frontier_models(task=task_id)`; each entry exposes `.id`, `.vendor`, `.vision`, and `.benchmarked`.

## 4. Read the ranked results

A terminal run carries one `EvalResult` per model. Sort by `quality_mean` to get the ranking, and read `run.cost` to see what the run cost you:

```python
ranked = sorted(run.results, key=lambda r: r.quality_mean or 0, reverse=True)

for r in ranked:
    cost_per_item = (r.mean_cost_micro_usd or 0) / 1_000_000  # micro-USD → dollars
    print(
        f"{r.model_id:24}  "
        f"quality={r.quality_mean:.3f}  "
        f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
        f"${cost_per_item:.6f}/item  "
        f"ok={r.n_succeeded} err={r.error_count}"
    )

print(f"\nrun cost: ${run.cost}")          # Decimal dollars, floored to cents
print(f"raw micro-USD: {run.cost_micro_usd}")
```

What the fields mean:

- **`quality_mean`**: the model's mean score on the task's scorer, in `[0, 1]`. This is your ranking key.
- **`quality_ci_low` / `quality_ci_high`**: the 95% confidence interval. If two models' intervals overlap heavily, your eval set is too small to separate them, so add rows.
- **`mean_cost_micro_usd`**: average cost per item, kept in micro-USD (not floored). This is where the open-vs-frontier comparison lives, so sub-cent precision is preserved: a cheaper open model that matches frontier quality is the whole point.
- **`n_succeeded` / `error_count`**: how many rows scored cleanly. A high `error_count` on one model usually means malformed output, not a bad model, so inspect before trusting its quality number.
- **`model_id`**: the per-task alias (open) or vendor id (frontier). `kind` distinguishes `"open"` from `"frontier"` where the backend populates it.

### A note on money

`run.cost` is a `Decimal` of dollars, floored to whole cents, so the SDK never overstates a charge and a sub-cent run reads `Decimal("0.00")`. For the exact figure use `run.cost_micro_usd` (an integer, where `1_000_000` micro-USD is `$1.00`). The same convention is why per-item rates like `mean_cost_micro_usd` stay in micro-USD: flooring them to cents would erase the open-vs-frontier difference you ran the eval to find.

## Not blocking on the run

`wait=True` polls until the run reaches `"completed"` or `"failed"`, then returns. For long sets, tune the cadence and ceiling:

```python
run = pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["contract-kie-1", "contract-kie-2"],
    frontier="benchmarked",
    wait=True,
    poll_interval=5.0,   # seconds between polls (default 3.0)
    timeout=1800.0,      # give up after 30 min (default 900.0); raises ParetaError on timeout
)
```

Or fire and poll yourself. `wait=False` returns immediately with a run you can retrieve later:

```python
run = pa.evals.runs.create(eval_set=eval_set.id,
                           models=["contract-kie-1"], frontier="benchmarked")
run_id = run.id
# ... later, from anywhere ...
run = pa.evals.runs.retrieve(run_id)
if run.is_terminal:
    print(run.status, run.results)

# equivalently, block on an already-started run:
run = pa.evals.runs.wait(run_id, timeout=1800.0)
```

## Handling an empty balance

Both the open and frontier compute are metered. If the org balance cannot cover the run, `create` raises before any work is billed:

```python
from pareta import InsufficientCreditsError

try:
    run = pa.evals.runs.create(eval_set=eval_set.id,
                               models=["contract-kie-1"], frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    print("Out of credit. Top up in the dashboard (billing is browser-only).")
```

`InsufficientCreditsError` is a subclass of `APIStatusError` (status 402), so you can also catch the broader `ParetaError` if you want one handler for every SDK failure.

## Full example

```python
from pareta import Pareta, InsufficientCreditsError

pa = Pareta.from_env()

# 1. Pick the task.
task_id = "contract-key-fields"
task = pa.tasks.retrieve(task_id)
print("scoring on:", task.default_scorer)

# 2. Build the eval set from your rows.
items = [
    {"text": "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
     "target": {"effective_date": "2026-03-03", "parties": ["Acme Corp", "Globex LLC"]}},
    {"text": "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
     "target": {"effective_date": "2026-01-12", "parties": ["Initech", "Hooli"]}},
]
eval_set = pa.evals.sets.create(task=task_id, items=items, name="contract fields v1")

# 3. Run open candidates against the benchmarked frontier baselines.
try:
    run = pa.evals.runs.create(
        eval_set=eval_set.id,
        models=["contract-kie-1", "contract-kie-2"],
        frontier="benchmarked",
        wait=True,
    )
except InsufficientCreditsError:
    raise SystemExit("Out of credit. Top up in the dashboard.")

# 4. Read the ranked results.
for r in sorted(run.results, key=lambda r: r.quality_mean or 0, reverse=True):
    print(f"{r.model_id:24} {r.quality_mean:.3f}  ${(r.mean_cost_micro_usd or 0)/1e6:.6f}/item")

print("run cost:", run.cost)  # Decimal dollars, floored to cents
```

## Async

Every call here has an `async` twin on `AsyncPareta`. The signatures match; the methods are coroutines (`wait` included).

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        eval_set = await pa.evals.sets.create(task="contract-key-fields", items=items)
        run = await pa.evals.runs.create(
            eval_set=eval_set.id,
            models=["contract-kie-1", "contract-kie-2"],
            frontier="benchmarked",
            wait=True,
        )
        for r in run.results:
            print(r.model_id, r.quality_mean)
        print("run cost:", run.cost)

asyncio.run(main())
```

## Next steps

- [Deploy an endpoint](deploy-and-infer.md): take the winner of your eval to a live, OpenAI-compatible endpoint.
- [Run inference](../guide/inference.md): call your deployed model; inference is metered the same way evals are.
- [Discover tasks](../guide/discovery.md): match intent to tasks and read leaderboards in depth.
- [Errors and retries](../guide/errors-and-retries.md): the full exception hierarchy behind `InsufficientCreditsError` and friends.



---

<!-- examples/document-extraction.md -->

# Document extraction (PDF/image)

Pull structured fields out of PDFs and scanned images, then serve the model that does it best for the least money.

This page walks the full loop for a document task end to end:

1. Find the blob task and check what it expects.
2. Build an eval set from your own documents (one JSONL row per document, with the PDF/image attached to each row).
3. Run the eval against a few open-weights candidates plus frontier (vision) baselines.
4. Read the per-model quality and cost results, pick the winner.
5. Deploy that model and call it with OpenAI-compatible inference.

Why do it this way: a document task is a *blob* task. The model reads pixels, not just text, so picking by gut is a bad idea. Running an eval on your real documents tells you, in dollars and quality points, which open model matches the frontier closely enough to be worth running yourself. Both evals and inference are metered against your org balance, so the eval also tells you the bill before you commit.

Throughout, `model` ids are per-task public aliases. You never see or pass real open-weights ids, and you never pass hardware. Pareta hides the GPU; `deploy` takes a task and a model, nothing else.

## Setup

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (and optional PARETA_BASE_URL)
```

`from_env()` is the path you want in real code. See [Inference](../guide/inference.md) for the OpenAI-compatible alternative.

## 1. Find the document task

Document tasks carry binary inputs, surfaced as `task.has_blob_input == True`. If you know the task id (here, `invoice-extraction`), retrieve it directly. If you only know your intent in words, let the matcher rank candidates.

```python
# By id
task = pa.tasks.retrieve("invoice-extraction")
print(task.id, task.default_scorer, task.has_blob_input)
# invoice-extraction  field_f1  True

# Or by intent
m = pa.tasks.match("extract totals and line items from vendor invoices")
if m.matched:
    print("chose:", m.chosen.task_id, m.chosen.confidence)   # 'high' | 'medium' | 'low'
for c in m.candidates:
    print(f"  {c.task_id}  score={c.score:.2f}  {c.confidence}")
```

`task.default_scorer` is the scorer the eval run applies (for a field-extraction task that is typically a field-level F1). You do not invoke it yourself; the run scores each model against the expected output you provide in step 2.

To see which open models are even in the running for this task, and what the frontier baseline is, read the leaderboard. `recommended` is the deployable alias `deploy(model="recommended")` resolves to.

```python
lb = pa.tasks.leaderboard("invoice-extraction")
print("recommended:", lb.recommended)
for e in lb.models:                      # ranked, best first
    print(f"  {e.name:18}  q={e.quality:.3f}  {e.cost_per_request_micro_usd} uUSD/req  {e.kind}")
if lb.frontier:
    print("frontier baseline:", lb.frontier.name, lb.frontier.quality)
```

See [Discovering tasks](../guide/discovery.md) for the full catalog and matching reference.

## 2. Build an eval set from your documents

A document eval set is one JSONL row per document. Each row holds the *expected* extraction (what a correct answer looks like) plus a placeholder for the document blob. You attach the actual PDF/image to each row in a second step with `upload_document`.

Create the set first. `items` must be non-empty. The blob field (here `document`) is the input field the document attaches to; the rest of each row is the gold/expected output the scorer grades against.

```python
items = [
    {
        "document": None,                       # filled by upload_document below
        "expected": {
            "invoice_number": "INV-4471",
            "invoice_date": "2026-03-14",
            "total": "1284.50",
            "currency": "USD",
            "vendor": "Katana ML",
        },
    },
    {
        "document": None,
        "expected": {
            "invoice_number": "INV-4472",
            "invoice_date": "2026-03-15",
            "total": "962.00",
            "currency": "USD",
            "vendor": "Katana ML",
        },
    },
]

eval_set = pa.evals.sets.create(
    task="invoice-extraction",
    items=items,
    name="Q1 vendor invoices (10 docs)",   # optional; auto-named if omitted
)
print(eval_set.id, eval_set.item_count, eval_set.scoring_strategy)
# es_…  2  extraction
```

Now attach the document for each row. `idx` is the 0-based row index and `field_name` is the blob field you left as `None` above. `file` accepts a path (`str`/`Path`), raw `bytes`, or any binary file-like object; the MIME type is guessed from the filename and can be overridden with `mime=`.

```python
from pathlib import Path

invoices = [Path("invoices/INV-4471.pdf"), Path("invoices/INV-4472.png")]

for idx, path in enumerate(invoices):
    pa.evals.sets.upload_document(
        eval_set.id,
        path,
        idx=idx,
        field_name="document",
    )
```

The upload is one call regardless of file size. Files under 5 MiB go up inline; larger files use a signed-URL direct-to-storage flow under the hood. You can also pass bytes or a handle:

```python
with open("invoices/INV-4471.pdf", "rb") as fh:
    pa.evals.sets.upload_document(eval_set.id, fh, idx=0, field_name="document")

raw = Path("scan.tiff").read_bytes()
pa.evals.sets.upload_document(
    eval_set.id, raw, idx=1, field_name="document", mime="image/tiff",
)
```

## 3. Run the eval

Pass the open-weights candidates you want to compare in `models` (aliases from the leaderboard), and choose frontier baselines with `frontier=`. For a document task you want vision-capable baselines; `frontier="benchmarked"` resolves to the frontier models already on this task's leaderboard (vision-filtered for document tasks), so you compare against the right roster automatically.

```python
run = pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["qwen2.5-vl-1", "internvl-1"],   # open candidates (per-task aliases)
    frontier="benchmarked",                  # vision frontier baselines on this task
    wait=True,                               # block until the run is terminal
)
print(run.status, run.id)                    # 'completed'  run_…
```

`wait=True` polls until the run reaches `completed` or `failed` (default `poll_interval=3.0`s, `timeout=900.0`s), then returns the final run. To fire and poll yourself, leave `wait=False` and call `runs.wait(run.id)` or `runs.retrieve(run.id)` later:

```python
run = pa.evals.runs.create(eval_set=eval_set.id, models=["qwen2.5-vl-1"], frontier="benchmarked")
# ... do other work ...
run = pa.evals.runs.wait(run.id, poll_interval=5.0, timeout=1200.0)
```

If you would rather not pre-create the set, `runs.create` accepts `task=… + items=…` inline and creates the set for you. You still attach blobs first, so for document tasks the explicit `sets.create` + `upload_document` path above is the one to use.

### What `frontier=` accepts

| Value | Resolves to |
|---|---|
| `None` or `"none"` | no baselines |
| `"all"` | every frontier model for the task (from `pa.evals.frontier_models(task=…)`) |
| `"benchmarked"` | frontier models on the task's leaderboard (vision-filtered for document tasks) |
| `["gpt-4o", "claude-…"]` | exactly those frontier ids |

Frontier (vendor) ids are in the clear, so you can name them explicitly. To see the roster first:

```python
for fm in pa.evals.frontier_models(task="invoice-extraction"):
    print(fm.id, fm.vendor, "vision" if fm.vision else "text", "benchmarked" if fm.benchmarked else "")
```

### Metering

The run debits your org balance for the compute it consumes, open candidates and frontier baselines alike. If the balance is empty, `runs.create` raises `InsufficientCreditsError` (402). Top-up is browser-only; the SDK never exposes balance or payment methods.

```python
from pareta import InsufficientCreditsError

try:
    run = pa.evals.runs.create(eval_set=eval_set.id, models=["qwen2.5-vl-1"], frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    print("Org out of credit — top up in the dashboard, then re-run.")
```

## 4. Read the results and pick a winner

`run.results` is one `EvalResult` per evaluated model: the open candidates and the frontier baselines, each with mean quality, a 95% confidence interval, and the average cost per item. `run.cost` is the billed total for the whole run.

```python
if run.status == "failed":
    raise RuntimeError(run.error_detail)

print(f"run cost: ${run.cost}")              # Decimal dollars, floored to cents

for r in sorted(run.results, key=lambda r: (r.quality_mean or 0), reverse=True):
    print(
        f"{r.model_id:18} {r.kind:8} "
        f"q={r.quality_mean:.3f} "
        f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
        f"{r.mean_cost_micro_usd} uUSD/item  "
        f"ok={r.n_succeeded} err={r.error_count}"
    )
```

```
gpt-4o-vision      frontier q=0.946 [0.921, 0.968]  41200 uUSD/item  ok=10 err=0
qwen2.5-vl-1       open     q=0.921 [0.889, 0.948]   3100 uUSD/item  ok=10 err=0
internvl-1         open     q=0.870 [0.831, 0.905]   2750 uUSD/item  ok=10 err=0
```

Reading it: `qwen2.5-vl-1` lands within a couple of quality points of the frontier baseline at roughly a tenth of the per-item cost, and its CI overlaps the frontier's lower bound. That is the open model worth serving. Pick the winning alias:

```python
ranked = sorted(
    (r for r in run.results if r.kind == "open"),
    key=lambda r: (r.quality_mean or 0),
    reverse=True,
)
winner = ranked[0].model_id
print("winner:", winner)   # 'qwen2.5-vl-1'
```

### A note on money

`run.cost` is a `Decimal` in dollars, floored to whole cents (the SDK never rounds a charge up). The raw integer is on `run.cost_micro_usd` (1,000,000 = $1.00) if you need sub-cent precision. Per-item rates like `result.mean_cost_micro_usd` stay in micro-USD on purpose; flooring them to cents would erase the open-vs-frontier comparison that just earned its keep above.

```python
print(run.cost)             # Decimal('0.07')
print(run.cost_micro_usd)   # 72500
```

## 5. Deploy the winner

`deploy` takes the task and the model alias. No GPU, no quantization, no tensor-parallel knob; Pareta resolves the serving class. With `wait=True` it blocks through provisioning and returns the live `Endpoint`.

```python
ep = pa.endpoints.deploy(
    task="invoice-extraction",
    model=winner,            # the alias your eval chose
    wait=True,
)
print(ep.id, ep.status, ep.is_live, ep.url)
# ep_…  live  True  https://…
```

Prefer to watch progress instead of blocking? Leave `wait=False` (the default) and iterate the deploy event stream:

```python
for ev in pa.endpoints.deploy(task="invoice-extraction", model=winner):
    if ev["event"] == "progress":
        print("...", ev["data"].get("stage"))
    elif ev["event"] == "complete":
        ep = ev["data"]["endpoint"]
    elif ev["event"] == "error":
        raise RuntimeError(ev["data"].get("message"))
```

You can also let Pareta pick: `deploy(task="invoice-extraction", model="recommended")` resolves to `pa.tasks.recommended("invoice-extraction")`. The eval above is how you decide whether the recommended pick is actually the right call for *your* documents.

See [Deploying and operating endpoints](../guide/deploying-endpoints.md) for `list`, `start`, `stop`, `delete`, and the `metrics(...)` dimensions.

## 6. Run inference against the endpoint

The endpoint is OpenAI-compatible. Pass `ep.id` as `model`. For a vision document task, send the image in the standard OpenAI content-parts shape; PDFs are typically handed in as page images or a data URL, matching whatever the task expects.

```python
import base64

img_b64 = base64.b64encode(open("invoices/new-INV.png", "rb").read()).decode()

resp = pa.chat.completions.create(
    model=ep.id,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract invoice_number, invoice_date, total, currency, vendor as JSON."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            ],
        }
    ],
    temperature=0,
    max_tokens=512,
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)
```

Inference is metered the same way the eval was: a successful completion debits the org balance, and a zero balance raises `InsufficientCreditsError` (402). To stream tokens as they generate:

```python
for chunk in pa.chat.completions.create(model=ep.id, messages=[...], stream=True):
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

## Async

Every step mirrors on `AsyncPareta`. `deploy` and `runs.wait` are awaitable; deploy event streams and chat streams are `async for`.

```python
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        es = await pa.evals.sets.create(task="invoice-extraction", items=items)
        await pa.evals.sets.upload_document(es.id, "invoices/INV-4471.pdf", idx=0, field_name="document")

        run = await pa.evals.runs.create(
            eval_set=es.id, models=["qwen2.5-vl-1"], frontier="benchmarked", wait=True,
        )
        winner = max(
            (r for r in run.results if r.kind == "open"),
            key=lambda r: (r.quality_mean or 0),
        ).model_id

        ep = await pa.endpoints.deploy(task="invoice-extraction", model=winner, wait=True)

        resp = await pa.chat.completions.create(
            model=ep.id,
            messages=[{"role": "user", "content": "Extract the total as JSON."}],
        )
        print(resp.choices[0].message.content)
```

## The whole loop

```python
from pareta import Pareta

pa = Pareta.from_env()
TASK = "invoice-extraction"

# 1. eval set from your documents
es = pa.evals.sets.create(task=TASK, items=items, name="vendor invoices")
for idx, path in enumerate(invoices):
    pa.evals.sets.upload_document(es.id, path, idx=idx, field_name="document")

# 2. compare open candidates against vision frontier baselines
run = pa.evals.runs.create(
    eval_set=es.id,
    models=["qwen2.5-vl-1", "internvl-1"],
    frontier="benchmarked",
    wait=True,
)
print(f"eval cost ${run.cost}")

# 3. pick the best open model
winner = max(
    (r for r in run.results if r.kind == "open"),
    key=lambda r: (r.quality_mean or 0),
).model_id

# 4. deploy it and extract
ep = pa.endpoints.deploy(task=TASK, model=winner, wait=True)
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the invoice fields as JSON."}],
)
print(resp.choices[0].message.content)
```

## See also

- [Inference (OpenAI-compatible)](../guide/inference.md) — calling endpoints, streaming, using the `openai` client.
- [Discovering tasks](../guide/discovery.md) — the catalog, `match`, leaderboards, and the `recommended` alias.
- [Evaluating models on your data](../guide/evaluation.md) — eval sets, runs, frontier baselines, and metering in depth.
- [Deploying and operating endpoints](../guide/deploying-endpoints.md) — deploy events, lifecycle, and per-endpoint metrics.



---

<!-- examples/streaming-chat.md -->

# Streaming chat completions

Stream tokens as the model generates them instead of waiting for the whole
response. Pass `stream=True` to `chat.completions.create(...)` and you get an
iterator of `ChatCompletionChunk` objects, each carrying one incremental piece
of text on `chunk.choices[0].delta.content`. Use this for chat UIs, agent
loops, long generations, and anywhere a first-token-fast experience matters.

Inference on Pareta is OpenAI-compatible, so the streaming shape here is the
same vLLM-style data-only SSE the `openai` SDK consumes. The model id you pass
is an endpoint id from [deploying an endpoint](../guide/deploying-endpoints.md), and
streamed inference is metered against your org balance exactly like a
non-streaming call.

## Quickstart

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

stream = pa.chat.completions.create(
    model="ep_contract_kie",  # an endpoint id from endpoints.deploy(...)
    messages=[{"role": "user", "content": "Write a haiku about throughput."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
```

`stream=True` changes the return type: instead of a single `ChatCompletion`,
`create(...)` returns an `Iterator[ChatCompletionChunk]`. Nothing is sent until
you start iterating, and the connection stays open for the life of the loop.

## Reading a chunk

A streaming chunk has the same schema as a `ChatCompletion`, but each choice
carries a `delta` (the incremental token) instead of a full `message`:

```python
chunk.choices[0].delta.content   # str | None — the new text in this chunk
chunk.choices[0].delta.role      # str | None — usually only set on the first chunk
chunk.choices[0].finish_reason   # str | None — "stop" / "length" on the last chunk
chunk.id                         # str | None
chunk.model                      # str | None
```

`delta.content` is `None` on chunks that carry no text (for example the opening
role chunk, or a final chunk that only sets `finish_reason`), so always guard
the `if delta:` check before printing or appending. The stream ends when the
server sends `[DONE]`; the SDK consumes that sentinel and stops the iterator for
you, so a plain `for` loop terminates cleanly.

Need the raw server JSON for a field the typed layer does not surface? Every
response object keeps it: `chunk.to_dict()` returns the untouched payload.

## Accumulating the full text

Collect the deltas into a buffer to reconstruct the complete message:

```python
from pareta import Pareta

pa = Pareta.from_env()

chunks = pa.chat.completions.create(
    model="ep_contract_kie",
    messages=[
        {"role": "system", "content": "You are concise."},
        {"role": "user", "content": "Summarize what an invoice number is."},
    ],
    stream=True,
    temperature=0.2,   # extra OpenAI params pass straight through
    max_tokens=256,
)

parts = []
finish_reason = None
for chunk in chunks:
    choice = chunk.choices[0]
    if choice.delta.content:
        parts.append(choice.delta.content)
    if choice.finish_reason:
        finish_reason = choice.finish_reason

full_text = "".join(parts)
print(full_text)
print("finish_reason:", finish_reason)  # e.g. "stop" or "length"
```

A `finish_reason` of `"length"` means the model hit `max_tokens` before it was
done; raise `max_tokens` if you need the full answer.

Note: token usage is not reliably populated on streamed chunks. If you need the
`usage` counts (`prompt_tokens` / `completion_tokens` / `total_tokens`), make
the same call with `stream=False` and read `completion.usage`.

## Extra parameters

Any OpenAI chat parameter you pass as a keyword argument is forwarded verbatim
in the request body: `temperature`, `max_tokens`, `top_p`, `stop`,
`frequency_penalty`, and so on. There is no hardware knob — GPUs, quantization,
and tensor-parallelism are resolved by Pareta when you deploy the endpoint, so
the only model selector here is the endpoint id you pass to `model`.

```python
stream = pa.chat.completions.create(
    model="ep_contract_kie",
    messages=[{"role": "user", "content": "List three GPU-free wins."}],
    stream=True,
    top_p=0.9,
    stop=["\n\n"],
)
```

## Async streaming

`AsyncPareta` mirrors the sync client. `create(...)` is a coroutine, so
`await` it once to get the async iterator, then drive it with `async for`:

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        stream = await pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Stream me a limerick."}],
            stream=True,
        )
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                print(delta, end="", flush=True)
        print()


asyncio.run(main())
```

The `async with` block calls `aclose()` for you when the block exits, releasing
the HTTP client. The chunk shape is identical to the sync path:
`chunk.choices[0].delta.content` is the incremental text.

## Metering and errors

Streamed inference debits your org balance on success, the same as a
non-streaming completion. Top-ups are browser-only; the SDK does not expose
balance or payment methods. If the balance is empty, the call raises
`InsufficientCreditsError` (HTTP 402) before any tokens flow:

```python
from pareta import Pareta
from pareta import InsufficientCreditsError, EndpointNotReadyError

pa = Pareta.from_env()

try:
    stream = pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Hello"}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
    print()
except InsufficientCreditsError:
    print("Out of credit — top up in the dashboard.")
except EndpointNotReadyError:
    print("Endpoint is cold or stopped — start it and retry.")
```

A few things to know about how the stream behaves under failure:

- **`model` / `messages` validation is local.** Passing an empty `model` or
  empty `messages` raises `ValueError` immediately, before any network call.
- **Errors surface before the first byte.** Non-2xx responses (402, 401, 404,
  503, and so on) are raised as the matching `ParetaError` subclass when the
  stream starts, not mid-loop. A stopped or cold endpoint raises
  `EndpointNotReadyError` (503).
- **Mid-stream drops are not retried.** Retries cover only the initial
  connect/handshake. Once SSE bytes are flowing, a dropped connection raises
  (`APIConnectionError` / `APITimeoutError`) rather than silently resuming,
  because a partial generation cannot be safely continued. Wrap the loop and
  re-issue the request if you need at-least-once delivery.

See [error handling](../guide/errors-and-retries.md) for the full exception hierarchy.

## Related

- [Deploying an endpoint](../guide/deploying-endpoints.md) — get the `model` id you
  pass here.
- [Listing models](../guide/inference.md) — `models.list()` returns your deployed,
  callable endpoints.
- [Non-streaming completions](../guide/inference.md) — `stream=False` returns a
  single `ChatCompletion` with `usage` populated.
- [Running evals](../guide/evaluation.md) — compare models on your own data, also metered
  against the org balance.



---

<!-- examples/concurrent-async.md -->

# Concurrent calls with AsyncPareta

`AsyncPareta` lets you fire many requests at once instead of one at a time. When
you have a batch of inference prompts to score, or several eval runs to kick off,
running them concurrently turns a wall of sequential round-trips into a single
`asyncio.gather`. The same surface as the sync [`Pareta`](../reference/client.md)
client, with every resource method `async def` and the streaming iterators
async.

This page shows how to:

- run a batch of `chat.completions` concurrently and collect results
- bound concurrency so you do not hammer an endpoint (backpressure)
- handle errors per task so one failure does not sink the batch
- launch and await several eval runs at once

One `AsyncPareta` instance wraps a single pooled `httpx.AsyncClient`. Build it
once, share it across all your coroutines, and close it once. Do not make a
client per request.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:   # reads PARETA_API_KEY
        resp = await pa.chat.completions.create(
            model="ep_invoice_extract",
            messages=[{"role": "user", "content": "Extract the total."}],
        )
        print(resp.choices[0].message.content)

asyncio.run(main())
```

Inference is OpenAI-compatible and metered: each successful completion debits
your org balance, and a zero balance raises `InsufficientCreditsError` (402).
Top-up is browser-only, so the SDK never exposes balance or payment. `model` is
an endpoint id from [`endpoints.deploy()`](../guide/deploying-endpoints.md) (or any model
id your org can reach); Pareta hides the hardware, so there is no GPU knob to
pass.

## Fan out a batch of completions

`asyncio.gather` runs every coroutine concurrently and returns results in input
order. Because all calls share the same client, httpx pools and reuses
connections for you.

```python
import asyncio
from pareta import AsyncPareta

PROMPTS = [
    "Extract the invoice total.",
    "Extract the vendor name.",
    "Extract the due date.",
    "Extract the line-item count.",
]

async def classify_one(pa, prompt, document):
    resp = await pa.chat.completions.create(
        model="ep_invoice_extract",
        messages=[
            {"role": "system", "content": "You are an invoice parser."},
            {"role": "user", "content": f"{prompt}\n\n{document}"},
        ],
        temperature=0,
        max_tokens=64,
    )
    return resp.choices[0].message.content

async def main():
    document = "INVOICE #4471 ... TOTAL $1,240.00 ..."
    async with AsyncPareta.from_env() as pa:
        answers = await asyncio.gather(
            *(classify_one(pa, p, document) for p in PROMPTS)
        )
    for prompt, answer in zip(PROMPTS, answers):
        print(f"{prompt} -> {answer}")

asyncio.run(main())
```

If any coroutine raises, `gather` propagates the first exception and the rest are
cancelled. That is rarely what you want for a batch. The next two sections fix
both halves of the problem: capacity (backpressure) and partial failure.

## Bound concurrency with a semaphore

Firing 5,000 prompts at `gather` opens as many tasks at once, overruns the
connection pool, and is the fastest way to earn a `RateLimitError` (429) or push
a cold endpoint into `EndpointNotReadyError` (503). An `asyncio.Semaphore` caps
how many calls are in flight at any moment. The rest queue and drain as slots
free up.

```python
import asyncio
from pareta import AsyncPareta

MAX_IN_FLIGHT = 16

async def complete(pa, sem, messages):
    async with sem:                       # acquire a slot; release on exit
        resp = await pa.chat.completions.create(
            model="ep_invoice_extract",
            messages=messages,
            temperature=0,
        )
        return resp.choices[0].message.content

async def run_batch(documents):
    sem = asyncio.Semaphore(MAX_IN_FLIGHT)
    async with AsyncPareta.from_env() as pa:
        tasks = [
            complete(pa, sem, [{"role": "user", "content": f"Summarize:\n{d}"}])
            for d in documents
        ]
        return await asyncio.gather(*tasks)

# 1,000 docs, but never more than 16 concurrent requests
results = asyncio.run(run_batch([f"doc {i}" for i in range(1000)]))
print(len(results))
```

Pick `MAX_IN_FLIGHT` to match what the endpoint can sustain. 8 to 32 is a sane
starting band; tune it against the endpoint's latency from
[`endpoints.metrics()`](cost-and-metrics.md). The SDK already retries `429`,
`503`, and `5xx` with exponential backoff (`max_retries`, default 2), so the
semaphore is your first line of defense and retries are the backstop.

## Handle errors per task

Pass `return_exceptions=True` to `gather` and every coroutine resolves to either
its result or the exception it raised, in order. The batch always completes; you
decide what to do with the failures. This is the right default for fan-out work.

```python
import asyncio
from pareta import (
    AsyncPareta,
    ParetaError,
    InsufficientCreditsError,
    EndpointNotReadyError,
    RateLimitError,
    APITimeoutError,
)

MAX_IN_FLIGHT = 16

async def complete(pa, sem, doc):
    async with sem:
        resp = await pa.chat.completions.create(
            model="ep_invoice_extract",
            messages=[{"role": "user", "content": f"Extract the total from:\n{doc}"}],
            temperature=0,
        )
        return resp.choices[0].message.content

async def main(documents):
    sem = asyncio.Semaphore(MAX_IN_FLIGHT)
    async with AsyncPareta.from_env() as pa:
        outcomes = await asyncio.gather(
            *(complete(pa, sem, d) for d in documents),
            return_exceptions=True,
        )

    ok, failed = [], []
    for doc, outcome in zip(documents, outcomes):
        if isinstance(outcome, InsufficientCreditsError):
            # Org balance hit zero mid-batch. Nothing else will succeed —
            # stop and top up in the dashboard.
            raise outcome
        if isinstance(outcome, BaseException):
            failed.append((doc, outcome))
        else:
            ok.append((doc, outcome))

    print(f"{len(ok)} succeeded, {len(failed)} failed")
    for doc, err in failed:
        if isinstance(err, EndpointNotReadyError):
            reason = "endpoint cold/stopped"      # 503
        elif isinstance(err, RateLimitError):
            reason = "rate limited after retries"  # 429
        elif isinstance(err, APITimeoutError):
            reason = "timed out"
        elif isinstance(err, ParetaError):
            reason = str(err)
        else:
            reason = repr(err)
        print(f"  retry {doc!r}: {reason}")
    return ok, failed

asyncio.run(main([f"doc {i}" for i in range(50)]))
```

Notes on the error types (all subclass `ParetaError`):

- **`InsufficientCreditsError` (402)** is fatal for the whole batch, not just one
  task. The balance is shared across the org, so once it hits zero every
  remaining call fails the same way. Stop early and top up.
- **`EndpointNotReadyError` (503)** means the endpoint is stopped, cold-starting,
  or its provider is down. Often transient; safe to retry the failed subset after
  a `start()` or a short wait.
- **`RateLimitError` (429)** surfaces only after the SDK exhausts its own
  retries. If you see these, lower `MAX_IN_FLIGHT`.
- **`APITimeoutError`** is raised after `max_retries`. Long generations may need a
  larger `timeout=` on the client (default is 60s, 10s connect).

Because `return_exceptions=True` never cancels siblings, you can re-run just
`failed` on the next pass.

## Streaming under concurrency

Async streaming mirrors the sync path with one twist: `create(...)` is a
coroutine, so you `await` it — and because `stream=True`, the awaited result is
an async iterator you then `async for` over. (Non-streaming `create` is awaited
too, returning the `ChatCompletion`.)

```python
import asyncio
from pareta import AsyncPareta

async def stream_into(pa, prompt, sink):
    stream = await pa.chat.completions.create(   # await → returns the async iterator
        model="ep_invoice_extract",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    async for chunk in stream:
        sink.append(chunk.choices[0].delta.content or "")

async def main():
    sinks = {p: [] for p in ("Summarize doc A.", "Summarize doc B.")}
    async with AsyncPareta.from_env() as pa:
        await asyncio.gather(
            *(stream_into(pa, p, sink) for p, sink in sinks.items())
        )
    for prompt, parts in sinks.items():
        print(prompt, "->", "".join(parts))

asyncio.run(main())
```

`chunk.choices[0].delta.content` is the incremental text. Streams end on
`[DONE]`; the SDK closes them for you. Retries only cover the initial handshake,
so a mid-stream drop raises immediately rather than silently resuming.

## Concurrent eval runs

The same pattern launches several [eval runs](../guide/evaluation.md) at once. With
`wait=True`, each `runs.create(...)` polls the run to completion using
`asyncio.sleep` under the hood, so the coroutines yield the event loop while they
wait. That makes a fan-out of `wait=True` runs genuinely concurrent.

```python
import asyncio
from pareta import AsyncPareta

# Compare candidate aliases on three tasks, each against its frontier baselines.
JOBS = [
    ("contract-key-fields",   ["qwen-1", "llama-2"]),
    ("invoice-extraction",    ["qwen-1", "pixtral-1"]),
    ("doc-classification",    ["llama-1", "qwen-2"]),
]

async def eval_task(pa, eval_set_id, models):
    run = await pa.evals.runs.create(
        eval_set=eval_set_id,
        models=models,            # per-task open-model aliases
        frontier="benchmarked",   # frontier models on this task's leaderboard
        wait=True,                # polls until terminal (completed/failed)
        timeout=1200.0,
    )
    return run

async def main(eval_set_ids):
    async with AsyncPareta.from_env() as pa:
        runs = await asyncio.gather(
            *(eval_task(pa, sid, models)
              for sid, (_, models) in zip(eval_set_ids, JOBS)),
            return_exceptions=True,
        )

    for outcome in runs:
        if isinstance(outcome, BaseException):
            print("run failed to launch/finish:", outcome)
            continue
        run = outcome
        if run.status == "failed":
            print(f"{run.id}: failed — {run.error_detail}")
            continue
        print(f"{run.id}: {run.status}  cost ${run.cost}")  # Decimal dollars
        for r in run.results:
            print(f"  {r.model_id} ({r.kind}): quality={r.quality_mean}")

# eval_set_ids from earlier pa.evals.sets.create(...) calls
asyncio.run(main(["es_abc", "es_def", "es_ghi"]))
```

Eval runs are metered too: the org balance is debited for the compute (open
candidates plus any frontier baselines), and an empty balance raises
`InsufficientCreditsError` (402). `run.cost` is the billed total as `Decimal`
dollars floored to cents; `run.cost_micro_usd` keeps the raw micro-USD integer if
you need sub-cent precision. Result `model_id`s are per-task public aliases, not
real model ids.

If you do not want to block on completion, drop `wait=True` and the call returns
immediately with a queued `EvalRun`; await `pa.evals.runs.wait(run.id)` later, or
poll `pa.evals.runs.retrieve(run.id)` yourself.

## Checklist

- One `AsyncPareta` per process, shared across coroutines. `async with` (or
  `await pa.aclose()`) to release the pool.
- `asyncio.gather(*tasks)` to fan out; `return_exceptions=True` so one failure
  does not cancel the batch.
- `asyncio.Semaphore(N)` to bound in-flight calls — your backpressure valve.
- Treat `InsufficientCreditsError` as batch-fatal; retry `EndpointNotReadyError`
  and the residual `RateLimitError` subset.
- Always `await create()`. For `stream=True` the awaited result is the async
  iterator you `async for` over; for non-streaming it is the `ChatCompletion`.

## See also

- [The client](../reference/client.md) — constructor, `from_env`, retries, timeouts
- [Chat completions](../guide/inference.md) — full inference surface and streaming
- [Endpoints](../guide/deploying-endpoints.md) — deploy, start/stop, and operate endpoints
- [Evals](../guide/evaluation.md) — eval sets, runs, and frontier baselines
- [Errors](../guide/errors-and-retries.md) — the full `ParetaError` hierarchy



---

<!-- examples/cost-and-metrics.md -->

# Cost & quality monitoring

Every dollar you spend on Pareta runs through one org balance, and every model you serve gets watched for drift. This page is about reading both: what a call or an eval run actually cost, how the open model you deployed stacks up against the frontier baseline it replaced, and how to watch a live endpoint's spend and quality over time so you catch a regression before your users do.

Two things to keep straight up front, because they shape every number below:

- **Money is metered against your org balance.** Inference (`chat.completions.create`) and evals (`evals.runs.create`) both debit the balance on success. An empty balance raises `InsufficientCreditsError` (402). The SDK never exposes balance or payment methods — top-up is browser-only, in the dashboard.
- **GPUs are hidden and models are aliases.** You never priced a GPU-hour or picked a quantization; Pareta did. So cost shows up as a flat per-request rate or a run total, and the open models in every cost report are per-task public aliases, not raw model names. Frontier (vendor) ids are in the clear.

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

## The money convention: dollars are floored to cents

You are billed in whole cents, and the SDK **floors** to cents so it never overstates a charge. That rule shows up in two complementary fields on anything that carries a total:

- `cost: Decimal` — the billed total in dollars, floored to whole cents. A run that truly cost a third of a cent reads `Decimal("0.00")`.
- `cost_micro_usd: int` — the raw integer in micro-USD, where `1_000_000` == `$1.00`. This is the precise number for your own accounting.

```python
run = pa.evals.runs.retrieve(run_id)

print(run.cost)            # Decimal('0.07')  — billed dollars, floored to cents
print(run.cost_micro_usd)  # 74211            — raw micro-USD (74,211 uUSD)
```

The flooring is one-directional on purpose: a sub-cent total bills as `$0.00` but keeps its true value on `cost_micro_usd`, so nothing is lost. **Per-unit rates stay in micro-USD** and are never floored — flooring a sub-cent unit rate to whole cents would erase the open-vs-frontier comparison that the whole exercise is about. You will see this on `result.mean_cost_micro_usd` below.

## What an eval run cost

An eval run is the densest cost signal you get, because it prices several models on the same rows in one shot. The run carries the bill; each `EvalResult` carries that model's per-item rate.

```python
run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[
        {"input": "Effective as of January 1, 2026, ...", "expected": {"effective_date": "2026-01-01"}},
        {"input": "This Agreement terminates on 2027-12-31 ...", "expected": {"termination_date": "2027-12-31"}},
    ],
    models=["llama-1", "qwen-2"],   # per-task open aliases
    frontier="benchmarked",          # baselines already on this task's leaderboard
    wait=True,                       # block until the run is terminal
)

print(f"run {run.id}: {run.status}")
print(f"billed ${run.cost} ({run.cost_micro_usd} uUSD)")  # open + frontier compute

for r in run.results:
    print(f"{r.model_id:16} {r.kind:8} "
          f"q={r.quality_mean:.3f} [{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
          f"~{r.mean_cost_micro_usd} uUSD/item  "
          f"({r.n_succeeded} ok, {r.error_count} err)")
```

`run.cost` / `run.cost_micro_usd` is the **total** for the run, across both the open candidates and any frontier baselines — both are metered against your balance. Each `EvalResult` reports `mean_cost_micro_usd`, the average cost per item for that model in micro-USD. That field is the heart of a cost comparison, so it deliberately stays in raw micro-USD: a 700-uUSD frontier item and a 90-uUSD open item both floor to `$0.00`, and the gap between them is exactly the thing you came to measure.

If the balance is empty, `create` raises `InsufficientCreditsError` (402) before any compute runs. See [Errors, retries & timeouts](../guide/errors-and-retries.md).

### Quality vs. cost, the actual trade

The point of running open candidates next to a frontier baseline is to read both axes at once: how much quality you give up, and how much money you save. Split the results by `kind` and compare.

```python
run = pa.evals.runs.retrieve(run_id)

frontier = next((r for r in run.results if r.kind == "frontier"), None)
open_models = [r for r in run.results if r.kind == "open"]

for r in sorted(open_models, key=lambda r: r.quality_mean or 0.0, reverse=True):
    line = f"{r.model_id:16} q={r.quality_mean:.3f}  {r.mean_cost_micro_usd} uUSD/item"
    if frontier and frontier.mean_cost_micro_usd and r.mean_cost_micro_usd:
        # micro-USD ratio — never compute savings off the floored dollar field
        cheaper = frontier.mean_cost_micro_usd / r.mean_cost_micro_usd
        dq = (r.quality_mean or 0.0) - (frontier.quality_mean or 0.0)
        line += f"  ({cheaper:.1f}x cheaper than {frontier.model_id}, dq={dq:+.3f})"
    print(line)
```

Two rules when you read this:

- **Compute savings from `mean_cost_micro_usd`, never from `cost`.** The dollar field is floored to cents and a per-item rate is almost always sub-cent, so a ratio built on it would divide by zero or lie. Stay in micro-USD for any per-unit math.
- **Respect the confidence interval.** `quality_mean` comes with `quality_ci_low` / `quality_ci_high` (a 95% CI). Two models whose intervals overlap are not meaningfully different on this sample — add rows before you call one the winner on a hair's-width quality edge.

Full eval mechanics (building sets, frontier roster selection, document tasks, async) live in [Evaluating on your own data](./evaluate-on-your-data.md) and the [Evaluation guide](../guide/evaluation.md).

## What an inference call cost

Inference is OpenAI-compatible, so `chat.completions.create` returns a `ChatCompletion` with a `usage` block. Use it for token accounting; the dollar cost of that traffic lands on the endpoint's cost metric (next section), since pricing is per-request at the endpoint, not returned inline per call.

```python
resp = pa.chat.completions.create(
    model="ep-contract-key-fields",   # an endpoint id from endpoints.deploy()
    messages=[{"role": "user", "content": "Extract the effective date from: ..."}],
)

u = resp.usage
print(u.prompt_tokens, u.completion_tokens, u.total_tokens)
print(resp.choices[0].message.content)
```

Each successful call debits your org balance. An empty balance raises `InsufficientCreditsError` (402) here too. The inference surface — streaming, kwargs pass-through, the OpenAI compatibility contract — is covered in [Running inference](../guide/inference.md).

## Monitoring a live endpoint

Once a model is serving, `endpoints.metrics(endpoint_id)` is your window into its spend, quality, latency, and uptime over time. It returns a `Metrics` handle with one method per dimension. Each method takes free-form `**params` that become the query string (e.g. a time window or a grouping), and each returns the raw metric JSON for that dimension — shapes vary by dimension, and typed models arrive with the OpenAPI generation later.

```python
m = pa.endpoints.metrics("ep-contract-key-fields")

cost      = m.cost()           # per-endpoint spend + vs-frontier savings
quality   = m.quality()        # judge windows over time
perf      = m.performance()    # p50/p95/p99 latency
uptime    = m.uptime()         # availability
activity  = m.activity()       # usage stats

# Narrow with params — they pass straight through as the query string.
last_day  = m.cost(window="24h")
by_day    = m.cost(group_by="day")
```

`endpoints.metrics(id)` is a cheap local handle — it does no I/O until you call a dimension. So you can hold one handle and query several dimensions off it.

### Cost and the vs-frontier savings framing

`m.cost()` is the per-endpoint counterpart to a run's total: it reports what the endpoint has spent and frames it against the frontier baseline the open model stands in for. That "vs-frontier savings" framing is the whole pitch of serving an open model — the metric tells you, in production, how much cheaper this endpoint is than calling the vendor model would have been. Because the dimension returns raw JSON, read it with the dict accessors:

```python
cost = pa.endpoints.metrics("ep-contract-key-fields").cost(window="7d")

# raw JSON dict — use the keys the dimension returns
print(cost.get("total_micro_usd"))
print(cost.get("frontier_baseline_micro_usd"))
print(cost.get("savings_micro_usd"))
```

The exact keys are owned by the backend and may grow; treat the dict as the source of truth and pull what you need. The money convention still holds — anything labeled `micro_usd` is raw micro-USD (`1_000_000` == `$1.00`), and you floor to cents yourself only when you want a billed-dollar figure.

### Quality monitoring (judge windows)

`m.quality()` reports the endpoint's quality over rolling windows, scored by the platform's judge — the same scoring machinery evals use, run continuously against live traffic so you catch drift without launching a run. Poll it on a schedule and alert when a window dips below your bar.

```python
q = pa.endpoints.metrics("ep-contract-key-fields").quality(window="24h")

score = q.get("quality_mean")
if score is not None and score < 0.90:
    print(f"quality slipped to {score:.3f} on the last window — investigate")
```

Latency (`performance`) and `uptime` round out the operational picture; `activity` reports usage volume. They are all the same call shape: pass a window or grouping, read the returned dict.

## Async

Every method here has an async twin on `AsyncPareta` with the same signatures. Note one shape detail: `endpoints.metrics(id)` itself is **not** a coroutine even on the async client — it returns an `AsyncMetrics` handle synchronously — but the dimension methods on that handle are `async def`.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        run = await pa.evals.runs.retrieve(run_id)
        print("billed", run.cost, "/", run.cost_micro_usd, "uUSD")

        m = pa.endpoints.metrics("ep-contract-key-fields")  # sync handle, no await
        cost, quality = await asyncio.gather(
            m.cost(window="7d"),
            m.quality(window="24h"),
        )
        print(cost.get("savings_micro_usd"), quality.get("quality_mean"))

asyncio.run(main())
```

## Lossless access

Every response object keeps the raw server JSON. `run.to_dict()`, `result.to_dict()`, and the dicts returned by the metric dimensions give you everything the API sent, including fields not yet surfaced as typed properties. When a metric grows a new key before the SDK grows a property for it, `to_dict()` (or plain `.get()`) is your escape hatch.

## See also

- [Evaluating on your own data](./evaluate-on-your-data.md) — build eval sets, pick frontier baselines, read per-model results.
- [Deploy a model and call it](./deploy-and-infer.md) — task to live endpoint in two calls.
- [Concurrent & async](./concurrent-async.md) — fan-out inference and parallel eval runs.
- [Deploying endpoints](../guide/deploying-endpoints.md) — the full `endpoints` surface, including the `metrics` dimension table.
- [Running inference](../guide/inference.md) — the OpenAI-compatible chat surface and streaming.
- [Errors, retries & timeouts](../guide/errors-and-retries.md) — `InsufficientCreditsError`, the money convention, and the exception hierarchy.



---

<!-- examples/migrate-from-openai.md -->

# Migrating from the OpenAI SDK

Pareta inference is OpenAI-compatible. If you already call `chat.completions.create(...)` through the `openai` SDK, you do not have to rewrite that code to run on Pareta. Point the OpenAI client at your Pareta base URL with a `pareta_sk_` key and your existing inference keeps working against a Pareta endpoint.

This page covers two things:

1. **Keep using `openai` for inference**, the smallest possible diff: change `base_url` and `api_key`, pass an endpoint id as `model`.
2. **Switch to the `pareta` SDK** for the things OpenAI does not do: deploying endpoints, evaluating models against frontier baselines on your own data, and discovering tasks.

The mental model: OpenAI gives you one client for one purpose (inference). Pareta splits that into a data plane (inference, which is OpenAI-compatible) and a control plane (deploy / eval / discover, which is Pareta-native). You migrate the data plane by changing two strings; you adopt the control plane when you want it.

## The one-diff migration

You have this today:

```python
from openai import OpenAI

client = OpenAI(api_key="sk-...")  # talks to api.openai.com
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Extract the invoice total: ..."}],
)
print(resp.choices[0].message.content)
```

Change the client construction and the `model`:

```python
from openai import OpenAI

client = OpenAI(
    api_key="pareta_sk_...",                 # a Pareta key, not an OpenAI key
    base_url="https://api.pareta.ai/v1",     # note the /v1 suffix
)
resp = client.chat.completions.create(
    model="ep_contract_kie",                 # a Pareta endpoint id, not "gpt-4o-mini"
    messages=[{"role": "user", "content": "Extract the invoice total: ..."}],
)
print(resp.choices[0].message.content)
```

Three things changed, nothing else:

- **`api_key`** is a `pareta_sk_...` key (mint it in the dashboard; key management is browser-only). It rides in the same `Authorization: Bearer` header the OpenAI client already sends.
- **`base_url`** is `https://api.pareta.ai/v1`. The OpenAI client appends `/chat/completions` to whatever base URL you give it, and Pareta serves the route at `/v1/chat/completions`, so the base URL must include the `/v1` suffix.
- **`model`** is a Pareta endpoint id (for example `ep_contract_kie`), the value you get back from `endpoints.deploy(...).id`. It is not an OpenAI model name. See [Deploy a model and call it](./deploy-and-infer.md) for how to get one.

Streaming, `temperature`, `max_tokens`, `top_p`, `stop`, system messages, and the response shape (`resp.choices[0].message.content`, `resp.usage`) all behave exactly as they do against OpenAI, because the wire format is the same. Your existing response-parsing code does not change.

### Why this works

Pareta serves inference in the vLLM OpenAI-compatible format: data-only SSE for streams, the same request body, the same `ChatCompletion` / `ChatCompletionChunk` JSON shapes. The OpenAI SDK cannot tell the difference. The only Pareta-specific facts that leak through are the key prefix and the fact that `model` names an endpoint you deployed rather than a hosted vendor model.

## Where the OpenAI SDK stops

The OpenAI SDK is built around calling models that already exist on a vendor's servers. Pareta's reason to exist is the opposite: you bring a task, Pareta stands up an open-weights endpoint to serve it, and you measure it against frontier models on your own data before you commit. None of that has an OpenAI-SDK equivalent:

| You want to... | OpenAI SDK | Pareta SDK |
| --- | --- | --- |
| Call a model | `client.chat.completions.create(...)` | works as-is (OpenAI-compatible) |
| Stand up your own serving endpoint | not available | `pa.endpoints.deploy(task=..., model=...)` |
| Compare candidate models on your data | not available | `pa.evals.runs.create(...)` |
| Compare open models vs frontier baselines | not available | `frontier=` on the eval run |
| Find the right task / model for an intent | not available | `pa.tasks.match(...)`, `pa.tasks.leaderboard(...)` |
| List your callable endpoints | `client.models.list()` (vendor catalog) | `pa.models.list()` (your endpoints) |

For everything in the bottom rows, install and use the `pareta` SDK. It also speaks OpenAI-compatible inference through `pa.chat.completions.create(...)`, so once you adopt it you can drop the second `openai` client entirely and use one library for both planes.

## Switching to the `pareta` SDK

Install it and construct the client from the environment. `Pareta.from_env()` reads `PARETA_API_KEY` and the optional `PARETA_BASE_URL` (default `https://api.pareta.ai`, no `/v1` suffix, the SDK adds route prefixes itself):

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

The client is a context manager, which releases the HTTP connection cleanly:

```python
with Pareta.from_env() as pa:
    resp = pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Hello"}],
    )
    print(resp.choices[0].message.content)
```

### Inference looks the same, with one rename

The OpenAI call maps one-to-one onto the Pareta call. The arguments and the response shape are identical:

```python
# OpenAI:
resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
    temperature=0,
    max_tokens=512,
)

# Pareta:
resp = pa.chat.completions.create(
    model="ep_contract_kie",   # endpoint id instead of a vendor model name
    messages=[{"role": "user", "content": "..."}],
    temperature=0,             # extra OpenAI params pass straight through
    max_tokens=512,
)

choice = resp.choices[0]
print(choice.message.content)
print(choice.finish_reason)       # "stop", "length", ...
print(resp.usage.total_tokens)    # prompt_tokens + completion_tokens
```

`model` and `messages` are both required; passing either falsy raises `ValueError` before any request goes out. Any extra OpenAI keyword argument (`temperature`, `max_tokens`, `top_p`, `stop`, `frequency_penalty`, ...) is forwarded verbatim as a request-body field.

Streaming is the same shape as OpenAI too. `stream=True` returns an iterator of `ChatCompletionChunk`, and the incremental text is on `chunk.choices[0].delta.content`:

```python
for chunk in pa.chat.completions.create(
    model="ep_contract_kie",
    messages=[{"role": "user", "content": "Summarize this clause: ..."}],
    stream=True,
):
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
```

See [Streaming chat completions](./streaming-chat.md) for the full streaming details and the async variant.

### Listing models means something different

In the OpenAI SDK, `client.models.list()` returns the vendor's hosted catalog. In Pareta, `pa.models.list()` returns **your org's deployed, callable endpoints**, the OpenAI-compatible subset with a live `url`. Each `Model.id` is an endpoint id you can pass straight to `chat.completions.create(model=...)`:

```python
for m in pa.models.list():
    print(m.id, m.owned_by)   # m.id is callable as `model=...`
```

If you want full endpoint records (status, task, the deployed model alias) rather than the OpenAI-compatible subset, use `pa.endpoints.list()` instead, which returns `Endpoint` objects.

## Three platform facts that have no OpenAI equivalent

These are the differences that matter once you are past the inference call. They are not gotchas; they are the point of the platform.

### 1. You deploy your own endpoint, and GPUs are hidden

There is no "pick gpt-4o" step. You pick a **task** and let Pareta serve an open-weights model for it. `deploy()` takes a task and a model and nothing about hardware, no GPU, tensor-parallel, quantization, or run-mode knob. Pareta resolves the serving class from its registry.

```python
# Deploy the recommended open model for a task, block until it is live.
endpoint = pa.endpoints.deploy(
    task="contract-key-fields",   # required: a subtask id from the catalog
    model="recommended",          # default; resolves to the task's curated pick
    wait=True,                    # block until live and return the Endpoint
)
print(endpoint.id)        # the value you pass as `model` to chat.completions.create
print(endpoint.is_live)   # True after a wait=True deploy
```

`endpoint.id` (the name) is what `chat.completions.create(model=...)` expects, not `endpoint.model`, which is the per-task alias of the weights that were deployed. Full walkthrough in [Deploy a model and call it](./deploy-and-infer.md).

### 2. Models are per-task aliases, not raw weight ids

OpenAI model names are global and stable (`gpt-4o-mini`). Pareta open-weights models are exposed as **per-task public aliases**; the real open-weights ids never cross into the SDK. So `endpoints.deploy(model=...)`, the rows in `pa.tasks.leaderboard(task_id)`, `endpoint.model`, and `result.model_id` on an eval run are all aliases. Frontier (vendor) ids, OpenAI, Anthropic, and so on, are passed in the clear, because those are public vendor model names. The practical upshot: pass `model="recommended"` (or a task's alias) to deploy, and let Pareta resolve it; do not try to pass a HuggingFace repo id.

### 3. Inference and evals are metered against your org balance

OpenAI bills the account behind the key out of band; you never see a price on the response. On Pareta, the same key debits a shared **org balance**, and the eval path surfaces the cost back to you in dollars.

- **Inference debits on success.** Each completed `chat.completions.create()` call debits the org balance. Cost is metered server-side, not returned on the completion object.
- **Evals debit for open + frontier compute**, and the run reports its spend: `run.cost` is a `Decimal` in dollars (floored to whole cents per Pareta's billing convention, for example 5 micro-USD reads `Decimal("0.00")`), with the raw integer on `run.cost_micro_usd`.
- **A zero balance raises `InsufficientCreditsError` (402)** on both the inference and eval paths.
- **Top-up is browser-only.** The SDK consumes credit; it never exposes balance, payment methods, or a way to add funds. There is no API call for it.

```python
from pareta import InsufficientCreditsError

try:
    resp = pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "..."}],
    )
except InsufficientCreditsError:
    print("Org balance is empty. Top up in the dashboard, then retry.")
```

Note that this error reaches the OpenAI client too: if you stayed on the one-diff `openai`-SDK path, a 402 surfaces there as an OpenAI status error rather than as `pareta.InsufficientCreditsError`. Mapping it to a typed exception is one more reason to adopt the `pareta` SDK.

## Evaluate before you commit (the OpenAI SDK can't do this)

The biggest reason to reach for the `pareta` SDK rather than the bare `openai` client: before you wire a model into production, run your own data through a set of candidate open models and frontier baselines and read back quality and cost. There is no OpenAI-SDK analog.

```python
from pareta import Pareta

pa = Pareta.from_env()

run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[
        {"input": "...", "expected": "..."},
        {"input": "...", "expected": "..."},
    ],
    models=["contract-kie-1", "contract-kie-2"],  # per-task open-model aliases
    frontier="benchmarked",                       # frontier baselines on this task's leaderboard
    wait=True,                                     # poll until terminal, then return
)

print(run.status)            # "completed" or "failed"
print(run.cost)              # Decimal dollars (floored to cents)

for r in run.results:
    print(r.model_id, r.kind, r.quality_mean, r.mean_cost_micro_usd)
```

`frontier=` accepts `None`/`"none"` (no baselines), `"all"` (every frontier model for the task), `"benchmarked"` (the frontier models on the task's leaderboard), or an explicit list of frontier ids. You can pull the roster with `pa.evals.frontier_models(task="contract-key-fields")` to see what is available, including which entries are `vision`-capable and which are `benchmarked` on that task. See the evaluation walkthrough for eval sets, document uploads, and the async path.

## Discovery: turning intent into a task

If you do not yet know which task id to deploy or eval against, the `tasks` resource maps free-text intent onto the catalog. Again, no OpenAI equivalent:

```python
match = pa.tasks.match("pull key fields out of vendor contracts", top_k=5)
if match.matched and match.chosen:
    task_id = match.chosen.task_id
    print("best task:", task_id, "confidence:", match.chosen.confidence)

# Once you have a task id, see what "recommended" will deploy and how models rank:
print(pa.tasks.recommended(task_id))      # the deployable alias deploy(model=...) will use
board = pa.tasks.leaderboard(task_id)
for entry in board.models:
    print(entry.name, entry.kind, entry.quality, entry.cost_per_request_micro_usd)
```

`match()` raises `ValueError` on an empty query. `recommended()` and `leaderboard()` are available on the synchronous `tasks` resource.

## Errors: from OpenAI exceptions to Pareta exceptions

If you keep the `openai` client, you keep OpenAI's exception types. If you adopt the `pareta` SDK, errors become Pareta exceptions, all subclasses of `ParetaError`, mapped per HTTP status:

```python
from pareta import (
    ParetaError,               # base class; also raised on a failed deploy
    AuthenticationError,       # 401 - bad or missing key
    InsufficientCreditsError,  # 402 - org out of credit; top up in the dashboard
    PermissionDeniedError,     # 403
    NotFoundError,             # 404 - unknown endpoint or task
    RateLimitError,            # 429 - throttled (auto-retried)
    EndpointNotReadyError,     # 503 - endpoint stopped, cold, or provider down
    BadRequestError,           # 400/422 - malformed request
)
```

The rough correspondence to OpenAI: `AuthenticationError` ↔ 401, `RateLimitError` ↔ 429, `BadRequestError` ↔ 400/422, `NotFoundError` ↔ 404. The two without OpenAI analogs are `InsufficientCreditsError` (402, the org-balance gate above) and `EndpointNotReadyError` (503, raised when you call an endpoint that is stopped or still cold, start it with `pa.endpoints.start(endpoint_id)` and retry). The client auto-retries 429s and transient 5xx/timeouts with exponential backoff (`max_retries`, default 2).

## Async

`AsyncPareta` mirrors the sync client exactly, same arguments, same response shapes, with `async def` methods and async iterators for streams:

```python
import asyncio
from pareta import AsyncPareta


async def main():
    async with AsyncPareta.from_env() as pa:
        resp = await pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Extract the total: ..."}],
        )
        print(resp.choices[0].message.content)

        # Streaming: await create() once to get the async iterator, then `async for`.
        stream = await pa.chat.completions.create(
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Summarize: ..."}],
            stream=True,
        )
        async for chunk in stream:
            print(chunk.choices[0].delta.content or "", end="")
        print()


asyncio.run(main())
```

This is the same async shape the `openai` SDK uses (`AsyncOpenAI`, `await create(...)`, `async for chunk`), so async migrations are as small as the sync ones.

## Migration checklist

- [ ] Mint a `pareta_sk_` key in the dashboard.
- [ ] Deploy an endpoint for your task and grab `endpoint.id`, see [Deploy a model and call it](./deploy-and-infer.md).
- [ ] **Staying on `openai`?** Set `base_url="https://api.pareta.ai/v1"`, `api_key="pareta_sk_..."`, and `model=<endpoint id>`. Done.
- [ ] **Adopting `pareta`?** Swap `OpenAI(...)` for `Pareta.from_env()` and `client.chat...` for `pa.chat...`. Inference args and response shapes are unchanged.
- [ ] Map your error handling: 402 becomes `InsufficientCreditsError`, 503 becomes `EndpointNotReadyError`.
- [ ] Keep your org balance funded (top-up is browser-only); a zero balance stops both inference and evals.

## Next steps

- [Deploy a model and call it](./deploy-and-infer.md), get the endpoint id you pass as `model`.
- [Streaming chat completions](./streaming-chat.md), the full streaming and async story.



---

<!-- reference/client.md -->

# Client (`Pareta`, `AsyncPareta`)

The client is the one object you build and the only thing that talks to the network. It holds your API key, the environment URL, the timeout and retry policy, and an HTTP connection pool. Every call you make goes through it: deploying endpoints, running inference, browsing the catalog, evaluating models. There are two of them and they are mirror images: `Pareta` is synchronous, `AsyncPareta` is `async`/`await`. Pick one, build it once, reuse it.

```python
from pareta import Pareta

with Pareta.from_env() as pa:                 # reads PARETA_API_KEY
    print(pa.models.list())
```

Nothing else in the SDK is constructed directly. Resources like `pa.chat`, `pa.endpoints`, and `pa.evals` are attributes that hang off the client; you never instantiate them yourself.

## Build it from the environment

`from_env()` is the recommended constructor. It reads `PARETA_API_KEY` and an optional `PARETA_BASE_URL`, then builds the client for you. It keeps `pareta_sk_…` secrets out of source and lets the same code run against production or staging by flipping one environment variable.

```bash
export PARETA_API_KEY="pareta_sk_live_…"
```

```python
from pareta import Pareta, AsyncPareta

pa = Pareta.from_env()             # sync
apa = AsyncPareta.from_env()       # async — same call, async client
```

```python
@classmethod
Pareta.from_env(**kwargs) -> Pareta
AsyncPareta.from_env(**kwargs) -> AsyncPareta
```

`from_env()` forwards any extra keyword arguments straight to the constructor, so you can keep the key in the environment and still override the rest in code:

```python
pa = Pareta.from_env(max_retries=5, timeout=120.0)
```

An explicit `api_key=` or `base_url=` passed to `from_env()` wins over the environment variable of the same name.

## Construct it directly

When you are not driving config from the environment, call the constructor. Both clients take the same arguments; they differ only in the type of `http_client`.

```python
from pareta import Pareta

pa = Pareta(
    api_key="pareta_sk_live_…",
    base_url="https://api.pareta.ai",
    timeout=60.0,
    max_retries=2,
    http_client=None,
)
```

```python
Pareta(
    api_key: str | None = None,
    base_url: str | None = None,
    timeout=None,
    max_retries: int = 2,            # DEFAULT_MAX_RETRIES
    http_client: httpx.Client | None = None,
)

AsyncPareta(
    api_key: str | None = None,
    base_url: str | None = None,
    timeout=None,
    max_retries: int = 2,
    http_client: httpx.AsyncClient | None = None,
)
```

| Parameter | Type | Default | What it does |
|-----------|------|---------|--------------|
| `api_key` | `str \| None` | `None` | Your `pareta_sk_…` key. Sent as `Authorization: Bearer <key>`. Required (raises `ParetaError` if missing). |
| `base_url` | `str \| None` | `"https://api.pareta.ai"` | API root. Normalized with `rstrip("/")`. Pass the staging URL to point at staging. |
| `timeout` | `httpx.Timeout \| float \| None` | `httpx.Timeout(60.0, connect=10.0)` | Per-request HTTP timeout. |
| `max_retries` | `int` | `2` | Automatic retries on transient failures. Clamped to `>= 0`. |
| `http_client` | `httpx.Client \| httpx.AsyncClient \| None` | `None` | Bring your own httpx client (proxies, custom transports, pools). |

### `api_key`

The key is the one piece of config you cannot skip. The SDK sends it as a Bearer token on every request. Mint keys in the dashboard; key management is browser-only and the SDK only ever consumes a key.

If the key is falsy (and `PARETA_API_KEY` is unset when using `from_env()`), the constructor raises `ParetaError` before any network call:

```python
from pareta import Pareta, ParetaError

try:
    pa = Pareta(api_key="")
except ParetaError as e:
    print(e)
    # missing API key. Pass api_key=… or set PARETA_API_KEY
    # (mint a pareta_sk_ key in the dashboard).
```

A key that is present but rejected by the server surfaces as `AuthenticationError` (401) on the first request, not at construction time.

### `base_url`

`base_url` selects the environment. It defaults to production and is normalized with a trailing-slash strip, so `https://api.pareta.ai/` and `https://api.pareta.ai` behave identically. Keys are environment-scoped: pair each `base_url` with a key minted for that environment.

```python
prod    = Pareta(api_key="pareta_sk_live_…")                                  # default
staging = Pareta(api_key="pareta_sk_test_…", base_url="https://api-staging.pareta.ai")
```

### `timeout`

Caps how long a single request may take. The default `httpx.Timeout(60.0, connect=10.0)` allows up to 10 seconds to connect and 60 seconds overall. A bare float sets the overall timeout for read, write, and connect alike. Raise it for long completions, or stream the response so tokens arrive incrementally (see [Inference](../guide/inference.md)). Note that `evals.runs.create(..., wait=True)` has its own `timeout` argument governing the poll loop, separate from this HTTP timeout (see [Evals](../guide/evaluation.md)).

```python
import httpx
from pareta import Pareta

pa = Pareta(api_key="pareta_sk_live_…", timeout=httpx.Timeout(120.0, connect=10.0))
```

### `max_retries`

The SDK automatically retries transient failures: HTTP `408, 409, 429, 500, 502, 503, 504`. The default is `2` (up to 3 attempts). Backoff is exponential with jitter, capped at 8 seconds, and honors a server `Retry-After` header when present. Non-transient errors (`401`, `402`, `404`, and so on) raise on the first attempt. Once a stream's bytes are flowing, a mid-stream drop raises immediately and is not retried. See [Errors and retries](../guide/errors-and-retries.md).

```python
pa = Pareta(api_key="pareta_sk_live_…", max_retries=5)   # patient batch job
pa = Pareta(api_key="pareta_sk_live_…", max_retries=0)   # fail fast (tests)
```

### `http_client`

By default the client constructs its own httpx client (configured with your `timeout`) and closes it for you. Pass `http_client=` to control the transport layer: an outbound proxy, mTLS, shared connection pools, or test doubles.

```python
import httpx
from pareta import Pareta

my_client = httpx.Client(
    proxy="http://proxy.internal:8080",
    limits=httpx.Limits(max_connections=50, max_keepalive_connections=10),
    timeout=httpx.Timeout(120.0, connect=10.0),
)
pa = Pareta(api_key="pareta_sk_live_…", http_client=my_client)
```

When you inject a client, you own its lifecycle and its timeout. `pa.close()` will not close a client you passed in, and the constructor's `timeout=` applies only to an SDK-owned client. Set the timeout on your own client, and close it yourself.

## Lifecycle and cleanup

Each client owns an HTTP connection pool. Release it when you are done. The cleanly idiomatic way is the context manager, which closes the pool on exit.

### Sync

```python
close() -> None          # close the HTTP client (only if the SDK owns it)
__enter__() -> Pareta
__exit__(*exc) -> None
```

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    completion = pa.chat.completions.create(
        model="ep_a1b2c3",
        messages=[{"role": "user", "content": "Extract the parties."}],
    )
    print(completion.choices[0].message.content)
# HTTP client closed on exit
```

Or close it explicitly:

```python
pa = Pareta.from_env()
try:
    pa.models.list()
finally:
    pa.close()
```

### Async

```python
async aclose() -> None
async __aenter__() -> AsyncPareta
async __aexit__(*exc) -> None
```

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        models = await pa.models.list()
        print(models)
    # HTTP client closed on exit

asyncio.run(main())
```

The ownership rule holds in both: if you passed `http_client=`, neither `close()`/`aclose()` nor exiting the context manager touches it. Close your own client.

## Resource namespaces

The client is a namespace router. Every capability hangs off it as an attribute. The sync client exposes the sync resources; the async client exposes the async mirrors. The method shapes match one-to-one, async methods are `async def`, and streaming methods return async iterators on the async client.

| Namespace | Sync type | Async type | What it does | Reference |
|-----------|-----------|------------|--------------|-----------|
| `chat` | `Chat` | `AsyncChat` | OpenAI-compatible inference via `chat.completions.create(...)`. Metered. | [chat](./chat.md) |
| `models` | `Models` | `AsyncModels` | `models.list()` — the deployed, callable endpoints (OpenAI-compatible subset). | [models](./models.md) |
| `endpoints` | `Endpoints` | `AsyncEndpoints` | `deploy`, `list`, `retrieve`, `start`, `stop`, `delete`, and `metrics(id)`. | [endpoints](./endpoints.md) |
| `tasks` | `Tasks` | `AsyncTasks` | Browse the benchmark catalog: `list`, `retrieve`, `match`, `leaderboard`, `recommended`. | [tasks](./tasks.md) |
| `evals` | `Evals` | `AsyncEvals` | `evals.sets`, `evals.runs`, and `evals.frontier_models(...)`. Metered. | [evals](./evals.md) |

A tour of all five against one client:

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    # tasks — discover what to deploy
    task = "contract-key-fields"
    print("recommended:", pa.tasks.recommended(task))      # a per-task alias

    # endpoints — deploy it (wait=True blocks through the deploy SSE stream)
    ep = pa.endpoints.deploy(task=task, model="recommended", wait=True)
    print("live:", ep.id, ep.status)

    # chat — OpenAI-compatible inference against the endpoint id
    resp = pa.chat.completions.create(
        model=ep.id,
        messages=[{"role": "user", "content": "Say hello."}],
    )
    print(resp.choices[0].message.content)

    # models — the OpenAI-compatible list of callable endpoints
    for m in pa.models.list():
        print(m.id, m.owned_by)

    # evals — score candidates on your own data, billed total in dollars
    run = pa.evals.runs.create(
        task=task,
        items=[{"input": "…", "expected": "…"}],
        models=["qwen-vl-2"],
        frontier="benchmarked",
        wait=True,
    )
    print("run cost:", run.cost)        # Decimal dollars, floored to cents
```

The same code on the async client, with `await` and the async context manager:

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        ep = await pa.endpoints.deploy(
            task="contract-key-fields", model="recommended", wait=True,
        )
        resp = await pa.chat.completions.create(
            model=ep.id,
            messages=[{"role": "user", "content": "Say hello."}],
        )
        print(resp.choices[0].message.content)

asyncio.run(main())
```

Two async differences worth pinning down: `pa.endpoints.metrics(id)` returns an `AsyncMetrics` object directly (not a coroutine), and its dimension methods are the things you await. See [Async](../guide/async.md) for the full sync-vs-async mapping.

## Platform truths the client makes concrete

These hold no matter how you build the client. They are why there is no GPU knob, no balance API, and no real model ids in the SDK.

- **GPUs are hidden.** You configure a key, a URL, timeouts, and retries — never hardware. `endpoints.deploy(task=…, model=…)` takes a task and a model alias; Pareta resolves the GPU, tensor-parallelism, and quantization from its registry. There is no hardware parameter anywhere in the SDK. See [Deploy endpoints](../guide/deploying-endpoints.md).
- **Models are per-task aliases.** Every model id you pass or read — in `deploy(model=…)`, on `tasks.leaderboard()` rows, in `run.results[].model_id`, in `endpoints.list()[].model` — is a per-task public alias like `qwen-vl-2`. Real internal ids never cross into the SDK. Frontier (vendor) ids are in the clear. See [Discovery](../guide/discovery.md).
- **Inference and evals are metered against your org balance.** A successful `pa.chat.completions.create()` debits your balance; `pa.evals.runs.create()` debits for both open and frontier compute. An `EvalRun` reports its billed total on `run.cost` (a `Decimal` in dollars, floored to whole cents, so a sub-cent run reads `Decimal("0.00")`) and the raw value on `run.cost_micro_usd`. When the balance hits zero, both paths raise `InsufficientCreditsError` (402). Top-up is browser-only; the SDK exposes neither balance nor payment methods.

  ```python
  from pareta import InsufficientCreditsError

  try:
      pa.chat.completions.create(model=ep.id, messages=[{"role": "user", "content": "ping"}])
  except InsufficientCreditsError:
      print("Out of credit — top up in the dashboard.")
  ```

- **Inference is OpenAI-compatible.** `base_url` plus your `pareta_sk_…` key is a drop-in OpenAI endpoint. You can point the `openai` SDK at the same `base_url` to call a deployed endpoint; this SDK adds the control plane (deploy, eval, discovery) the `openai` client cannot do. See [Inference](../guide/inference.md).

## See also

- [Configuration](../guide/configuration.md) — the full configuration guide: `from_env`, `base_url`, timeouts, retries, custom transports, and the configuration cookbook.
- [Inference](../guide/inference.md) — `chat.completions.create`, streaming, and metering.
- [Deploy endpoints](../guide/deploying-endpoints.md) — deploy a model to a task and operate it.
- [Discovery](../guide/discovery.md) — browse the catalog, match intent, read leaderboards.
- [Evaluation](../guide/evaluation.md) — score models on your own data, including `run.cost`.
- [Errors and retries](../guide/errors-and-retries.md) — the `ParetaError` hierarchy and retry behavior.
- [Async](../guide/async.md) — the sync-vs-async mapping for every resource.



---

<!-- reference/chat.md -->

# chat.completions

Run inference against a deployed endpoint. `chat.completions.create(...)` is the one call you make to get tokens out of a model you deployed on Pareta. It has the same shape as the OpenAI chat completions API: pass a `model`, a list of `messages`, and you get a `ChatCompletion` back. Set `stream=True` and you get an iterator of token deltas instead.

Inference on Pareta is OpenAI-compatible on the wire (vLLM-style SSE), so this exact surface works whether you call it through this SDK, the `openai` package, or raw HTTP. This SDK's added value is the control plane around it (deploy, eval, discover); for plain inference the clients are interchangeable.

Two platform truths shape this page:

- **Models are per-task aliases, and GPUs are hidden.** The `model` you pass is an endpoint id from [`endpoints.deploy(...)`](./endpoints.md), or any callable model id your org can reach. Real open-weights model ids never cross to you, and you never pick a GPU, quantization, or tensor-parallel setting. The backend resolves all of that.
- **Inference is metered against your org balance.** A successful completion debits your balance in dollars. If the balance is empty, the call raises [`InsufficientCreditsError`](exceptions.md) (402). Top-up is browser-only; the SDK exposes no balance or payment surface.

**Route:** `POST /v1/chat/completions`

## Signature

```python
class Completions:
    def create(
        self,
        *,
        model: str,
        messages: list[dict[str, Any]],
        stream: bool = False,
        **kwargs: Any,
    ) -> ChatCompletion | Iterator[ChatCompletionChunk]
```

All arguments are keyword-only.

| Parameter | Type | Default | Notes |
|-----------|------|---------|-------|
| `model` | `str` | required | An endpoint id from [`endpoints.deploy(...)`](./endpoints.md), an id from [`models.list()`](./models.md), or a per-task alias. Validated server-side at call time. |
| `messages` | `list[dict]` | required | Non-empty list of OpenAI-format message dicts (`{"role": ..., "content": ...}`). |
| `stream` | `bool` | `False` | `False` returns a `ChatCompletion`; `True` returns an iterator of `ChatCompletionChunk`. |
| `**kwargs` | `Any` | — | Any extra OpenAI body field (`temperature`, `max_tokens`, `top_p`, `stop`, `seed`, ...) passes through unchanged. |

`model` and `messages` are both required. The SDK raises `ValueError` before sending if `model` is falsy or `messages` is empty, so a malformed call fails fast without burning a request or a charge.

## Basic completion

```python
from pareta import Pareta

with Pareta.from_env() as pa:   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
    resp = pa.chat.completions.create(
        model="ep_invoice_xtract",   # an endpoint id from endpoints.deploy()
        messages=[
            {"role": "system", "content": "You extract structured fields from documents."},
            {"role": "user", "content": "What is the invoice total?\n\nINVOICE\nTotal due: $4,210.00"},
        ],
    )

    print(resp.choices[0].message.content)
    print(resp.usage.total_tokens, "tokens")
```

`Pareta.from_env()` is the recommended constructor; it reads `PARETA_API_KEY` and the optional `PARETA_BASE_URL`. You can also pass the key explicitly with `Pareta(api_key="pareta_sk_...")`. The client is a context manager, so `with` releases the HTTP connection for you.

### Where `model` comes from

Three interchangeable sources:

- An endpoint id you deployed. See [`endpoints.deploy`](./endpoints.md).
- Any id returned by [`models.list()`](./models.md) (your deployed, callable endpoints).
- A per-task model alias. The deployable recommended pick for a task is `pa.tasks.recommended(task_id)`; see [`tasks`](./tasks.md).

## Return type: ChatCompletion

With `stream=False` (the default), `create(...)` returns a `ChatCompletion`. Fields mirror OpenAI:

```python
resp.id                            # str | None
resp.model                         # str | None — the alias that served the call
resp.created                       # int | None — Unix timestamp
resp.choices                       # list[Choice]
resp.choices[0].index              # int | None
resp.choices[0].finish_reason      # str | None — "stop", "length", ...
resp.choices[0].message.role       # str | None — "assistant"
resp.choices[0].message.content    # str | None — the generated text
resp.usage.prompt_tokens           # int | None
resp.usage.completion_tokens       # int | None
resp.usage.total_tokens            # int | None
```

Every response object keeps the untouched server JSON. If a field is not surfaced as a typed property, reach it with `resp.to_dict()` or `resp["..."]`. Nothing the API returns is lost behind the typed layer.

## Passthrough parameters

Any extra keyword goes straight into the request body, so the full OpenAI parameter set is available without the SDK enumerating it:

```python
resp = pa.chat.completions.create(
    model="ep_invoice_xtract",
    messages=[{"role": "user", "content": "Summarize this contract clause: ..."}],
    temperature=0.2,
    max_tokens=512,
    top_p=0.9,
    stop=["\n\n"],
    seed=7,
)
```

These fields are not validated SDK-side; they are forwarded as-is and validated by the serving model. An unsupported field comes back as a `BadRequestError` (400/422).

## Streaming

Set `stream=True` and `create(...)` returns an `Iterator[ChatCompletionChunk]` instead of a single `ChatCompletion`. Each chunk carries a `delta` (not a `message`); the incremental text is at `chunk.choices[0].delta.content`.

```python
with Pareta.from_env() as pa:
    stream = pa.chat.completions.create(
        model="ep_invoice_xtract",
        messages=[{"role": "user", "content": "Draft a one-paragraph status update."}],
        stream=True,
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
    print()
```

A chunk has the same schema as a `ChatCompletion`. `ChatCompletionChunk` exists as a distinct type only for hinting:

```python
chunk.choices[0].delta.content    # str | None — the new text in this chunk
chunk.choices[0].delta.role       # str | None — usually only set on the first chunk
chunk.choices[0].finish_reason    # str | None — "stop" / "length" on the last chunk
chunk.id                          # str | None
chunk.model                       # str | None
```

Guard `delta.content` with `or ""` (or an `if delta:` check): the opening role chunk and the final `finish_reason` chunk carry no text, so `delta.content` is `None` there. To accumulate the full text:

```python
text = "".join(c.choices[0].delta.content or "" for c in stream)
```

The stream is data-only SSE and always terminates on a `[DONE]` sentinel, which the SDK consumes for you, so the iterator simply ends and a plain `for` loop exits cleanly.

**Mid-stream behavior:** retries (see below) cover only the initial connect and status-line handshake. Once tokens are flowing, a mid-stream connection drop raises immediately rather than silently resuming. Nothing is sent until you start iterating, and the connection stays open for the life of the loop.

## Async

`AsyncPareta` mirrors the sync client. `create(...)` is `async def`. For streaming you `await` the call once, then `async for` over the chunks.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        # Non-streaming: await returns a ChatCompletion
        resp = await pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "What is the invoice total?"}],
        )
        print(resp.choices[0].message.content)

        # Streaming: await once, then async-for the chunks
        stream = await pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "Stream me a haiku about ledgers."}],
            stream=True,
        )
        async for chunk in stream:
            print(chunk.choices[0].delta.content or "", end="", flush=True)
        print()

asyncio.run(main())
```

The async return type is `ChatCompletion | AsyncIterator[ChatCompletionChunk]`. The async client also exposes `aclose()` and works as an `async with` context manager.

## Metering

Every successful completion (streaming or not) debits your org balance. The `chat.completions` surface does not return a per-call cost field; spend is summarized per endpoint via [`endpoints.metrics(...).cost(...)`](./endpoints.md). If the org balance is empty, the call raises `InsufficientCreditsError` (402, a subclass of `ParetaError`). Top up in the dashboard; billing is browser-only and the SDK has no balance or payment surface.

## Errors

`create(...)` raises specific subclasses of `ParetaError`. A single `except ParetaError` is a fine catch-all; the named classes let you branch. The two cases specific to running inference are an empty balance and a not-ready endpoint:

```python
from pareta import (
    Pareta,
    InsufficientCreditsError,   # 402: org balance empty
    EndpointNotReadyError,      # 503: endpoint stopped / cold / provider down
)

with Pareta.from_env() as pa:
    try:
        resp = pa.chat.completions.create(
            model="ep_invoice_xtract",
            messages=[{"role": "user", "content": "Hello"}],
        )
        print(resp.choices[0].message.content)
    except InsufficientCreditsError:
        # Balance hit zero. Top up in the dashboard (billing is browser-only).
        print("Out of credit — top up in the dashboard, then retry.")
    except EndpointNotReadyError:
        # The endpoint is stopped or cold-starting. Start it, wait for live, retry.
        pa.endpoints.start("ep_invoice_xtract")
        print("Endpoint was not ready — started it; retry shortly.")
```

| Raised | Status | When |
|--------|--------|------|
| `ValueError` | — | `model` falsy or `messages` empty (SDK-side, before sending) |
| `BadRequestError` | 400 / 422 | Malformed request or unsupported passthrough field |
| `AuthenticationError` | 401 | Invalid or missing API key |
| `InsufficientCreditsError` | 402 | Org balance empty |
| `PermissionDeniedError` | 403 | Caller lacks permission for the endpoint |
| `NotFoundError` | 404 | `model` is not a callable endpoint or model id |
| `RateLimitError` | 429 | Rate limited (after retries) |
| `EndpointNotReadyError` | 503 | Endpoint stopped, cold-starting, or provider down |
| `APITimeoutError` | — | No response within the client timeout (after retries) |
| `APIConnectionError` | — | DNS, TCP, or TLS failure |

`APIStatusError` subclasses expose `status_code`, `detail`, `request_id` (the `x-request-id` header), and the raw `response` for debugging. See [Errors](exceptions.md) for the full hierarchy.

### Retries

Transient failures (408, 409, 429, 500, 502, 503, 504) are retried automatically with exponential backoff and jitter, up to `max_retries` times (default 2), honoring a `Retry-After` header when present. You only see `RateLimitError`, `EndpointNotReadyError`, or `APITimeoutError` after retries are exhausted. For streaming, retries cover only the initial handshake, never a mid-stream drop.

## Using the OpenAI SDK instead

Because the endpoint is OpenAI-compatible, you do not need this SDK to call it. Point the `openai` client at Pareta's base URL with your `pareta_sk_` key. Note the `/v1` suffix the OpenAI client expects:

```python
from openai import OpenAI

client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")

resp = client.chat.completions.create(
    model="ep_invoice_xtract",
    messages=[{"role": "user", "content": "What is the invoice total?"}],
)
print(resp.choices[0].message.content)
```

Streaming, `temperature`, `max_tokens`, and the rest work exactly as they do against OpenAI. Metering still applies; a zero balance returns a 402, which the `openai` client surfaces as its own status error. Reach for the Pareta SDK when you want the control plane: [deploying endpoints](./endpoints.md), [discovering tasks](./tasks.md), and [running evals](./evals.md).

## See also

- [`models`](./models.md) — list the callable endpoint ids you can pass as `model`.
- [`endpoints`](./endpoints.md) — deploy, start, stop, and read cost metrics for the endpoint you infer against.
- [`tasks`](./tasks.md) — discover tasks and the recommended deployable model for each.
- [`evals`](./evals.md) — evaluate candidate models on your own data before deploying.
- [Errors](exceptions.md) — the full exception hierarchy and retry policy.



---

<!-- reference/models.md -->

# models

`client.models` lists the models you can call right now. It is the OpenAI-compatible model index: `GET /v1/models` returning only your deployed, url-bearing endpoints. Use it to discover the ids you pass to [`chat.completions.create(model=...)`](../guide/inference.md).

This is the inference-time view of your fleet. It deliberately shows less than [`endpoints.list()`](../guide/deploying-endpoints.md): only live, callable endpoints, and only the three fields the OpenAI `/v1/models` contract defines. When you want lifecycle and operations (deploy, start, stop, metrics), use the [endpoints](../guide/deploying-endpoints.md) namespace instead.

Two platform truths show up here:

- **Models are per-task aliases.** A `Model.id` is a callable endpoint id; the underlying open-weights model id never reaches you. The backend resolves it. You never see or pick a GPU.
- **Calling a model is metered.** Listing is free, but each completion against an id from this list debits your org balance. An empty balance raises `InsufficientCreditsError` (402) at call time. Top-up is browser-only.

## list

```python
def list(self) -> ModelList
```

**Route:** `GET /v1/models`

Returns a [`ModelList`](#modellist) of every deployed endpoint that has a live inference URL. Endpoints that are stopped, cold, or still deploying are omitted, because they have no `url` and so cannot be called. There are no parameters and no pagination.

```python
from pareta import Pareta

with Pareta.from_env() as pa:          # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
    models = pa.models.list()          # ModelList

    print(len(models), "callable models")
    for m in models:                   # ModelList is directly iterable
        print(m.id, "·", m.owned_by)
```

`m.id` is exactly what you feed to inference. Listing and calling compose directly:

```python
with Pareta.from_env() as pa:
    models = pa.models.list()
    if len(models) == 0:
        raise SystemExit("No live endpoints. Deploy one first: pa.endpoints.deploy(task=...)")

    first = models.data[0]             # a Model
    resp = pa.chat.completions.create(
        model=first.id,                # the callable endpoint id
        messages=[{"role": "user", "content": "What is the invoice total?"}],
    )
    print(resp.choices[0].message.content)
```

### Async

`AsyncModels.list` is the same call, awaited:

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        models = await pa.models.list()
        for m in models:
            print(m.id, m.owned_by)

asyncio.run(main())
```

## ModelList

The return value of `list()`. It wraps the raw `{"data": [...]}` payload and behaves like a lightweight collection.

| Member | Type | Description |
| --- | --- | --- |
| `data` | `list[Model]` | The deployed, callable models. |
| `__iter__()` | `Iterable[Model]` | Iterate models directly: `for m in models`. |
| `__len__()` | `int` | Number of callable models: `len(models)`. |

```python
models = pa.models.list()

len(models)          # int: how many endpoints are live and callable
models.data          # list[Model]: the underlying list
list(models)         # same elements, via __iter__
[m.id for m in models]
```

`ModelList` is not indexable directly. To grab one element, go through `.data` (`models.data[0]`) or iterate.

Like every Pareta response object, it keeps the raw server JSON. Reach anything not surfaced as a property with `models.to_dict()` or `models["data"]`.

## Model

One element of `ModelList.data`. It is the OpenAI-compatible model record, so it carries only three fields.

| Property | Type | Description |
| --- | --- | --- |
| `id` | `str \| None` | The endpoint id. Pass it as `chat.completions.create(model=...)`. |
| `owned_by` | `str \| None` | `"pareta"` for your deployed open-weights endpoints, or a vendor name. |
| `created` | `int \| None` | Unix timestamp (seconds) when the endpoint was created. |

```python
for m in pa.models.list():
    print(m.id)         # str | None: usable as the `model` arg in inference
    print(m.owned_by)   # str | None: "pareta" or a vendor name
    print(m.created)    # int | None: Unix seconds

    m.to_dict()         # full raw record, nothing lost behind the typed layer
```

`Model.id` is a per-task alias, not the real open-weights model id. That is by design: the underlying model id never crosses into the SDK. You deploy with a task and an alias and you call with the resulting endpoint id; hardware is resolved for you. See [Core concepts](../guide/core-concepts.md) for the aliasing and GPU-hiding model.

## How this differs from `endpoints.list()`

Both list your fleet, but they answer different questions.

| | `models.list()` | `endpoints.list()` |
| --- | --- | --- |
| Returns | `ModelList` of `Model` | `list[Endpoint]` |
| Includes | Only live, url-bearing endpoints | All endpoints the org can access (any status) |
| Fields | `id`, `owned_by`, `created` | `id`, `name`, `model`, `status`, `task`, `url`, `is_live` |
| Use for | "What can I call right now?" | Deploy, start, stop, delete, inspect, metrics |
| Shape | OpenAI-compatible | Pareta-native |

If `models.list()` returns fewer entries than you expect, an endpoint is probably not live. Check its status with [`endpoints.list()`](../guide/deploying-endpoints.md) or `endpoints.retrieve(endpoint_id)`, and `endpoints.start(endpoint_id)` it if it is stopped.

## Errors

`list()` makes a plain authenticated GET, so the failure modes are the standard ones. A bad or missing key raises `AuthenticationError` (401); transient 429/5xx and connection timeouts are retried automatically (`max_retries`, default 2) before surfacing as `RateLimitError`, `APIStatusError`, or `APITimeoutError`. All inherit from `ParetaError`.

```python
from pareta import Pareta, AuthenticationError, ParetaError

try:
    with Pareta.from_env() as pa:
        models = pa.models.list()
except AuthenticationError:
    print("Check PARETA_API_KEY (it should start with pareta_sk_).")
except ParetaError as e:
    print("Listing failed:", e)
```

`InsufficientCreditsError` (402) does **not** fire here. Listing is free; metering happens when you call a model. See [Errors and retries](../guide/errors-and-retries.md) for the full hierarchy.

## See also

- [Running inference](../guide/inference.md) — pass a `Model.id` to `chat.completions.create`.
- [Deploying endpoints](../guide/deploying-endpoints.md) — create the endpoints that show up here.
- [Discovering tasks](../guide/discovery.md) — find a task and its recommended model before you deploy.
- [Core concepts](../guide/core-concepts.md) — per-task aliases, hidden GPUs, and org-balance metering.



---

<!-- reference/endpoints.md -->

# `endpoints`

`client.endpoints` is the control plane for serving open-weights models. Hand it a task and a model; it deploys an OpenAI-compatible inference endpoint, hands you back a live `Endpoint`, and lets you start, stop, delete, and measure it from code. There is no infrastructure to reason about.

Three platform truths shape this whole namespace:

- **GPUs are hidden.** `deploy()` takes a task and a model, nothing else. There is no GPU, tensor-parallel, quantization, or run-mode knob. Pareta resolves the serving class from its registry.
- **Models are per-task aliases.** The `model` you deploy and the `Endpoint.model` you read back are public per-task aliases (`{family}-{rank}`), not raw open-weights ids. Real ids never cross into the SDK.
- **Inference is metered against your org balance.** Once an endpoint is live, every [`chat.completions.create()`](chat.md) debits your org balance and raises `InsufficientCreditsError` (402) on an empty balance. Top-up is browser-only; the SDK never exposes balance or payment methods.

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
```

## `deploy()`

```python
endpoints.deploy(
    *,
    task: str,
    model: str = "recommended",
    name: str | None = None,
    wait: bool = False,
    **extra,
) -> Iterator[dict] | Endpoint
```

**Route:** `POST /v1/endpoints` (named-event SSE)

Deploys a model for a task and brings the endpoint live.

- `task` (required) is a catalog subtask id, e.g. `"contract-key-fields"`. Discover one with [`tasks.match()` / `tasks.list()`](tasks.md). Passing an empty `task` raises `ValueError` before any request goes out.
- `model` is a per-task public alias, an explicit real-id-equivalent alias, or `"recommended"` (the default — the task's curated pick, else the leaderboard's top open model). To see what `"recommended"` resolves to before you deploy, read [`pa.tasks.recommended(task)`](tasks.md).
- `name` is optional. Leave it off and Pareta names the endpoint for you.
- `wait` controls the return type (see below).
- `**extra` is passed straight to the backend (e.g. `cost_per_request_micro_usd`, `frontier_cost_per_request_micro_usd`, `region`, `provider`, `quality`, `run_mode`, `taskDisplay`). You never pass hardware.

The return type depends entirely on `wait`.

### `wait=True` — block and get the live `Endpoint`

The simplest path. `deploy(wait=True)` consumes the deploy stream internally, blocks until the endpoint is live, and returns the `Endpoint`. If the deploy emits an `"error"` event (or the stream ends without a `"complete"` event), it raises `ParetaError`.

```python
ep = pa.endpoints.deploy(
    task="contract-key-fields",
    model="recommended",   # Pareta picks the task's best open model
    wait=True,
)

assert ep.is_live              # status == "live"
print(ep.id)                   # pass this to chat.completions.create(model=…)
print(ep.model)                # per-task public alias that got deployed
print(ep.url)                  # OpenAI-compatible inference URL

# Use it immediately — metered against your org balance.
resp = pa.chat.completions.create(
    model=ep.id,
    messages=[{"role": "user", "content": "Extract the parties and effective date."}],
)
print(resp.choices[0].message.content)
```

### `wait=False` — stream deploy progress (default)

With `wait=False` (the default), `deploy()` returns an iterator of named progress events so you can render a deploy UI or log stages. Each event is a `{"event": str, "data": dict}` dict.

```python
endpoint = None
for ev in pa.endpoints.deploy(task="contract-key-fields"):
    if ev["event"] == "progress":
        # data carries the deploy stage status, e.g. {"stage": "pulling weights", "pct": 45}
        print("progress:", ev["data"])
    elif ev["event"] == "complete":
        endpoint = ev["data"]["endpoint"]   # the live endpoint payload (dict)
        print("live:", endpoint)
    elif ev["event"] == "error":
        # wait=True raises ParetaError for you; with wait=False you handle it.
        raise RuntimeError(ev["data"].get("message", "deploy failed"))
```

The terminal event is `"complete"` (its `data.endpoint` is the live endpoint) or `"error"`. Pick `wait=True` for scripts and notebooks; pick `wait=False` only when you want to surface live progress.

## `list()`

```python
endpoints.list() -> list[Endpoint]
```

**Route:** `GET /v1/endpoints`

Returns every endpoint your org can access, in any status.

```python
for ep in pa.endpoints.list():
    print(ep.id, ep.status, ep.task, ep.model)
```

For the OpenAI-compatible subset — only deployed, url-bearing endpoints, shaped as `Model` objects — use [`pa.models.list()`](models.md) instead.

## `retrieve()`

```python
endpoints.retrieve(endpoint_id: str) -> Endpoint
```

**Route:** `GET /v1/endpoints/{endpoint_id}`

Fetches one endpoint by id. Raises `NotFoundError` (404) for an unknown id.

```python
ep = pa.endpoints.retrieve("ep_a1b2c3")
print(ep.is_live, ep.url)
```

## `start()` / `stop()`

```python
endpoints.start(endpoint_id: str)   # POST /v1/endpoints/{id}/start
endpoints.stop(endpoint_id: str)    # POST /v1/endpoints/{id}/stop
```

A stopped endpoint costs nothing to keep but cannot serve. `stop()` pauses spend; `start()` resumes a stopped endpoint.

```python
pa.endpoints.stop("ep_a1b2c3")      # pause a live endpoint
pa.endpoints.start("ep_a1b2c3")     # resume a stopped one
```

While an endpoint is stopped or still cold, inference calls against it raise `EndpointNotReadyError` (503). After `start()`, poll `retrieve(id).is_live` before sending traffic.

```python
import time

pa.endpoints.start("ep_a1b2c3")
while not pa.endpoints.retrieve("ep_a1b2c3").is_live:
    time.sleep(3)
```

## `delete()`

```python
endpoints.delete(endpoint_id: str) -> None
```

**Route:** `DELETE /v1/endpoints/{endpoint_id}`

Removes an endpoint for good. Returns `None`.

```python
pa.endpoints.delete("ep_a1b2c3")
```

## `metrics()`

```python
endpoints.metrics(endpoint_id: str) -> Metrics
```

Returns a `Metrics` handle for querying one endpoint's observability dimensions. This call does not hit the network — each dimension method does.

```python
class Metrics:
    def performance(self, **params) -> dict   # GET /v1/endpoints/{id}/performance
    def uptime(self, **params) -> dict         # GET /v1/endpoints/{id}/uptime
    def cost(self, **params) -> dict           # GET /v1/endpoints/{id}/cost
    def quality(self, **params) -> dict        # GET /v1/endpoints/{id}/quality
    def activity(self, **params) -> dict       # GET /v1/endpoints/{id}/activity
```

Each method returns raw metric JSON (shapes vary by dimension; typed models arrive with the OpenAPI generation) and accepts arbitrary query params as keyword arguments, passed straight through to the query string.

```python
m = pa.endpoints.metrics("ep_a1b2c3")

m.performance()   # p50/p95/p99 latency
m.uptime()        # availability
m.cost()          # per-endpoint spend + vs-frontier savings
m.quality()       # judge windows
m.activity()      # usage stats

# Params pass through to the query string:
m.performance(window="24h")
m.cost(group_by="day")
```

| Method | Returns |
|---|---|
| `performance(**params)` | p50/p95/p99 latency |
| `uptime(**params)` | availability metrics |
| `cost(**params)` | per-endpoint spend and savings versus the frontier baseline |
| `quality(**params)` | judge-window quality scores |
| `activity(**params)` | usage stats |

`metrics(id).cost()` is per-endpoint **observability** — it reports what this endpoint spent and how much you saved against a frontier vendor. It is not your account balance. Balance and top-up live in the dashboard only.

## The `Endpoint` object

Returned by `deploy(wait=True)`, `retrieve()`, and each element of `list()`.

| Field | Type | Meaning |
|---|---|---|
| `id` | `str \| None` | Endpoint id (== name) — pass as [`chat.completions.create(model=…)`](chat.md) |
| `name` | `str \| None` | Display name |
| `model` | `str \| None` | Per-task public alias serving here |
| `status` | `str \| None` | `"live"`, `"starting"`, `"stopped"`, … |
| `task` | `str \| None` | Task name |
| `url` | `str \| None` | OpenAI-compatible inference URL |
| `is_live` | `bool` | `status == "live"` |

Every `Endpoint` keeps the full server record on `to_dict()`, so nothing the API returns is lost behind the typed fields.

## End to end

Discover a task, deploy its recommended model, serve traffic, check cost, then tear down.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    # 1. Find a task for your intent.
    match = pa.tasks.match("pull key fields out of contracts")
    task_id = match.chosen.task_id          # e.g. "contract-key-fields"

    # 2. Deploy the recommended open model (no GPU knob).
    ep = pa.endpoints.deploy(task=task_id, model="recommended", wait=True)
    print(f"live: {ep.id} serving {ep.model}")

    # 3. Run metered inference (debits the org balance).
    resp = pa.chat.completions.create(
        model=ep.id,
        messages=[{"role": "user", "content": "Extract the governing-law clause."}],
    )
    print(resp.choices[0].message.content)

    # 4. Check what it cost and how it performed.
    print(pa.endpoints.metrics(ep.id).cost())

    # 5. Stop it to pause spend (or delete it to remove it).
    pa.endpoints.stop(ep.id)
```

## Async

`AsyncPareta` mirrors the sync surface. `deploy()`, `list()`, `retrieve()`, `start()`, `stop()`, and `delete()` are `async def`. `metrics(id)` returns an `AsyncMetrics` handle synchronously (it is not a coroutine), and its dimension methods are awaitable.

```python
from pareta import AsyncPareta

async with AsyncPareta.from_env() as pa:
    # wait=True awaits the deploy and returns the live Endpoint.
    ep = await pa.endpoints.deploy(task="contract-key-fields", wait=True)

    # wait=False returns an async progress-event iterator.
    async for ev in await pa.endpoints.deploy(task="contract-key-fields"):
        if ev["event"] == "complete":
            print("live:", ev["data"]["endpoint"])

    m = pa.endpoints.metrics(ep.id)         # NOT awaited — returns the handle
    print(await m.performance())            # the dimension call IS awaited

    await pa.endpoints.stop(ep.id)
```

## Errors

| Exception | Status | When |
|---|---|---|
| `BadRequestError` | 400 / 422 | Unknown task, malformed deploy params |
| `InsufficientCreditsError` | 402 | Org balance empty when you run inference against the endpoint |
| `NotFoundError` | 404 | Unknown `endpoint_id` |
| `ConflictError` | 409 | Seed/legacy endpoint conflict, or transient contention |
| `EndpointNotReadyError` | 503 | Endpoint stopped, cold-starting, or provider down |
| `ParetaError` | — | A deploy stream emitted an `"error"` event, or ended without `"complete"` |

`deploy(task="")` raises `ValueError` before any request leaves the process.

## Related

- [`tasks`](tasks.md) — find a `task` id and read its recommended model before you deploy.
- [`chat`](chat.md) — call `chat.completions.create()` against a deployed endpoint id.
- [`models`](models.md) — list the OpenAI-compatible subset of deployed endpoints.
- [`evals`](evals.md) — pick the right open model for a task before you deploy it.
- [Deploying & operating endpoints](../guide/deploying-endpoints.md) — the narrative guide.
- [Errors & retries](../guide/errors-and-retries.md) — the exception hierarchy and retry policy.



---

<!-- reference/tasks.md -->

# tasks

`client.tasks` is the catalog layer. Before you deploy or evaluate anything you
need two things: a **task** (which benchmark you are solving) and a **model**
(which model to deploy or measure). `tasks` resolves both for you:

- `list` / `retrieve` browse the benchmark catalog and a task's schema.
- `match` turns a plain-English description of your job into ranked candidate
  tasks.
- `leaderboard` / `recommended` rank the models scored on a task and hand you the
  deployable pick.

Two platform facts run through everything here:

- **Models are per-task aliases.** Leaderboard rows, the `recommended` pick, and
  eval result `model_id`s are public aliases like `qwen-1` or `recommended`,
  never the underlying open-weights ids. You pass the alias straight back into
  [`endpoints.deploy(model=...)`](./endpoints.md) or
  [`evals.runs.create(models=[...])`](./evals.md), and Pareta resolves the real
  model and the hardware. There is no GPU, quantization, or run-mode knob
  anywhere in this flow.
- **Catalog reads are free.** `list`, `retrieve`, `match`, `leaderboard`, and
  `recommended` are not metered. The meter only starts when you run compute
  (inference and eval runs), which is debited against your org balance. See
  [Errors](exceptions.md) for `InsufficientCreditsError`.

All snippets assume:

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (and optional PARETA_BASE_URL)
```

---

## tasks.match

```python
def match(self, query: str, *, top_k: int = 5) -> TaskMatch
```

**Route:** `POST /v1/tasks/match`

Turns a free-text description of your intent into ranked candidate tasks. The
matcher is a deterministic keyword scorer (with a semantic backstop on the
backend), so the same query returns the same ranking.

- `query` (required): free-text intent. Raises `ValueError` if empty or
  whitespace-only.
- `top_k` (default `5`): how many ranked candidates to return.

Returns a [`TaskMatch`](#taskmatch).

```python
match = pa.tasks.match("pull line items and totals out of vendor invoices")

if match.matched:
    task_id = match.chosen.task_id          # the best task
    print(f"matched {task_id} via {match.matcher} "
          f"(confidence={match.chosen.confidence})")
else:
    # No high-confidence hit: show the ranked alternates instead of guessing.
    for cand in match.candidates:
        print(f"  {cand.task_id}  score={cand.score:.2f}  {cand.confidence}")
```

A robust pattern handles both the no-match and the ambiguous cases rather than
blindly trusting `chosen`:

```python
match = pa.tasks.match("classify support tickets by urgency")

if not match.matched:
    raise SystemExit(f"no task matched; closest: "
                     f"{[c.task_id for c in match.candidates]}")
if match.ambiguous:
    # Top two scores are close: a good moment to ask the user to disambiguate.
    print("ambiguous, top candidates:",
          [(c.task_id, round(c.score or 0, 2)) for c in match.candidates[:2]])

task_id = match.chosen.task_id
```

---

## tasks.retrieve

```python
def retrieve(self, task_id: str, *, examples_n: int | None = None) -> Task
```

**Route:** `GET /v1/tasks/{task_id}`

Fetches a single task's schema. The field that matters most is `has_blob_input`:
`True` means the task takes documents or images (PDFs, scans), which determines
how you build eval sets and which frontier models can run it (vision-capable
only).

- `examples_n` (optional): request N example items from the task. The typed layer
  surfaces `id`, `default_scorer`, and `has_blob_input`; reach the examples
  through the raw record with `task.to_dict()`.

Returns a [`Task`](#task).

```python
task = pa.tasks.retrieve(task_id, examples_n=3)
print(task.id, task.default_scorer, "blob_input=", task.has_blob_input)

# examples come back on the raw record:
examples = task.to_dict().get("examples", [])
```

---

## tasks.list

```python
def list(self) -> list[Task]
```

**Route:** `GET /v1/tasks`

Returns every benchmark task in the catalog as a `list[Task]`. Use this to browse
when you do not have a free-text query to `match`.

```python
for task in pa.tasks.list():
    kind = "document" if task.has_blob_input else "text"
    print(f"  {task.id:<28} {kind:<10} scorer={task.default_scorer}")
```

---

## tasks.leaderboard

```python
def leaderboard(self, task_id: str) -> Leaderboard
```

**Route:** `GET /v1/tasks/{task_id}/leaderboard`

Returns the models scored on a task, ranked by quality, with each model's
per-request cost. This is how you choose between open models and read, concretely,
how far below the frontier the cost sits.

Returns a [`Leaderboard`](#leaderboard).

```python
board = pa.tasks.leaderboard(task_id)

print(f"metric={board.metric}  cost_unit={board.cost_unit}")
print(f"recommended: {board.recommended}")

for entry in board.models:
    cost = entry.cost_per_request_micro_usd or 0
    print(f"  {entry.name:<16} {entry.kind:<8} "
          f"quality={entry.quality:.3f}  "
          f"${cost / 1_000_000:.6f}/req  ctx={entry.context_k}k")

if board.frontier:
    f = board.frontier
    print(f"frontier baseline: {f.name}  quality={f.quality:.3f}  "
          f"${(f.cost_per_request_micro_usd or 0) / 1_000_000:.6f}/req")
```

`board.recommended` is exactly what `endpoints.deploy(model="recommended")`
resolves to: the curated pick, or the top-ranked open model if there is no
curated one. Pass it straight to `deploy(model=...)`.

> **Cost is in micro-USD here, on purpose.** Per-request rates are sub-cent, so
> the leaderboard keeps the raw `cost_per_request_micro_usd` integer
> (1,000,000 micro-USD = $1.00). Flooring to whole cents, which is how billed
> **totals** like `run.cost` behave (see [evals](./evals.md)), would erase the
> open-vs-frontier comparison. Divide by 1,000,000 to display dollars.

---

## tasks.recommended

```python
def recommended(self, task_id: str) -> str | None
```

Convenience wrapper over `leaderboard(task_id).recommended`. Returns the
deployable model alias Pareta recommends for the task (or `None` if the task has
no ranked models yet).

```python
model = pa.tasks.recommended(task_id)        # e.g. "qwen-1" or "recommended"
ep = pa.endpoints.deploy(task=task_id, model=model, wait=True)
print(ep.id, ep.status)
```

Passing `model="recommended"` to `deploy` does the same resolution server-side,
so `recommended` is mainly useful when you want to **see** the pick (log it, show
it, gate on it) before committing to a deploy.

> **Sync only, for now.** `leaderboard` and `recommended` live on the sync `Tasks`
> resource. `AsyncTasks` has `list`, `retrieve`, and `match`; the ranking methods
> land for async in a later slice. From async code, either call them on a
> short-lived sync `Pareta` or run them in a thread.

---

## A full discovery pass

End to end: intent in, recommended open model plus the frontier gap out, ready to
hand to a deploy or an eval.

```python
from pareta import Pareta

pa = Pareta.from_env()

# 1. intent -> task
match = pa.tasks.match("extract key fields from contracts")
if not match.matched:
    raise SystemExit(f"no task matched: {[c.task_id for c in match.candidates]}")
task_id = match.chosen.task_id

# 2. inspect the task (document task? which scorer?)
task = pa.tasks.retrieve(task_id)
print(f"task={task.id}  scorer={task.default_scorer}  blob={task.has_blob_input}")

# 3. task -> recommended open model + the open-vs-frontier quality gap
board = pa.tasks.leaderboard(task_id)
pick = board.recommended
frontier_q = board.frontier.quality if board.frontier else None
print(f"recommend={pick}  frontier_quality={frontier_q}")

# now: deploy `pick` (endpoints.deploy), or eval it vs the frontier on your data.
```

From here you either deploy the recommended model
([endpoints](./endpoints.md)) or run it head to head against the frontier on your
own data ([evals](./evals.md)). To pick the vendor baselines to measure against,
see `evals.frontier_models` in [evals](./evals.md).

---

## Async

`AsyncTasks` mirrors the sync surface for the catalog reads. Every method is
`async def` and awaited:

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        match = await pa.tasks.match("extract key fields from contracts")
        if not match.matched:
            return
        task = await pa.tasks.retrieve(match.chosen.task_id)
        print(task.id, task.default_scorer, task.has_blob_input)

        catalog = await pa.tasks.list()
        print(f"{len(catalog)} tasks in the catalog")

asyncio.run(main())
```

`AsyncTasks` does **not** expose `leaderboard` or `recommended` yet. Call those on
a sync `Pareta` (they are non-blocking catalog reads) or run them in a thread. See
the [async guide](../guide/async.md) for the full sync-vs-async story.

---

## Response models

Every response object keeps the raw server JSON: call `.to_dict()` (or index it
like a dict) to reach any field the typed layer does not surface yet.

### Task

From `GET /v1/tasks` and `GET /v1/tasks/{id}`.

| Field | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Task id, e.g. `"contract-key-fields"` |
| `default_scorer` | `str \| None` | The scorer used to grade outputs on this task |
| `has_blob_input` | `bool` | `True` if the task takes documents/images (vision tasks) |

### TaskMatch

From `POST /v1/tasks/match`.

| Field | Type | Notes |
|---|---|---|
| `query` | `str \| None` | The echoed query |
| `matched` | `bool` | A high-confidence task was found |
| `chosen` | `TaskMatchCandidate \| None` | The best candidate, or `None` if nothing cleared the bar |
| `candidates` | `list[TaskMatchCandidate]` | The top-`top_k` ranked alternates |
| `ambiguous` | `bool` | `True` when the top two scores are close |
| `matcher` | `str \| None` | Which matcher fired: `"keyword"` or `"semantic"` |

### TaskMatchCandidate

| Field | Type | Notes |
|---|---|---|
| `task_id` | `str \| None` | The candidate task id |
| `score` | `float \| None` | Match score in `[0, 1]` |
| `confidence` | `str \| None` | `"high"`, `"medium"`, or `"low"` |

### Leaderboard

From `GET /v1/tasks/{id}/leaderboard`.

| Field | Type | Notes |
|---|---|---|
| `task_id` | `str \| None` | The task this board ranks |
| `metric` | `str \| None` | What `quality` measures, e.g. `"quality"` |
| `cost_unit` | `str \| None` | Cost unit, e.g. `"per_request"` |
| `recommended` | `str \| None` | The deployable model alias to pass to `deploy(model=...)` |
| `models` | `list[LeaderboardEntry]` | The ranked entries |
| `frontier` | `LeaderboardEntry \| None` | The vendor baseline this task is measured against |

### LeaderboardEntry

| Field | Type | Notes |
|---|---|---|
| `name` | `str \| None` | Model name / alias |
| `kind` | `str \| None` | `"open"` or `"frontier"` |
| `quality` | `float \| None` | Quality score in `[0, 1]` |
| `cost_per_request_micro_usd` | `int \| None` | Raw unit cost in micro-USD (not floored) |
| `context_k` | `int \| None` | Context window in thousands of tokens |
| `run_mode` | `str \| None` | Backend-provided context (`"rte"` / `"twostage"`); not a user knob |

---

See also: [endpoints](./endpoints.md) · [evals](./evals.md) ·
[inference](../guide/inference.md) · [errors](exceptions.md) ·
[discovery guide](../guide/discovery.md)



---

<!-- reference/evals.md -->

# `evals`: evaluate models on your own data

`client.evals` runs the only benchmark that matters: how candidate models score on **your** rows. You hand Pareta a task and a list of labeled items, name a slate of open-weights candidates (and optionally frontier baselines to beat), and get back per-model quality with 95% confidence intervals and per-item cost. The platform scores everything with the task's scorer, runs the open candidates and the frontier baselines on the same items, and meters the compute against your org balance. No GPUs to size, no scorer to wire up, no judge to host.

The namespace has three parts:

- [`evals.sets`](#evalssets-evaluation-datasets): turn your rows into a reusable eval set (and attach documents for blob tasks).
- [`evals.runs`](#evalsruns-evaluation-runs): run candidates over a set and read aggregated results.
- [`evals.frontier_models`](#evalsfrontier_models-frontier-baseline-roster): list the vendor baselines you can evaluate against.

All examples use the synchronous `Pareta` client. Every method has an `async` twin with the same signature on `AsyncPareta`; see [Async](#async).

```python
from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (and optional PARETA_BASE_URL)
```

## The shape of an eval

1. Turn your rows into an **eval set** (`evals.sets.create`), or pass them inline to the run.
2. Kick off an **eval run** over a list of models (`evals.runs.create`), optionally blocking until it finishes.
3. Read `run.results` to compare quality and cost; read `run.cost` for the bill.

```python
run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[
        {"input": "Effective as of January 1, 2026, ...", "expected": {"effective_date": "2026-01-01"}},
        {"input": "This Agreement terminates on 2027-12-31 ...", "expected": {"termination_date": "2027-12-31"}},
    ],
    models=["contract-kie-1", "contract-kie-2"],  # per-task open aliases
    frontier="benchmarked",                        # vendor baselines on this leaderboard
    wait=True,                                     # block until the run is terminal
)

print(run.status)             # "completed"
print(f"billed ${run.cost}")  # Decimal dollars, floored to cents

for r in run.results:
    print(f"{r.model_id:16} {r.kind:8} q={r.quality_mean:.3f} "
          f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
          f"~{r.mean_cost_micro_usd} uUSD/item  ({r.n_succeeded} ok, {r.error_count} err)")
```

That single call created the eval set inline, started the run, polled it to completion, and returned an `EvalRun` with one aggregate per model. The sections below unpack each piece.

The ids in `models=` are **per-task public aliases** (`{family}-{rank}`), not raw model names; they come from a task's leaderboard. Frontier (vendor) ids are in the clear. See [`tasks`](./tasks.md) for where the aliases come from and [`models`](./models.md) for why real ids never cross the SDK boundary.

## `evals.sets`: evaluation datasets

An eval set is your rows bound to a task, stored server-side and reusable across runs. Create one explicitly when you want to reuse it; otherwise pass `task=` + `items=` straight to `runs.create` (see [inline create](#inline-create)).

### `sets.create`

```python
def create(self, *, task: str, items: list[dict], name: str | None = None) -> EvalSet
```

`POST /v1/eval-sets`

- `task` (required): the task id. Carries the scorer and the input schema.
- `items` (required, non-empty): your evaluation rows. Each is a dict in the task's input schema; the SDK serializes them to JSONL on the wire. An empty list raises `ValueError` before any request goes out.
- `name` (optional): defaults to `f"sdk eval set ({len(items)} items)"`.

```python
eval_set = pa.evals.sets.create(
    task="contract-key-fields",
    items=[
        {"input": "Effective as of January 1, 2026, ...", "expected": {"effective_date": "2026-01-01"}},
        {"input": "This Agreement terminates on 2027-12-31 ...", "expected": {"termination_date": "2027-12-31"}},
    ],
    name="Q2 contracts sample",
)

print(eval_set.id)               # pass this to runs.create(eval_set=...)
print(eval_set.task_id)          # "contract-key-fields"
print(eval_set.item_count)       # 2
print(eval_set.scoring_strategy) # e.g. "extraction": how this task is scored
```

The exact row fields (`input`, `expected`, and any others) follow the task you chose. To inspect a task's schema and pull sample rows before formatting yours, use `pa.tasks.retrieve(task_id, examples_n=...)`. See [`tasks`](./tasks.md).

Returns an [`EvalSet`](#evalset).

### `sets.list`

```python
def list(self) -> list[EvalSet]
```

`GET /v1/eval-sets`: every eval set the org can access.

```python
for s in pa.evals.sets.list():
    print(s.id, s.task_id, s.item_count, s.name)
```

### `sets.retrieve`

```python
def retrieve(self, eval_set_id: str) -> EvalSet
```

`GET /v1/eval-sets/{eval_set_id}`: one set by id.

```python
eval_set = pa.evals.sets.retrieve("evalset_abc123")
```

### `sets.delete`

```python
def delete(self, eval_set_id: str) -> None
```

`DELETE /v1/eval-sets/{eval_set_id}`: remove a set.

```python
pa.evals.sets.delete(eval_set.id)
```

### `sets.upload_document`

```python
def upload_document(
    self,
    eval_set_id: str,
    file,
    *,
    idx: int,
    field_name: str,
    mime: str | None = None,
) -> dict
```

Attaches a binary document (PDF, image, scan) to one row's blob field. Use this for tasks where `task.has_blob_input == True`: create the set with each row's labels (and a placeholder for the blob), then attach the file to that row by index.

- `eval_set_id`: the set to attach to.
- `file`: a path (`str` / `pathlib.Path`), raw `bytes`/`bytearray`, or any binary file-like object with `.read()`. Anything else raises `TypeError`.
- `idx` (required): 0-based row index.
- `field_name` (required): the blob input field on the task schema.
- `mime` (optional): MIME type; guessed from the filename when omitted, falling back to `application/octet-stream`.

```python
eval_set = pa.evals.sets.create(
    task="invoice-extraction",
    items=[
        {"expected": {"total": "1240.00", "vendor": "Katana ML"}},  # doc attached next
        {"expected": {"total": "89.50", "vendor": "Acme"}},
    ],
)

# Attach the PDF for row 0's `document` field.
pa.evals.sets.upload_document(
    eval_set.id,
    "invoices/katana-0001.pdf",  # path, bytes, or binary file-like
    idx=0,
    field_name="document",
)

# Bytes or a file handle work too; override the guessed MIME when needed.
with open("invoices/scan.tiff", "rb") as f:
    pa.evals.sets.upload_document(eval_set.id, f, idx=1, field_name="document", mime="image/tiff")
```

`upload_document` collapses the upload into one call. Files under 5 MiB go up inline via `attach-blob`; larger files mint a signed URL (`blob-upload-url`), stream straight to storage with a `PUT`, then confirm (`blob-upload-complete`). Either way it returns the completion endpoint's response dict. A failed storage `PUT` raises `ParetaError`.

Frontier baselines on document tasks are automatically vision-filtered, so you never accidentally score a scan against a text-only model.

## `evals.runs`: evaluation runs

A run evaluates a list of models over an eval set and returns per-model aggregates.

### `runs.create`

```python
def create(
    self,
    *,
    eval_set: str | None = None,
    task: str | None = None,
    items: list[dict] | None = None,
    models,
    frontier=None,
    name: str | None = None,
    wait: bool = False,
    poll_interval: float = 3.0,
    timeout: float = 900.0,
) -> EvalRun
```

`POST /v1/eval-runs`

You drive it one of two ways. Pass **`eval_set=<id>`** to run against an existing set, **or** pass **`task=...` + `items=...`** to create a set inline in the same call. Passing neither raises `ValueError`.

- `models` (required): the list of open candidate aliases to evaluate. Required even when `frontier` is set; an empty `models` with no frontier ids raises `ValueError`.
- `frontier` (default `None`): the vendor baselines to score alongside your candidates. Keyword or explicit list, [resolved SDK-side](#frontier-resolution).
- `name` (optional): run label; also used as the inline set's name.
- `wait` (default `False`): when `False`, returns as soon as the run is queued (status `"running"` or queued). When `True`, blocks via [`runs.wait`](#runswait) and returns the terminal run.
- `poll_interval` (default `3.0`): seconds between polls when `wait=True`.
- `timeout` (default `900.0`): max seconds to wait; exceeding it raises `ParetaError`.

```python
# Against an existing set
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1", "contract-kie-2"], wait=True)
```

<a id="inline-create"></a>
**Inline create**: skip `sets.create` entirely and hand the rows to the run; the SDK creates the set for you:

```python
run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[{"input": "...", "expected": {"effective_date": "2026-01-01"}}],
    models=["contract-kie-1", "contract-kie-2"],
    frontier="benchmarked",
    wait=True,
)
```

**Metering.** Each run is metered: the org balance is debited for the compute across **open and frontier** models. If the balance cannot cover the run, `create` raises `InsufficientCreditsError` (402) before any work is billed. Top-up is browser-only; the SDK never exposes balance or payment methods.

```python
from pareta import InsufficientCreditsError

try:
    run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"],
                               frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    raise SystemExit("Out of credit. Top up in the dashboard (billing is browser-only).")
```

`InsufficientCreditsError` subclasses `APIStatusError`; catch `ParetaError` for one handler over every SDK failure. See [Errors and metering](exceptions.md).

Returns an [`EvalRun`](#evalrun).

#### `frontier=` resolution

`frontier=` controls which vendor models get scored alongside your open candidates, so the report shows exactly how much quality and cost you are trading. The SDK resolves the keyword to a concrete list of ids before sending the run:

| `frontier=` | Baselines scored |
| --- | --- |
| `None` or `"none"` (default) | none, open candidates only (`[]`) |
| `"all"` | every frontier model available for the task |
| `"benchmarked"` | frontier models on the task's leaderboard (vision-filtered for document tasks) |
| `["gpt-4o", "claude-..."]` | exactly these frontier ids, passed through as-is |

```python
# Open candidates only
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"], frontier="none", wait=True)

# Everything in the frontier pool for the task
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"], frontier="all", wait=True)

# A hand-picked baseline
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"], frontier=["gpt-4o"], wait=True)
```

The `"all"` and `"benchmarked"` keywords need the task to fetch the roster. When you create inline (`task=...`) the SDK already has it; when you pass `eval_set=...` it looks the task up from the set. If it still cannot resolve a task it raises `ValueError`. An unrecognized keyword (anything other than `"all"` / `"benchmarked"` / `"none"`) raises `ValueError`, and a `frontier` that is not `None`, a list/tuple, or a string raises `TypeError`. An explicit list skips the roster lookup entirely.

To enumerate and pin the roster yourself, see [`frontier_models`](#evalsfrontier_models-frontier-baseline-roster).

### `runs.retrieve`

```python
def retrieve(self, run_id: str) -> EvalRun
```

`GET /v1/eval-runs/{run_id}`: full run state, including `results` once the run is terminal.

```python
run = pa.evals.runs.retrieve("evalrun_xyz789")
if run.is_terminal:
    print(run.status, run.results)
```

### `runs.wait`

```python
def wait(self, run_id: str, *, poll_interval: float = 3.0, timeout: float = 900.0) -> EvalRun
```

Polls `runs.retrieve(run_id)` every `poll_interval` seconds until `run.is_terminal` (status `"completed"` or `"failed"`), then returns the final `EvalRun`. Raises `ParetaError` if `timeout` seconds elapse first. This is exactly what `create(..., wait=True)` calls internally, so you can fire a run and block on it later:

```python
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"])  # returns immediately
run = pa.evals.runs.wait(run.id, poll_interval=5.0, timeout=1800.0)           # block on it
```

Or poll on your own schedule without `wait`:

```python
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"])
while not run.is_terminal:
    run = pa.evals.runs.retrieve(run.id)
```

### Reading results

A terminal `EvalRun` carries one [`EvalResult`](#evalresult) per model in `run.results`, plus the bill.

```python
run = pa.evals.runs.retrieve(run_id)

if run.status == "failed":
    print("run failed:", run.error_detail)
else:
    ranked = sorted(run.results, key=lambda r: r.quality_mean or 0.0, reverse=True)
    for r in ranked:
        cost_per_item = (r.mean_cost_micro_usd or 0) / 1_000_000  # micro-USD to dollars
        print(f"{r.model_id:24} {r.kind:8} q={r.quality_mean:.3f} "
              f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
              f"${cost_per_item:.6f}/item  ok={r.n_succeeded} err={r.error_count}")

    print(f"run cost: ${run.cost}")          # Decimal dollars, floored to cents
    print(f"raw micro-USD: {run.cost_micro_usd}")
```

Use the confidence interval: two models whose CIs overlap are not meaningfully different on this sample, so add rows before declaring a winner. A high `error_count` on one model usually means malformed output, not a bad model, so inspect before trusting its quality number.

**On money.** `run.cost` is a `Decimal` in dollars, **floored to whole cents** (the SDK never rounds a charge up), so a sub-cent run reads `Decimal("0.00")`. `run.cost_micro_usd` is the raw integer (`1_000_000` micro-USD = `$1.00`) for exact accounting. Per-item rates like `result.mean_cost_micro_usd` stay in micro-USD on purpose: flooring sub-cent unit rates to whole cents would erase the open-vs-frontier cost gap the eval exists to find. Same convention SDK-wide; see [Errors and metering](exceptions.md).

## `evals.frontier_models`: frontier baseline roster

```python
def frontier_models(self, task: str | None = None) -> list[FrontierModel]
```

`GET /v1/eval/frontier-models`: the vendor (frontier) models you can evaluate against. Feed the `.id`s into `runs.create(frontier=[...])`.

- `task` (optional): when given, each entry is annotated `benchmarked` (on that task's leaderboard) and the roster is vision-filtered for document tasks. Without a task the full roster comes back unannotated.

```python
roster = pa.evals.frontier_models(task="contract-key-fields")
for m in roster:
    print(m.id, m.vendor, "vision" if m.vision else "text",
          "benchmarked" if m.benchmarked else "-")

# Pin two benchmarked baselines explicitly
ids = [m.id for m in roster if m.benchmarked][:2]
run = pa.evals.runs.create(eval_set=eval_set.id, models=["contract-kie-1"], frontier=ids, wait=True)
```

Returns a list of [`FrontierModel`](#frontiermodel).

## Async

Every method above has an `async` twin on `AsyncPareta` with an identical signature; the methods are coroutines (`wait` included). Document uploads are async too.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        eval_set = await pa.evals.sets.create(
            task="contract-key-fields",
            items=[{"input": "...", "expected": {"effective_date": "2026-01-01"}}],
        )
        run = await pa.evals.runs.create(
            eval_set=eval_set.id,
            models=["contract-kie-1", "contract-kie-2"],
            frontier="benchmarked",
            wait=True,
        )
        for r in run.results:
            print(r.model_id, r.kind, r.quality_mean)
        print("billed", run.cost)

asyncio.run(main())
```

`await pa.evals.runs.wait(run_id)`, `await pa.evals.frontier_models(task=...)`, and `await pa.evals.sets.upload_document(...)` all work the same way.

## Response objects

Every object keeps the raw server JSON: call `.to_dict()` for lossless access to anything not yet surfaced as a typed field, and index it dict-style (`run["..."]`) as an escape hatch.

### `EvalSet`

From `sets.create`, `sets.list`, `sets.retrieve`.

| Field | Type | Notes |
| --- | --- | --- |
| `id` | `str \| None` | Pass to `runs.create(eval_set=...)` |
| `task_id` | `str \| None` | The task this set is bound to |
| `name` | `str \| None` | Label |
| `item_count` | `int \| None` | Number of rows |
| `scoring_strategy` | `str \| None` | How the task is scored (e.g. `"extraction"`) |

### `EvalRun`

From `runs.create`, `runs.retrieve`, `runs.wait`. Wraps the `{"run": {...}, "results": [...]}` envelope.

| Field | Type | Notes |
| --- | --- | --- |
| `id` | `str \| None` | Run id; pass to `runs.retrieve` / `runs.wait` |
| `eval_set_id` | `str \| None` | The set evaluated |
| `status` | `str \| None` | `"running"`, `"evaluating"`, `"completed"`, `"failed"` |
| `is_terminal` | `bool` | `True` when status is `"completed"` or `"failed"` |
| `candidate_models` | `list[str]` | Model ids evaluated (open + frontier) |
| `error_detail` | `str \| None` | Error message when `status == "failed"` |
| `cost` | `Decimal` | Billed total in dollars, floored to cents |
| `cost_micro_usd` | `int` | Raw total cost in micro-USD (`1_000_000` = `$1.00`) |
| `results` | `list[EvalResult]` | One aggregate per model (populated once terminal) |

### `EvalResult`

One model's aggregate on a run; from `run.results`.

| Field | Type | Notes |
| --- | --- | --- |
| `model_id` | `str \| None` | Per-task alias (open) or vendor id (frontier) |
| `kind` | `str \| None` | `"open"` or `"frontier"` |
| `quality_mean` | `float \| None` | Mean score in `[0, 1]`, your ranking key |
| `quality_ci_low` | `float \| None` | 95% CI lower bound |
| `quality_ci_high` | `float \| None` | 95% CI upper bound |
| `mean_cost_micro_usd` | `int \| None` | Avg cost per item in micro-USD (not floored) |
| `n_succeeded` | `int \| None` | Rows that scored cleanly |
| `error_count` | `int \| None` | Rows that errored |

### `FrontierModel`

A vendor baseline; from `frontier_models`.

| Field | Type | Notes |
| --- | --- | --- |
| `id` | `str \| None` | Pass to `runs.create(frontier=[...])` |
| `vendor` | `str \| None` | `"openai"`, `"anthropic"`, etc. |
| `vision` | `bool` | `True` if vision-capable |
| `benchmarked` | `bool` | `True` if on the task's leaderboard (only set when `task=` is given) |

## See also

- [`tasks`](./tasks.md): match intent to a task, inspect its schema, pull example rows, read leaderboards.
- [`models`](./models.md): where per-task open aliases come from and why real ids are hidden.
- [`endpoints`](./endpoints.md): deploy the winning model to a live endpoint (task + model only; no hardware knob).
- [`chat`](./chat.md): the OpenAI-compatible inference surface, metered the same way evals are.
- [Errors and metering](exceptions.md): `InsufficientCreditsError`, the money convention, and the full exception hierarchy.



---

<!-- reference/exceptions.md -->

# Exceptions

Every error the Pareta SDK raises is a subclass of `ParetaError`. That single
base class is the contract: one `except ParetaError` catches anything the SDK
can throw, and a narrower `except InsufficientCreditsError` catches exactly the
case you care about. Server errors carry the HTTP `status_code`, the server's
`detail` message, and a `request_id` you can quote in a support ticket.

This page is the class-by-class reference: the full hierarchy, the
status-code-to-class mapping, and the attributes on each error. For the
narrative version (what gets retried automatically, how to tune timeouts and
the retry budget), see [Errors, retries & timeouts](../guide/errors-and-retries.md).

## Import

All exception classes are exported from the top-level package. Import the ones
you handle directly.

```python
from pareta import (
    ParetaError,                # base class for everything below
    APIConnectionError,         # never reached the server (DNS/TCP/TLS)
    APITimeoutError,            # subclass of APIConnectionError
    APIStatusError,             # any non-2xx from the server
    BadRequestError,            # 400, 422
    AuthenticationError,        # 401
    InsufficientCreditsError,   # 402 — org out of balance
    PermissionDeniedError,      # 403
    NotFoundError,              # 404
    ConflictError,              # 409
    RateLimitError,             # 429
    EndpointNotReadyError,      # 503 — endpoint stopped/cold/provider down
)
```

## The hierarchy

```
Exception
└── ParetaError                      base class for every SDK error
    ├── APIConnectionError           request never reached the server
    │   └── APITimeoutError          timed out before any response
    └── APIStatusError               server returned a non-2xx status
        ├── BadRequestError          400, 422
        ├── AuthenticationError      401
        ├── InsufficientCreditsError 402
        ├── PermissionDeniedError    403
        ├── NotFoundError            404
        ├── ConflictError            409
        ├── RateLimitError           429
        └── EndpointNotReadyError    503
```

Two facts that fall out of this tree and are worth holding onto:

- `APITimeoutError` is a subclass of `APIConnectionError`, so catching
  `APIConnectionError` also catches timeouts.
- Every status-mapped class (`BadRequestError`, `InsufficientCreditsError`, and
  the rest) is a subclass of `APIStatusError`, so catching `APIStatusError`
  catches all of them and gives you `.status_code`, `.detail`, and `.request_id`.

## Status code mapping

When the server returns a non-2xx response, the SDK builds the most specific
`APIStatusError` subclass for that status. Anything not in the table below
(other 5xx, unexpected codes) surfaces as a plain `APIStatusError` carrying the
raw `status_code`.

| Status | Exception | When |
| --- | --- | --- |
| 400 | `BadRequestError` | Request rejected by the server |
| 401 | `AuthenticationError` | Missing or invalid `pareta_sk_` API key |
| 402 | `InsufficientCreditsError` | Org balance is empty — top up in the dashboard |
| 403 | `PermissionDeniedError` | Authenticated, but not allowed |
| 404 | `NotFoundError` | Endpoint, task, eval set, or run id does not exist |
| 409 | `ConflictError` | Conflict (e.g. seed endpoint, transient lock/contention) |
| 422 | `BadRequestError` | FastAPI request validation failed |
| 429 | `RateLimitError` | Rate limited; honors `Retry-After` |
| 503 | `EndpointNotReadyError` | Endpoint stopped, cold-starting, or provider down |
| other 5xx | `APIStatusError` | Generic server error |

Note that `400` and `422` both map to `BadRequestError`, so a single clause
covers both client-side and FastAPI-validation rejections.

## Class reference

### `ParetaError`

The base class for every error the SDK raises. Catch this to handle any SDK
failure with one clause. It is also raised directly in one non-HTTP case:
constructing a client with no API key.

```python
from pareta import Pareta, ParetaError

try:
    pa = Pareta()            # no api_key arg, PARETA_API_KEY unset
except ParetaError as e:
    print(e)                 # "missing API key. Pass api_key=… or set PARETA_API_KEY …"
```

### `APIConnectionError(ParetaError)`

The request never reached the server: DNS failure, TCP refusal, TLS error, or a
dropped connection. Connection failures on the initial handshake are retried up
to `max_retries` times; this is raised only after the retry budget is spent.

```python
APIConnectionError(message: str = "connection error", *, cause: BaseException | None = None)
```

The underlying `httpx` exception is attached as `.__cause__`, so a traceback
shows the original network error.

### `APITimeoutError(APIConnectionError)`

The request did not complete within the client `timeout` (default
`httpx.Timeout(60.0, connect=10.0)`). Because it subclasses
`APIConnectionError`, an `except APIConnectionError` clause also catches it.

```python
APITimeoutError(message: str = "request timed out", *, cause: BaseException | None = None)
```

### `APIStatusError(ParetaError)`

The server returned a non-2xx status. This is the parent of every
status-mapped class below, and is also raised directly for any status not in
the [mapping table](#status-code-mapping).

**Attributes**

```python
status_code: int                  # the HTTP status
detail: Any                       # server's `detail` message, or the raw body
request_id: str | None            # value of the x-request-id response header
response: httpx.Response | None   # the full response, for advanced use
```

`detail` is the FastAPI `{"detail": "..."}` message when the body is JSON;
otherwise it falls back to the raw response text. `request_id` comes from the
`x-request-id` header and is the value to quote when reporting a problem.

```python
from pareta import Pareta, APIStatusError

pa = Pareta.from_env()
try:
    pa.endpoints.retrieve("ep_does_not_exist")
except APIStatusError as e:
    print(e.status_code)     # 404
    print(e.detail)          # server's explanation
    print(e.request_id)      # quote this in a support ticket
```

### `BadRequestError(APIStatusError)` — 400, 422

The request was rejected by the server, either as a bad request (400) or a
FastAPI validation failure (422). Inspect `.detail` for the specific field or
reason.

### `AuthenticationError(APIStatusError)` — 401

The `pareta_sk_` API key is missing or invalid. Mint a fresh key in the
dashboard and pass it via `Pareta.from_env()` (reads `PARETA_API_KEY`) or
`Pareta(api_key="pareta_sk_…")`.

### `InsufficientCreditsError(APIStatusError)` — 402

The org balance is empty. Both metered paths raise this: inference
(`chat.completions.create()`) and eval runs (`evals.runs.create()`, billed for
open and frontier compute combined). Top-up is browser-only — the SDK never
exposes balance or payment methods, so the fix is to add credit in the
dashboard and retry.

```python
from pareta import Pareta, InsufficientCreditsError

pa = Pareta.from_env()
try:
    pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Extract the parties."}],
    )
except InsufficientCreditsError:
    print("Org is out of credit — top up at https://pareta.ai in the dashboard, then retry.")
```

### `PermissionDeniedError(APIStatusError)` — 403

The key is valid but the org or user lacks permission for the requested
resource or action.

### `NotFoundError(APIStatusError)` — 404

The referenced resource does not exist: an unknown endpoint id, task id, eval
set id, or run id.

### `ConflictError(APIStatusError)` — 409

A conflict on the server — for example a seed/legacy endpoint that is not
deployable, or a transient lock or contention. A 409 is retried automatically
(see [the retry list below](#what-gets-retried)); a stable 409 surfaces here
after the retry budget is spent.

### `RateLimitError(APIStatusError)` — 429

Too many requests. The client retries 429s automatically, honoring the server's
`Retry-After` header when present; this is raised only after retries are
exhausted.

### `EndpointNotReadyError(APIStatusError)` — 503

The target endpoint is not serving yet: stopped, cold-starting, or its provider
is temporarily unavailable. Start it with `endpoints.start(endpoint_id)`, wait
for `endpoint.is_live`, then retry. 503 is in the automatic retry set, so a
brief cold start often resolves before this is raised.

```python
from pareta import Pareta, EndpointNotReadyError

pa = Pareta.from_env()
try:
    pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "ping"}],
    )
except EndpointNotReadyError:
    pa.endpoints.start("ep_contract_kie")   # wake it, then retry
```

## What gets retried

The client retries transient failures before raising, so most of the errors
above only surface after the retry budget (`max_retries`, default `2`) is spent.

- **Retried automatically:** connection and timeout errors on the initial
  handshake, and HTTP statuses **408, 409, 429, 500, 502, 503, 504**. Backoff is
  exponential with jitter — `min(0.5 * 2**attempt, 8.0)` seconds — and honors a
  `Retry-After` header when the server sends one.
- **Never retried:** `400`, `401`, `402`, `403`, `404`, `422`, and any other
  status not in the list. These are deterministic — retrying would not change the
  outcome — so they raise immediately.
- **Mid-stream drops are not retried.** Retries apply only to the initial
  handshake; once SSE bytes are flowing (a streaming chat completion or a deploy
  progress stream) a dropped connection raises immediately, since the stream
  cannot be safely resumed.

Tune the budget per client:

```python
from pareta import Pareta

pa = Pareta.from_env(max_retries=5)   # default is 2; set 0 to disable retries
```

## Handling errors

Order `except` clauses from most specific to least specific. The base
`ParetaError` is the safety net at the bottom.

```python
from pareta import (
    Pareta,
    InsufficientCreditsError,
    EndpointNotReadyError,
    RateLimitError,
    APIStatusError,
    APIConnectionError,
    ParetaError,
)

pa = Pareta.from_env()

try:
    completion = pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Summarize this contract."}],
    )
    print(completion.choices[0].message.content)
except InsufficientCreditsError:
    print("Out of credit — top up in the dashboard, then retry.")
except EndpointNotReadyError:
    pa.endpoints.start("ep_contract_kie")          # wake a cold/stopped endpoint
except RateLimitError as e:
    print(f"Rate limited; the client already retried. request_id={e.request_id}")
except APIStatusError as e:
    print(f"Server returned {e.status_code}: {e.detail} (request_id={e.request_id})")
except APIConnectionError:
    print("Could not reach the API after retries — check the network.")
except ParetaError as e:
    print(f"Unexpected SDK error: {e}")
```

The same classes and hierarchy apply to the async client — `await`ed
`AsyncPareta` calls raise the exact same exception types, so error handling code
is identical across sync and async.

```python
import asyncio
from pareta import AsyncPareta, APIStatusError

async def main():
    async with AsyncPareta.from_env() as pa:
        try:
            await pa.endpoints.retrieve("ep_does_not_exist")
        except APIStatusError as e:
            print(e.status_code, e.detail)

asyncio.run(main())
```

## Pre-flight `ValueError`

A few methods validate arguments before any network call and raise the
standard-library `ValueError` (not a `ParetaError`) when an argument is plainly
unusable. These are programming errors caught early, not server responses:

- `chat.completions.create()` — empty `model` or `messages`
- `tasks.match()` — empty `query`
- `evals.sets.create()` — empty `items`
- `evals.runs.create(frontier=…)` — a `frontier` value that cannot be resolved

## See also

- [Errors, retries & timeouts](../guide/errors-and-retries.md) — the narrative
  guide: retry behavior, backoff, and tuning the timeout and retry budget.
- [Configuration](../guide/configuration.md) — setting `api_key`, `base_url`,
  `timeout`, and `max_retries`.
- [Inference](../guide/inference.md) — the metered chat path that raises
  `InsufficientCreditsError`.
- [Deploying endpoints](../guide/deploying-endpoints.md) — starting and stopping
  endpoints behind `EndpointNotReadyError`.
- [Evaluation](../guide/evaluation.md) — eval runs, also metered against the org
  balance.



---

<!-- reference/types.md -->

# Response types

Every method that talks to the API hands you back a typed object, not a bare
dict. These objects give you attribute access and autocomplete over the shapes
the API returns: a chat completion's `choices`, an endpoint's `status`, an eval
run's `cost`. They are thin: each one wraps the raw server JSON and exposes the
fields you actually use as properties.

This page is the field-by-field reference for those objects. For how the methods
that return them work, see [Running inference](../guide/inference.md),
[Deploying endpoints](../guide/deploying-endpoints.md),
[Finding the right model](../guide/discovery.md), and
[Evaluating models](../guide/evaluation.md).

## The shared base: every object keeps the raw JSON

All response objects inherit from one base. Two things are true of every object
on this page:

- `.to_dict()` returns the exact JSON the server sent, losslessly. The typed
  properties are a convenience layer over it; nothing is dropped.
- `obj["some_key"]` and `obj.get("some_key", default)` read raw fields directly.
  This is the escape hatch for any field the platform adds before the typed layer
  catches up.

```python
from pareta import Pareta

pa = Pareta.from_env()   # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

resp = pa.chat.completions.create(
    model="ep_contract_qwen",
    messages=[{"role": "user", "content": "ping"}],
)

resp.choices[0].message.content   # typed access
resp.to_dict()                    # the full raw JSON, lossless
resp["id"]                        # raw-key access for anything not yet typed
```

Properties return `None` (or an empty list) when a field is absent rather than
raising, so reading an optional field is always safe.

## Inference types

These come back from `chat.completions.create` (route `POST /v1/chat/completions`).
Inference is OpenAI-compatible, so the schema matches the OpenAI chat objects.

### ChatCompletion

The non-streaming result of `chat.completions.create(...)`.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Completion id |
| `model` | `str \| None` | The model/endpoint id that served the request |
| `created` | `int \| None` | Unix timestamp |
| `choices` | `list[Choice]` | One entry per generated choice |
| `usage` | `Usage` | Token counts |

```python
resp = pa.chat.completions.create(
    model="ep_contract_qwen",
    messages=[{"role": "user", "content": "Extract the effective date."}],
    temperature=0,
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)
```

### ChatCompletionChunk

One delta from a streaming completion. Returned (one per SSE event) when you pass
`stream=True`. It has the same schema as `ChatCompletion`; it is a distinct type
purely for hinting. The incremental text lives on `choices[0].delta.content`, not
`choices[0].message`.

```python
for chunk in pa.chat.completions.create(
    model="ep_contract_qwen",
    messages=[{"role": "user", "content": "Summarize this contract."}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)
```

The `or ""` guard matters: the first and last chunks often carry no content (role
preamble, finish marker), so `delta.content` can be `None` mid-stream.

### Choice

One element of `completion.choices`.

| Property | Type | Notes |
|---|---|---|
| `index` | `int \| None` | Position in the choices list |
| `finish_reason` | `str \| None` | `"stop"`, `"length"`, etc. |
| `message` | `Message` | The full message. Populated on **non-streaming** results |
| `delta` | `Message` | The incremental token. Populated on **streaming** chunks |

`message` and `delta` always return a `Message` (empty if absent), so reading
`choice.delta.content` on a non-streaming result, or vice versa, returns `None`
rather than blowing up.

### Message

The content of a `Choice`.

| Property | Type | Notes |
|---|---|---|
| `role` | `str \| None` | `"assistant"`, `"user"`, etc. |
| `content` | `str \| None` | The text |

### Usage

Token accounting on a `ChatCompletion`.

| Property | Type |
|---|---|
| `prompt_tokens` | `int \| None` |
| `completion_tokens` | `int \| None` |
| `total_tokens` | `int \| None` |

## Model listing types

Returned from `models.list()` (route `GET /v1/models`). This is the
OpenAI-compatible model listing: it returns only your deployed endpoints that
have a live inference URL, so you can point any OpenAI-style tooling at Pareta and
get a sensible `/models` response.

### ModelList

| Property | Type |
|---|---|
| `data` | `list[Model]` |

`ModelList` is directly iterable and has a length, so you usually skip `.data`:

```python
models = pa.models.list()
print(len(models))
for m in models:                       # iterates m in models.data
    print(m.id, m.owned_by)
```

### Model

One element of a `ModelList`.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Endpoint id. Pass straight into `chat.completions.create(model=...)` |
| `owned_by` | `str \| None` | `"pareta"` or a vendor name |
| `created` | `int \| None` | Unix timestamp |

## Endpoint

Returned from `endpoints.deploy(..., wait=True)`, `endpoints.list()`, and
`endpoints.retrieve(id)`. A deployed open-weights model serving inference.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | The endpoint id (== `name`). This is what you pass to `chat.completions.create(model=...)` |
| `name` | `str \| None` | Endpoint name |
| `model` | `str \| None` | The **per-task public alias** of the served model, never the real id |
| `status` | `str \| None` | `"live"`, `"starting"`, `"stopped"`, etc. |
| `task` | `str \| None` | The task this endpoint serves |
| `url` | `str \| None` | Inference URL (set once live) |
| `is_live` | `bool` | Convenience: `status == "live"` |

Two platform truths show up here. There is no GPU, quantization, or
tensor-parallel field, because hardware is hidden: you deploy with a task and a
model and Pareta resolves the serving class. And `model` is the per-task alias,
not the underlying checkpoint id; the real id never crosses into the SDK.

```python
ep = pa.endpoints.deploy(task="contract-key-fields", model="recommended", wait=True)
print(ep.id, ep.model, ep.status, ep.url)

if ep.is_live:
    pa.chat.completions.create(model=ep.id, messages=[{"role": "user", "content": "hi"}])
```

`endpoints.metrics(id)` returns a `Metrics` object (not a response type) whose
`.performance()`, `.uptime()`, `.cost()`, `.quality()`, and `.activity()` methods
return raw JSON dicts. Typed wrappers are planned but not yet shipped.

## Discovery types

These come from the `tasks` namespace and power the match-and-rank funnel.

### Task

Returned from `tasks.list()` and `tasks.retrieve(id)`. One benchmarked job.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Stable task id, e.g. `"contract-key-fields"` |
| `default_scorer` | `str \| None` | The function that grades model output for this task |
| `has_blob_input` | `bool` | `True` when the task takes documents or images, not just text |

```python
for t in pa.tasks.list():
    print(t.id, t.default_scorer, "doc" if t.has_blob_input else "text")
```

`has_blob_input` tells you whether you will need
`evals.sets.upload_document(...)` to attach binaries when you evaluate on this
task.

### TaskMatch

Returned from `tasks.match(query, top_k=...)`. The ranked result of matching
free-text intent to a task.

| Property | Type | Notes |
|---|---|---|
| `query` | `str \| None` | The query, echoed back |
| `matched` | `bool` | `True` when a high-confidence match was found |
| `chosen` | `TaskMatchCandidate \| None` | The best candidate, or `None` if nothing matched confidently |
| `candidates` | `list[TaskMatchCandidate]` | Top-K ranked alternates |
| `ambiguous` | `bool` | `True` when the top two scores are close |
| `matcher` | `str \| None` | Which strategy answered: `"keyword"` or `"semantic"` |

```python
m = pa.tasks.match("pull totals and dates out of vendor invoices", top_k=5)
if m.matched and m.chosen:
    print("best:", m.chosen.task_id, m.chosen.score, m.chosen.confidence)
else:
    for c in m.candidates:                  # fall back to ranked alternates
        print(c.task_id, c.score, c.confidence)
print("ambiguous?", m.ambiguous, "via", m.matcher)
```

### TaskMatchCandidate

An element of `match.candidates` (and the type of `match.chosen`).

| Property | Type | Notes |
|---|---|---|
| `task_id` | `str \| None` | The candidate task's id |
| `score` | `float \| None` | Match score in `[0, 1]` |
| `confidence` | `str \| None` | `"high"`, `"medium"`, or `"low"` |

### Leaderboard

Returned from `tasks.leaderboard(task_id)`. Open models ranked for a task, with a
single frontier baseline to beat.

| Property | Type | Notes |
|---|---|---|
| `task_id` | `str \| None` | The task |
| `metric` | `str \| None` | What the ranking optimizes, e.g. `"quality"` |
| `cost_unit` | `str \| None` | Cost basis, e.g. `"per_request"` |
| `recommended` | `str \| None` | The deployable model alias Pareta would pick. Pass straight to `endpoints.deploy(model=...)` |
| `models` | `list[LeaderboardEntry]` | Ranked **open** candidates |
| `frontier` | `LeaderboardEntry \| None` | The vendor baseline (savings reference) |

`recommended` is exactly what `endpoints.deploy(model="recommended")` resolves to,
and `tasks.recommended(task_id)` is a shortcut for `leaderboard(task_id).recommended`.

```python
lb = pa.tasks.leaderboard("contract-key-fields")
print("recommended:", lb.recommended, "| metric:", lb.metric, "| unit:", lb.cost_unit)

for e in lb.models:                         # ranked open aliases
    print(e.name, e.kind, e.quality, e.cost_per_request_micro_usd, f"{e.context_k}k")

if lb.frontier:
    print("baseline:", lb.frontier.name, lb.frontier.quality)
```

`tasks.leaderboard()` and `tasks.recommended()` are sync-only today; the async
`AsyncTasks` namespace does not expose them yet.

### LeaderboardEntry

A row of `leaderboard.models` (and the type of `leaderboard.frontier`).

| Property | Type | Notes |
|---|---|---|
| `name` | `str \| None` | Model name. For open models this is the **per-task alias**; for the frontier row it is the vendor id |
| `kind` | `str \| None` | `"open"` or `"frontier"` |
| `quality` | `float \| None` | Quality score in `[0, 1]` |
| `cost_per_request_micro_usd` | `int \| None` | Unit cost in **micro-USD** (not floored). See [money](#money-cost-vs-cost_micro_usd) |
| `context_k` | `int \| None` | Context window, in thousands of tokens |
| `run_mode` | `str \| None` | Backend benchmark context (`"rte"`, `"twostage"`); not a knob you set |

`name` for an open entry is the alias you feed back into `deploy(model=...)` or an
eval run's `models=[...]`. You never translate ids yourself.

### FrontierModel

Returned from `evals.frontier_models(task=...)`. A vendor model you can evaluate
against (the baseline, never something you deploy).

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Vendor model id. Feed into `evals.runs.create(frontier=[...])` |
| `vendor` | `str \| None` | `"openai"`, `"anthropic"`, etc. |
| `vision` | `bool` | `True` if vision-capable |
| `benchmarked` | `bool` | `True` if it is on the task's leaderboard. Only meaningful when you passed `task=` |

Frontier ids are shown in the clear because they are public products. Open-model
aliases are not.

```python
for fm in pa.evals.frontier_models(task="contract-key-fields"):
    flag = "vision" if fm.vision else "text"
    note = " (on leaderboard)" if fm.benchmarked else ""
    print(fm.id, fm.vendor, flag, note)
```

## Evaluation types

These come from the `evals` namespace and carry the cost numbers you compare
open against frontier with.

### EvalSet

Returned from `evals.sets.create(...)`, `evals.sets.list()`, and
`evals.sets.retrieve(id)`. A reusable evaluation dataset.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Eval set id. Pass to `evals.runs.create(eval_set=...)` |
| `task_id` | `str \| None` | The task this set is graded against |
| `name` | `str \| None` | Label (auto-generated if you did not pass one) |
| `item_count` | `int \| None` | Number of rows |
| `scoring_strategy` | `str \| None` | The strategy used to grade rows, e.g. `"extraction"`, `"classification"` |

```python
es = pa.evals.sets.create(
    task="contract-key-fields",
    items=[{"input": "...contract...", "expected": {"effective_date": "2026-01-01"}}],
    name="my contracts v1",
)
print(es.id, es.task_id, es.item_count, es.scoring_strategy)
```

### EvalRun

Returned from `evals.runs.create(...)`, `evals.runs.retrieve(id)`, and
`evals.runs.wait(id)`. The state of an evaluation, including per-model results
once it is terminal. The object wraps the server's `{"run": {...}, "results":
[...]}` envelope and flattens it for you.

| Property | Type | Notes |
|---|---|---|
| `id` | `str \| None` | Run id |
| `eval_set_id` | `str \| None` | The set being evaluated |
| `status` | `str \| None` | `"running"`, `"evaluating"`, `"completed"`, `"failed"` |
| `is_terminal` | `bool` | `True` when `status` is `"completed"` or `"failed"` |
| `candidate_models` | `list[str]` | The model ids (aliases) that were evaluated |
| `error_detail` | `str \| None` | Failure message when `status == "failed"` |
| `cost_micro_usd` | `int` | Raw total cost in **micro-USD** |
| `cost` | `Decimal` | Billed total in **dollars, floored to cents**. See [money](#money-cost-vs-cost_micro_usd) |
| `results` | `list[EvalResult]` | Per-model aggregates (populated once terminal) |

```python
run = pa.evals.runs.create(
    task="contract-key-fields",
    items=[{"input": "...", "expected": {"effective_date": "2026-01-01"}}],
    models=["qwen-1", "mistral-2"],   # open candidates (per-task aliases)
    frontier="benchmarked",           # baselines on this task's leaderboard
    wait=True,                        # block until terminal
)

if run.status == "failed":
    print("eval failed:", run.error_detail)
else:
    for r in sorted(run.results, key=lambda r: r.quality_mean or 0, reverse=True):
        print(r.model_id, r.kind, r.quality_mean, r.mean_cost_micro_usd, f"n={r.n_succeeded}")
    print("billed:", run.cost, "| raw µUSD:", run.cost_micro_usd)
```

Eval compute is metered against your org balance (both the open candidates and
any frontier baselines). An empty balance raises `InsufficientCreditsError` (402);
top up in the browser, since the SDK never exposes balance or payment.

### EvalResult

One element of `run.results`: a single model's aggregate over the run.

| Property | Type | Notes |
|---|---|---|
| `model_id` | `str \| None` | The model evaluated. Open models appear as **per-task aliases** |
| `kind` | `str \| None` | `"open"` or `"frontier"` |
| `quality_mean` | `float \| None` | Mean score in `[0, 1]` |
| `quality_ci_low` | `float \| None` | 95% CI lower bound |
| `quality_ci_high` | `float \| None` | 95% CI upper bound |
| `mean_cost_micro_usd` | `int \| None` | Average per-item cost in **micro-USD** (not floored) |
| `n_succeeded` | `int \| None` | Rows that scored without error |
| `error_count` | `int \| None` | Rows that errored |

The point of a result row is the comparison: read `quality_mean` against the
confidence interval to know whether a cheaper open model genuinely matches the
frontier on your data, and `mean_cost_micro_usd` to see what each call costs.

```python
for r in run.results:
    cheaper_and_as_good = (
        r.kind == "open"
        and r.quality_ci_low is not None
        and r.quality_ci_low >= 0.9
    )
    print(r.model_id, r.kind, f"{r.quality_mean:.3f}",
          f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]",
          r.mean_cost_micro_usd, "<-- candidate" if cheaper_and_as_good else "")
```

## Money: `.cost` vs `.cost_micro_usd`

Money on these objects follows one convention (SDK_PLAN §6): the **billed total
is floored to whole cents**, while sub-cent unit rates stay in micro-USD. The SDK
floors rather than rounds, so it never overstates a charge.

Three fields, two representations:

- `run.cost` is a `Decimal` in **dollars, floored to cents**. A 5 µUSD run reads
  `Decimal("0.00")`; a 420,715 µUSD run reads `Decimal("0.42")`. This is what the
  org is billed.
- `run.cost_micro_usd` is the **raw integer** in micro-USD. `1_000_000` = `$1.00`.
  Use it when you need the exact charge below cent precision.
- Per-item and per-request **unit rates** stay in micro-USD on purpose:
  `result.mean_cost_micro_usd` and `entry.cost_per_request_micro_usd`. Flooring a
  fraction-of-a-cent unit rate to whole cents would collapse it to zero and erase
  the open-vs-frontier comparison that is the whole reason you ran the eval.

```python
from decimal import Decimal

print(run.cost)                       # Decimal("0.42") — billed dollars, floored
print(run.cost_micro_usd)             # 420715 — raw micro-USD
assert run.cost == Decimal("0.42")

# Convert any micro-USD unit rate to dollars yourself when you want to display it:
mean = run.results[0].mean_cost_micro_usd        # e.g. 850 µUSD per item
print(f"${mean / 1_000_000:.6f} per item")       # $0.000850 per item
```

Both inference and evals debit the org balance on success; an empty balance
raises `InsufficientCreditsError`. The SDK only ever consumes credit and surfaces
the 402; topping up is browser-only.

## See also

- [Running inference](../guide/inference.md) — `ChatCompletion`, streaming chunks, and the async iterator form
- [Deploying endpoints](../guide/deploying-endpoints.md) — the `Endpoint` lifecycle and deploy progress events
- [Finding the right model](../guide/discovery.md) — `Task`, `TaskMatch`, and `Leaderboard` in depth
- [Evaluating models](../guide/evaluation.md) — building `EvalSet`s, running evals, and reading `EvalRun` cost
- [Core concepts](../guide/core-concepts.md) — aliases, hidden hardware, and metering, end to end



---

<!-- reference/http-api.md -->

# Underlying HTTP API

The Pareta Python SDK is a thin, typed wrapper over a plain JSON-over-HTTPS API
served at `https://api.pareta.ai` under the `/v1/` prefix. Every method you call
maps to exactly one route (a couple of ergonomic helpers fan out to two or
three). This page is the lookup table: for each SDK method, the HTTP method,
path, request shape, and response shape it wraps.

Reach for it when you are debugging a request in a proxy log, calling Pareta from
a language without an SDK, or you just want to know what goes over the wire.
Everywhere else, prefer the SDK: it handles auth, retries, SSE parsing, the cost
flooring convention, and the per-task model aliasing for you.

A few platform truths shape every route below:

- **GPUs are hidden.** `POST /v1/endpoints` takes a `task` and a `model`; it
  never takes a GPU, tensor-parallel, or quantization knob. Pareta resolves the
  serving hardware server-side.
- **Models are per-task aliases.** Open-weights model ids are masked to
  per-task public aliases on the way out. Real ids never cross this boundary.
  Frontier (vendor) ids are in the clear.
- **Inference and evals are metered against your org balance.**
  `POST /v1/chat/completions` debits on success; `POST /v1/eval-runs` debits for
  the open and frontier compute it runs. An empty balance returns `402`. Top-up
  is browser-only; there is no balance or payment route.
- **Inference is OpenAI-compatible.** `/v1/chat/completions` and `/v1/models`
  speak the OpenAI wire format, so existing OpenAI clients point at Pareta by
  swapping the base URL and key.

## Base URL and versioning

| | |
|---|---|
| Base URL | `https://api.pareta.ai` (override with `PARETA_BASE_URL`) |
| Prefix | `/v1/` |
| Content type | `application/json` (JSON bodies); `multipart/form-data` for uploads |
| Streaming | `text/event-stream` (chat streaming, deploy progress) |

The SDK normalizes the base URL with `rstrip("/")`, so a trailing slash is
harmless.

## Authentication

Every request carries a bearer token in the `Authorization` header. The token is
your `pareta_sk_…` secret key, minted in the dashboard.

```
Authorization: Bearer pareta_sk_…
User-Agent: pareta-python/<version>
Accept: application/json            # or text/event-stream for streaming routes
Content-Type: application/json      # JSON bodies only; multipart sets its own
```

The SDK reads the key from the `api_key=` argument or the `PARETA_API_KEY`
environment variable. Prefer `Pareta.from_env()`, which reads both
`PARETA_API_KEY` and the optional `PARETA_BASE_URL`:

```python
from pareta import Pareta

# Reads PARETA_API_KEY (+ optional PARETA_BASE_URL) from the environment.
with Pareta.from_env() as pa:
    print([m.id for m in pa.models.list()])
```

A raw `curl` against the same route:

```bash
curl https://api.pareta.ai/v1/models \
  -H "Authorization: Bearer $PARETA_API_KEY"
```

Constructing a client with no key raises `ParetaError` before any request goes
out. A key that reaches the server and is rejected returns `401`
(`AuthenticationError`). See [Errors, retries & timeouts](../guide/errors-and-retries.md).

## Route map at a glance

| SDK call | Method | Path |
|---|---|---|
| `chat.completions.create(...)` | `POST` | `/v1/chat/completions` |
| `models.list()` | `GET` | `/v1/models` |
| `endpoints.deploy(...)` | `POST` | `/v1/endpoints` (SSE) |
| `endpoints.list()` | `GET` | `/v1/endpoints` |
| `endpoints.retrieve(id)` | `GET` | `/v1/endpoints/{id}` |
| `endpoints.start(id)` | `POST` | `/v1/endpoints/{id}/start` |
| `endpoints.stop(id)` | `POST` | `/v1/endpoints/{id}/stop` |
| `endpoints.delete(id)` | `DELETE` | `/v1/endpoints/{id}` |
| `endpoints.metrics(id).performance(...)` | `GET` | `/v1/endpoints/{id}/performance` |
| `endpoints.metrics(id).uptime(...)` | `GET` | `/v1/endpoints/{id}/uptime` |
| `endpoints.metrics(id).cost(...)` | `GET` | `/v1/endpoints/{id}/cost` |
| `endpoints.metrics(id).quality(...)` | `GET` | `/v1/endpoints/{id}/quality` |
| `endpoints.metrics(id).activity(...)` | `GET` | `/v1/endpoints/{id}/activity` |
| `tasks.list()` | `GET` | `/v1/tasks` |
| `tasks.retrieve(id)` | `GET` | `/v1/tasks/{id}` |
| `tasks.match(query)` | `POST` | `/v1/tasks/match` |
| `tasks.leaderboard(id)` / `tasks.recommended(id)` | `GET` | `/v1/tasks/{id}/leaderboard` |
| `evals.frontier_models(task)` | `GET` | `/v1/eval/frontier-models` |
| `evals.sets.create(...)` | `POST` | `/v1/eval-sets` |
| `evals.sets.list()` | `GET` | `/v1/eval-sets` |
| `evals.sets.retrieve(id)` | `GET` | `/v1/eval-sets/{id}` |
| `evals.sets.delete(id)` | `DELETE` | `/v1/eval-sets/{id}` |
| `evals.sets.upload_document(...)` | `POST` | `/v1/eval-sets/{id}/attach-blob` (small) or `/blob-upload-url` + `PUT` + `/blob-upload-complete` (large) |
| `evals.runs.create(...)` | `POST` | `/v1/eval-runs` |
| `evals.runs.retrieve(id)` / `evals.runs.wait(id)` | `GET` | `/v1/eval-runs/{id}` |

## Inference: chat completions

### `POST /v1/chat/completions`

OpenAI-compatible chat completions. Wrapped by
[`chat.completions.create()`](../guide/inference.md). Metered: a successful
completion debits the org balance, and an empty balance returns `402`
(`InsufficientCreditsError`).

`model` is an endpoint id from a deploy (or any model id your org can reach).
Extra OpenAI fields (`temperature`, `max_tokens`, `top_p`, ...) pass straight
through as body fields.

Request body:

```json
{
  "model": "ep_contract_kie",
  "messages": [{"role": "user", "content": "Extract the parties."}],
  "temperature": 0.0
}
```

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    resp = pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Extract the parties."}],
        temperature=0.0,
    )
    print(resp.choices[0].message.content)   # ChatCompletion -> Choice -> Message
    print(resp.usage.total_tokens)           # Usage
```

The same request as `curl`:

```bash
curl https://api.pareta.ai/v1/chat/completions \
  -H "Authorization: Bearer $PARETA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ep_contract_kie",
    "messages": [{"role": "user", "content": "Extract the parties."}]
  }'
```

#### Streaming

Set `"stream": true`. The response is a data-only SSE stream in vLLM format:
each `data:` line is one JSON chunk, and the stream ends with `data: [DONE]`.

```
data: {"choices": [{"delta": {"content": "The"}}]}
data: {"choices": [{"delta": {"content": " parties"}}]}
data: [DONE]
```

The SDK yields `ChatCompletionChunk` objects;
`chunk.choices[0].delta.content` is the incremental text.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    for chunk in pa.chat.completions.create(
        model="ep_contract_kie",
        messages=[{"role": "user", "content": "Summarize the contract."}],
        stream=True,
    ):
        piece = chunk.choices[0].delta.content
        if piece:
            print(piece, end="", flush=True)
```

Retries cover only the initial handshake. Once SSE bytes are flowing a
mid-stream drop raises immediately (`APIConnectionError`) and cannot be resumed.

### `GET /v1/models`

OpenAI-compatible model listing. Wrapped by `models.list()`. Returns only
deployed, url-bearing endpoints (the OpenAI-compatible subset), shaped as
`{"data": [{"id", "owned_by", "created"}, ...]}`. Each `id` is usable as
`chat.completions.create(model=...)`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    models = pa.models.list()          # ModelList (iterable, has len)
    for m in models:
        print(m.id, m.owned_by)        # Model
```

## Endpoints

### `POST /v1/endpoints` (SSE)

Deploy a model for a task. Wrapped by
[`endpoints.deploy()`](../guide/deploying-endpoints.md). No hardware knob:
the body is `{task, model, ...}` and Pareta resolves the serving class. `model`
defaults to `"recommended"` (the task's curated or leaderboard-top open pick);
you may also pass a per-task alias or a real id.

Request body:

```json
{"task": "contract-key-fields", "model": "recommended"}
```

The response is a **named-event** SSE stream (distinct from the chat stream's
data-only format):

```
event: progress
data: {"stage": "pulling weights", "pct": 45}

event: complete
data: {"endpoint": {"id": "ep_...", "status": "live", "url": "https://..."}}

event: error
data: {"message": "out of memory"}
```

With `wait=False` (default) the SDK yields `{"event": str, "data": dict}` tuples
so you can drive a progress bar. With `wait=True` it consumes the stream
internally and returns the live `Endpoint` on the `complete` event, raising
`ParetaError` on `error`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    # Stream progress yourself:
    for ev in pa.endpoints.deploy(task="contract-key-fields"):
        if ev["event"] == "progress":
            print(ev["data"])

    # Or block until live:
    ep = pa.endpoints.deploy(task="contract-key-fields", model="recommended", wait=True)
    print(ep.id, ep.is_live, ep.url)   # Endpoint
```

Extra deploy parameters (`cost_per_request_micro_usd`,
`frontier_cost_per_request_micro_usd`, `region`, `provider`, `quality`,
`run_mode`, `taskDisplay`) pass through as body fields when present.

### `GET /v1/endpoints`

List every endpoint your org can access. Wrapped by `endpoints.list()`. Returns
a bare JSON array of endpoint records; the SDK maps each to an `Endpoint`. The
`model` field on each is the per-task public alias.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    for ep in pa.endpoints.list():
        print(ep.id, ep.status, ep.model)   # Endpoint
```

### `GET /v1/endpoints/{endpoint_id}`

Retrieve one endpoint. Wrapped by `endpoints.retrieve(endpoint_id)`. Returns the
endpoint record as an `Endpoint`. A wrong id returns `404` (`NotFoundError`).

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    ep = pa.endpoints.retrieve("ep_contract_kie")
    print(ep.is_live)   # status == "live"
```

### `POST /v1/endpoints/{endpoint_id}/start` and `/stop`

Start a stopped endpoint, or stop a live one. Wrapped by
`endpoints.start(endpoint_id)` and `endpoints.stop(endpoint_id)`. Both take only
the endpoint id (no GPU knob) and return the raw JSON status body.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    pa.endpoints.start("ep_contract_kie")   # warm a cold endpoint
    pa.endpoints.stop("ep_contract_kie")    # scale to zero
```

### `DELETE /v1/endpoints/{endpoint_id}`

Remove an endpoint. Wrapped by `endpoints.delete(endpoint_id)`, which returns
`None`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    pa.endpoints.delete("ep_contract_kie")
```

### Endpoint metrics

Five read-only dimensions hang off `endpoints.metrics(endpoint_id)`. Each method
issues a `GET` and returns the raw metric JSON (typed models are forthcoming).
All accept arbitrary query params via `**params`, which become the query string.

| SDK call | Method | Path | What it returns |
|---|---|---|---|
| `.performance(**params)` | `GET` | `/v1/endpoints/{id}/performance` | p50/p95/p99 latency |
| `.uptime(**params)` | `GET` | `/v1/endpoints/{id}/uptime` | availability metrics |
| `.cost(**params)` | `GET` | `/v1/endpoints/{id}/cost` | per-endpoint spend + vs-frontier savings |
| `.quality(**params)` | `GET` | `/v1/endpoints/{id}/quality` | judge windows |
| `.activity(**params)` | `GET` | `/v1/endpoints/{id}/activity` | usage stats |

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    m = pa.endpoints.metrics("ep_contract_kie")
    print(m.performance())          # GET /v1/endpoints/ep_contract_kie/performance
    print(m.cost(window="7d"))      # ?window=7d
```

## Tasks (benchmark catalog)

### `GET /v1/tasks`

List the benchmark catalog. Wrapped by `tasks.list()`. The server returns
`{"tasks": [...]}`; the SDK maps each to a `Task` (`id`, `default_scorer`,
`has_blob_input`).

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    for t in pa.tasks.list():
        print(t.id, t.default_scorer, t.has_blob_input)
```

### `GET /v1/tasks/{task_id}`

Retrieve one task's schema and default scorer. Wrapped by
`tasks.retrieve(task_id, examples_n=None)`. The optional `examples_n` query param
requests N example items when available.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    task = pa.tasks.retrieve("contract-key-fields", examples_n=3)
    print(task.id, task.has_blob_input)
```

### `POST /v1/tasks/match`

Map free-text intent to ranked candidate tasks. Wrapped by
`tasks.match(query, top_k=5)`. The matcher is a deterministic keyword scorer with
an optional semantic backstop. An empty `query` raises `ValueError` client-side.

Request body:

```json
{"query": "pull key fields out of vendor contracts", "top_k": 5}
```

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    match = pa.tasks.match("pull key fields out of vendor contracts")
    if match.matched and match.chosen:
        print(match.chosen.task_id, match.chosen.confidence)
    for c in match.candidates:        # ranked alternates
        print(c.task_id, c.score)
```

### `GET /v1/tasks/{task_id}/leaderboard`

Models ranked by quality and cost for a task, plus the `recommended` alias and a
`frontier` baseline entry. Wrapped by two sync methods:

- `tasks.leaderboard(task_id)` returns the full `Leaderboard`.
- `tasks.recommended(task_id)` is a convenience that returns
  `leaderboard(task_id).recommended` (the deployable model id to pass to
  `endpoints.deploy(model=...)`).

Leaderboard rows carry `cost_per_request_micro_usd` as raw micro-USD (not floored
to cents). Open-model rows are aliases; the `frontier` baseline is a vendor id.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    lb = pa.tasks.leaderboard("contract-key-fields")
    print(lb.recommended, lb.metric, lb.cost_unit)
    for entry in lb.models:           # LeaderboardEntry
        print(entry.name, entry.kind, entry.quality, entry.cost_per_request_micro_usd)

    best = pa.tasks.recommended("contract-key-fields")
    ep = pa.endpoints.deploy(task="contract-key-fields", model=best, wait=True)
```

> `tasks.leaderboard()` and `tasks.recommended()` exist on the sync client only;
> the async `AsyncTasks` has `list`, `retrieve`, and `match`.

## Evals

### `GET /v1/eval/frontier-models`

The vendor frontier roster you can evaluate against. Wrapped by
`evals.frontier_models(task=None)`. The server returns
`{"frontier_models": [...]}`; the SDK maps each to a `FrontierModel`
(`id`, `vendor`, `vision`, `benchmarked`). Pass `task` to annotate `benchmarked`
(on that task's leaderboard) and vision-filter for document tasks. Feed the ids
into `evals.runs.create(frontier=[...])`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    roster = pa.evals.frontier_models(task="contract-key-fields")
    for fm in roster:
        print(fm.id, fm.vendor, fm.vision, fm.benchmarked)
```

### `POST /v1/eval-sets`

Create an eval set from your rows. Wrapped by
[`evals.sets.create(task=..., items=...)`](../guide/evaluation.md). The rows go
over the wire as **JSONL** inside a `multipart/form-data` body (`items` file part
plus `task_id` and `name` form fields), not as a JSON array. An empty `items`
raises `ValueError`. The server returns `{"eval_set": {...}}`; the SDK maps it to
an `EvalSet` (`id`, `task_id`, `name`, `item_count`, `scoring_strategy`).

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    eval_set = pa.evals.sets.create(
        task="contract-key-fields",
        items=[
            {"input": "Agreement between A and B...", "expected": {"parties": ["A", "B"]}},
            {"input": "This SOW is by C for D...",     "expected": {"parties": ["C", "D"]}},
        ],
    )
    print(eval_set.id, eval_set.item_count, eval_set.scoring_strategy)
```

### `GET /v1/eval-sets` and `GET /v1/eval-sets/{eval_set_id}`

List your eval sets, or retrieve one. Wrapped by `evals.sets.list()` (server
returns `{"eval_sets": [...]}`) and `evals.sets.retrieve(eval_set_id)` (server
returns `{"eval_set": {...}}`). Both map to `EvalSet`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    for es in pa.evals.sets.list():
        print(es.id, es.name, es.item_count)
    one = pa.evals.sets.retrieve("evset_123")
```

### `DELETE /v1/eval-sets/{eval_set_id}`

Delete an eval set. Wrapped by `evals.sets.delete(eval_set_id)`, which returns
`None`.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    pa.evals.sets.delete("evset_123")
```

### Uploading documents to a row (3 routes)

For document/image tasks, attach a binary blob to one row's input field. The SDK
collapses two upload paths into a single
`evals.sets.upload_document(eval_set_id, file, *, idx, field_name, mime=None)`
call. `file` may be a path, raw `bytes`, or a binary file-like; anything else
raises `TypeError`. `idx` is the 0-based row, `field_name` the blob input field.

The SDK picks the path by size:

- **Files under 5 MiB** go inline through
  `POST /v1/eval-sets/{id}/attach-blob` (`multipart/form-data`: the `file` part
  plus `idx`, `field_name`, `mime` form fields).
- **Larger files** use the signed-URL flow: mint a URL with
  `POST /v1/eval-sets/{id}/blob-upload-url`, `PUT` the bytes directly to storage
  (GCS), then confirm with `POST /v1/eval-sets/{id}/blob-upload-complete`.

Either way the method returns the response dict from the terminal call.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    eval_set = pa.evals.sets.create(
        task="document-extraction",
        items=[{"expected": {"invoice_total": "1240.00"}}],
    )
    # Attach the PDF that row 0's blob field expects.
    pa.evals.sets.upload_document(
        eval_set.id, "invoice.pdf", idx=0, field_name="document"
    )
```

### `POST /v1/eval-runs`

Start an eval run. Wrapped by
[`evals.runs.create(...)`](../guide/evaluation.md). Pass either an existing
`eval_set=<id>` or an inline `task=...` + `items=...` (which the SDK turns into an
eval set first). `models` is the list of open-candidate aliases to evaluate;
`frontier` adds vendor baselines.

The SDK resolves `frontier` to a list of ids before sending, then posts
`{"eval_set_id": ..., "candidate_model_ids": [...open..., ...frontier...]}`:

| `frontier=` value | Resolves to |
|---|---|
| `None` or `"none"` | `[]` (no baselines) |
| list of ids | the list, as-is |
| `"all"` | every id from `GET /v1/eval/frontier-models?task=...` |
| `"benchmarked"` | frontier models on the task's leaderboard |

A keyword (`"all"` / `"benchmarked"`) needs the task; if you passed `eval_set=`
only, the SDK looks up its `task_id` to resolve the roster, and raises
`ValueError` if the task is unknown. Metered: the org balance is debited for open
and frontier compute, and an empty balance returns `402`.

The server responds with `{"run_id": ..., "status": ...}`. With `wait=False`
the SDK returns an `EvalRun` in its initial (running/queued) state. With
`wait=True` it polls `GET /v1/eval-runs/{run_id}` every `poll_interval` seconds
(default 3.0) until terminal, up to `timeout` seconds (default 900.0), then
returns the final `EvalRun`; exceeding the deadline raises `ParetaError` while the
run keeps going server-side.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    run = pa.evals.runs.create(
        task="contract-key-fields",
        items=[{"input": "Agreement between A and B...", "expected": {"parties": ["A", "B"]}}],
        models=["contract-1", "contract-2"],   # open-model aliases
        frontier="benchmarked",                 # vendor baselines on the leaderboard
        wait=True,
    )
    print(run.status, run.cost)                 # "completed" Decimal("0.42")
    for r in run.results:                       # EvalResult per model
        print(r.model_id, r.kind, r.quality_mean, r.mean_cost_micro_usd)
```

### `GET /v1/eval-runs/{run_id}`

Retrieve full run state, including per-model results once terminal. Wrapped by
`evals.runs.retrieve(run_id)` and the `evals.runs.wait(run_id, ...)` poll helper
(same semantics as `create(..., wait=True)`). The server returns an envelope
`{"run": {...}, "results": [...]}` that the SDK maps to an `EvalRun`.

`EvalRun.cost` is the billed total as `Decimal` dollars **floored to cents**
(never rounded up), while `EvalRun.cost_micro_usd` keeps the raw integer
micro-USD value. A 5 micro-USD run reads `Decimal("0.00")`. Per-item unit rates
such as `EvalResult.mean_cost_micro_usd` stay in micro-USD so the open-vs-frontier
comparison is not erased by flooring.

```python
from pareta import Pareta

with Pareta.from_env() as pa:
    run = pa.evals.runs.retrieve("run_456")
    if run.is_terminal:                         # status in ("completed", "failed")
        print(run.cost, run.cost_micro_usd)
        if run.status == "failed":
            print(run.error_detail)
    else:
        run = pa.evals.runs.wait("run_456", poll_interval=5.0, timeout=600.0)
```

## Status codes

The server is FastAPI, so error bodies are `{"detail": "<message>"}` with a
standard HTTP status. The SDK maps each status to a specific
`ParetaError` subclass so you catch by meaning.

| Status | Exception | When |
|---|---|---|
| 400, 422 | `BadRequestError` | request validation failed |
| 401 | `AuthenticationError` | invalid or missing API key |
| 402 | `InsufficientCreditsError` | org out of balance (top up in the dashboard) |
| 403 | `PermissionDeniedError` | authenticated, not allowed |
| 404 | `NotFoundError` | endpoint / eval set / run / task id not found |
| 409 | `ConflictError` | seed endpoint, transient lock/contention |
| 429 | `RateLimitError` | rate limited |
| 503 | `EndpointNotReadyError` | endpoint stopped, cold, or provider down |
| other 5xx | `APIStatusError` | generic server error |

Each `APIStatusError` exposes `status_code`, `detail`, `request_id` (from the
`x-request-id` response header), and the underlying `response`. The SDK
automatically retries `408, 409, 429, 500, 502, 503, 504` with exponential
backoff that honors `Retry-After`. Full treatment in
[Errors, retries & timeouts](../guide/errors-and-retries.md).

## Async over the same routes

`AsyncPareta` hits the identical routes with awaitable methods. Streaming routes
return async iterators; `evals.runs.wait()` is a coroutine. The wire format,
auth, status mapping, and retry policy are the same.

```python
import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        models = await pa.models.list()                 # GET /v1/models
        async for chunk in await pa.chat.completions.create(  # POST /v1/chat/completions
            model="ep_contract_kie",
            messages=[{"role": "user", "content": "Extract the parties."}],
            stream=True,
        ):
            piece = chunk.choices[0].delta.content
            if piece:
                print(piece, end="", flush=True)

asyncio.run(main())
```

## See also

- [Inference](../guide/inference.md) — OpenAI-compatible chat completions and streaming
- [Deploying endpoints](../guide/deploying-endpoints.md) — `deploy`, `start`/`stop`, and `is_live`
- [Evaluation](../guide/evaluation.md) — eval sets, runs, `wait`, and `run.cost`
- [Discovery](../guide/discovery.md) — the benchmark catalog, `match()`, and leaderboards
- [Errors, retries & timeouts](../guide/errors-and-retries.md) — the full exception hierarchy
- [Async](../guide/async.md) — the `AsyncPareta` client end to end
- [Configuration](../guide/configuration.md) — base URL, keys, timeout, and retry budget

