Metadata-Version: 2.4
Name: flexinference
Version: 1.3.0
Summary: Official Python SDK for FlexInference - a deadline-aware, OpenAI-compatible inference router.
Project-URL: Homepage, https://flexinference.com
Project-URL: Documentation, https://flexinference.mintlify.app
Author-email: Aditya Perswal <adityaperswal@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,flexinference,gpt,inference,llm,openai
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2
Requires-Dist: typing-extensions>=4.12
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# FlexInference (Python)

The official Python SDK for [FlexInference](https://flexinference.com) - a deadline-aware inference router across **OpenAI, Google Gemini, and Anthropic**. Send the OpenAI requests you already send, bring your own provider key, and set one required field - `start_within` - to trade latency for cost. Four caller formats are supported: `responses`, `chat.completions`, `interactions` (Gemini shape), and `messages` (Anthropic shape) - any of them reaches any provider.

```bash
pip install flexinference
```

## Quickstart

```python
from flexinference import FlexInference, output_text

client = FlexInference(api_key="flex_live_...")

res = client.responses.create({
    "model": "gpt-5.5",
    "input": "Write a haiku about cheap GPUs.",
    "start_within": "00h-00m-30s",
})

print(output_text(res))
```

Responses come back as the **raw OpenAI JSON** (we never reshape the body), so there is no `output_text` field on the wire - that is computed by OpenAI's own SDKs. `output_text(res)` pulls the assistant's text out of either a response or a chat completion for you.

`start_within` is **required** on every request. It takes `"default"`, `"priority"`, `"auto"`, or a duration `"HHh-MMm-SSs"` (5s-10m). The duration races OpenAI's flex tier on a flex-capable model and falls back to standard if it can't start in time; `"default"`, `"priority"`, and `"auto"` map to those OpenAI service tiers and proxy any model. See the [docs](https://flexinference.mintlify.app/deadline-routing).

## Providers (OpenAI, Gemini, and Anthropic)

FlexInference routes to **OpenAI**, **Google Gemini**, and **Anthropic**. Send the same OpenAI-shaped request and pass whichever model id you want - `gpt-5.5`, `o4-mini`, `gemini-3.5-flash`, `claude-opus-4-8`, and so on. We translate Gemini and Anthropic to and from the OpenAI shape, so your code is identical for all three.

- **OpenAI:** `default` (standard tier), `priority`, `auto`, and the flex race (a duration) on flex-capable models.
- **Gemini:** `default` maps to Gemini's **standard** tier, plus `priority` and the flex race on the Gemini flex models (`gemini-3.5-flash`, `gemini-3.1-flash-lite`, `gemini-3.1-pro-preview`, `gemini-3-flash-preview`, `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`). Gemini has no `auto` tier, so `start_within="auto"` on a Gemini model returns `400`.
- **Anthropic (Claude):** proxy-only. `default`, `priority`, and `auto` work; there is **no flex race**, so a duration `start_within` on a `claude-*` model returns `400 flex_unsupported_for_anthropic`. Anthropic requires a token cap, so set `max_output_tokens` (`max_completion_tokens` on Chat, `max_tokens` on Messages) or you get `400 missing_max_tokens`. You keep the unified API and tier control, and draw down your own Anthropic credits.

Add the provider key you'll use (OpenAI, Gemini, and/or Anthropic) in the [dashboard](https://www.flexinference.com/dashboard). Text, streaming, structured outputs, function calling, image input, and web search work across providers (send a Responses `web_search` tool; we map it to Gemini's `google_search`).

Don't send `service_tier` - the router controls the tier from `start_within` and rejects a caller-supplied `service_tier` with `400 service_tier_not_allowed`.

## Streaming

```python
stream = client.responses.create(
    {"model": "gpt-5-nano", "input": "Count to ten.", "start_within": "00h-00m-20s"},
    stream=True,
)
for event in stream:
    if event.get("type") == "response.output_text.delta":
        print(event["delta"], end="")
```

## Chat Completions

```python
res = client.chat.completions.create({
    "model": "gpt-5.5",
    "messages": [{"role": "user", "content": "Hello!"}],
    "start_within": "default",
})
print(res["choices"][0]["message"]["content"])
```

## Interactions (Gemini shape)

Speak Google's Interactions shape and reach any model. `interaction_output_text(res)` pulls the assistant text out of the interaction's `steps`.

```python
from flexinference import interaction_output_text

res = client.interactions.create({
    "model": "gemini-3.5-flash",
    "input": "Summarize this contract.",
    "start_within": "00h-01m-00s",
})
print(interaction_output_text(res))
```

## Messages (Anthropic shape)

Speak Anthropic's Messages shape and reach any model. `max_tokens` is required (Anthropic requires it). `message_output_text(res)` pulls the assistant text out of the message `content`.

```python
from flexinference import message_output_text

res = client.messages.create({
    "model": "claude-opus-4-8",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Summarize this contract."}],
    "start_within": "default",
})
print(message_output_text(res))
```

## Closing the client

The client holds a pooled `httpx.Client`, so close it when you're done to release connections. Use it as a context manager:

```python
with FlexInference(api_key="flex_live_...") as client:
    res = client.responses.create({"model": "gpt-5.5", "input": "Hi.", "start_within": "default"})
    print(output_text(res))
# connections are released on exit
```

Or close it yourself:

```python
client = FlexInference(api_key="flex_live_...")
try:
    ...
finally:
    client.close()
```

## Request validation

Before a request leaves your machine, the SDK validates the parts it owns. `start_within` is **required** and must be `"default"`, `"priority"`, `"auto"`, or a duration `"HHh-MMm-SSs"` between 5s and 10m; `model` and `input`/`messages` must be present. A missing or bad value raises a `ValueError` locally instead of making a round trip to a provider 400:

```python
client.responses.create({"model": "gpt-5.5", "input": "hi"})
# ValueError: Invalid request body:
#   Missing required parameter: `start_within`. Set it to "default", "priority", "auto", or a duration "HHh-MMm-SSs".
```

Validation is request-only. Unknown fields pass straight through to the provider (so new OpenAI parameters keep working), and responses are never validated or reshaped.

## Errors

Non-2xx responses raise `FlexInferenceError`, carrying the OpenAI-shaped `status`, `type`, `code`, and `param`:

```python
from flexinference import FlexInferenceError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "priority"})
except FlexInferenceError as err:
    if err.code == "no_byok_key":
        print("Add your OpenAI key in the dashboard.")
    else:
        raise
```

## Billing / 402

If your account's billing is past due, the router pauses **billable flex** and returns
`402 Payment Required` on those requests; free routing keeps working. The SDK raises a
typed `PaymentRequiredError` (a subclass of `FlexInferenceError`) for HTTP 402, so you
can catch it on its own and prompt the user to update payment while letting other errors
propagate:

```python
from flexinference import PaymentRequiredError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "00h-00m-30s"})
except PaymentRequiredError:
    print("Billing is past due - update payment in the dashboard to resume flex.")
except FlexInferenceError:
    raise
```

Because `PaymentRequiredError` subclasses `FlexInferenceError`, existing
`except FlexInferenceError` handlers keep catching 402s too.

## Configuration

| Argument | Default | Description |
| --- | --- | --- |
| `api_key` | (required) | Your `flex_live_` key. |
| `base_url` | `https://api.flexinference.com/v1` | Override the router endpoint. |
| `client` | `httpx.Client` (600s read, 10s connect) | Provide your own `httpx.Client`. |

## License

MIT
