Metadata-Version: 2.4
Name: flexinference
Version: 1.4.1
Summary: Official Python SDK for FlexInference - a deadline-aware, OpenAI-compatible inference router.
Project-URL: Homepage, https://flexinference.com
Project-URL: Documentation, https://flexinference.mintlify.app
Author-email: Aditya Perswal <adityaperswal@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,flexinference,gpt,inference,llm,openai
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2
Requires-Dist: typing-extensions>=4.12
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# FlexInference (Python)

The official Python SDK for [FlexInference](https://flexinference.com). FlexInference is an inference router that works with **OpenAI, Google Gemini, and Anthropic**. You send the OpenAI-shaped requests you already send, you bring your own provider key, and you set one required field called `start_within`. That field is how long you will wait for the request to start. We try a cheaper tier first to lower your bill, and if it cannot start in time we fall back to your standard tier so the request still runs. The SDK speaks four caller formats: `responses`, `chat.completions`, `interactions` (Gemini shape), and `messages` (Anthropic shape). Any of them reaches any provider.

```bash
pip install flexinference
```

## Quickstart

```python
from flexinference import FlexInference, output_text

client = FlexInference(api_key="flex_live_...")

res = client.responses.create({
    "model": "gpt-5.5",
    "input": "Write a haiku about cheap GPUs.",
    "start_within": "00h-00m-30s",
})

print(output_text(res))
```

Responses come back as the **raw OpenAI JSON** and we never reshape the body. That means there is no `output_text` field on the wire, because OpenAI's own SDKs compute that field rather than the provider. `output_text(res)` pulls the assistant's text out of either a response or a chat completion for you.

`start_within` is **required** on every request. Set it to `"default"`, `"priority"`, `"auto"`, or a duration written as `"HHh-MMm-SSs"` from 5s to 10m. A duration is how long you will wait for the request to start running. We try OpenAI's cheaper flex tier first on a flex-capable model, and that is where your savings come from. If flex cannot start inside your window, we switch to your normal standard tier so the request still completes. The words `"default"`, `"priority"`, and `"auto"` map straight to those OpenAI service tiers and work with any model. See the [docs](https://flexinference.mintlify.app/deadline-routing).

This fallback is your safety net. You never lose a request just because the cheap tier was busy. Your standard tier always finishes the job. It runs the same model either way, so you trade a little waiting for a lower bill and keep the result you would have gotten anyway.

## Providers (OpenAI, Gemini, and Anthropic)

FlexInference routes to **OpenAI**, **Google Gemini**, and **Anthropic**. Send the same OpenAI-shaped request and pass whichever model id you want, such as `gpt-5.5`, `o4-mini`, `gemini-3.5-flash`, or `claude-opus-4-8`. We translate Gemini and Anthropic to and from the OpenAI shape, so your code is identical for all three.

- **OpenAI:** `default` (standard tier), `priority`, `auto`, and the flex race (a duration) on flex-capable models.
- **Gemini:** `default` maps to Gemini's **standard** tier, plus `priority` and the flex race on the Gemini flex models (`gemini-3.5-flash`, `gemini-3.1-flash-lite`, `gemini-3.1-pro-preview`, `gemini-3-flash-preview`, `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`). Gemini has no `auto` tier, so `start_within="auto"` on a Gemini model returns `400`.
- **Anthropic (Claude):** proxy-only. `default`, `priority`, and `auto` work; there is **no flex race**, so a duration `start_within` on a `claude-*` model returns `400 flex_unsupported_for_anthropic`. Anthropic requires a token cap, so set `max_output_tokens` (`max_completion_tokens` on Chat, `max_tokens` on Messages) or you get `400 missing_max_tokens`. You keep the unified API and tier control, and draw down your own Anthropic credits.

Add the provider key you'll use (OpenAI, Gemini, and/or Anthropic) in the [dashboard](https://www.flexinference.com/dashboard). Text, streaming, structured outputs, function calling, image input, and web search work across providers (send a Responses `web_search` tool; we map it to Gemini's `google_search`).

Don't send `service_tier`. The router picks the tier from `start_within`, so a request that sets its own `service_tier` fails fast with `400 service_tier_not_allowed`.

## Streaming

```python
stream = client.responses.create(
    {"model": "gpt-5-nano", "input": "Count to ten.", "start_within": "00h-00m-20s"},
    stream=True,
)
for event in stream:
    if event.get("type") == "response.output_text.delta":
        print(event["delta"], end="")
```

## Chat Completions

```python
res = client.chat.completions.create({
    "model": "gpt-5.5",
    "messages": [{"role": "user", "content": "Hello!"}],
    "start_within": "default",
})
print(res["choices"][0]["message"]["content"])
```

## Interactions (Gemini shape)

Speak Google's Interactions shape and reach any model. `interaction_output_text(res)` pulls the assistant text out of the interaction's `steps`.

```python
from flexinference import interaction_output_text

res = client.interactions.create({
    "model": "gemini-3.5-flash",
    "input": "Summarize this contract.",
    "start_within": "00h-01m-00s",
})
print(interaction_output_text(res))
```

## Messages (Anthropic shape)

Speak Anthropic's Messages shape and reach any model. `max_tokens` is required (Anthropic requires it). `message_output_text(res)` pulls the assistant text out of the message `content`.

```python
from flexinference import message_output_text

res = client.messages.create({
    "model": "claude-opus-4-8",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Summarize this contract."}],
    "start_within": "default",
})
print(message_output_text(res))
```

## Closing the client

The client holds a pooled `httpx.Client`, so close it when you're done to release connections. Use it as a context manager:

```python
with FlexInference(api_key="flex_live_...") as client:
    res = client.responses.create({"model": "gpt-5.5", "input": "Hi.", "start_within": "default"})
    print(output_text(res))
# connections are released on exit
```

Or close it yourself:

```python
client = FlexInference(api_key="flex_live_...")
try:
    ...
finally:
    client.close()
```

## Request validation

Before a request leaves your machine, the SDK validates the parts it owns. `start_within` is **required** and must be `"default"`, `"priority"`, `"auto"`, or a duration `"HHh-MMm-SSs"` between 5s and 10m; `model` and `input`/`messages` must be present. A missing or bad value raises a `ValueError` locally instead of making a round trip to a provider 400:

```python
client.responses.create({"model": "gpt-5.5", "input": "hi"})
# ValueError: Invalid request body:
#   Missing required parameter: `start_within`. Set it to "default", "priority", "auto", or a duration "HHh-MMm-SSs".
```

Validation is request-only. Unknown fields pass straight through to the provider (so new OpenAI parameters keep working), and responses are never validated or reshaped.

## Errors

Non-2xx responses raise `FlexInferenceError`, carrying the OpenAI-shaped `status`, `type`, `code`, and `param`:

```python
from flexinference import FlexInferenceError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "priority"})
except FlexInferenceError as err:
    if err.code == "no_byok_key":
        print("Add your OpenAI key in the dashboard.")
    else:
        raise
```

Every FlexInference error tells you the same four things. It says what went wrong, why it went wrong, how to fix it, and it shows an example of a request that works. This is built for agents as much as for people. An agent can read the message and correct the call instead of guessing and burning tokens. When the error comes from the provider instead of from us, we pass it straight through with its status and body intact, so you always see the real cause.

## Billing / 402

Standard routing is always free. Flex routing is the part you pay for, and you only pay a
share of the money it saves you. If your billing is past due, the router pauses flex and
returns `402 Payment Required` on those flex requests, and your free standard routing keeps
working. The SDK raises a
typed `PaymentRequiredError` (a subclass of `FlexInferenceError`) for HTTP 402, so you
can catch it on its own and prompt the user to update payment while letting other errors
propagate:

```python
from flexinference import PaymentRequiredError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "00h-00m-30s"})
except PaymentRequiredError:
    print("Billing is past due - update payment in the dashboard to resume flex.")
except FlexInferenceError:
    raise
```

Because `PaymentRequiredError` subclasses `FlexInferenceError`, existing
`except FlexInferenceError` handlers keep catching 402s too.

## Configuration

| Argument | Default | Description |
| --- | --- | --- |
| `api_key` | (required) | Your `flex_live_` key. |
| `base_url` | `https://api.flexinference.com/v1` | Override the router endpoint. |
| `client` | `httpx.Client` (600s read, 10s connect) | Provide your own `httpx.Client`. |

## License

MIT
