Metadata-Version: 2.4
Name: flexinference
Version: 1.1.0
Summary: Official Python SDK for FlexInference - a deadline-aware, OpenAI-compatible inference router.
Project-URL: Homepage, https://flexinference.com
Project-URL: Documentation, https://flexinference.mintlify.app
Author-email: Aditya Perswal <adityaperswal@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,flexinference,gpt,inference,llm,openai
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2
Requires-Dist: typing-extensions>=4.12
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# FlexInference (Python)

The official Python SDK for [FlexInference](https://flexinference.com) - a deadline-aware, OpenAI-compatible inference router. Send the OpenAI requests you already send, bring your own OpenAI key, and set one required field - `start_within` - to trade latency for cost.

```bash
pip install flexinference
```

## Quickstart

```python
from flexinference import FlexInference, output_text

client = FlexInference(api_key="flex_live_...")

res = client.responses.create({
    "model": "gpt-5.5",
    "input": "Write a haiku about cheap GPUs.",
    "start_within": "00h-00m-30s",
})

print(output_text(res))
```

Responses come back as the **raw OpenAI JSON** (we never reshape the body), so there is no `output_text` field on the wire - that is computed by OpenAI's own SDKs. `output_text(res)` pulls the assistant's text out of either a response or a chat completion for you.

`start_within` is **required** on every request. It takes `"default"`, `"priority"`, `"auto"`, or a duration `"HHh-MMm-SSs"` (5s-10m). The duration races OpenAI's flex tier on a flex-capable model and falls back to standard if it can't start in time; `"default"`, `"priority"`, and `"auto"` map to those OpenAI service tiers and proxy any model. See the [docs](https://flexinference.mintlify.app/deadline-routing).

## Providers (OpenAI and Gemini)

FlexInference routes to **OpenAI** and **Google Gemini**. Send the same OpenAI-shaped request and pass whichever model id you want - `gpt-5.5`, `o4-mini`, `gemini-3.5-flash`, and so on. We run Gemini through its Interactions API and translate it back to the OpenAI shape, so your code is identical for both.

- **OpenAI:** `default` (standard tier), `priority`, `auto`, and the flex race (a duration) on flex-capable models.
- **Gemini:** `default` maps to Gemini's **standard** tier, plus `priority` and the flex race on the Gemini flex models (`gemini-3.5-flash`, `gemini-3.1-flash-lite`, `gemini-3.1-pro-preview`, `gemini-3-flash-preview`, `gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite`). Gemini has no `auto` tier, so `start_within="auto"` on a Gemini model returns `400`.

Add the provider key you'll use (OpenAI and/or Gemini) in the [dashboard](https://www.flexinference.com/dashboard). Text, streaming, structured outputs, function calling, image input, and web search work on both providers (send a Responses `web_search` tool; we map it to Gemini's `google_search`).

Don't send `service_tier` - the router controls the tier from `start_within` and rejects a caller-supplied `service_tier` with `400 service_tier_not_allowed`.

## Streaming

```python
stream = client.responses.create(
    {"model": "gpt-5-nano", "input": "Count to ten.", "start_within": "00h-00m-20s"},
    stream=True,
)
for event in stream:
    if event.get("type") == "response.output_text.delta":
        print(event["delta"], end="")
```

## Chat Completions

```python
res = client.chat.completions.create({
    "model": "gpt-5.5",
    "messages": [{"role": "user", "content": "Hello!"}],
    "start_within": "default",
})
print(res["choices"][0]["message"]["content"])
```

## Closing the client

The client holds a pooled `httpx.Client`, so close it when you're done to release connections. Use it as a context manager:

```python
with FlexInference(api_key="flex_live_...") as client:
    res = client.responses.create({"model": "gpt-5.5", "input": "Hi.", "start_within": "default"})
    print(output_text(res))
# connections are released on exit
```

Or close it yourself:

```python
client = FlexInference(api_key="flex_live_...")
try:
    ...
finally:
    client.close()
```

## Request validation

Before a request leaves your machine, the SDK validates the parts it owns. `start_within` is **required** and must be `"default"`, `"priority"`, `"auto"`, or a duration `"HHh-MMm-SSs"` between 5s and 10m; `model` and `input`/`messages` must be present. A missing or bad value raises a `ValueError` locally instead of making a round trip to a provider 400:

```python
client.responses.create({"model": "gpt-5.5", "input": "hi"})
# ValueError: Invalid request body:
#   Missing required parameter: `start_within`. Set it to "default", "priority", "auto", or a duration "HHh-MMm-SSs".
```

Validation is request-only. Unknown fields pass straight through to the provider (so new OpenAI parameters keep working), and responses are never validated or reshaped.

## Errors

Non-2xx responses raise `FlexInferenceError`, carrying the OpenAI-shaped `status`, `type`, `code`, and `param`:

```python
from flexinference import FlexInferenceError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "priority"})
except FlexInferenceError as err:
    if err.code == "no_byok_key":
        print("Add your OpenAI key in the dashboard.")
    else:
        raise
```

## Configuration

| Argument | Default | Description |
| --- | --- | --- |
| `api_key` | (required) | Your `flex_live_` key. |
| `base_url` | `https://api.flexinference.com/v1` | Override the router endpoint. |
| `client` | `httpx.Client` (600s read, 10s connect) | Provide your own `httpx.Client`. |

## License

MIT
