Metadata-Version: 2.4
Name: flexinference
Version: 0.1.0
Summary: Official Python SDK for FlexInference - a deadline-aware, OpenAI-compatible inference router.
Project-URL: Homepage, https://flexinference.com
Project-URL: Documentation, https://flexinference.com/docs
Author-email: Aditya Perswal <adityaperswal@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai,flexinference,gpt,inference,llm,openai
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: httpx>=0.27
Requires-Dist: typing-extensions>=4.12
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

# FlexInference (Python)

The official Python SDK for [FlexInference](https://flexinference.com) - a deadline-aware, OpenAI-compatible inference router. Send the OpenAI requests you already send, bring your own OpenAI key, and add one field - `start_within` - to trade latency for cost.

```bash
pip install flexinference
```

## Quickstart

```python
from flexinference import FlexInference

client = FlexInference(api_key="flex_live_...")

res = client.responses.create({
    "model": "gpt-5.5",
    "input": "Write a haiku about cheap GPUs.",
    "start_within": "00h-00m-30s",
})

print(res["output_text"])
```

`start_within` takes `"priority"`, `"standard"`, or a duration `"HHh-MMm-SSs"` (5s-10m) that races OpenAI's flex tier and falls back to standard if it can't start in time. See the [docs](https://flexinference.com/docs/deadline-routing).

## Streaming

```python
stream = client.responses.create(
    {"model": "gpt-5-nano", "input": "Count to ten.", "start_within": "00h-00m-20s"},
    stream=True,
)
for event in stream:
    if event.get("type") == "response.output_text.delta":
        print(event["delta"], end="")
```

## Chat Completions

```python
res = client.chat.completions.create({
    "model": "gpt-5.5",
    "messages": [{"role": "user", "content": "Hello!"}],
    "start_within": "standard",
})
print(res["choices"][0]["message"]["content"])
```

## Closing the client

The client holds a pooled `httpx.Client`, so close it when you're done to release connections. Use it as a context manager:

```python
with FlexInference(api_key="flex_live_...") as client:
    res = client.responses.create({"model": "gpt-5.5", "input": "Hi."})
    print(res["output_text"])
# connections are released on exit
```

Or close it yourself:

```python
client = FlexInference(api_key="flex_live_...")
try:
    ...
finally:
    client.close()
```

## Errors

Non-2xx responses raise `FlexInferenceError`, carrying the OpenAI-shaped `status`, `type`, `code`, and `param`:

```python
from flexinference import FlexInferenceError

try:
    client.responses.create({"model": "gpt-5.5", "input": "hi", "start_within": "priority"})
except FlexInferenceError as err:
    if err.code == "no_byok_key":
        print("Add your OpenAI key in the dashboard.")
    else:
        raise
```

## Configuration

| Argument | Default | Description |
| --- | --- | --- |
| `api_key` | (required) | Your `flex_live_` key. |
| `base_url` | `https://api.flexinference.com/v1` | Override the router endpoint. |
| `client` | `httpx.Client` with a 600s read timeout | Provide your own `httpx.Client`. |

## License

MIT
