Metadata-Version: 2.4
Name: mlx-gpt-oss
Version: 1.0.3
Summary: Minimal OpenAI-compatible server for GPT-OSS models on Apple Silicon with MLX
Author: Hayssam Keilany
License-Expression: MIT
Project-URL: Homepage, https://github.com/icelaglace/mlx-gpt-oss
Project-URL: Repository, https://github.com/icelaglace/mlx-gpt-oss
Project-URL: Issues, https://github.com/icelaglace/mlx-gpt-oss/issues
Keywords: gpt-oss,harmony,mlx,fastapi,openai-compatible,minimal,mlx-lm
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: WSGI :: Application
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi<1,>=0.115
Requires-Dist: mlx-lm<1,>=0.30
Requires-Dist: openai-harmony<1,>=0.0.8
Requires-Dist: loguru<1,>=0.7
Requires-Dist: pydantic<3,>=2
Requires-Dist: uvicorn<1,>=0.35
Provides-Extra: dev
Requires-Dist: build<2,>=1.2; extra == "dev"
Requires-Dist: httpx<1,>=0.27; extra == "dev"
Requires-Dist: pytest<10,>=8; extra == "dev"
Requires-Dist: twine<7,>=5; extra == "dev"
Dynamic: license-file

# MLX GPT-OSS Server

Minimal OpenAI-compatible server for GPT-OSS/Harmony models on Apple Silicon.  
Built with `mlx-lm` (inference), `openai-harmony` (prompt formatting), and FastAPI (HTTP API).

## Feature List

- OpenAI-style `/v1/chat/completions` endpoint
- OpenAI-style `/v1/responses` endpoint
- Streaming (`SSE`) and non-streaming responses
- Harmony `reasoning_effort` support (`low`, `medium`, `high`)
- OpenAI tool-calling response format
- Responses API function-calling and `previous_response_id` support
- Robust Harmony tool-calling parser and stream recovery paths
- Usage token counts in responses
- `/health` queue stats and `/v1/models` compatibility endpoint
- Single-model runtime with FIFO request queueing

## Requirements

- macOS on Apple Silicon
- Python `>=3.11`

## Quick Start

```bash
pip install mlx-gpt-oss
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8
```

Default bind: `http://0.0.0.0:8000`

## Install From Source

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8
```

## API Endpoints

| Endpoint | Method | Purpose |
| --- | --- | --- |
| `/health` | `GET` | Server health + active/queued request counts |
| `/v1/models` | `GET` | Loaded model metadata |
| `/v1/chat/completions` | `POST` | OpenAI-compatible chat completion |
| `/v1/responses` | `POST` | OpenAI-compatible Responses API create |
| `/v1/responses/{response_id}` | `GET` | Retrieve stored response |
| `/v1/responses/{response_id}` | `DELETE` | Delete stored response |
| `/v1/responses/{response_id}/input_items` | `GET` | Retrieve stored request input items |

## Chat Completions Notes

- `model` is required for compatibility, but the server always uses the single model loaded at startup.
- Supports OpenAI-style `messages`, `stream`, `tools`, `tool_choice`, `stop`, and common sampling params.
- `top_k` is accepted but generation remains pinned to `top_k=0` for GPT-OSS behavior.
- `reasoning_effort` can be set directly, or via `chat_template_kwargs.reasoning_effort`.
- Streaming returns `chat.completion.chunk` events and ends with `[DONE]`.

## Responses API Notes

- Supported input types are text message items, replayed `function_call` items, and `function_call_output` items.
- Supported tools are custom `function` tools only.
- Stored responses are process-local, in-memory, and bounded by LRU eviction.
- `previous_response_id` reuses stored conversation transcript, but does not carry forward prior `instructions`.

## Responses API Limits

- No multimodal inputs (`image`, `audio`, `file`, etc.)
- No hosted OpenAI tools such as `web_search`, `file_search`, or `code_interpreter`
- No structured output / non-plain-text `text.format`
- No `parallel_tool_calls=false`
- No named/required tool forcing; `tool_choice` supports `auto` and `none`

## Tool Calling Reliability

- Uses official Harmony assistant-action stop tokens from `openai-harmony` (no hardcoded token IDs).
- Handles streaming edge cases: unfinished tool-call endings, buffered fallback dedupe, and repeated identical tool calls.
- Addresses a class of tool-calling failures seen in other MLX servers.

## CLI Options

| Flag | Default | Description |
| --- | --- | --- |
| `--model` | required | Model path or Hugging Face ID |
| `--host` | `0.0.0.0` | Bind address |
| `--port` | `8000` | Bind port |
| `--context-length` | `8196` | Max KV cache context length |
| `--log-level` | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR` |
| `--log-file` | disabled | Optional rotating file log output |
| `--debug-raw-preview-chars` | `0` | In `DEBUG`, preview N chars of prompts/output |
| `--http-access-log` | `False` | Emit one access log line per HTTP request |
| `--responses-store-max-items` | `256` | Max stored `/v1/responses` records kept in memory |
| `--responses-store-max-bytes` | `67108864` | Approximate max in-memory bytes for stored responses |

## Security

- No built-in auth or API key checks, this is your responsibility.
- Default host is `0.0.0.0` for local/LAN self-hosting.
- CORS is permissive (`*`, credentials disabled).
- Use `--host 127.0.0.1` for local-only access.
