Metadata-Version: 2.4
Name: mlproxy-py
Version: 0.1.1
Summary: SLA/QoS-aware reverse proxy for ML inference workloads (batching, routing, latency metrics).
Author: Kubenew
License: MIT
License-File: LICENSE
Keywords: asyncio,batching,inference,llm,ml,qos,reverse-proxy
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: AsyncIO
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: fastapi>=0.111.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: prometheus-client>=0.20.0
Requires-Dist: pydantic>=2.7.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: typer>=0.12.3
Requires-Dist: uvicorn>=0.30.0
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: test
Requires-Dist: httpx>=0.27.0; extra == 'test'
Requires-Dist: pytest-asyncio>=0.21; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Description-Content-Type: text/markdown

# mlproxy-py

[![PyPI](https://img.shields.io/pypi/v/mlproxy-py)](https://pypi.org/project/mlproxy-py/)
[![Python Versions](https://img.shields.io/pypi/pyversions/mlproxy-py)](https://pypi.org/project/mlproxy-py/)
[![License](https://img.shields.io/pypi/l/mlproxy-py)](https://github.com/Kubenew/mlproxy-py/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/Kubenew/mlproxy-py?style=flat&logo=github)](https://github.com/Kubenew/mlproxy-py)
[![Downloads](https://img.shields.io/pepy.tech/dt/mlproxy-py)](https://pepy.tech/project/mlproxy-py)

**mlproxy-py** is a minimal ML inference reverse proxy with QoS-aware routing.

Designed for LLM / ML inference workloads where routing decisions should be based on latency, SLA targets, backend health, queue depth, and batching potential.

## Features

- Reverse proxy for JSON inference requests
- Backends grouped into model pools
- SLA-aware routing (choose lowest latency backend)
- Optional micro-batching (collect requests for N ms)
- Concurrent health checks with connection pooling
- Prometheus metrics (request count, latency, backend latency)

## Quickstart

### Install

```bash
pip install mlproxy-py
```

### Run proxy

```bash
mlproxy run -c examples/config.yml
```

### Send request

```bash
curl -X POST http://localhost:7000/infer/modelA \
  -H "Content-Type: application/json" \
  -d '{"text":"hello"}'
```

## Architecture

```
Client ──POST /infer/{model}──► FastAPI
                                    │
                          ┌─────────▼──────────┐
                          │  ModelRouter       │
                          │  choose_backend()  │
                          │  (score = latency  │
                          │   + active_req*5)  │
                          └─────────┬──────────┘
                                    │ backend URL
                          ┌─────────▼──────────┐
                          │  forward_json()    │
                          │  (httpx conn pool) │
                          └─────────┬──────────┘
                                    ▼
                            Backend ML server

       ┌──────────────────┐    ┌──────────────────┐
       │  BatchQueue      │    │  Healthcheck     │
       │  (optional per   │    │  (concurrent,    │
       │   model pool)    │    │   per-backend)   │
       └──────────────────┘    └──────────────────┘
```

## Config

See `examples/config.yml`.

## Changelog

### 0.1.1

- **Lifespan pattern**: Migrated from deprecated `@app.on_event("startup")` to FastAPI `lifespan` context manager.
- **Graceful shutdown**: Batch workers and healthcheck loop are properly cancelled on shutdown.
- **Connection pooling**: Shared `httpx.AsyncClient` singletons for proxy and healthcheck (was creating a client per request/check).
- **Concurrent health checks**: Backends checked in parallel via `asyncio.gather` (was sequential).
- **Logging**: Added structured `logging` throughout; `--log-level` CLI option.
- **Bare except fixes**: All `except Exception` blocks re-raise `asyncio.CancelledError`.
- **Deprecated API fixes**: Replaced `asyncio.get_event_loop()` with `asyncio.get_running_loop()` in batching module.
- **Build system**: Migrated from `setuptools` to `hatchling`. Added classifiers, keywords, optional dev/test deps, ruff/pytest config.
- **Tests**: Expanded from 1 test to 15+ tests covering config, router, batching, proxy, healthcheck, and backends.

### 0.1.0

- Initial release: JSON inference proxy, model pools, SLA-aware routing, micro-batching, health checks, Prometheus metrics.

## License

MIT
