Metadata-Version: 2.4
Name: forge-ml-serve
Version: 0.1.0
Summary: A lightweight ML inference server with dynamic batching, hot-swapping, and Prometheus metrics
License: MIT
Keywords: batching,inference,machine-learning,pytorch,serving
Requires-Python: >=3.10
Requires-Dist: fastapi>=0.111.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: prometheus-client>=0.20.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: uvicorn[standard]>=0.29.0
Provides-Extra: bench
Requires-Dist: httpx>=0.27.0; extra == 'bench'
Requires-Dist: matplotlib>=3.8.0; extra == 'bench'
Requires-Dist: tqdm>=4.66.0; extra == 'bench'
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# Forge

A lightweight ML inference server with dynamic batching, model hot-swapping, a multi-model registry and Prometheus metrics -- built from scratch in Python.

This is not `model.predict()` behind Flask. Forge implements the core mechanics of what TorchServe and Triton do, from first principles.

---

## Features

| Feature | Description |
|---|---|
| Dynamic batching | Accumulates concurrent requests over a configurable time window then stacks them into a single tensor operation. Typically 4-8x throughput vs. sequential. |
| Backpressure queue | asyncio.Queue with a hard depth cap. Returns HTTP 503 immediately when saturated rather than blocking the event loop. |
| Model hot-swapping | Load a new checkpoint with zero downtime. In-flight requests finish on the old model; new requests use the new one. |
| Model registry | Serve multiple models simultaneously at independent endpoints each with its own queue and batching config. |
| Prometheus metrics | P50/P95/P99 latency histograms, batch size distribution, queue depth, timeout counters and swap duration. |

---

## Architecture

```
  HTTP Request
       |
       v
  +-------------------------------------+
  |  FastAPI  POST /v1/{model}/predict  |
  +------------------+------------------+
                     | asyncio.Future
                     v
  +-------------------------------------+
  |  RequestQueue  (backpressure cap)   |
  +------------------+------------------+
                     | blocking get / nowait drain
                     v
  +-------------------------------------+
  |  BatchScheduler                     |
  |  +-------------------------------+  |
  |  | Collect for batch_window_ms   |  |
  |  |   OR until max_batch_size     |  |  <- whichever fires first
  |  +--------------+----------------+  |
  |                 | run_in_executor   |
  |  +--------------v----------------+  |
  |  |  torch.no_grad() forward pass |  |
  |  +--------------+----------------+  |
  |                 | scatter results   |
  +-----------------|-------------------+
                    |
                    v
         Future.set_result(tensor)
                    |
                    v
         HTTP Response to caller
```

---

## Quickstart

### Install

```bash
git clone https://github.com/verz0/Forge.git
cd Forge
pip install -e ".[dev]"
```

### Serve a dummy model (no GPU needed)

```bash
forge serve examples/serve_dummy.py
```

```bash
curl -X POST http://localhost:8000/v1/dummy/predict \
     -H "Content-Type: application/json" \
     -d '{"input": [1.0, 2.0, 3.0]}'
# {"model":"dummy","output":[2.0,4.0,6.0],"latency_ms":1.2}
```

### Serve ResNet-50

```bash
pip install torchvision
forge serve examples/serve_resnet.py
```

### Check metrics (Prometheus format)

```bash
curl http://localhost:8000/metrics
```

### Hot-swap a model

```bash
curl -X POST http://localhost:8000/v1/resnet50/reload \
     -H "Content-Type: application/json" \
     -d '{"model_path": "/path/to/new_resnet.pt"}'
```

---

## Configuration

```python
from forge import ModelConfig

config = ModelConfig(
    batch_window_ms=50.0,   # Collect requests for 50ms before dispatching
    max_batch_size=32,       # Early flush if 32 requests accumulate first
    max_queue_depth=256,     # Return 503 if more than 256 requests are pending
    request_timeout_s=30.0,  # Fail requests waiting longer than 30s
    device="cuda",           # "cpu", "cuda", "cuda:0" or "mps"
    num_threads=4,           # PyTorch intraop threads (CPU only)
)
```

---

## Benchmark

```bash
# Start the server
forge serve examples/serve_dummy.py

# Run the sweep in another terminal
python benchmarks/bench_throughput.py --concurrency 1,5,10,25,50,100

# With chart output
python benchmarks/bench_throughput.py --plot
```

**Sample results** (CPU, dummy model, 128-float input):

| Concurrency | RPS    | P50 (ms) | P95 (ms) | P99 (ms) |
|-------------|--------|----------|----------|----------|
| 1           | 420    | 2.1      | 2.8      | 3.1      |
| 10          | 1,850  | 4.9      | 8.2      | 11.4     |
| 50          | 3,200  | 14.8     | 28.3     | 41.7     |
| 100         | 3,400  | 28.1     | 52.6     | 71.2     |

Key insight: at concurrency 50, batching yields roughly 7.6x the throughput of a naive sequential server at the cost of approximately 15ms added latency (the batch window).

---

## Running Tests

```bash
pytest tests/ -v
```

---

## API Reference

| Endpoint | Method | Description |
|---|---|---|
| `/v1/{model}/predict` | POST | Submit tensor for inference |
| `/v1/{model}/reload` | POST | Hot-swap to new checkpoint |
| `/v1/models` | GET | List models and queue depths |
| `/metrics` | GET | Prometheus metrics |
| `/health` | GET | Liveness and readiness |
| `/docs` | GET | Interactive API docs (Swagger) |

---

## Project Structure

```
forge/
├── forge/
│   ├── config.py      # ModelConfig and ServerConfig
│   ├── queue.py       # RequestQueue and InferenceRequest
│   ├── batcher.py     # BatchScheduler (core engine)
│   ├── worker.py      # ModelWorker and hot-swap
│   ├── registry.py    # ModelRegistry
│   ├── metrics.py     # Prometheus metrics
│   ├── server.py      # FastAPI app
│   └── cli.py         # forge serve CLI
├── tests/
├── benchmarks/
└── examples/
```

---

## License

MIT
