Metadata-Version: 2.4
Name: aquila
Version: 0.3.3
Summary: Single control plane for multi-node vLLM inference — deploy, serve, and manage LLMs across a GPU cluster without Kubernetes.
Author-email: Marc Schlichting <mschl@stanford.edu>
License: MIT
Project-URL: Homepage, https://github.com/sisl/aquila
Keywords: vllm,llm,gpu,cluster,inference,deployment,openai,serving
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# Aquila

Single control plane for multi-node vLLM inference. Point-and-click deployments, an OpenAI-compatible gateway, warm caching, live GPU monitoring, and a full deployment lifecycle — without Kubernetes or a managed platform.

## Quick start

```bash
uv venv && source .venv/bin/activate
uv pip install aquila
```

**Host (management server):**

```bash
aquila host up --host-ip 0.0.0.0 --host-frontend-port 5173 --host-discover-port 11400
```

**Client (each GPU node):**

```bash
aquila client up --host-ip <host-ip> --host-discover-port 11400
```

Open `http://<host-ip>:5173` — client nodes appear within seconds. Add `--service` for persistent systemd services.

## Features

- **Deploy and manage models** across GPU nodes via Docker or rootless Podman — each runs in the official `vllm/vllm-openai` container with a specific version, nightly build, or commit hash.
- **OpenAI-compatible gateway** (`/v1`) with stable URLs across node moves, API key auth with per-deployment scoping, and auto-expiring snippet keys.
- **Warm cache** — pause idle models to RAM and resume on demand; LRU auto-eviction frees GPU VRAM while keeping weights ready for near-instant restart.
- **Local checkpoints and LoRA adapters** — upload from the browser (streamed) or pull from a URL directly onto a node.
- **Live monitoring** — GPU utilization, disk usage, deployment status, per-deployment usage metrics, and 48-hour metric history charts.
- **Usage tracking** — lifetime tokens, request counts, and average prefill/generation speeds from vLLM's own metrics.
- **Reproducibility manifests** — export model, HF revision, seed, vLLM version, image digest, and full config per deployment.
- **Notifications** — Slack/webhook alerts when deployments become ready, fail, or are about to expire.
- **Per-GPU maintenance** — cordon individual GPUs while the rest of the node keeps serving; optionally drain affected deployments.
- **Extra packages and plugins** — install pip packages and upload vLLM plugins per deployment via cached derived images.
- **Reverse proxy support** — deploy behind nginx at any sub-path with `--base-path`.

## Best for

- Research labs and university clusters
- Teams sharing GPUs across projects
- Self-hosted multi-model inference

## Supported platforms

- Python 3.10–3.14, Node.js ≥ 23 (host only)
- Ubuntu 22.04 and 24.04
- NVIDIA GPUs (H100, A100, L40, RTX 4090, DGX Spark)

[Full documentation](https://sisl.github.io/aquila/)
