Metadata-Version: 2.4
Name: local-engine-router
Version: 0.5.0
Summary: A single-port OpenAI- and Ollama-compatible reverse proxy that swaps the GPU between local LLM engines on demand.
Author: rxxusp
License: MIT
Project-URL: Homepage, https://github.com/rxxusp/local-engine-router
Project-URL: Repository, https://github.com/rxxusp/local-engine-router
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115
Requires-Dist: uvicorn[standard]>=0.30
Requires-Dist: httpx>=0.27
Requires-Dist: pyyaml>=6.0
Requires-Dist: psutil>=5.9
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Dynamic: license-file

# local-engine-router

[![CI](https://github.com/rxxusp/local-engine-router/actions/workflows/ci.yml/badge.svg)](https://github.com/rxxusp/local-engine-router/actions/workflows/ci.yml)

**On memory-constrained, unified-memory hardware, only one heavy LLM engine can
hold the GPU at a time.** local-engine-router is a single-port, OpenAI- and
Ollama-compatible reverse proxy that reads each request's `model` field, figures
out which local engine owns it, and **swaps engines on demand** so your clients
never have to know which backend is currently active. The proxy itself is **pure
Python and uses no GPU**.

Built and verified on a DGX Spark (GB10, 128 GB unified CPU+GPU memory), where
DeepSeek-V4-Flash alone uses ~81 GB and running two heavy engines simultaneously
causes OOM failures.

---

## Install

Pick whichever path fits. All three leave the same `local-engine-router` and
`routerctl` commands on your `PATH`. The router needs Python >= 3.10 and no GPU.

### Option 1: one-line script (recommended)

```bash
curl -fsSL https://raw.githubusercontent.com/rxxusp/local-engine-router/main/install.sh | bash
```

This creates an isolated virtualenv, installs the package and its dependencies
into it, links `local-engine-router` and `routerctl` into `~/.local/bin`, writes
a starter config if you do not have one, and offers to install and enable the
systemd `--user` service. It is idempotent, so re-run it any time to upgrade.

Pass flags through the pipe with `-s --`:

```bash
curl -fsSL .../install.sh | bash -s -- --yes         # non-interactive; enable the service
curl -fsSL .../install.sh | bash -s -- --no-service  # skip systemd
```

The installer is parameterised by environment variables (`LER_VENV`,
`LER_CONFIG`, `LER_BIN`, `LER_UNIT_DIR`) and supports `--dry-run`,
`--print-unit`, and `--uninstall`. Run `install.sh --help` for the full list.

### Option 2: pip / pipx

```bash
# Once published to PyPI:
pipx install local-engine-router        # isolated, recommended
pip install local-engine-router         # into the current environment

# Until then (or to track main), install straight from GitHub:
pipx install "git+https://github.com/rxxusp/local-engine-router.git"
```

The package installs both console scripts, `local-engine-router` and
`routerctl`.

### Option 3: Docker

```bash
docker run --rm -p 8077:8077 \
  -v "$PWD/config.yaml:/app/config.yaml" \
  --add-host host.docker.internal:host-gateway \
  ghcr.io/rxxusp/local-engine-router:latest
```

Or with the bundled compose file:

```bash
docker compose up -d      # uses docker-compose.yml in this repo
```

The image is pure Python (no CUDA) and is published for `linux/amd64` and
`linux/arm64` on every `v*` tag. Inside a container, point each engine's
`base_url` at `http://host.docker.internal:<port>` instead of `127.0.0.1`, and
note that only `api_swap` and `ollama` engines work from a container (see
[Container limitation](#container-limitation)).

> **Maintainers:** the PyPI path goes live once a PyPI Trusted Publisher is
> configured and the `ENABLE_PYPI_PUBLISH` repository variable is set to `true`.
> See [`.github/workflows/pypi-publish.yml`](.github/workflows/pypi-publish.yml)
> for the exact one-time setup.


## Quickstart

From zero to a running router in three steps:

```bash
# 1. Install (any path above). For example:
curl -fsSL https://raw.githubusercontent.com/rxxusp/local-engine-router/main/install.sh | bash

# 2. Detect your running engines and write them into the config the service reads.
#    The installer prints this path as `conf:`; for the one-line install it is:
local-engine-router init --config ~/.config/local-engine-router/config.yaml

# 3. Restart so the router picks up the new config, then list its models:
routerctl restart
curl http://127.0.0.1:8077/v1/models
```

> Point `init` at the config the router actually loads. A script install reads
> `~/.config/local-engine-router/config.yaml` (the `conf:` path the installer
> prints). A pip/manual install has no service, so run `init` with no `--config`
> (it writes `./config.yaml`) and start the router directly with
> `local-engine-router --config ./config.yaml`.

### The `init` wizard

`local-engine-router init` (also `routerctl init`) probes the well-known
localhost ports of every supported backend (Ollama, llama.cpp, vLLM, SGLang, LM
Studio, TabbyAPI, KoboldCpp), confirms what is actually listening, fetches each
engine's live model list, and scaffolds a working `config.yaml` from the
matching presets. It asks only what it cannot infer: bind host, an optional API
key, and which detected engines to include.

It is suggest-and-confirm: a port that is open but does not confirm as a known
backend is never added without an explicit yes, so the router never routes to an
unmanaged port silently. The generated config is validated through the real
loader before it is written.

```bash
local-engine-router init                 # interactive
local-engine-router init --yes           # non-interactive; include all confirmed engines
local-engine-router init --detect-only   # just report what is running, write nothing
local-engine-router init --example       # write a commented starter, no probing
```

Engines the router drives purely over HTTP (Ollama, LM Studio, TabbyAPI) work
with no further editing. For engines the router launches itself (llama.cpp,
vLLM, SGLang, KoboldCpp), fill in `start_cmd` in the generated config so the
router can restart them after a swap; the wizard prints which ones still need it.

### Your first request

```bash
curl http://127.0.0.1:8077/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<a-model-id-from-/v1/models>","messages":[{"role":"user","content":"hello"}],"stream":false}'
```

The router logs `SWAP begin:` / `SWAP done:` lines as it switches engines. Check
`GET /health` for liveness and `GET /status` (or `routerctl status`) for full
engine state. Validate a config without starting:

```bash
python3 -m router --check-config --config config.yaml
```

### Run it as a service (Linux)

The one-line installer offers to set up a systemd `--user` service pointing at
its virtualenv. If you installed from a git checkout instead, use the
checkout-based installer:

```bash
bash deploy/install.sh        # copies the unit, enables lingering, starts it
```

Or wire the unit up by hand:

```bash
mkdir -p ~/.config/systemd/user
cp deploy/local-engine-router.service ~/.config/systemd/user/
sudo loginctl enable-linger "$USER"    # boot-start without a login session
systemctl --user daemon-reload
systemctl --user enable --now local-engine-router
```

The checkout unit uses `%h` (systemd's home placeholder) for all paths, so it
works for any user without editing. macOS and Windows have no systemd; run
`local-engine-router --config config.yaml` directly or under your own supervisor.

---

## How the GPU swap works

This is the core mechanic that distinguishes local-engine-router from simpler
proxies. When a request arrives for a model that belongs to a different engine
than the one currently holding the GPU, the router performs a full swap before
forwarding the request:

```
  client request (model: llama3.1:8b)
        │
        ▼
  resolve model → engine: ollama
        │
        ├── already active? ──yes──► forward immediately
        │
       no
        │
        ▼
  acquire _swap_lock
        │
        ▼
  [1] DRAIN in-flight requests on every non-target engine
      (wait up to drain_timeout_s, default 30 s)
        │
        ▼
  [2] FREE VRAM on every non-target engine
      (systemctl stop / keep_alive:0 / SIGTERM, etc.)
        │
        ▼
  [3] WAIT FOR OS MEMORY RECLAIM  ← THE DIFFERENTIATOR
      poll MemAvailable until it plateaus (~1 GiB between samples)
      or swap_memory_settle_timeout_s (default 25 s) elapses
        │
        │   WHY: after a heavy engine (e.g. an ~81 GB model) stops,
        │   the kernel may not reclaim those pages for several
        │   seconds. If the next model starts while they are still
        │   resident, its pre-flight memory check sees less free
        │   memory than actually exists and fails with OOM.
        │   The memory-settle wait eliminates that race.
        │
        ▼
  [4] ensure_started() on the target engine
      poll readiness until HTTP 200 (up to start_timeout_s, default 240 s)
        │
        │   ┌────────────────────────────────────────────┐
        │   │  while waiting (streaming clients only):   │
        │   │  /v1/* SSE streams:  ": keepalive\n\n"    │
        │   │  /api/* NDJSON:      "\n" (bare newline)  │
        │   │  emitted every swap_keepalive_interval_s  │
        │   └────────────────────────────────────────────┘
        │
        ▼
  [5] active_engine = target; release _swap_lock
        │
        ▼
  increment in-flight counter; forward request to target engine
        │
        ▼
  response complete → release() decrements in-flight counter
```

**Keep-alive and disconnect safety.** Streaming responses start immediately; the
engine acquire runs inside the response generator, so a long cold start never
blocks with zero bytes sent to the client. A shielded asyncio task guarantees
`release()` is called even if the client disconnects mid-swap, preventing GPU
leaks. On client disconnect during a swap, the pending acquire is cancelled
cleanly on a normal control-flow path (not under `CancelledError`).

**Non-streaming requests** cannot carry keep-alive frames and block for the
entire swap. See [Sharp edges](#sharp-edges).


## Platform support

| Platform | Router runs | Memory-settle | systemd unit |
|----------|------------|---------------|--------------|
| Linux    | yes        | yes (`/proc/meminfo` fast path) | yes (`deploy/local-engine-router.service`) |
| macOS    | yes        | yes (via `psutil`) | no |
| Windows  | yes        | yes (via `psutil`) | no |

`psutil` is a required dependency, installed automatically by `pip install .`
(or `pip install -r requirements.txt`), so the cross-platform memory-settle
wait and process control work out of the box on Linux, macOS, and Windows. On
Linux the router reads `/proc/meminfo` directly as a fast path; everywhere else
it reads available memory through `psutil`.


## What makes this different

The general space (llama-swap, LocalAI, GPUStack, ...) is crowded; this targets
the **memory-constrained unified-memory** niche (GB10 / Apple Silicon) and does
four things no maintained tool does today:

1. **Explicit kernel memory-settle wait.** After freeing an engine the router
   polls `MemAvailable` until it plateaus *before* starting the next engine, so
   the incoming model's pre-flight memory check doesn't fail on pages the kernel
   hasn't reclaimed yet. On a GB10 with an ~81 GB model, this takes a few
   seconds and is the difference between a clean swap and an OOM failure.
2. **Manages engines it didn't spawn.** It can drive a `systemctl --user` unit
   with `Restart=always` (a plain SIGTERM would just respawn) -- structurally
   impossible with a pure `cmd:`-launches-the-process model.
3. **Native Ollama `/api/*` on a swap proxy.** Both the OpenAI `/v1/*` surface
   and Ollama-native `/api/*` are first-class and trigger swaps.
4. **Upstream-independent keep-alive** during long cold starts -- on both
   `/v1/*` SSE streams and `/api/*` NDJSON streams.


## Sharp edges

### Non-streaming requests block for the entire swap

A single JSON response body has nowhere to embed a keep-alive frame.
Non-streaming requests (`stream: false`) block from the moment they arrive until
the swap completes and the upstream returns. On a cold ds4 start that is up to
`start_timeout_s` (default 240 s). **Set your client read-timeout above the
worst-case swap** -- 300 s is a safe ceiling for most setups.

Streaming clients (`stream: true`) are not affected: they receive keep-alive
frames and never see a multi-second silence.

### Only one engine holds the GPU at a time

This is by design. The single-active invariant is the whole point of the router
on unified-memory hardware. There is no "run two engines in parallel" mode.

### Container limitation

The Docker image runs the router process only -- it cannot reach into the host's
process tree or `systemctl --user` namespace. Engines of type `generic_process`
(the router launches the server) and `ds4` (systemd-user lifecycle) do **not**
work inside the container. Only `api_swap` and `ollama` engines work from the
container, because they communicate over HTTP to servers already running on the
host.

Run the router on the host directly (pip install / systemd) if you need
process-control engines.

### Memory-settle uses /proc/meminfo on Linux, psutil elsewhere

The router reads `/proc/meminfo` directly on Linux as a fast path and falls back
to `psutil` (a required dependency) on macOS, Windows, and any other host, so the
memory-settle wait works on every platform. In the rare case memory cannot be
read at all, the wait is skipped (logged as a warning), which on a heavily loaded
system can let the incoming model's pre-flight memory check fail with OOM.


## Authentication and binding

By default the router binds `127.0.0.1` (localhost only).

- **Bind:** set `host: 0.0.0.0` only if you need off-localhost access (e.g.
  Open WebUI in Docker reaching the host via `host.docker.internal`).
- **Auth:** set `api_keys` in `config.yaml`. When set, every request except
  `GET /health` must present a key via `Authorization: Bearer <key>` or
  `X-API-Key: <key>`. Keys are compared in constant time.
- If the router is bound off-localhost with no `api_keys`, it logs a security
  warning on startup.


## Backend presets

Ready-to-paste `engines:` + `models:` blocks for common backends live in
[`presets/`](presets/). Copy the relevant file into your `config.yaml` and fill
in the `<ANGLE_BRACKET>` placeholders.

| Backend | Preset file | Engine type |
|---------|-------------|-------------|
| llama.cpp (llama-server) | [`presets/llamacpp.yaml`](presets/llamacpp.yaml) | `generic_process` |
| vLLM | [`presets/vllm.yaml`](presets/vllm.yaml) | `generic_process` |
| SGLang | [`presets/sglang.yaml`](presets/sglang.yaml) | `generic_process` |
| KoboldCpp | [`presets/koboldcpp.yaml`](presets/koboldcpp.yaml) | `generic_process` |
| MLX (mlx-lm) | [`presets/mlx.yaml`](presets/mlx.yaml) | `generic_process` |
| TabbyAPI | [`presets/tabbyapi.yaml`](presets/tabbyapi.yaml) | `api_swap` |
| LM Studio | [`presets/lmstudio.yaml`](presets/lmstudio.yaml) | `api_swap` |
| LocalAI | [`presets/localai.yaml`](presets/localai.yaml) | `generic_process` |
| ramalama | [`presets/ramalama.yaml`](presets/ramalama.yaml) | `generic_process` |
| MAX (Modular) | [`presets/max.yaml`](presets/max.yaml) | `generic_process` |

See [`presets/README.md`](presets/README.md) for gotchas per backend (e.g. vLLM
reports a false `/health` 200 before the model is actually servable; the preset
uses `ready_path: /v1/models` + `ready_check: "model:<id>"` to work around it).


## Integrations

- **Open WebUI**: [`deploy/openwebui-wiring.md`](deploy/openwebui-wiring.md) --
  route all Open WebUI model requests through the router via Admin Panel
  (recommended, zero risk) or a `docker run` recreate.
- **OpenCode**: [`deploy/opencode.snippet.md`](deploy/opencode.snippet.md) --
  point both OpenCode providers at the router by changing two `baseURL` values
  in `~/.config/opencode/opencode.json`.


## Architecture

```
  OpenCode / curl / any OpenAI client
        │
        │  POST /v1/chat/completions (model: "llama3.1:8b")
        ▼
  ┌──────────────────────────────────────┐
  │        local-engine-router :8077     │
  │                                      │
  │  /v1/chat /v1/completions           │
  │  /v1/embeddings /v1/messages        │  ← OpenAI-compatible
  │  /v1/responses                      │
  │                                      │
  │  /api/chat /api/generate            │  ← Ollama-native
  │  /api/embeddings /api/embed         │
  │                                      │
  │  reads "model", resolves engine,    │
  │  swaps if needed, proxies           │
  └──────────┬───────────────┬──────────┘
             │               │
   ┌─────────▼────┐   ┌──────▼──────────┐
   │llamacpp :8080│   │  Ollama :11434   │
   │(or vLLM etc) │   │  (various sizes) │
   └──────────────┘   └─────────────────┘
         ← only ONE active at a time →
         (single-GPU unified memory pool)

  swap: drain → free VRAM → wait for OS memory reclaim → start target
  streaming clients stay alive: SSE comments on /v1/*, bare newlines on /api/*
```


## Config reference

Copy `config.example.yaml` to `config.yaml` and edit for your machine. A
[JSON Schema (`config.schema.json`)](config.schema.json) ships in the repo;
point your editor's YAML language server at it for inline validation.

Validate without starting:

```bash
python3 -m router --check-config --config config.yaml
python3 -m router --print-schema    # print the JSON Schema
```

### Top-level keys

| Key | Default | Description |
|-----|---------|-------------|
| `host` | `127.0.0.1` | Bind address. Use `0.0.0.0` to expose off-localhost (pair with `api_keys`). |
| `port` | `8077` | Listen port. |
| `api_keys` | `[]` | When non-empty, require a key on all requests except `GET /health`. |
| `allow_destructive_ollama_api` | `false` | Allow `/api/delete`, `/api/create`, `/api/copy`, `/api/push`, `/api/blobs` (refused with 403 when false). |
| `log_level` | `INFO` | Python log level. |
| `log_file` | `logs/router.log` | Rotating log (5 MB x 3 backups). |
| `state_file` | `state.json` | Persisted active-engine snapshot (re-probed on startup). |
| `swap_keepalive_enabled` | `true` | Emit keep-alive frames to streaming clients during swaps. |
| `swap_keepalive_interval_s` | `5.0` | Seconds between keep-alive frames. |
| `drain_timeout_s` | `30.0` | Max wait for in-flight requests before stopping an engine. |
| `swap_memory_settle_timeout_s` | `25.0` | Max wait for freed memory to plateau before starting the next engine. |
| `upstream_connect_timeout_s` | `15.0` | Connect timeout to backends (read timeout is unbounded). |

### Generic `engines:` table

Use this to add any number of engines with config only -- no Python needed.
`type` is one of `generic_process`, `api_swap`, `ollama` (and `ds4`, an advanced
escape hatch for systemd-managed servers; see `config.example.yaml`). When
`engines:` is present it is the **sole** source of engines (the legacy top-level
`ds4:`/`ollama:` sections are ignored).

```yaml
engines:
  # Local server the router launches + supervises (llama.cpp, vLLM, SGLang, ...)
  llamacpp:
    type: generic_process
    base_url: http://127.0.0.1:8080
    start_cmd: ["/usr/local/bin/llama-server", "-m", "/models/foo.gguf", "--port", "8080"]
    ready_path: /health
    start_timeout_s: 300

  # Engine whose models load/unload over HTTP (TabbyAPI, LM Studio, ...)
  tabby:
    type: api_swap
    base_url: http://127.0.0.1:5000
    health_path: /v1/model
    unload_path: /v1/model/unload
    loaded_path: /v1/model
    loaded_models_key: data
    loaded_name_key: id

models:
  - { id: qwen2.5-7b-instruct, engine: llamacpp }
  - { id: my-tabby-model,       engine: tabby }
```

See `config.example.yaml` and `config.schema.json` for the full key reference
on `generic_process`, `api_swap`, `ds4`, and `ollama` engine types.

### Model aliases

Map a fixed client-side name to a real model id. The router rewrites the
request body so the upstream always sees the real id.

```yaml
aliases:
  gpt-4o-mini: qwen2.5-7b-instruct
  claude-3-5-sonnet: llama3.1:8b
```


## Model auto-discovery

Auto-discovery is **opt-in and off by default.** When `discover.enabled` is
false (or the `discover:` block is absent entirely), the router is
byte-identical to a build without the feature. Discovery only activates when
you set `discover: enabled: true` in `config.yaml`.

### Global discover block

```yaml
discover:
  enabled: false            # opt-in; false is the safe default
  collision: config_order   # how to resolve engine conflicts (only mode today)
  port_probe:
    enabled: false          # reserved for future use; parse-validated, not yet active
```

`collision: config_order` means the first engine in declaration order wins when
two engines both claim the same model id. A one-time WARNING is logged naming
both engines.

### Per-engine discovery fields (generic_process only)

On any `generic_process` engine you can set:

```yaml
engines:
  llamacpp:
    type: generic_process
    start_cmd: ["/usr/local/bin/llama-server", "-m", "/models/my-model.gguf", "--port", "8080"]
    base_url: http://127.0.0.1:8080
    ready_path: /health
    discover_models: true          # opt this engine into discovery
    served_models:                 # optional extra hint ids (augments start_cmd parse)
      - my-model
    tags_cache_ttl_s: 30.0        # TTL for the /v1/models cache used during discovery
```

| Field | Default | Description |
|-------|---------|-------------|
| `discover_models` | `false` | Opt this engine into the discovery index. Required for the engine to participate. |
| `served_models` | `[]` | Extra model ids to register regardless of what the engine advertises live. Useful when the engine is typically stopped. |
| `tags_cache_ttl_s` | `30.0` | Seconds to cache the `/v1/models` response from this engine when building the discovery index. |

### How discovery augments the static registry

Discovery **augments** the static `models:` list. It never overrides it.

- Static `models:` entries always win. If a model id appears in both the static
  list and the discovery index, the static entry takes precedence.
- Newly-pulled Ollama tags and live `api_swap` model ids are picked up
  automatically without a restart.
- For a **stopped** `generic_process` engine, the router resolves models from
  three sources in union: the `start_cmd` argv (parsing `--served-model-name`,
  `--model`/`-m`, and `.gguf` basenames), a self-healing last-seen cache
  populated while the engine ran, and the explicit `served_models` hint list.
  This means a model belonging to a stopped engine still appears in
  `GET /v1/models` and routes correctly when requested (triggering a start).

The last-seen cache is persisted in `state.json` under `seen_models` and
reloaded on startup, so discovery survives router restarts.

### Per-model thinking guard

Any model in the static `models:` list can carry a per-model field:

```yaml
models:
  - id: qwen3-30b
    engine: vllm
    disable_thinking_below_max_tokens: 1000
```

When `disable_thinking_below_max_tokens` is set on a model, the router
intercepts `POST /v1/chat/completions` requests for that model and, if the
request's `max_tokens` is below the threshold AND the client has not explicitly
set `enable_thinking`, injects `enable_thinking: false` into the request body
before forwarding. This prevents models from allocating a thinking budget that
exceeds the available token budget, which can produce empty responses.

If the client already set `enable_thinking` explicitly, the router leaves it
untouched.

### Triggering a discovery scan

```bash
# Via HTTP (auth-gated the same as /admin/swap)
curl -X POST http://127.0.0.1:8077/admin/discover \
  -H "Authorization: Bearer <key>" \
  -H "Content-Type: application/json" -d '{}'

# Via routerctl
routerctl discover
```

The response is a JSON object mapping each engine key to the sorted list of
model ids it advertised. Stopped-engine entries from the discovery index (parsed
from `start_cmd`, last-seen cache, and `served_models`) are merged in.

`POST /admin/discover` is a scan-and-report call. It does not change routing; it
is intended for inspection and debugging.


## Endpoint reference

| Method | Path | Behaviour |
|--------|------|-----------|
| GET | `/` | HTML status page |
| GET | `/health` | `{"status":"ok"}` -- liveness; never triggers a swap |
| GET | `/metrics` | Prometheus text exposition. Unauthenticated even when `api_keys` are set. |
| GET | `/status` | Full status: active engine, last swap, per-engine state, model list |
| GET | `/v1/models` | OpenAI model list: static registry + live engine tags + stopped-engine discovered ids, deduplicated. Discovery entries only appear when `discover.enabled` is true. No swap. |
| POST | `/v1/chat/completions` | OpenAI chat; routed by `body.model` |
| POST | `/v1/completions` | OpenAI legacy completions; routed by `body.model` |
| POST | `/v1/embeddings` | OpenAI embeddings; routed by `body.model` |
| POST | `/v1/messages` | Anthropic messages format; routed by `body.model` |
| POST | `/v1/responses` | Responses API; routed by `body.model` |
| POST | `/api/chat` | Ollama-native chat; routed by `body.model` |
| POST | `/api/generate` | Ollama-native generate; routed by `body.model` |
| POST | `/api/embeddings` | Ollama-native embeddings; routed by `body.model` |
| POST | `/api/embed` | Ollama-native embed; routed by `body.model` |
| GET/POST | `/api/tags`, `/api/ps`, `/api/version`, `/api/show`, `/api/pull`, `/api/*` | Passthrough to Ollama, no swap. Destructive endpoints refused with 403 unless `allow_destructive_ollama_api: true`. |
| POST | `/admin/swap` | Body: `{"model":"<id>"}` or `{"engine":"<key>"}`. Proactive swap without a user request. |
| POST | `/admin/discover` | Scan all engines for discoverable model ids and return a per-engine summary. Auth-gated the same as `/admin/swap`. |


## Metrics

`GET /metrics` exposes Prometheus text (format v0.0.4). No `prometheus_client`
dependency -- the exposition is hand-rolled.

| Series | Type | Meaning |
|--------|------|---------|
| `swap_duration_seconds` | histogram | Wall-clock duration of a full engine swap |
| `memory_settle_seconds` | histogram | Time spent waiting for memory to plateau |
| `in_flight_at_swap_start` | histogram | In-flight requests being drained at swap start |
| `swap_total{from,to,result}` | counter | Count of swaps by transition and result (`ok`/`error`) |
| `engine_uptime_seconds{engine}` | gauge | Seconds the active engine has been active |


## routerctl

`routerctl` is a thin CLI for inspecting and controlling the running router.

```bash
routerctl status                    # active engine, in-flight, last swap
routerctl models                    # list all known models
routerctl use llamacpp              # swap to a specific engine now
routerctl use qwen2.5-7b-instruct   # or name a model; swaps to its owning engine
routerctl discover                  # POST /admin/discover: scan engines, print per-engine model ids
routerctl logs                      # tail the service journal
routerctl restart                   # restart the service
```


## Operations

### Start / stop / restart

```bash
systemctl --user start   local-engine-router
systemctl --user stop    local-engine-router
systemctl --user restart local-engine-router
systemctl --user status  local-engine-router
```

### Logs

```bash
# Rotating file
tail -f ~/local-engine-router/logs/router.log

# systemd journal
journalctl --user -u local-engine-router -f
journalctl --user -u local-engine-router --since "1 hour ago"
```

### State file

`state.json` is written after every swap. It is a snapshot only; the router
re-probes reality on startup rather than trusting it.

```json
{"active_engine": "ollama", "last_swap": {"from": "ds4", "to": "ollama", "duration_s": 52.3, "ok": true}}
```


## Development and tests

The test suite is hermetic -- no GPU and no network (engines are replaced by a
mock backend), so it runs anywhere CI does:

```bash
pip install '.[dev]'      # adds pytest and pytest-asyncio
python3 -m pytest -q
```

CI runs the same suite on every push (see
[`.github/workflows/ci.yml`](.github/workflows/ci.yml)).


## Troubleshooting

### Non-streaming request timed out

The client read-timeout is shorter than the swap. See [Sharp edges: non-streaming
requests block for the entire swap](#non-streaming-requests-block-for-the-entire-swap).
Set the client timeout above `start_timeout_s` (default 240 s); 300 s is a safe
ceiling.

### Swap to next engine fails with "more system memory than is available"

The outgoing model's memory hasn't been reclaimed yet. The router waits up to
`swap_memory_settle_timeout_s` (default 25 s). If you still hit this, the model
genuinely doesn't fit in available memory -- check `free -g` against the model
size. If memory cannot be read at all (very unusual, since `psutil` ships as a
dependency), the wait is skipped and this race is more likely.

### ds4 won't stop during a swap

```bash
systemctl --user stop ds4.service
pgrep -f ds4/ds4-server             # any leftover process?
kill -9 <pid>                       # last resort
```

Then retry via `routerctl ollama` or restart the router.

### Open WebUI model picker is empty

The container is missing `--add-host=host.docker.internal:host-gateway`. Without
it the container cannot resolve `host.docker.internal` and the picker shows
nothing. See [`deploy/openwebui-wiring.md`](deploy/openwebui-wiring.md) for the
safe recreate command.

### Port 8077 busy

```bash
ss -tlnp | grep 8077
systemctl --user stop local-engine-router
# kill the offending pid, then restart
```

### Ollama won't unload a model

The router sends `keep_alive: 0` then falls back to `ollama stop <name>`. If
models remain after `unload_timeout_s` (60 s), the router logs a warning and
proceeds. Unload manually:

```bash
ollama list
ollama stop <name>
```


## License

MIT. Attribution to `rxxusp`. See [`LICENSE`](LICENSE).

> The Python package is `router`; the console scripts are `local-engine-router`
> and `routerctl`; the project/repo is **local-engine-router**.
