Metadata-Version: 2.4
Name: llm-autotune
Version: 1.3.0
Summary: 39% faster TTFT, 67% less KV cache, zero config — autotune optimises local LLMs on Ollama, LM Studio, and MLX
Project-URL: Homepage, https://github.com/tanavc1/local-llm-autotune
Project-URL: Repository, https://github.com/tanavc1/local-llm-autotune
Project-URL: Bug Tracker, https://github.com/tanavc1/local-llm-autotune/issues
Project-URL: Changelog, https://github.com/tanavc1/local-llm-autotune/releases
Author-email: Tanav Chinthapatla <tanavc1@users.noreply.github.com>
License: MIT
License-File: LICENSE
Keywords: apple-silicon,gemma,inference,inference-server,kv-cache,latency,llama,llm,local-ai,local-llm,memory-optimization,mlx,ollama,openai-compatible,optimization,qwen,ttft
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Hardware
Classifier: Topic :: System :: Systems Administration
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Requires-Dist: click>=8.1
Requires-Dist: fastapi>=0.110
Requires-Dist: httpx>=0.27
Requires-Dist: packaging>=21.0
Requires-Dist: psutil>=5.9
Requires-Dist: py-cpuinfo>=9.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: uvicorn[standard]>=0.29
Provides-Extra: dev
Requires-Dist: fastapi[all]>=0.110; extra == 'dev'
Requires-Dist: httpx>=0.27; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: scipy>=1.11; extra == 'dev'
Requires-Dist: types-psutil>=5.9; extra == 'dev'
Provides-Extra: mlx
Requires-Dist: mlx-lm>=0.21; extra == 'mlx'
Description-Content-Type: text/markdown

# autotune — Local LLM Inference Optimizer

[![PyPI](https://img.shields.io/pypi/v/llm-autotune)](https://pypi.org/project/llm-autotune/)
[![Python](https://img.shields.io/badge/python-3.9--3.13-blue)](https://pypi.org/project/llm-autotune/)
[![Docker](https://img.shields.io/docker/v/tanavc1/llm-autotune?label=docker)](https://hub.docker.com/r/tanavc1/llm-autotune)
[![CI](https://github.com/tanavc1/local-llm-autotune/actions/workflows/test.yml/badge.svg)](https://github.com/tanavc1/local-llm-autotune/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Product Hunt](https://img.shields.io/badge/Product%20Hunt-Featured-DA552F?logo=producthunt&logoColor=white)](https://www.producthunt.com/products/autotune-llm)

**Website & install guide → [autotunellm.com](https://autotunellm.com)**

**39% faster time-to-first-word. 3× less KV cache. Drop-in for Ollama, LM Studio, and MLX.**

autotune is a middleware layer that makes your local LLMs noticeably faster and lighter — without changing your code or workflow. It computes the exact KV cache each request needs, pins your system prompt in memory, and manages context windows automatically.

```bash
pip install llm-autotune                        # macOS / Windows
pipx install llm-autotune                       # Linux (recommended — see install notes below)
brew install tanavc1/autotune/llm-autotune      # Homebrew (macOS)
docker pull tanavc1/llm-autotune                # Docker
autotune chat --model qwen3:8b                  # that's it
```

Works with **Ollama**, **LM Studio**, and **MLX** (Apple Silicon native) out of the box.

---

## What autotune actually improves

Benchmarked on Apple M2 16 GB using Ollama's own nanosecond-precision internal timers — not Python wall-clock estimates. Results are means across 3 runs × 5 prompt types, with Wilcoxon signed-rank statistical testing and Cohen's d effect sizes.

| Metric | llama3.2:3b | gemma4:e2b | qwen3:8b | Average |
|--------|:-----------:|:----------:|:--------:|:-------:|
| **Time to first word (TTFT)** | −35% | −29% | **−53%** | **−39%** |
| **KV prefill time** | −66% | −64% | **−72%** | **−67%** |
| **KV cache RAM** | −66% | **−69%** | −66% | **−67%** |
| **Generation speed (tok/s)** | ±2% | ±0.2% | ±2.4% | **unchanged** |

> **Timing source:** `prompt_eval_duration`, `load_duration`, and `total_duration` from Ollama's Go runtime. Token counts (`prompt_eval_count`) are identical in both conditions — autotune right-sizes the buffer, not the content.

### What the numbers mean

**You wait 39% less for the first word.** On qwen3:8b that's 53% faster. On a long-context prompt, up to 89% faster. You feel this on every message.

**KV cache shrinks 3×.** Raw Ollama allocates a fixed 4,096-token KV buffer regardless of prompt length. autotune computes the exact size each request needs — for a typical chat message that frees 300–400 MB before inference even starts.

**Generation speed is unchanged.** Token generation on Apple Silicon is Metal GPU-bound. The ±2% variance in the data is measurement noise. autotune is transparent about this.

**122,778 KV buffer slots freed** across all benchmark runs — slots Ollama would have allocated, zeroed, and initialized for nothing.

### Verify it yourself

```bash
# Quick 45-second check on any model you have:
autotune proof --model qwen3:8b

# Full statistical benchmark with Wilcoxon p-values and Cohen's d:
autotune proof-suite --model qwen3:8b --runs 3
```

`autotune proof` runs two scenarios: a standard multi-turn session and a long-context code-review prompt where TTFT and KV allocation differences are most visible. Results are saved as JSON alongside your terminal output.

---

## Quickstart

### 1. Install Ollama

**macOS**
```bash
brew install ollama
```
Or download the desktop app from [https://ollama.com/download](https://ollama.com/download).

**Linux**
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

**Windows** — download the installer from [https://ollama.com/download](https://ollama.com/download).

Once installed, pull a model:

```bash
autotune pull qwen3:8b         # 5.2 GB — best general model for 16 GB machines
```

autotune starts Ollama in the background automatically — no separate `ollama serve` needed.

Not sure which model to use? Run `autotune recommend` after installing and it will pick the best model for your exact hardware.

### 2. Install autotune

**macOS / Windows**
```bash
pip install llm-autotune
```

**Linux**

Modern Linux distros (Ubuntu 23.04+, Debian 12+, Fedora 38+) block `pip install` to the system Python by default (PEP 668). Use `pipx` — it's the correct tool for CLI apps and keeps autotune isolated from your system packages:

```bash
# 1. Install pipx if you don't have it
sudo apt install pipx          # Debian / Ubuntu
sudo dnf install pipx          # Fedora
# or without sudo: pip install --user pipx

# 2. Install autotune
pipx install llm-autotune

# 3. Add ~/.local/bin to PATH (only needed once)
pipx ensurepath

# 4. Reload your shell — no need to open a new terminal
exec $SHELL
```

> **"`autotune: command not found`" right after installing?** This means `~/.local/bin` wasn't in your PATH before. Steps 3 and 4 above fix it — `exec $SHELL` reloads your shell in place without opening a new window.

> If you'd rather not use `pipx`: `pip install llm-autotune --break-system-packages` works but may conflict with your distro's system packages. Not recommended.

**Requirements:** Python 3.9+, Ollama running locally.

```bash
# Apple Silicon acceleration (native Metal GPU kernels):
pip install "llm-autotune[mlx]"

# Development install:
git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune && pip install -e ".[dev]"
```

### 3. Get a model recommendation for your hardware

```bash
autotune recommend
```

Profiles your CPU, RAM, and GPU, then scores every model in the registry against your hardware and recommends the best option with an exact `autotune pull` command to run.

### 4. Start chatting

```bash
autotune chat --model qwen3:8b                   # optimized chat, default profile
autotune chat --model qwen3:8b --profile fast    # minimum latency
autotune chat --model qwen3:8b --profile quality # largest context window
autotune chat --model qwen3:8b --no-swap         # guarantee no macOS swap
autotune chat --model qwen3:8b --system "You are a concise coding assistant."
```

### 5. Check what's running

```bash
autotune ps        # all models in memory — RAM, context, quant, age
autotune hardware  # CPU, RAM, GPU backend, and effective memory budget
autotune ls        # every locally installed model scored against your hardware
```

---

## API server (OpenAI-compatible)

```bash
autotune serve
# → Listening at http://127.0.0.1:8765/v1
```

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="local")
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
```

### Per-request headers

```
X-Autotune-Profile: fast          # override profile for this request
X-Conversation-Id: a3f92c1b       # attach to a persistent conversation
```

### Endpoints

| Endpoint | Description |
|----------|-------------|
| `POST /v1/chat/completions` | OpenAI-compatible, streaming or non-streaming |
| `GET /v1/models` | All available models across all backends |
| `GET /health` | Server status, queue depth, memory pressure |
| `GET /api/hardware` | Live hardware snapshot |
| `GET /api/profiles` | Profile definitions |
| `GET /api/running_models` | Models in memory with RAM, context, quant, age |
| `POST/GET/DELETE /api/conversations` | Persistent conversation CRUD |
| `GET /api/conversations/{id}/export` | Export as Markdown |

### API key authentication

By default the server accepts all requests. To enforce API keys on all `/v1/*` routes:

```bash
export AUTOTUNE_ADMIN_KEY="your-secret-admin-key"
export AUTOTUNE_REQUIRE_API_KEY=1
autotune serve
```

**Create and manage keys** via the admin API (requires `Authorization: Bearer $AUTOTUNE_ADMIN_KEY`):

```bash
# Create a key
curl -s -X POST http://localhost:8765/admin/keys \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-app", "label": "Production"}' | jq .

# List all keys
curl -s http://localhost:8765/admin/keys \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" | jq .

# Per-key usage (last 30 days)
curl -s "http://localhost:8765/admin/usage/summary?days=30" \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" | jq .

# Revoke a key
curl -s -X DELETE http://localhost:8765/admin/keys/{id} \
  -H "Authorization: Bearer $AUTOTUNE_ADMIN_KEY" \
  -d '{"reason": "rotated"}'
```

Keys use the format `sk-at-<token>`. The plaintext is returned once on creation — only the SHA-256 hash is stored. Usage is logged per key per day to local SQLite with an optional Supabase mirror.

| Admin endpoint | Description |
|----------------|-------------|
| `POST /admin/keys` | Create key — returns plaintext once |
| `GET /admin/keys` | List all keys |
| `GET /admin/keys/{id}` | Single key + 30-day usage |
| `DELETE /admin/keys/{id}` | Revoke (soft delete) |
| `GET /admin/usage` | Per-key/day/model breakdown |
| `GET /admin/usage/summary` | Aggregate totals per key |

`/health`, `/api/*`, and `/admin/*` are always exempt from key enforcement.

### Concurrency

```bash
AUTOTUNE_MAX_CONCURRENT=1    # parallel inference slots (default: 1)
AUTOTUNE_MAX_QUEUED=8        # max requests waiting (default: 8)
AUTOTUNE_WAIT_TIMEOUT=120    # seconds before a queued request gets 429 (default: 120)
```

---

## Model recommendations by hardware

| RAM | Recommended model | Pull command | Why |
|-----|------------------|--------------|-----|
| 8 GB | `qwen3:4b` | `autotune pull qwen3:4b` | Best 4B available; hybrid thinking mode |
| 16 GB | `qwen3:8b` | `autotune pull qwen3:8b` | Near-frontier quality; best 8B as of 2026 |
| 16 GB (coding) | `qwen2.5-coder:7b` | `autotune pull qwen2.5-coder:7b` | Near GPT-4o on HumanEval at 7B |
| 24 GB | `qwen3:14b` | `autotune pull qwen3:14b` | Excellent reasoning; comfortable headroom |
| 24 GB (coding) | `qwen2.5-coder:14b` | `autotune pull qwen2.5-coder:14b` | Best open coding model at this size |
| 32 GB | `qwen3:30b-a3b` | `autotune pull qwen3:30b-a3b` | MoE: flagship quality at 7B inference cost |
| 64 GB+ | `qwen3:32b` | `autotune pull qwen3:32b` | Top dense open model |
| Reasoning | `deepseek-r1:14b` | `autotune pull deepseek-r1:14b` | Chain-of-thought; strong math and logic |

Run `autotune recommend` to get a personalised pick with scores for your exact hardware configuration.

---

## Features

| Feature | What happens |
|---------|-------------|
| **Dynamic KV sizing** | Computes the exact `num_ctx` each request needs — typically 4–8× less KV cache than Ollama's fixed 4,096-token default |
| **KV prefix caching** | Pins system-prompt tokens via `num_keep` so they're never re-evaluated each turn |
| **Model keep-alive** | Sets `keep_alive=-1` so the model stays loaded between conversations — eliminates reload latency |
| **Adaptive KV precision** | Automatically downgrades F16 → Q8 under memory pressure before any slowdown occurs |
| **Flash attention** | Enables `flash_attn=true` on every request — reduces peak KV activation memory |
| **Prefill batching** | Sets `num_batch=1024` (2× Ollama default) — fewer Metal kernel dispatches for long prompts |
| **Context management** | Trims conversation history at token budget thresholds, always at sentence/paragraph boundaries |
| **Inference queue** | FIFO queue (1 concurrent, 8 waiting) with HTTP 429 back-pressure — prevents memory thrashing |
| **OpenAI-compatible API** | Drop-in server at `localhost:8765/v1` — works with any OpenAI SDK |
| **MLX backend** | On M-series Macs, routes inference to MLX-LM for native Metal GPU kernels |
| **Persistent memory** | Every conversation saved to SQLite; semantically searches past sessions at startup |
| **NoSwapGuard** | Exact-math pre-flight check using model architecture — computes precise KV bytes and reduces context + KV precision until the allocation fits in available RAM |

---

## Agentic workloads

Raw Ollama's fixed `num_ctx=4096` hurts most inside agent loops — where tool calls, observations, and reasoning steps accumulate. autotune sizes the session context once before the loop begins, holds it constant across all turns, and uses `num_keep` prefix caching so the system prompt is never re-evaluated after turn 1.

**Measured on `llama3.2:3b`, multi-turn tool-calling agent task:**

| Metric | Raw Ollama | autotune |
|--------|:----------:|:--------:|
| **Agent wall time** | **74 s** | **40 s (−46%)** |
| Model reloads per session | 0–1 | ~0 |
| Swap events | 1 of 3 trials | 0 |
| Tool call errors | 1 avg | 0 |
| Context tokens at session end | 3,043 | 1,946 (−36%) |
| TTFT trend per turn | grows | shrinks (prefix cache) |

For sessions with 3+ turns, prefix caching compounds — TTFT per turn falls as the conversation grows. Full methodology and raw data: [AGENT_BENCHMARK.md](AGENT_BENCHMARK.md)

---

## Chat commands

| Command | What it does |
|---------|-------------|
| `/help` | Show available commands |
| `/new` | Start a new conversation |
| `/history` | Show full conversation history |
| `/profile fast\|balanced\|quality` | Switch profile mid-conversation |
| `/model <id>` | Switch to a different model |
| `/system <text>` | Set or replace the system prompt |
| `/export` | Export conversation to Markdown |
| `/metrics` | Session stats: tok/s, TTFT, request count |
| `/recall` | Browse past conversations |
| `/recall search <query>` | Semantic search across all past sessions |
| `/pull <model>` | Pull a model from Ollama without leaving chat |
| `/quit` | Exit (also Ctrl-C) |

---

## Profiles

| Profile | Context | Temperature | KV precision | Best for |
|---------|--------:|:-----------:|:------------:|---------|
| `fast` ⚡ | 2,048 | 0.1 | Q8 | Quick lookups, autocomplete |
| `balanced` ⚖️ | 8,192 | 0.7 | F16 | General chat, coding |
| `quality` ✨ | 32,768 | 0.8 | F16 | Long documents, analysis |

---

## Apple Silicon (MLX)

```bash
pip install "llm-autotune[mlx]"
autotune mlx pull qwen3:8b        # download MLX-quantized model
autotune chat --model qwen3:8b    # automatically routes to MLX
autotune mlx list                 # show locally cached MLX models
```

MLX activates automatically on Apple Silicon — no configuration needed. Use Ollama-backed models when you need structured tool calls in agentic workflows.

---

## Docker — Ollama + autotune bundled

autotune ships pre-built images to Docker Hub for `linux/amd64` and `linux/arm64`. No local Python, no separate Ollama install.

### Quickstart from Docker Hub

```bash
# autotune server only — point it at your existing Ollama instance
docker run -p 8765:8765 \
  -e AUTOTUNE_OLLAMA_URL=http://host.docker.internal:11434 \
  tanavc1/llm-autotune:latest

# pin to a specific version
docker pull tanavc1/llm-autotune:1.2.0
```

### Bundled image — Ollama + autotune in one container

```bash
git clone https://github.com/tanavc1/local-llm-autotune.git
cd local-llm-autotune
docker compose --profile single up
```

Docker builds the bundled image (Ollama + autotune), starts both services, and exposes the API at `http://localhost:8765/v1`. Point any OpenAI-compatible client there.

### Auto-pull a model on first boot

```bash
OLLAMA_MODEL=qwen3:8b docker compose --profile single up
```

The container pulls `qwen3:8b` from Ollama's registry on first start, then begins serving. Subsequent runs skip the pull because the model is cached in the volume.

### Raw Docker (without Compose)

```bash
docker build -t autotune .
docker run -p 8765:8765 -v ollama_models:/root/.ollama autotune

# With a model:
docker run -p 8765:8765 -v ollama_models:/root/.ollama -e OLLAMA_MODEL=qwen3:8b autotune

# Also expose Ollama directly:
docker run -p 8765:8765 -p 11434:11434 -v ollama_models:/root/.ollama autotune
```

### docker-compose — separate Ollama + autotune services

```bash
docker compose --profile multi up
```

In this mode, Ollama and autotune run as separate services. autotune receives `AUTOTUNE_OLLAMA_URL=http://ollama:11434` so it routes to the Ollama service by name. Use a separate `Dockerfile.autotune` that contains only Python (~200 MB vs ~2 GB for the bundled image).

### Environment variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `OLLAMA_MODEL` | _(empty)_ | Model to auto-pull on first container start |
| `AUTOTUNE_PORT` | `8765` | Port autotune binds inside the container |
| `OLLAMA_HOST` | `0.0.0.0` | Bind address passed to `ollama serve` inside the container |
| `AUTOTUNE_OLLAMA_URL` | `http://localhost:11434` | Where autotune reaches Ollama — set to `http://ollama:11434` for multi-container mode |
| `AUTOTUNE_REQUIRE_API_KEY` | `0` | Set to `1` to enforce API key auth on all `/v1/*` routes |
| `AUTOTUNE_ADMIN_KEY` | _(unset)_ | Bearer token for all `/admin/*` endpoints — 503 if unset when accessed |

### GPU support

The bundled image is built on `ollama/ollama:latest` which includes CUDA and ROCm layers. Mount the appropriate devices:

```bash
# NVIDIA GPU
docker run --gpus all -p 8765:8765 -v ollama_models:/root/.ollama autotune

# AMD GPU (ROCm)
docker run --device /dev/kfd --device /dev/dri -p 8765:8765 \
  -v ollama_models:/root/.ollama autotune
```

---

## Embedding autotune in your application

```python
import autotune
from openai import OpenAI

autotune.start()                             # spawns server if not running; blocks until ready
client = OpenAI(**autotune.client_kwargs())  # {"base_url": "http://localhost:8765/v1", "api_key": "local"}

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello"}],
)
```

`start()` checks `/health` first and returns immediately if the server is already running.

### Options

```python
autotune.start(
    host="localhost",
    port=8765,
    timeout=30.0,       # raise TimeoutError if server isn't ready within this many seconds
    profile="balanced", # "fast" | "balanced" | "quality"
    use_mlx=False,      # True = MLX on Apple Silicon (faster, no tool calls)
    log_level="warning",
)
```

### Error handling

```python
try:
    response = client.chat.completions.create(...)
except Exception as e:
    error = e.response.json().get("detail", {})
    match error.get("type"):
        case "model_not_found":
            print(f"Run: autotune pull {error['model']}")
        case "memory_pressure":
            print("Not enough RAM. Try a smaller model or --profile fast.")
        case "backend_error":
            print(f"Backend error: {error['message']}\nSuggestion: {error['suggestion']}")
```

### Server RAM footprint

| Mode | Server RAM | Tool calling | Notes |
|------|-----------|:------------:|-------|
| `autotune.start()` (default) | ~94 MB | ✓ | Ollama-backed |
| `autotune.start(use_mlx=True)` | ~470 MB | ✗ | 10–40% faster on Apple Silicon |

---

## Agentic frameworks

autotune's OpenAI-compatible server is a drop-in local LLM backend for any framework that accepts a custom base URL.

```bash
autotune serve
```

### OpenClaw

```yaml
# openclaw/config.yaml
providers:
  - name: autotune-local
    api: openai-responses
    baseUrl: http://localhost:8765/v1
    apiKey: sk-local
    model: qwen3:8b
    supportsTools: true
```

### Hermes Agent

```yaml
# ~/.hermes/config.yaml
model:
  provider: custom
  base_url: http://localhost:8765/v1
  api_key: sk-local
  name: qwen3:8b
```

**Models confirmed for tool calling via Ollama:** `qwen3:8b`, `qwen3:14b`, `llama3.1:8b`, `qwen2.5-coder:14b`, `hermes3`

---

## How it works — all 14 optimizations

autotune sits between your code and Ollama as a transparent middleware layer. Every request passes through a stack of optimizations. Here's every one, explained plainly.

> **Full explanations with examples:** see below, or visit the [GitHub repo](https://github.com/tanavc1/local-llm-autotune#how-it-works--all-14-optimizations)

---

### The KV cache — the central concept

When an LLM generates text, every new token needs to "attend to" every previous token. The results of that attention computation — two tables of numbers per token called **K (keys)** and **V (values)** — are cached in RAM so they don't have to be recomputed. This is the KV cache.

Its size is mathematically exact:
```
2 × n_layers × kv_heads × head_dim × num_ctx × bytes_per_element
```

For qwen3:8b at 4,096 context: **576 MB**. At 1,536 context: **216 MB**. The KV cache scales linearly with context length — that's the big lever.

---

### Memory optimizations

**1. Dynamic context sizing** — *every request*

Ollama allocates the full KV cache before generating the first token, using whatever `num_ctx` you've configured — even if your actual prompt is 50 words. autotune computes the minimum context each request actually needs:

```
num_ctx = clamp(input_tokens + max_new_tokens + 256, 512, profile_max)
```

A typical balanced-profile message (22-token prompt + 1024 reply + 256 buffer = 1,302 tokens) allocates ~145 MB instead of ~576 MB on qwen3:8b. No tokens are dropped — the context window grows naturally as the conversation grows.

**2. KV cache precision control** — *per profile, adaptive*

KV elements can be stored as F16 (2 bytes each) or Q8 (1 byte each). Q8 halves the entire KV cache footprint with negligible quality impact. This is separate from model quantization — it only affects the temporary computation cache, not the model weights.

- `fast` profile: always Q8
- `balanced` / `quality`: F16 by default, Q8 under memory pressure

**3. NoSwapGuard — exact-math pre-flight guarantee** — *every request*

NoSwapGuard is autotune's hard guarantee against swap — and it works completely differently from the Live pressure system below. Where Live pressure uses RAM percentage as a heuristic, NoSwapGuard queries the actual model architecture from Ollama and computes the precise bytes the KV allocation will need:

```
kv_bytes = 2 × n_layers × kv_heads × head_dim × num_ctx × precision_bytes
```

If `kv_bytes + 1.5 GB safety margin > available RAM`, it fires — not because RAM looks "80% full", but because the exact number of bytes won't fit. It reduces in levels, applied in order until the math clears:

| Level | Action |
|-------|--------|
| 0 | Fits — no change |
| 1 | Trim context 25% |
| 2 | Halve context |
| 3 | Halve context + Q8 KV (saves ~50% more) |
| 4 | Quarter context + Q8 |
| 5 | Minimum (512 tokens) + Q8 — emergency floor |

The model's architecture (layers, KV heads, head dimension) is queried from Ollama's `/api/show` once and cached — every calculation is exact, not estimated. This is the fundamental difference from Live pressure's percentage tiers: NoSwapGuard knows precisely how many bytes are needed.

**4. Live memory pressure response** — *every request, real-time*

Live pressure is autotune's proactive, heuristic RAM tier system — entirely separate from NoSwapGuard's exact-math approach above. It reads the OS's RAM utilization percentage before every request and adjusts context + KV precision according to fixed thresholds, firing well before any swap risk:

| RAM usage | Context | KV precision |
|-----------|---------|--------------|
| < 80% | full | profile default |
| 80–88% | −10% | profile default |
| 88–93% | −25% | F16 → Q8 |
| > 93% | halved | forced Q8 |

At 80% utilization there is still 20% RAM free — no swap danger — but autotune starts backing off preemptively. If Live pressure has already trimmed enough headroom, NoSwapGuard may not need to fire at all. Changes are reported in the chat interface. No user action needed.

**5. Pre-flight model fit analysis** — *before loading*

Before a model is loaded, autotune calculates whether it will fit: `model_weights + kv_cache(context, precision) + runtime_overhead`. It classifies the result as SAFE / MARGINAL / SWAP_RISK / OOM and sets a safe context ceiling. If the model is too heavy, it recommends a lighter quantization with the exact `autotune pull` command to run.

---

### Speed optimizations

**6. Context bucket snapping** — *every request*

After computing the minimum context, autotune snaps it to the nearest bucket from a fixed list: `[512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192, 12288, 16384, 32768]`.

Why: Ollama caches the KV buffer for the most recently used context length. If `num_ctx` changes request-to-request (1,286 → 1,157 → 1,308), Ollama reallocates the Metal buffer on every call — even with the model already loaded. This "KV thrashing" adds 100–300 ms per request. Buckets eliminate it: prompts of 50–200 tokens all map to 1,536, Ollama allocates it once and reuses it forever.

**7. System prompt prefix caching** — *multi-turn conversations*

Ollama re-processes the system prompt from scratch on every turn. autotune pins the system prompt tokens in the KV cache via `num_keep` — they're evaluated once at the start and never again. In agentic sessions with 10+ turns, this compounding effect means TTFT actually *falls* as the session grows.

**8. Model keep-alive** — *between sessions*

Ollama unloads models after 5 minutes of idle. autotune sets `keep_alive="-1"` (forever) on every request. The model stays in RAM between conversations, eliminating the 1–4 second cold-reload cost you'd otherwise pay every time a session goes idle. This doesn't cost more RAM — the weights were already loaded; it just keeps them committed.

**9. Flash attention** — *every request*

Passes `flash_attn: true` to Ollama. Flash attention computes attention in tiles rather than materializing the full N² attention matrix, dramatically reducing the peak activation memory spike during prefill. Zero quality impact — it's mathematically identical to standard attention. Models that don't support it silently ignore the flag.

**10. Larger prefill batch size** — *long prompts*

Sets `num_batch=1024` (Ollama default: 512). During prefill (processing your prompt), tokens are fed through the model in chunks. A 700-token prompt with the default takes 2 GPU passes; with 1024, it takes 1. Fewer passes = fewer Metal kernel dispatches = lower TTFT for any prompt over 512 tokens. Short prompts are unaffected.

---

### Adaptive intelligence

**11. Hardware tuner** — *around each inference call*

Makes real OS-level changes before inference and restores them after:

- **macOS QOS class:** Sets the thread to `USER_INTERACTIVE` — the highest scheduling priority on macOS (same class as UI scrolling animations). The process gets more CPU time over background tasks.
- **Process priority (nice):** Raises the autotune and Ollama process priorities on macOS/Linux for better CPU scheduling.
- **Python GC disabled:** Python's garbage collector causes "stop the world" pauses of up to tens of milliseconds. Disabling it during inference eliminates hitches in streamed output.
- **Linux CPU governor:** Attempts to set the CPU to `performance` mode (full clock speed) during inference (requires root; silently skipped otherwise).

**12. Adaptive session advisor** — *live monitoring*

Continuously watches RAM%, swap activity, tokens/sec, and TTFT. Computes a 0–100 health score every 30 seconds. When the score drops below thresholds, takes the least-disruptive available action from an ordered list:

1. Reduce concurrency
2. Reduce context window
3. Lower KV precision (F16 → Q8)
4. Enable prompt caching
5. Disable speculative decoding
6. Lower quantization
7. Suggest switching to a smaller model

There's a 20-second cooldown between actions and a 90-second stability window before scale-up. The advisor attributes events — it knows whether a RAM spike was caused by loading a model, KV growth, or a background application.

---

### Context & conversation

**13. Context compressor** — *long sessions*

As conversation history grows toward the context limit, autotune compresses older messages in four tiers:

```
< 55%  FULL          — all turns verbatim
55–75% RECENT+FACTS  — last 8 turns + structured facts for older
75–90% COMPRESSED    — last 6 turns (lightly compressed) + compact summary
> 90%  EMERGENCY     — last 4 turns (compressed) + one-line summary
```

Compression strategies (lightest first): strip noise → compress JSON blobs → shorten tool output (head + tail) → trim assistant messages (keep first paragraph + code blocks + last paragraph) → trim user messages (preserve intent). Low-value chatter is dropped first; code blocks and stack traces are always preserved. All cuts happen at sentence boundaries. Facts extraction is deterministic — no extra LLM call required.

**14. Conversation memory & recall** — *across sessions*

Every conversation is saved to a local SQLite database (`~/.autotune/recall.db`). At the start of each new conversation, autotune searches your history for semantically relevant past context and quietly injects it as a system note.

- **Vector search (primary):** Uses `nomic-embed-text` (local, ~274 MB, runs in Ollama) to find semantically similar past exchanges — even if they use different words.
- **FTS5 keyword fallback:** Full-text search across all stored conversations when the embedding model isn't available.
- **Injection threshold:** Only injects if cosine similarity > 0.38 — conservative by design. Better to show nothing than irrelevant noise. Up to 3 memories injected, capped at 1,200 characters total.

All data is local. Nothing is sent to any server.

---

### What doesn't change

- **Generation speed (tok/s):** Metal GPU-bound on Apple Silicon. autotune doesn't touch the generation loop. Benchmarks show ±2% variance — measurement noise.
- **Output quality:** Model weights, sampling parameters, and temperature are unchanged. `prompt_eval_count` is identical — no tokens are dropped or skipped.
- **Turn 1 in agentic sessions:** Pre-allocating a full session KV window makes turn 1 ~80% slower. From turn 2 onward, prefix-cache savings compound and total wall time comes out ~46% lower.

---

## Conversation memory

Every conversation is saved to a local SQLite database with full-text and vector similarity search. No flags required.

- **Automatic context injection** — at session start, autotune surfaces relevant facts from past conversations as a silent system note.
- **Session resume** — use `--conv-id <id>` to continue an exact past session with full context.
- **In-chat recall** — `/recall` to browse sessions; `/recall search <topic>` for semantic search.

| Path | Contents |
|------|----------|
| `~/.autotune/recall.db` | FTS5 + float32 vectors; turns, extracted facts |
| `~/Library/Application Support/autotune/autotune.db` | Hardware telemetry, run observations (macOS) |
| `~/.local/share/autotune/autotune.db` | Same (Linux) |

---

## Telemetry

```bash
autotune telemetry                    # last 20 inference runs
autotune telemetry --events           # notable events: swap spikes, OOMs
autotune telemetry --model qwen3:8b   # filter by model
```

**Anonymous cloud telemetry is opt-in and off by default:**

```bash
autotune telemetry --status    # check opt-in status
autotune telemetry --enable    # opt in
autotune telemetry --disable   # opt out
```

What is sent when opted in: CPU architecture, RAM size, GPU backend, tokens/sec, TTFT, context size, quantization label, session start/stop events. No hostnames, usernames, IP addresses, or conversation content. Data goes to a private Supabase instance and is never sold or shared.

The Supabase anon key embedded in the package is a public client token (INSERT-only, row-level security enforced). See [SECURITY.md](SECURITY.md) for a full explanation.

---

## Troubleshooting

**"error: externally-managed-environment" (Linux)**
→ Your Linux distro blocks `pip install` to the system Python (PEP 668). Install via `pipx` instead:
```bash
sudo apt install pipx          # Debian/Ubuntu — or: pip install --user pipx
pipx install llm-autotune
pipx ensurepath && exec $SHELL
```

**"`autotune: command not found`" after `pipx install`**
→ `~/.local/bin` was just added to your PATH but the current shell session doesn't know yet. No need to open a new terminal — just run:
```bash
exec $SHELL
```
If that doesn't work, run `pipx ensurepath` first, then `exec $SHELL`.

**"Ollama is not running."**
→ autotune starts Ollama automatically. If it still fails, install Ollama:
```bash
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh
```
Or download the desktop app from [https://ollama.com/download](https://ollama.com/download).

**"No models found."**
→ Pull a model: `autotune pull qwen3:8b` or run `autotune recommend` for a hardware-matched suggestion.

**"Memory pressure — context 8192→6144 tokens"**
→ RAM is 88%+ full. Close other apps or switch to a smaller model.

**HTTP 429 — queue full**
→ Too many concurrent requests. Increase `AUTOTUNE_MAX_QUEUED` or wait for one to finish.

**First message is slow**
→ Expected — the model loads and the KV buffer initializes on the first request. Subsequent messages respond immediately.

---

## CLI command reference

### Get started

| Command | What it does |
|---------|-------------|
| `autotune run <model>` | Pre-flight RAM check + chat in one step. Best first command for any new model. |
| `autotune chat --model <id>` | Start an optimized chat session with a model already installed. |
| `autotune hardware` | Scan CPU/RAM/GPU, show which models fit, and suggest apps to close for more RAM. |
| `autotune recommend` | Profile your hardware and recommend the best model+settings. Prints exact `autotune pull` commands. |

### Manage models

| Command | What it does |
|---------|-------------|
| `autotune ls` | List downloaded models with fit scores, safe context window, and recommended profile. |
| `autotune ps` | Show every model currently loaded in RAM across Ollama, MLX, and LM Studio. |
| `autotune pull [model]` | Download an Ollama model. Omit the name to browse hardware-aware recommendations. |
| `autotune models` | List local models with size, architecture, and quality tier. `--registry` shows autotune's full catalog. |
| `autotune unload [model]` | Release a model from memory immediately. Interactive picker if no model specified. |

### Deploy & integrate

| Command | What it does |
|---------|-------------|
| `autotune serve` | Start an OpenAI-compatible API server on `localhost:8765`. All optimizations applied automatically. |

### Benchmarking & proof

| Command | Duration | What it does |
|---------|----------|-------------|
| `autotune proof -m <model>` | ~30 s | Quick head-to-head: raw Ollama vs autotune. Shows TTFT, KV RAM, swap events, RAM headroom. |
| `autotune proof-suite -m <model>` | ~10 min | 5-prompt statistical suite. Wilcoxon signed-rank + Cohen's d + 95% CI across multiple models. |
| `autotune bench -m <model>` | ~15 min | Intensive multi-prompt benchmark with `--duel`, `--raw`, and `--compare` modes. |
| `autotune user-bench -m <model>` | ~30 min | Real-world UX benchmark: swap events, TTFT consistency, CPU spikes, RAM headroom, 0–100 score. |
| `autotune agent-bench` | ~1–2 h | Agentic multi-turn benchmark across 5 tasks. Shows TTFT growth curves (the key story). |

```bash
# Typical proof workflow
autotune proof -m qwen3:8b                    # quick check (~30s)
autotune proof-suite -m qwen3:8b --runs 5     # statistical confirmation
autotune user-bench -m qwen3:8b --quick       # does it feel better?
```

**Key flags for `autotune proof`:**

| Flag | Default | Description |
|------|---------|-------------|
| `--model, -m` | auto | Ollama model ID. Auto-selects if omitted. |
| `--runs, -r` | `2` | Runs per condition. 3+ gives stabler numbers. |
| `--profile, -p` | `balanced` | autotune profile to test. |
| `--output, -o` | `proof_<model>.json` | Save JSON results. |
| `--list-models` | — | Print installed models and exit. |

### Conversation memory

| Command | What it does |
|---------|-------------|
| `autotune memory search "<query>"` | Search past conversations by meaning (vector) or keyword (FTS5 fallback). |
| `autotune memory list` | List recently stored memory chunks with timestamps and model names. |
| `autotune memory stats` | Show total chunks, vector coverage, DB size, date range, and per-model counts. |
| `autotune memory forget <id>` | Delete a specific memory chunk. `--all` wipes everything (with confirmation). |
| `autotune memory setup` | Pull `nomic-embed-text` (~274 MB) to enable semantic vector search. |

```bash
autotune memory setup                          # one-time: enable semantic search
autotune memory search "FastAPI auth"          # find relevant past sessions
autotune memory list --days 7                  # recent memories
autotune memory forget 42                      # remove a specific chunk
```

### Apple Silicon (MLX)

| Command | What it does |
|---------|-------------|
| `autotune mlx list` | List MLX models already cached locally. |
| `autotune mlx pull <model>` | Download MLX-quantized model from mlx-community on HuggingFace. Accepts Ollama names. |
| `autotune mlx resolve <model>` | Show which HuggingFace MLX model ID would be used for a given Ollama name. |

MLX is 10–40% faster than Ollama on the same model by running on Apple's unified memory and Metal GPU kernels.

```bash
autotune mlx pull qwen3:8b                     # download 4-bit MLX version
autotune mlx pull qwen2.5-coder:14b --quant 8bit
autotune serve --mlx                           # start API server using MLX backend
```

### Settings & diagnostics

| Command | What it does |
|---------|-------------|
| `autotune telemetry` | View recent inference runs (TTFT, tok/s, RAM, swap, CPU). |
| `autotune telemetry --enable` | Opt in to anonymous telemetry (hardware fingerprint + perf data). |
| `autotune telemetry --disable` | Opt out. No further data sent. |
| `autotune telemetry --status` | Show current consent status. |
| `autotune storage on\|off\|status` | Enable/disable local SQLite storage of performance observations. |
| `autotune doctor` | Full health check: Python, packages, Ollama connectivity, RAM/swap, DB health. |

---

## Architecture

```
autotune/
├── ttft/          ← TTFT optimisation (start here for latency work)
│   └── optimizer.py    TTFTOptimizer: dynamic num_ctx + keep_alive + num_keep
│
├── api/           ← Inference pipeline
│   ├── server.py       FastAPI server — OpenAI-compatible /v1 + FIFO queue
│   ├── chat.py         Terminal REPL with adaptive RAM + live stats
│   ├── kv_manager.py   KV options builder: flash_attn, num_batch, pressure tiers
│   ├── model_selector.py   Pre-flight fit analysis
│   └── backends/       Ollama, MLX, LM Studio, HuggingFace Inference API
│
├── context/       ← Context window management
│   ├── window.py       ContextWindow orchestrator
│   ├── budget.py       Tier thresholds (FULL → RECENT+FACTS → COMPRESSED → EMERGENCY)
│   ├── classifier.py   Message value scoring
│   ├── compressor.py   Tool output + long-content compression
│   └── extractor.py    Deterministic fact extraction
│
├── recall/        ← Conversation memory
│   ├── store.py        SQLite WAL: FTS5 full-text + float32 cosine vectors
│   └── manager.py      save / search / list conversations
│
├── db/            ← Persistence
│   └── store.py        SQLite: models, hardware, run_observations, telemetry_events
│
├── hardware/      ← Hardware detection
│   ├── profiler.py     CPU/GPU/RAM detection
│   └── ram_advisor.py  Real-time RAM pressure advice
│
├── memory/        ← Memory estimation + no-swap guarantee
│   ├── estimator.py    Model weights + KV + runtime overhead
│   └── noswap.py       NoSwapGuard: adjusts num_ctx to guarantee no swap
│
└── cli.py         ← Entry point (Click)
```

---

## Support development

[![GitHub Sponsors](https://img.shields.io/github/sponsors/tanavc1?label=Sponsor&logo=GitHub&color=EA4AAA)](https://github.com/sponsors/tanavc1)

autotune is free and MIT-licensed. If it saves you time or compute costs, sponsoring helps keep it maintained and worth improving.

**What sponsorships actually cover:**

| Cost | Monthly |
|------|---------|
| Supabase (anonymous telemetry DB) | $0 — free tier |
| Vercel (autotunellm.com hosting) | $0 — hobby plan |
| Domain (autotunellm.com) | $1.25 |
| **Total infrastructure** | **~$1.25 / month** |

The real cost is developer time. All development is done evenings and weekends. Sponsorships are a direct signal that the project is worth that time.

**Suggested tiers** (set up at [github.com/sponsors/tanavc1](https://github.com/sponsors/tanavc1)):

| | What it means |
|-|--------------|
| **$3 / month** — Supporter | You use it and it works. Thank you. |
| **$10 / month** — Regular | Your name in the README sponsors list. |
| **$25 / month** — Patron | Priority issue responses + your name in the README. |

No features are locked behind sponsorship. Everything in this repo is and will remain free.

---

## Contributing & support

Bug reports and pull requests welcome. Open an issue on GitHub or email [autotunellm@gmail.com](mailto:autotunellm@gmail.com).

For security vulnerabilities, see [SECURITY.md](SECURITY.md) — please do not open a public issue.

## License

MIT
