Metadata-Version: 2.4
Name: diskllm
Version: 0.1.0
Summary: Ollama-style GGUF runner with SSD-backed llama.cpp KV cache
Author: DiskLLM contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/shivnathtathe/diskllm
Project-URL: Repository, https://github.com/shivnathtathe/diskllm
Project-URL: Issues, https://github.com/shivnathtathe/diskllm/issues
Keywords: llm,gguf,llama.cpp,ssd,kv-cache,openai-compatible
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Provides-Extra: examples
Requires-Dist: openai>=1.0.0; extra == "examples"
Dynamic: license-file
Dynamic: requires-python

# DiskLLM

Run a 7B LLM on 258MB RAM.
Stock llama.cpp needs 6000MB.
DiskLLM uses your SSD instead.

![DiskLLM architecture](docs/architecture.png)

DiskLLM is an Ollama-style Python CLI and patched llama.cpp backend that keeps the KV cache on SSD through `mmap` instead of allocating the full cache in private RAM. It launches an OpenAI-compatible `llama-server`, downloads GGUF models from HuggingFace, and persists context sessions across restarts.

## Results

| Model | Context | Stock RAM | DiskLLM RAM | Reduction | tok/s |
|-------|---------|-----------|-------------|-----------|-------|
| Qwen2.5 3B | 65K | ~2,000 MB | ~200 MB | 10x | 8.9 |
| Qwen2.5 7B | 65K | ~6,000 MB | 258 MB | 23x | 4.5 |
| Qwen2.5 7B | 256K | 14,900 MB | 258 MB | **57x** | 2.5 |
| LFM2.5 8B | 128K | 1,845 MB | 172 MB | 10.7x | 10.0 |
| Qwen2.5 14B | 65K | ~10,000 MB | 289 MB | 34x | 1.7 |
| Qwen2.5 32B | 65K | ~20,000 MB | 424 MB | 47x | 0.1 |

| Metric | Value |
|--------|-------|
| Max RAM reduction | **57x** (256K context) |
| Min private RAM | **172 MB** (LFM2.5 8B) |
| Best tok/s | **10.0** (LFM2.5 8B) |
| Largest model run | **32B** on 2.5GB free RAM |
| KV cache on SSD | **112 MB** (vs GBs in RAM) |
| Session file size | **7.33 MB** (persists across restarts) |

## Quick Start

```powershell
pip install -e .
diskllm pull qwen2.5:7b
diskllm run qwen2.5:7b --session demo --headless
diskllm chat qwen2.5:7b --session demo --no-start
```

The server exposes an OpenAI-compatible API at `http://127.0.0.1:8080/v1`.

## How It Works

Stock llama.cpp allocates KV tensors in RAM. At long context, that KV allocation can dominate system memory even when model weights are memory-mapped.

DiskLLM adds a patched llama.cpp `--kv-backend ssd` mode. The backend creates a CPU-addressable `mmap` file under `~/.diskllm/kv_cache`, keeps only the active attention window hot, and lets the OS page the rest through NVMe storage. The patch is wired through dense KV, SWA, hybrid, and iSWA cache paths.

`diskllm run` always starts the patched server with:

```text
-ngl 0
--kv-backend ssd
--kv-path ~/.diskllm/kv_cache
--kv-window 2048
--host 0.0.0.0
--port 8080
--no-repack
```

When `--session NAME` is used, DiskLLM also passes `--slot-save-path ~/.diskllm/sessions` and uses llama-server slot save/restore APIs to persist context as `~/.diskllm/sessions/<name>.mmap`.

Read more in [docs/how-it-works.md](docs/how-it-works.md).

## Benchmarks

Benchmarks were run CPU-only on Windows 11 with an Intel i5-12450H, 16GB RAM, and NVMe SSD.

| Test | Result |
|------|--------|
| Qwen2.5 7B, 65K context, headless startup | 258 MB private RAM |
| Qwen2.5 7B, restored session startup | 3.46s to healthy plus restore |
| Qwen2.5 7B, restored prompt reuse | 34 cached prompt tokens |
| Qwen2.5 7B, restored generation | 7.13 tok/s |

See [docs/benchmarks.md](docs/benchmarks.md) for the full tables and hardware notes.

## Model Registry

```text
qwen2.5:3b   bartowski/Qwen2.5-3B-Instruct-GGUF Q4_K_M
qwen2.5:7b   bartowski/Qwen2.5-7B-Instruct-GGUF Q4_K_M
qwen2.5:14b  bartowski/Qwen2.5-14B-Instruct-GGUF Q4_K_M
lfm2.5:8b    LiquidAI/LFM2.5-8B-A1B-GGUF Q4_K_M
```

DiskLLM can also pull any HuggingFace GGUF repo and select a quant automatically:

```powershell
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF
diskllm pull bartowski/Qwen2.5-7B-Instruct-GGUF:Q5_K_M
```

## Storage

```text
~/.diskllm/models/      downloaded GGUF files
~/.diskllm/kv_cache/    SSD KV mmap files
~/.diskllm/sessions/    persistent slot KV sessions
~/.diskllm/config.json  settings and installed model index
~/.diskllm/logs/        background server logs
~/.diskllm/bin/         patched llama-server binary
```

Override the home directory with `DISKLLM_HOME`.

## Patched llama.cpp

The `llama.cpp/` directory is tracked as a git submodule containing the patched fork. DiskLLM finds the patched server binary in this order:

1. `~/.diskllm/bin/llama-server.exe`
2. `DISKLLM_LLAMA_SERVER`
3. `llama_server` in `~/.diskllm/config.json`
4. bundled package path `diskllm/bin/windows/llama-server.exe`
5. development checkout path `llama.cpp/build/bin/llama-server.exe`

If a binary is found outside `~/.diskllm`, DiskLLM copies it into `~/.diskllm/bin` and runs that copy.

## OpenAI Client

```python
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="diskllm")
reply = client.chat.completions.create(
    model="diskllm",
    messages=[{"role": "user", "content": "What is DiskLLM?"}],
)
print(reply.choices[0].message.content)
```

More examples are in [examples/](examples/).

## Contributing

Contributions are welcome. Start with [CONTRIBUTING.md](CONTRIBUTING.md), keep changes small, and include benchmark or test evidence for runtime changes.

## License

DiskLLM is released under the MIT License. See [LICENSE](LICENSE).
