Metadata-Version: 2.4
Name: clanker-gguf
Version: 1.1.0
Summary: LLM Turbo-Optimizer CLI. detect hardware, parse GGUF models, generate optimized llama-server commands.
Author: Clanker Contributors
License: MIT
Project-URL: Homepage, https://github.com/Stavros-alt/clanker
Keywords: llm,gguf,llama.cpp,nvidia,cli
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: psutil>=5.9
Requires-Dist: huggingface_hub>=0.20
Requires-Dist: gguf>=0.10
Requires-Dist: nvidia-ml-py>=12.535
Provides-Extra: dev
Requires-Dist: ruff; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# clanker

generates llama-server commands so you don't have to guess how many layers
fit on your gpu.  reads the gguf header, does the math, prints the command.

i got tired of trial-and-error vram tuning for moe models on consumer gpus.

## Setup

requires **python 3.10+** and `pip`.

```bash
pip install clanker-gguf
```

or install from source:
```bash
pip install git+https://github.com/Stavros-alt/clanker.git
pip install "git+https://github.com/Stavros-alt/clanker.git#egg=clanker-gguf[all]"
```

or clone and install locally:
```bash
git clone https://github.com/Stavros-alt/clanker.git
cd clanker
pip install -e ".[all]"
```

**you also need:**
- `nvidia-smi` on PATH (cpu-only mode works without it)
- `llama-server` from [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
  (build it with `clanker build`, or install manually)
- windows? good luck.  this probably works on wsl.

## Usage

### `clanker run <model>`

```bash
clanker run ~/models/qwen.gguf          # direct path
clanker run qwen                         # fuzzy cache search
clanker run qwen.gguf --preset speed     # apply a preset
clanker run qwen.gguf --context 65536    # override context
clanker run qwen.gguf --execute          # run it
```

defaults to 8K context, q8_0 KV cache, no flash attention.
pass `--preset big-brain` for the optimized 128k config.

### `clanker download <repo>`

```bash
clanker download unsloth/Qwen3.6-35B-A3B-GGUF            # scan only
clanker download unsloth/Qwen3.6-35B-A3B-GGUF --execute   # download
clanker download unsloth/Qwen3.6-35B-A3B-GGUF -c 32768    # budget for 32k ctx
```

picks the highest-quality quant that fits your ram, with preference for
models whose backbone fits on gpu.  `--fit` handles the offloading at
runtime.

**quality warnings:** if the best available quant is below IQ4_XS
(the kneedle knee point from KLD benchmarks), clanker aborts and
tells you to use a smaller model or a different repo.  accuracy
drops off a cliff below IQ4_XS (78.4% top-1 @ 16.4 gb).

**unsloth _XL warning:** some unsloth _XL quants contain f16 tensors
that crash ik_llama.cpp.  clanker warns and lets you filter out the
_risky quant to pick the next best.

### Other commands

```bash
clanker discover         # scan hf cache for gguf models
clanker discover --json  # json output
clanker ls               # alias for discover
clanker search           # open hf search with hardware-tuned bounds
clanker presets          # list all presets with their settings
clanker build            # clone and compile ik_llama.cpp
```

`clanker search` caps results by vram budget at IQ4_XS quality.
if your kv cache eats all the vram (128k ctx on 12gb cards),
it warns that models will offload entirely to ram.

## Presets

all presets enable mtp speculative decoding when the model supports it.

| Name | Context | KV Cache | Notes |
|------|---------|----------|-------|
| `big-brain` | 128K | k=q8_0, v=q5_1 | mlock, no-mmap, flash-attn |
| `speed` | 32K | k=q6_0, v=q5_0 | mlock, flash-attn |
| `infinite` | 512K | k=q5_0, v=q4_1 | no-mmap, flash-attn |
| `coding` | 64K | k=q8_0, v=q5_1 | mlock, flash-attn, temp=0.2 |

kv cache quants are from KLD benchmarks (anbeeld).  mixed k/v pairs
outperform symmetric ones on the pareto frontier.

## How it works

1. reads your gpu vram via nvidia-smi
2. detects physical cpu cores (uses cores - 1 for thread count)
3. parses the gguf header for architecture, layers, experts, quant type
4. figures out layer split, kv cache size, fit-margin based on available vram
5. spits out a llama-server command with thread tuning, ubatch tuning, and mlock

## Notes

- `--fit --fit-margin N` means "keep N MB of vram free".  base is 1664 for
  ik_llama, 4608 for mainline.  mtp adds 2048 mb overhead.
- thread count is `physical_cores - 1` (min 1).  keeps one core free for
  gpu scheduling and os tasks.
- `-ub 2048` sets the micro-batch size for max memory-bandwidth utilization
  during prompt prefill.
- mtp flags are only added when the model file has "mtp" in its name.
- the downloader tries `llama-cli -hf` first, falls back to
  `huggingface_hub`.  if llama-cli downloads then ooms, clanker detects
  the existing file and skips re-download.
- quantization detection from gguf metadata is unreliable (returns int
  enums).  filenames are more accurate, so clanker parses the filename.
- combined expert tensors (qwen3, deepseek) are handled by dividing total
  size by expert_count from metadata.
- `--backend llama` switches to mainline llama.cpp flags (spec-type,
  spec-draft-n-max, etc).  default is ik_llama.
