{% extends "_base.html" %} {% block content %}
Settings · Server

Model server

Pick a local OptiQ quant and (optionally) MTP speculation, click Apply, and the Lab swaps the running model without restarting. Switching takes ~5-30 seconds depending on model size.

Currently loaded

Model
API port{{ api_port }}
Server status
Spec decoding
Prompt cache budget
Sampler (from model defaults)

Switch model

{# Source tabs: Local / Published / Custom #}
{# Source: local #} {# Source: published #} {# Source: custom #}
Depth 1 is optimal on Apple Silicon. Deeper hurts because Metal's K-token verify scales near-linearly with K (see the MTP guide).
Auto-suggested for the selected Gemma-4 target. Loads a separate small -assistant drafter alongside the host model, γ=1 greedy. Leave blank to skip, or type any other HF id to override. See the MTP guide.
Caps mlx-lm's LRUPromptCache to prevent long-context OOMs. Higher = more prefix-reuse hits at the cost of RAM; lower = safer on small machines. Default {{ default_pc_gb }} GB (15 % of system RAM).
Streams the active experts of a mixture-of-experts quant from SSD instead of holding every expert resident, so models like Qwen3.6-35B-A3B (21 GB) run on a 24 GB Mac at a few GB. Decode is slower (~3 tok/s); only affects MoE models. Auto kicks in only when the model won't fit.
temp
top_p
top_k
min_p
Model defaults from generation_config.json apply for blank fields. Use temp=0 for greedy / deterministic output (gives the largest MTP speedup).
Pick from local adapters discovered under models/, paste a path from elsewhere, or add multiple. One adapter routes through mlx-lm's classic --adapter-path boot; two or more activate OptiQ's mounted-LoRA mode where the base model stays loaded and clients pick an adapter per request via the adapters field in the body (adapter name = the directory's basename). Switching is instant — one base in RAM, ~30 MB per extra adapter, gated by a ContextVar in the forward pass.
{% endblock %}