{% extends "_base.html" %} {% block content %}
Settings · Server

Model server

Pick a local OptIQ quant and (optionally) MTP speculation, click Apply, and the Lab swaps the running model without restarting. Switching takes ~5-30 seconds depending on model size.

Currently loaded

Model
API port{{ api_port }}
Server status
Spec decoding
Prompt cache budget
Sampler (from model defaults)

Switch model

{# Source tabs: Local / Published / Custom #}
{# Source: local #} {# Source: published #} {# Source: custom #}
Depth 1 is optimal on Apple Silicon. Deeper hurts because Metal's K-token verify scales near-linearly with K (see the MTP guide).
Auto-suggested for the selected Gemma-4 target. Loads a separate small -assistant drafter alongside the host model, γ=1 greedy. Leave blank to skip, or type any other HF id to override. See the MTP guide.
Caps mlx-lm's LRUPromptCache to prevent long-context OOMs. Higher = more prefix-reuse hits at the cost of RAM; lower = safer on small machines. Default {{ default_pc_gb }} GB (15 % of system RAM).
temp
top_p
top_k
min_p
Model defaults from generation_config.json apply for blank fields. Use temp=0 for greedy / deterministic output (gives the largest MTP speedup).
Pick from local adapters discovered under models/, paste a path from elsewhere, or add multiple. One adapter routes through mlx-lm's classic --adapter-path boot; two or more activate OptIQ's mounted-LoRA mode where the base model stays loaded and clients pick an adapter per request via the adapters field in the body (adapter name = the directory's basename). Switching is instant — one base in RAM, ~30 MB per extra adapter, gated by a ContextVar in the forward pass.
{% endblock %}