{{ t('modal.model_settings.section_label') }}

{{ t('modal.model_settings.profiles.section_label') }}

no presets available

{{ t('modal.model_settings.basic_label') }}

{{ t('modal.model_settings.empty_hint') }}

{{ t('modal.model_settings.advanced_label') }}

{{ t('modal.model_settings.enable_thinking') }}

{{ t('modal.model_settings.enable_thinking_hint') }}

{{ t('modal.model_settings.thinking_budget') }}

{{ t('modal.model_settings.thinking_budget_hint') }}

{{ t('modal.model_settings.guided_grammar') }}

{{ t('modal.model_settings.guided_grammar_hint') }}

{{ t('modal.model_settings.limit_tool_result') }}

{{ t('modal.model_settings.limit_tool_result_hint') }}

{{ t('modal.model_settings.force_sampling') }}

{{ t('modal.model_settings.force_sampling_hint') }}

{{ t('modal.model_settings.trust_remote_code') }}

{{ t('modal.model_settings.trust_remote_code_hint') }}

{{ t('modal.model_settings.chat_template_kwargs') }}

{{ t('modal.model_settings.chat_template_kwargs_hint') }}

{{ t('modal.model_settings.no_kwargs') }}

{{ t('modal.model_settings.experimental_label') }}

{{ t('modal.model_settings.turboquant_kv') }}

{{ t('modal.model_settings.turboquant_kv_hint') }}

SpecPrefill

Attention-based sparse prefill for MoE/hybrid models. (Paper) (HuggingFace)

Small model sharing tokenizer with target (e.g. Qwen3.5-0.8B for 35B)

Min tokens to trigger (shorter prompts use full prefill)

DFlash

Block diffusion speculative decoding for 3-4x faster generation. Supports Qwen (3, 3.5, 3.6) and Gemma4 model families. Requires a DFlash draft model checkpoint.
Single-stream only: requests run one at a time.
* MLX impl by bstnxbt(GitHub)

DFlash draft checkpoint (e.g. z-lab/Qwen3-4B-DFlash-b16, z-lab/gemma-4-26B-A4B-it-DFlash). Note: -DFlash suffix only; -assistant variants are for MTP.

Quantization

Enable quantization for the draft model (weight, activation bits & group size).

Prompts at or above this token count switch to BatchedEngine. Leave empty for unlimited.

Long-context tuning

Draft model sliding-attention window. Helps stabilise acceptance on long contexts. Leave empty for dflash default (1024).

Attention-sink tokens always kept regardless of window. Leave empty for dflash default (64).

Verifier algorithm. "adaptive" shrinks block size when acceptance drops; "off" disables speculative verify.

In-memory cache

DFlash L1 prefix snapshot cache in RAM. Speeds up multi-turn chats with shared prefixes.

Maximum number of prefix snapshots kept in L1 cache. Each entry stores KV + draft GDN state for one conversation prefix.

Byte budget for L1 snapshots; LRU evicts when exceeded.

SSD cache

L2 spill of evicted L1 entries to disk. Uses the oMLX paged SSD cache directory (dflash_l2/).

Enable oMLX paged SSD cache first (--paged-ssd-cache-dir).

Requires in-memory cache to be enabled.

Disk budget for L2 spill; oldest entries are evicted when exceeded.

{{ t('modal.model_settings.mtp') }}

{{ t('modal.model_settings.mtp_hint') | safe }}

{{ t('modal.model_settings.mtp_conflict') }}

{{ t('modal.model_settings.vlm_mtp') }}

{{ t('modal.model_settings.vlm_mtp_hint') | safe }}

{{ t('modal.model_settings.vlm_mtp_conflict') }}