{{ t('modal.model_settings.empty_hint') }}
{{ t('modal.model_settings.enable_thinking_hint') }}
{{ t('modal.model_settings.thinking_budget_hint') }}
{{ t('modal.model_settings.guided_grammar_hint') }}
{{ t('modal.model_settings.limit_tool_result_hint') }}
{{ t('modal.model_settings.force_sampling_hint') }}
{{ t('modal.model_settings.trust_remote_code_hint') }}
{{ t('modal.model_settings.chat_template_kwargs_hint') }}
{{ t('modal.model_settings.no_kwargs') }}
{{ t('modal.model_settings.turboquant_kv_hint') }}
{{ t('modal.model_settings.index_cache_hint') }} (GitHub)
Attention-based sparse prefill for MoE/hybrid models. (Paper) (HuggingFace)
Small model sharing tokenizer with target (e.g. Qwen3.5-0.8B for 35B)
Min tokens to trigger (shorter prompts use full prefill)
Block diffusion speculative decoding for 3-4x faster generation. Supports Qwen (3, 3.5, 3.6) and Gemma4 model families. Requires a DFlash draft model checkpoint.
Single-stream only: requests run one at a time.
* MLX impl by bstnxbt(GitHub)
DFlash draft checkpoint (e.g. z-lab/Qwen3-4B-DFlash-b16, z-lab/gemma-4-26B-A4B-it-DFlash). Note: -DFlash suffix only; -assistant variants are for MTP.
Enable quantization for the draft model (weight, activation bits & group size).
Prompts at or above this token count switch to BatchedEngine. Leave empty for unlimited.
Long-context tuning
Draft model sliding-attention window. Helps stabilise acceptance on long contexts. Leave empty for dflash default (1024).
Attention-sink tokens always kept regardless of window. Leave empty for dflash default (64).
Verifier algorithm. "adaptive" shrinks block size when acceptance drops; "off" disables speculative verify.
DFlash L1 prefix snapshot cache in RAM. Speeds up multi-turn chats with shared prefixes.
Maximum number of prefix snapshots kept in L1 cache. Each entry stores KV + draft GDN state for one conversation prefix.
Byte budget for L1 snapshots; LRU evicts when exceeded.
L2 spill of evicted L1 entries to disk. Uses the oMLX paged SSD cache directory (dflash_l2/).
Enable oMLX paged SSD cache first (--paged-ssd-cache-dir).
Requires in-memory cache to be enabled.
Disk budget for L2 spill; oldest entries are evicted when exceeded.
{{ t('modal.model_settings.mtp_hint') | safe }}
{{ t('modal.model_settings.mtp_conflict') }}
{{ t('modal.model_settings.vlm_mtp_hint') | safe }}
{{ t('modal.model_settings.vlm_mtp_conflict') }}