{{ t('modal.model_settings.empty_hint') }}
{{ t('modal.model_settings.enable_thinking_hint') }}
{{ t('modal.model_settings.thinking_budget_hint') }}
{{ t('modal.model_settings.limit_tool_result_hint') }}
{{ t('modal.model_settings.force_sampling_hint') }}
{{ t('modal.model_settings.trust_remote_code_hint') }}
{{ t('modal.model_settings.chat_template_kwargs_hint') }}
{{ t('modal.model_settings.no_kwargs') }}
{{ t('modal.model_settings.turboquant_kv_hint') }}
{{ t('modal.model_settings.index_cache_hint') }} (GitHub)
Attention-based sparse prefill for MoE/hybrid models. (Paper) (HuggingFace)
Small model sharing tokenizer with target (e.g. Qwen3.5-0.8B for 35B)
Min tokens to trigger (shorter prompts use full prefill)
Block diffusion speculative decoding for 3-4x faster generation. Supports Qwen (3, 3.5, 3.6) and Gemma4 model families. Requires a DFlash draft model checkpoint.
Single-stream only: requests run one at a time.
* MLX impl by bstnxbt(GitHub)
DFlash draft checkpoint (e.g. z-lab/Qwen3-4B-DFlash-b16, z-lab/gemma-4-26B-A4B-it-DFlash). Note: -DFlash suffix only; -assistant variants are for MTP.
Enable quantization for the draft model (weight, activation bits & group size).
{{ t('modal.model_settings.dflash_max_ctx_help') }}
{{ t('modal.model_settings.dflash_max_concurrent_help') }}
{{ t('modal.model_settings.dflash_l1_cache_hint') }}
{{ t('modal.model_settings.dflash_l1_max_entries_help') }}
{{ t('modal.model_settings.dflash_l1_max_gib_help') }}
{{ t('modal.model_settings.dflash_l2_cache_hint') | safe }}
{{ t('modal.model_settings.dflash_l2_unavailable') | safe }}
{{ t('modal.model_settings.dflash_l2_requires_l1') }}
{{ t('modal.model_settings.mtp_hint') | safe }}
{{ t('modal.model_settings.mtp_conflict') }}
{{ t('modal.model_settings.vlm_mtp_hint') | safe }}
{{ t('modal.model_settings.vlm_mtp_conflict') }}