{{ t('models.manager.section_label') }}

{{ t('models.manager.heading') }}

{{ t('models.manager.description') }}

{{ t('models.manager.local_models') }}

{{ t('models.downloader.section_label') }}

{{ t('models.downloader.heading') }}

{{ t('models.downloader.description') }}

{{ t('models.downloader.download_section') }}
{{ t('models.search.section_label') }}
{{ t('models.search.recent_searches') }}
{{ t('models.browse.section_label') }}

{{ t('models.browse.loading') }}

{{ t('models.browse.load_prompt') }}

{{ t('models.browse.searching') }}

{{ t('models.browse.search_prompt') }}

{{ t('models.oq.section_label') }}

{{ t('models.oq.heading') }}

{{ t('models.oq.description') }}

{{ t('models.oq.form_section') }}
Text Only

Excludes vision encoder weights. Output is a text-only model (~2-3% smaller).

{{ t('models.oq.preserve_mtp') }}

{{ t('models.oq.preserve_mtp_help') }}

{{ t('models.oq.preserve_mtp_unavailable') }}

{{ t('models.oq.dtype_label') }}

{{ t('models.oq.dtype_help') }}

{{ t('models.oq.no_models') }}

About oQ Quantization

oQ: oMLX Universal Dynamic Quantization

Quantization should not be exclusive to any particular inference server.

oQ produces standard mlx-lm models that work everywhere — oMLX, mlx-lm, LM Studio, and any app that supports MLX safetensors format.

No custom loader required.

oQ measures each layer's quantization sensitivity through calibration (relative MSE vs float16) and builds a byte-budgeted mixed-precision plan that allocates bits where the data says they matter most. Built-in calibration data (600 samples across code, multilingual, reasoning, and tool calling). Every model gets a unique bit allocation tuned to its architecture.

Key features
Data-driven sensitivity Per-layer relative MSE measured against float16. Layers are ranked by quantization impact; the top 25% are marked as sensitive and receive priority in bit allocation.
Budget-constrained allocation 4-phase greedy plan within a hard bpw cap. Consensus tensors (lm_head, embeddings) get 8-bit first. Then attention and shared-expert layers get a protection floor. Remaining budget goes to the highest-sensitivity tensors. Routed experts (93-98% of MoE params) stay at base bits so the budget concentrates on the layers that actually move output quality.
Model-aware quantization MoE: router fp16, routed experts at base, shared experts protected. VLM: vision encoder fp16. SSM: state tensors fp32. 512+ expert models get larger group size automatically.
Streaming Tensor-by-tensor quantization from safetensors. Use a sensitivity model (pre-quantized proxy) to measure layer ranking at ~4x less memory than float16 calibration.
Comparison with other methods
oQ GGUF K-quant GGUF IQ unsloth Dynamic AWQ
Format MLX safetensors GGUF GGUF GGUF safetensors
Mixed precision Data-driven sensitivity (per-layer MSE) ~15 type rules imatrix per-weight Per-layer (proprietary) Per-channel scaling
Hybrid modes Affine (group_size=64) K-quant types E8 lattice Proprietary Affine only
MoE support Router fp16, expert-aware budget Basic Basic Router fp16 Limited
VLM / SSM Vision fp16, SSM F32 state Separate mmproj, F32 state Separate mmproj, F32 state Supported Not tested
Calibration Built-in 600 samples (code + multilingual + reasoning) None imatrix required Proprietary data Activation data
Memory Streaming (~5-7GB) 1× model 1× model Proprietary 1.5× model
Apple Silicon Native (MLX affine) Via llama.cpp Via llama.cpp GGUF only MLX (affine)

{{ t('models.detail.loading') }}