{% extends "_base.html" %} {% block content %}
Workflow · quantize

Quantize a model

Pick a Hugging Face model, choose a target average bits-per-weight, and OptIQ picks per-layer precision from a sensitivity sweep.

{# Step indicator #}
{# Step 1: Source #}

Source model

Anything on huggingface.co. We support Qwen3.5 / 3.6 and Gemma-4 explicitly; other architectures may work but aren't tested.
{# Step 2: Configure #}

Configure

3 bpw = small & lossy · 5.0 bpw = balanced default · 8 bpw = near-lossless
The pool the per-layer KL probe samples from. Bigger pool = more diverse coverage; OptIQ ships a curated 40 to keep total runtime small.
How many of the 40-sample pool to actually use. Each layer gets forward-passed against every sample × every candidate bit-width, so cost is n_layers × n_bits × n_samples forwards. 8 is the default (good signal/cost balance); 16+ shaves variance on noisy layers but doubles the wait.
Comma-separated. The knapsack picks per-layer from this set. The lowest bit becomes the model's name suffix (e.g. OptiQ-4bit) following llama.cpp / Unsloth convention.
{# Step 3: Run #}

Running


  
{# Step 4: Save / Push #}

Done

OptIQ quant saved to:



    

Push to Hugging Face

Uses the token from Settings → Hugging Face. Requires your Lab password to decrypt.

Quantize another
{% endblock %}