Metadata-Version: 2.4
Name: auto-round-lib
Version: 0.13.2
Summary: Auto Round Kernel binary package
Home-page: https://github.com/intel/auto-round/auto_round_extension/ark
Author-email: yu.luo@intel.com
License: Apache 2.0
Keywords: quantization,auto-around,LLM,kernel
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: Apache Software License
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.10.0
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary

## What is AutoRound Kernel?
AutoRound Kernel is a low-bit acceleration library for Intel platform. 

The kernels are optimized for the following CPUs:
* Intel Xeon Scalable processor (formerly Sapphire Rapids, and Emerald Rapids)
* Intel Xeon 6 processors (formerly Sierra Forest and Granite Rapids)

The kernels are optimized for the following GPUs:
* Intel Arc B-Series Graphics and Intel Arc Pro B-Series Graphics
  (formerly Battlemage)

## Key Features
AutoRound Kernel provides weight-only linear computational capabilities for LLM inference. Specifically, the weight-only-quantization configs we support are given in the table below:
### CPU 
| Weight dtype     |          Compute dtype           |    Scale dtype    | Algorithm<sup>[1]</sup> |
|------------------|:--------------------------------:|:-----------------:|:-----------------------:|
| INT8             | INT8<sup>[2]</sup> / BF16 / FP32 |    BF16 / FP32    |       sym / asym        |
| INT4             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT3             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT2             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT5             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT6             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT7             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| INT1             |        INT8 / BF16 / FP32        |    BF16 / FP32    |       sym / asym        |
| FP8 (E4M3, E5M2) |           BF16 / FP32            | FP32 / FP8 (E8M0) |           NA            |
| FP4 (E2M1)       |           BF16 / FP32            |    BF16 / FP32    |           NA            |

### XPU 
| Weight dtype     |  Compute dtype |     Scale dtype   |  Algorithm |
|------------------|:--------------:|:-----------------:|:----------:|
| INT8             |  INT8 / FP16   |       FP16        |    sym     |
| INT4             |  INT8 / FP16   |       FP16        |    sym     |
| FP8 (E4M3, E5M2) |      FP16      | FP16 / FP8 (E8M0) |     NA     |

<sup>[1]</sup>: Quantization algorithms for integer types: symmetric or asymmetric.  
<sup>[2]</sup>: Includes dynamic activation quantization; results are dequantized to floating-point formats.  
  

## Installation
### 1. Install via pip
```bash
pip install auto-round-lib
```

### 2. Install from Source
```bash
pip install . --no-build-isolation
# or
python setup.py bdist_wheel;pip install dist/*
```

### Validated Hardware Environment
#### CPU based on [Intel 64 architecture or compatible processors](https://en.wikipedia.org/wiki/X86-64):
* Intel Xeon Scalable processor (Granite Rapids)
#### GPU built on Intel's Xe architecture:
* Intel Arc B-Series Graphics (Battlemage)

### Resources

#### QuantLinear API
  ARK exposes a unified weight-only linear interface through QuantLinear, QuantLinearGPTQ, QuantLinearAWQ, and QuantLinearFP8. Please refer to the [QLinear](auto_round_kernel/qlinear.py) for more integration details.

  The expected lifecycle is: create the module, load quantized tensors from the checkpoint, call post_init() once to repack weights into the ARK-friendly layout, and then call forward() during inference.

   Minimal usage:
```python
from auto_round_kernel.qlinear import QuantLinear

qlinear = QuantLinear(
    bits=4,
    group_size=128,
    sym=True,
    in_features=in_features,
    out_features=out_features,
    bias=bias is not None,
    weight_dtype=weight_dtype,
)
# Load qweight, qzeros, scales, and bias from checkpoint.
qlinear.post_init()

# Run inference
y = qlinear(x)
```

#### A Weight-Only Example
  A runnable end-to-end example is available in [test_weightonly.py](test/test_weightonly.py). It demonstrates how to prepare quantized weights and scales, call repack_quantized_weight to build ARK-packed weights, verify correctness with unpack_weight, and run woqgemm on CPU and XPU.

#### Replace torch SDPA and run lm-eval
  ARK also exposes an XPU SDPA kernel through `ARK.sdpa(...)`. If you want to replace `torch.nn.functional.scaled_dot_product_attention` globally for evaluation without editing model code, use the helper launcher in [tools/lm_eval_with_ark_sdpa.py](tools/lm_eval_with_ark_sdpa.py).

  Example:
```bash
cd /path/to/auto_round_extension/ark
PYTHONPATH=$PWD python tools/lm_eval_with_ark_sdpa.py \
  --model hf \
  --model_args pretrained=/path/to/model,trust_remote_code=True,dtype=bfloat16 \
  --tasks hellaswag,piqa,winogrande \
  --device xpu:0 \
  --batch_size 1
```

  Notes:
  * The patch only routes calls to ARK on XPU when the inputs match ARK kernel constraints; otherwise it falls back to the original torch SDPA.
  * Supported Q/K/V dtypes are FP16 and BF16.
  * Supported head dimensions are 64, 96, 128, and 192.
  * `dropout_p` must be 0.0 for the ARK path.
  * Additive masks are supported when they can be normalized to `[B, 1, Sq, Skv]`; boolean masks fall back to torch.
