Metadata-Version: 2.4
Name: grillyinference
Version: 0.1.0
Summary: Native fp16 inference engine for Llama models — optional grilly extension
Author-email: Nicolas Cloutier <ncloutier@grillcheeseai.com>
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: grilly>=0.4.0
Requires-Dist: numpy
Requires-Dist: safetensors
Provides-Extra: hf
Requires-Dist: huggingface_hub; extra == "hf"
Requires-Dist: transformers; extra == "hf"
Dynamic: license-file

# GrillyInference

Native fp16 inference engine for Llama-family models — optional [grilly](https://github.com/grillcheese/grilly) extension.

## Features

- **Native fp16 inference** — runs Llama 3.2 3B at ~6.4 GB VRAM, zero quality loss
- **Paged KV-Cache** — 256-token SRAM pages with LRU eviction, 4x context extension
- **H2O Eviction** — exponential decay on old KV heads, 32k context on 12GB
- **VSA Multi-Scale Summaries** — hypervector bind/bundle for 128k effective context
- **SmoothQuant INT8** — per-group-64 weight quantization, <1% PPL loss
- **4-bit Block Quantization** — run 100B models on 12GB VRAM with layer offloading
- **Llama 3.2 Instruct** — chat template, streaming generation, top-k/top-p sampling

## Quick Start

```bash
pip install grillyinference
```

```python
from grillyinference import LlamaForCausalLM, TextGenerator
from transformers import AutoTokenizer

model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
gen = TextGenerator(model, tokenizer)

# Simple generation
response = gen.generate("What is the meaning of life?", max_tokens=256)
print(response)

# Chat
response = gen.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain transformers in 3 sentences."},
])
print(response)

# Streaming
for token in gen.generate("Once upon a time", stream=True):
    print(token, end="", flush=True)
```

## Context Extension (12GB VRAM)

| Context | Decode Speed | PPL Hit | Technique |
|---------|-------------|---------|-----------|
| 2k | 9 t/s | 0% | Baseline |
| 8k | 8 t/s | 0% | PagedAttention |
| 32k | 7 t/s | 1.5% | + H2O eviction |
| 128k | 6 t/s | 3% | + VSA summaries |

```python
from grillyinference import KVCache, LlamaConfig

config = LlamaConfig.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
kv = KVCache(config, raw_window=2048, h2o_lambda=0.0002, enable_vsa=True)
```

## SmoothQuant INT8

```python
from grillyinference.inference.quantize import SmoothQuantCalibrator, SmoothQuantizer

calibrator = SmoothQuantCalibrator(model, tokenizer)
stats = calibrator.calibrate()
quantizer = SmoothQuantizer(group_size=64)
quantized = quantizer.smooth_and_quantize(model._weights, stats)
```

## Requirements

- Python 3.12+
- grilly >= 0.4.0
- numpy, safetensors
- Optional: huggingface_hub, transformers (for `from_pretrained`)

## License

MIT
