Metadata-Version: 2.4
Name: anvil-eval
Version: 0.2.0
Summary: A research-first, evaluation-first inference library.
Project-URL: Homepage, https://github.com/bishoymoussa/anvil
Project-URL: Issues, https://github.com/bishoymoussa/anvil/issues
Author: Anvil contributors
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: evaluation,inference,llm,reproducibility
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: accelerate>=1.0
Requires-Dist: datasets>=3.0
Requires-Dist: fastapi>=0.115
Requires-Dist: filelock>=3.13
Requires-Dist: huggingface-hub>=0.26
Requires-Dist: jinja2>=3.1
Requires-Dist: numpy<3,>=1.26
Requires-Dist: pillow>=10.0
Requires-Dist: pydantic<3,>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: safetensors>=0.4
Requires-Dist: tokenizers>=0.20
Requires-Dist: torch<3,>=2.4
Requires-Dist: tqdm>=4.66
Requires-Dist: transformers<5,>=4.45
Requires-Dist: typer>=0.12
Requires-Dist: uvicorn[standard]>=0.30
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: import-linter>=2.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.5; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Provides-Extra: flash-attn
Requires-Dist: flash-attn>=2.6; extra == 'flash-attn'
Provides-Extra: multimodal
Requires-Dist: av>=12.0; extra == 'multimodal'
Requires-Dist: decord>=0.6; extra == 'multimodal'
Requires-Dist: librosa>=0.10; extra == 'multimodal'
Provides-Extra: outlines
Requires-Dist: outlines>=0.1; extra == 'outlines'
Provides-Extra: rocm
Requires-Dist: torch<3,>=2.4; extra == 'rocm'
Provides-Extra: vllm
Requires-Dist: vllm==0.20.1; extra == 'vllm'
Provides-Extra: xgrammar
Requires-Dist: xgrammar>=0.1.10; extra == 'xgrammar'
Provides-Extra: xpu
Requires-Dist: torch<3,>=2.4; extra == 'xpu'
Description-Content-Type: text/markdown

<p align="center">
  <img src="docs/assets/anvil_logo_name.png" alt="anvil" width="420" />
</p>

<p align="center">
  <em>A research-first, evaluation-first inference library.</em>
</p>

<p align="center">
  <a href="docs/design.md">Design&nbsp;manuscript</a> ·
  <a href="#install">Install</a> ·
  <a href="#quickstart">Quickstart</a> ·
  <a href="#milestones">Milestones</a>
</p>

---

> **Status: alpha (v0.2.0).** All six v0 milestones (M0–M6) implemented and passing. Per-request logits processors (vLLM + HF), real dataset SHAs in manifests, and CI on Python 3.11/3.12 are live. Built per the design manuscript in [`docs/design.md`](docs/design.md).
>
> **Not yet in alpha:** multi-turn fewshot, `Classify` request type, DoLa (v0.5). CaaS LLM tier is v1.

## What this is

Anvil is **not** trying to be the fastest inference engine. vLLM and SGLang win throughput. Anvil's identity is correctness, reproducibility, and research ergonomics:

- Every run produces a content-hashed [`Manifest`](src/anvil/manifest/schema.py): two runs with the same manifest must produce identical numbers, byte-for-byte.
- Every chat template, tokenization, sampler, and image input is a versioned, hashed object — not a string loaded from a file at runtime.
- Day-zero new-model coverage via a transformers slow path; popular architectures graduate to a fast path.
- Per-request logits processors and hidden-state extraction are stable public APIs (the V0-vLLM API, restored).
- A preflight CaaS agent (rule engine + curated 15-entry KB) runs before every major run, catches the silent failures (missing chat template, EOS misconfigured, OOM-from-bad-config), and either fixes them or refuses to publish a manifest that crossed a silent regression.

See [`docs/design.md`](docs/design.md) for the full design rationale.

## Install

```bash
uv pip install anvil-eval
```

Wheels ship for `cu121`, `cu128`, `cu130`, plus a CPU fallback. The pure-Python install always works against any torch ≥ 2.4 / CUDA ≥ 12.1.

> **Import name:** the Python package is still `import anvil` — only the PyPI distribution name is `anvil-eval`.

For development:

```bash
uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[dev]"
```

Optional extras: `.[vllm]` for the vLLM backend, `.[multimodal]` for video/audio, `.[xgrammar]` for tool calling.

## Quickstart

```python
import anvil

result = anvil.eval(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tasks=["mmlu", "gsm8k", "humaneval"],
)
print(result.scores)               # {"mmlu": {...}, "gsm8k": {...}, ...}
result.manifest.save("run.json")
```

```bash
# CLI equivalent
anvil eval --model meta-llama/Llama-3.1-8B-Instruct \
           --tasks mmlu,gsm8k,humaneval \
           --output ./run.json

# Verify reproducibility
anvil manifest verify run.json

# Diff two runs to find which fields explain a score gap
anvil manifest diff run.json other.json
```

## Multimodal

```python
from PIL import Image
import anvil

m = anvil.load("Qwen/Qwen2.5-VL-7B-Instruct")
out = m.generate(messages=[{"role": "user", "content": [
    {"type": "image", "image": Image.open("cat.png")},
    {"type": "text",  "text": "What is in this image?"},
]}])
print(out.text)
print(out.image_token_counts)      # per-image vision-token counts
```

## Custom modalities (RNA, audio, embeddings, anything)

```python
from transformers import AutoModel
import anvil

model = anvil.load_custom(
    model_id="multimolecule/rnafm",
    model_class=AutoModel,
)

@anvil.register_task
class RNAFunctionRegression(anvil.Task):
    name = "rna_function_v1"
    dataset = "myorg/rna-function-set"

    def doc_to_request(self, doc):
        return anvil.Embed(input=doc["sequence"], pool="mean", layer=-1)

    def request_to_prediction(self, response, doc):
        return response.embedding

    def aggregate(self, predictions, docs):
        # your metric, your call — Spearman, Ridge probe, anything
        ...

result = anvil.eval(model=model, tasks=["rna_function_v1"])
```

## Migrating from lm-evaluation-harness

```bash
# Before:
lm_eval --model vllm \
    --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
    --tasks mmlu_pro,arc_challenge \
    --apply_chat_template \
    --num_fewshot 5 \
    --output_path ./out

# After:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
    --lm-eval-tasks mmlu_pro.yaml,arc_challenge.yaml \
    --n-fewshot 5 \
    --output ./run.json

# Validate the migration:
anvil eval --model Qwen/Qwen2.5-7B-Instruct \
    --lm-eval-tasks arc_challenge.yaml \
    --compare-with-lm-eval \
    --output ./run.json
```

## OpenAI-compatible server

```bash
anvil serve --model Qwen/Qwen2.5-7B-Instruct --port 8000
```

```python
# Drop-in replacement for the OpenAI client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-checked")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)
```

Tool calling is constrained-decoding-driven (one grammar; no per-model `--tool-call-parser` flag matrix).

## Diagnosing your environment

```bash
anvil doctor
# anvil         ok    anvil 0.0.1
# python        ok    Python 3.11.15
# cuda          warn  CUDA not available — torch wheel may not match driver
# transformers  ok    transformers 4.57.6
# vllm          warn  vLLM is not installed
# hf_token      warn  HF_TOKEN is not set
# ...

anvil doctor --json    # machine-readable for CI
```

## Design pillars

1. **Research as a first-class user.** Per-request logits processors, hidden-state extraction, structured output, and custom decoding strategies are stable, versioned public APIs.
2. **Datasets and benchmarks integrate in five lines.** A versioned task spec, batched evaluation primitives that drive the engine at full throughput, and a built-in library of the benchmarks that actually matter.
3. **Day-zero model support, by default.** New HuggingFace architectures load via the transformers backend the day they drop. The top architectures have a fast path.
4. **Reproducibility by construction.** Every run produces a manifest with the model SHA, dataset SHA, chat-template hash, sampler params, library version, and tokenizer version. Two runs with the same manifest produce identical numbers.
5. **CaaS preflight agent.** A small rule engine + curated known-issue database runs a smoke test before any major run, catches silent failures, and surfaces them as a diff for review.

## Built-in benchmarks (v0)

GSM8K (M0), MMLU + HumanEval+ (M1), MMMU (M4). Tier 2 lm-evaluation-harness imports for the rest of the catalog. Tier 3 custom tasks for any modality.

## Milestones

<p align="left">
  <img src="docs/assets/anvil_logo_symbol.png" alt="" width="40" align="right" />
</p>

The build proceeded milestone-by-milestone (`docs/design.md` §16.10), all green:

- **M0** — HF slow path, GSM8K, manifest emitted.
- **M1** — vLLM wrapper + ChatTemplate canonicalization + MMLU/HumanEval+.
- **M2** — Manifest canonical JSON + sign/verify/diff/replay/strip-caas.
- **M3** — CaaS rule engine + 15-entry KB + 10-case test corpus (70% auto-resolve, 0% false positive).
- **M4** — Multimodal (Qwen2.5-VL fast-path marker + MMMU + VLM-aware preflight).
- **M5** — lm-eval-harness shim + custom non-text modality (RNA example).
- **M6** — uv wheels (cu121/cu128/cu130), 5 fast paths, OpenAI-compatible serve, `anvil doctor`.

## License

Apache-2.0. See [`LICENSE`](LICENSE).

---

<p align="center">
  <img src="docs/assets/anvil_logo_symbol.png" alt="anvil" width="48" />
  <br />
  <sub><em>Anvil — the same manifest produces the same number, today, tomorrow, and on someone else's machine.</em></sub>
</p>
