Metadata-Version: 2.4
Name: llm-annotator
Version: 0.7.2
Summary: An easy-to-extend LLM annotator for robust, resumable data annotation.
Project-URL: Homepage, https://github.com/BramVanroy/llm-annotator
Project-URL: Documentation, https://BramVanroy.github.io/llm-annotator
Project-URL: Repository, https://github.com/BramVanroy/llm-annotator
Project-URL: Issues, https://github.com/BramVanroy/llm-annotator/issues
Author-email: Bram Vanroy <2779410+BramVanroy@users.noreply.github.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: MkDocs
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Build Tools
Classifier: Typing :: Typed
Requires-Python: <3.14,>=3.12
Requires-Dist: colorama<1,>=0.4.6
Requires-Dist: datasets<5,>=4.8.5
Provides-Extra: anthropic
Requires-Dist: anthropic<1,>=0.102.0; extra == 'anthropic'
Provides-Extra: gemini
Requires-Dist: google-genai<3,>=2.3.0; extra == 'gemini'
Provides-Extra: openai
Requires-Dist: openai<3,>=2.36.0; extra == 'openai'
Provides-Extra: vllm
Requires-Dist: mistral-common<2,>=1.11.2; extra == 'vllm'
Requires-Dist: vllm==0.21.0; extra == 'vllm'
Description-Content-Type: text/markdown

# Robust, resumable LLM dataset annotation

[![CI](https://github.com/BramVanroy/llm-annotator/actions/workflows/ci.yml/badge.svg)](https://github.com/BramVanroy/llm-annotator/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/BramVanroy/llm-annotator/branch/main/graph/badge.svg)](https://codecov.io/gh/BramVanroy/llm-annotator)
[![PyPI version](https://badge.fury.io/py/llm-annotator.svg)](https://badge.fury.io/py/llm-annotator)
[![Python versions](https://img.shields.io/pypi/pyversions/llm-annotator.svg)](https://pypi.org/project/llm-annotator/)
[![License](https://img.shields.io/github/license/BramVanroy/llm-annotator)](LICENSE)
![GitHub tag](https://img.shields.io/github/v/tag/BramVanroy/llm-annotator)


llm-annotator is a Python 3.12+ library for robust, resumable
LLM-driven dataset annotation and generation.

It supports multiple providers through pluggable clients:

- vLLM offline inference: `VLLMOfflineClient`
- vLLM server API: `VLLMClient`
- OpenAI API: `OpenAIClient`
- Anthropic API: `ClaudeClient`
- Gemini API: `GeminiClient`

Key capabilities:

- Resumable processing with JSONL checkpoints.
- Annotation of existing datasets and generation from scratch.
- Structured outputs via JSON schema.
- Retry and validation hooks for robust pipelines.
- Optional Hugging Face Hub upload cadence.
- Context-manager cleanup of client resources.

## Documentation

Read the full documentation at
[bramvanroy.github.io/llm-annotator](https://bramvanroy.github.io/llm-annotator/).

Provider setup reference:
[docs/provider-info.md](docs/provider-info.md)

## Installation

Recommended:

```sh
uv add llm-annotator
```

or

```sh
pip install llm-annotator
```

Install provider extras as needed:

```sh
uv add "llm-annotator[vllm]"
uv add "llm-annotator[openai]"
uv add "llm-annotator[anthropic]"
uv add "llm-annotator[gemini]"
```

See [docs/provider-info.md](docs/provider-info.md) for auth environment
variables and provider-specific setup notes.

For local vLLM runs, install flashinfer for your CUDA version.

```sh
uv pip install flashinfer-python flashinfer-cubin
# JIT cache package (replace cu128 with your CUDA variant)
uv pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu128
```

## Usage

Annotate an existing dataset:

```python
from llm_annotator import Annotator, VLLMOfflineClient

# Use a local vLLM model
client = VLLMOfflineClient(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_model_len=4096,
)

with Annotator(client=client, verbose=True) as anno:
    ds = anno.annotate_dataset(
        output_dir="outputs/sentiment",
        prompt_template="Classify the sentiment of this text: {text}",
        dataset_name="stanfordnlp/imdb",
        dataset_split="test",
        max_num_samples=100,
    )
```

Generate a dataset from scratch:

```python
from llm_annotator import Annotator, OpenAIClient

client = OpenAIClient(model="gpt-4o-mini")

with Annotator(client=client) as anno:
    ds = anno.generate_dataset(
        output_dir="outputs/generated-qa",
        prompts="Write a short geography quiz question with answer.",
        max_num_samples=200,
    )
```

See the documentation for more examples, including:
- Structured output with JSON schemas
- Custom validation and post-processing
- Large-scale streaming annotation
- Generating datasets from scratch
- Multi-GPU support

Or check out the [examples/](examples/) directory for complete working examples.


## Testing

Install development dependencies first:

```sh
uv sync --dev
```

Run the default checks:

```sh
make style
make quality
make test
make typecheck
```

Pytest marker targets:

```sh
# Fast tests (same as `make test`)
make test-fast

# Slow tests only
make test-slow

# Integration tests only
make test-integration

# Entire suite (fast + slow)
make test-all
```

You can also run markers directly with pytest:

```sh
uv run pytest -m "not slow"
uv run pytest -m "slow"
uv run pytest -m "integration"
```

Slow and integration tests may load local models, require more runtime, or depend on optional components.

## Building documentation

Local versioned docs preview (uses mike on a temporary local branch):

```sh
make serve-docs
```

Override version metadata when needed:

```sh
make serve-docs DOCS_VERSION=0.4.0 DOCS_ALIAS=latest DOCS_SOURCE_REF=v0.4.0
```

Docs are published with mike on release tags through
`.github/workflows/docs.yml`.
