Metadata-Version: 2.4
Name: llm-infer
Version: 0.3.0
Summary: A readable LLM inference server implementing paged attention and continuous batching
Author-email: LLM Works LLC <info@llm-works.ai>
License-Expression: Apache-2.0
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: appinfra[fastapi]<0.6.0,>=0.5.0
Requires-Dist: httpx<1.0.0,>=0.27.0
Requires-Dist: aiohttp<4.0.0,>=3.13.3
Requires-Dist: filelock<4.0.0,>=3.20.1
Requires-Dist: urllib3<3.0.0,>=2.6.3
Requires-Dist: werkzeug<4.0.0,>=3.1.5
Provides-Extra: runtime
Requires-Dist: torch<3.0.0,>=2.0.0; extra == "runtime"
Requires-Dist: transformers<5.0.0,>=4.30.0; extra == "runtime"
Requires-Dist: safetensors<1.0.0,>=0.4.0; extra == "runtime"
Provides-Extra: anthropic
Requires-Dist: anthropic<1.0.0,>=0.47.0; extra == "anthropic"
Provides-Extra: saia
Requires-Dist: llm-saia<0.3.0,>=0.2.0; extra == "saia"
Provides-Extra: cuda
Requires-Dist: vllm<1.0.0,>=0.7.0; extra == "cuda"
Requires-Dist: flashinfer-python<1.0.0,>=0.2.0; extra == "cuda"
Requires-Dist: pynvml<13.0.0,>=11.0.0; extra == "cuda"
Provides-Extra: dev
Requires-Dist: coverage<8.0.0,>=7.0.0; extra == "dev"
Requires-Dist: ruff<1.0.0,>=0.1.0; extra == "dev"
Requires-Dist: mypy<2.0.0,>=1.0.0; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.0; extra == "dev"
Requires-Dist: pytest<10.0.0,>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio<2.0.0,>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov<8.0.0,>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist<4.0.0,>=3.0.0; extra == "dev"
Dynamic: license-file

# llm-infer

![Python](https://img.shields.io/badge/python-3.11+-blue.svg)
![Coverage](https://img.shields.io/badge/coverage-53%25-yellow.svg)
[![Typed](https://img.shields.io/badge/typed-PEP%20561-brightgreen.svg)](https://peps.python.org/pep-0561/)
[![Linting: Ruff](https://img.shields.io/badge/linting-ruff-brightgreen)](https://github.com/astral-sh/ruff)
[![CI](https://github.com/llm-works/llm-infer/actions/workflows/ci.yml/badge.svg)](https://github.com/llm-works/llm-infer/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/llm-infer.svg)](https://pypi.org/project/llm-infer/)
![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)

Unified CLI and client library for local LLM inference. Wraps Ollama, vLLM, and a native engine
behind a single interface.

**Components:**

- **CLI & Server** - Single command to serve models via Ollama, vLLM, or native torch engine
- **Client Package** - Standard interface to multiple LLM backends (OpenAI, Anthropic, local servers)
- **Native Engine** - Custom torch implementation for learning and experimentation

## Quick Start

```bash
pip install llm-infer

# With Ollama (https://ollama.com)
ollama pull qwen2.5:0.5b
llm-infer serve --model qwen2.5:0.5b

# Query
llm-infer query "What is the capital of France?"
```

## Client Package

`llm_infer.client` is a Python client library for LLM inference with a unified interface across
backends. Built for autonomous agents and production use:

- **Multiple backends** - OpenAI, Anthropic, and any OpenAI-compatible API
- **Sync, async, streaming** - All execution modes supported
- **Rate limiting** - Per-backend request throttling
- **Retry with backoff** - Configurable exponential backoff on failures
- **Model routing** - Route requests to backends by model name
- **Extensible** - Register custom backends via `Factory.register()`

```python
from appinfra.log import Logger
from llm_infer.client import Factory

lg = Logger("my-app")
factory = Factory(lg)

with factory.openai(base_url="http://localhost:8000/v1") as client:
    response = client.chat(
        messages=[{"role": "user", "content": "Hello!"}],
        system="You are a helpful assistant.",
    )
    print(response.content)

# Streaming
with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    for token in client.chat_stream(messages):
        print(token, end="", flush=True)

# Async
async with factory.openai(base_url="http://localhost:8000/v1") as client:
    messages = [{"role": "user", "content": "Hello!"}]
    response = await client.chat_async(messages)
```

### Protocol Extensions

The server extends the OpenAI chat completions API:

**Request** - adds `think` and `adapter` fields:
```json
{
  "model": "default",
  "messages": [{"role": "user", "content": "What is 15 * 23?"}],
  "think": true,
  "adapter": "my-lora-adapter"
}
```

**Response** - adds `thinking` in message and `adapter` metadata:
```json
{
  "id": "chatcmpl-123",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "345",
      "thinking": "Let me calculate step by step..."
    }
  }],
  "adapter": {
    "requested": "my-lora-adapter",
    "actual": "my-lora-adapter",
    "fallback": false
  }
}
```

The client library exposes these as keyword arguments:

```python
response = client.chat(messages, think=True, adapter="my-adapter")
print(response.thinking)  # Reasoning content
print(response.content)   # Final answer
```

### Multiple Backends

```python
# Anthropic
async with factory.anthropic(model="claude-sonnet-4-20250514") as client:
    response = await client.chat_async(messages)

# OpenAI
with factory.openai(base_url="https://api.openai.com/v1", api_key="sk-...") as client:
    response = client.chat(messages)
```

## Engines

| Engine | Description | Install |
|--------|-------------|---------|
| `ollama` (default) | Wraps Ollama server | [ollama.com](https://ollama.com) |
| `vllm` | vLLM Python API | `pip install vllm` |
| `vllm-server` | vLLM HTTP subprocess | `pip install vllm` |
| `native` | Custom torch implementation | `pip install llm-infer[runtime]` |

```bash
llm-infer serve --model qwen2.5:7b                          # Ollama
llm-infer serve --engine vllm --model-path /path/to/model   # vLLM
llm-infer serve --engine native --model-path /path/to/model # Native
```

### Native Engine

The native engine is a from-scratch torch implementation with PagedAttention and FlashInfer. Useful
for learning how LLM inference works or experimenting with custom modifications.

```bash
pip install llm-infer[runtime]
llm-infer serve --engine native --model-path /path/to/model
```

## Configuration

```yaml
# etc/llm-infer.yaml
backends:
  engine: ollama

models:
  locations:
    - /path/to/models
  selection:
    generate:
      default: qwen2.5-7b
    embed:
      default: bge-small-en-v1.5

api:
  host: 0.0.0.0
  port: 8000
```

Per-model overrides in `etc/models.yaml`:

```yaml
models:
  qwen2.5-7b:
    max_model_len: 8192
    vllm:
      enforce_eager: true

  qwen2.5:7b:
    ollama: qwen2.5:7b  # Ollama model name mapping
```

## API Endpoints

| Endpoint | Description |
|----------|-------------|
| `POST /v1/chat/completions` | Chat completion (OpenAI-compatible) |
| `POST /v1/completions` | Text completion (OpenAI-compatible) |
| `GET /v1/models` | List available models |
| `GET /health` | Health check |
| `GET /metrics` | Prometheus metrics |

## Installation

```bash
pip install llm-infer              # Client only
pip install llm-infer[anthropic]   # With Anthropic support
pip install llm-infer[saia]        # With llm-saia integration
pip install llm-infer[runtime]     # With native engine (torch)
```

## License

Apache License 2.0
