Metadata-Version: 2.4
Name: ai5labs-relay
Version: 0.3.0
Summary: Relay — production-grade multi-provider LLM client. One YAML, one interface, every model.
Keywords: llm,ai,machine-learning,openai,anthropic,claude,gemini,bedrock,vertex,azure-openai,groq,mistral,deepseek,cohere,ollama,vllm,router,gateway,relay,byok,mcp,model-context-protocol,streaming,tool-calling,structured-output,cost-tracking,observability,opentelemetry,yaml,async
Author: ai5labs Research OPC Pvt Ltd
Author-email: ai5labs Research OPC Pvt Ltd <engineering@ai5labs.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Framework :: AsyncIO
Classifier: Typing :: Typed
Requires-Dist: httpx[http2]>=0.27
Requires-Dist: pydantic>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: orjson>=3.10
Requires-Dist: typing-extensions>=4.12
Requires-Dist: jsonschema>=4.21
Requires-Dist: ai5labs-relay[otel,aws,gcp,mcp] ; extra == 'all'
Requires-Dist: boto3>=1.34 ; extra == 'aws'
Requires-Dist: google-auth>=2.30 ; extra == 'gcp'
Requires-Dist: mcp>=1.0 ; extra == 'mcp'
Requires-Dist: opentelemetry-api>=1.27 ; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.27 ; extra == 'otel'
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/ai5labs/relay-llm
Project-URL: Documentation, https://github.com/ai5labs/relay-llm#readme
Project-URL: Repository, https://github.com/ai5labs/relay-llm
Project-URL: Issues, https://github.com/ai5labs/relay-llm/issues
Project-URL: Changelog, https://github.com/ai5labs/relay-llm/blob/main/CHANGELOG.md
Provides-Extra: all
Provides-Extra: aws
Provides-Extra: gcp
Provides-Extra: mcp
Provides-Extra: otel
Description-Content-Type: text/markdown

# Relay

> The fastest, lightest BYOK relay for any and every LLM model — open source.

[![CI](https://github.com/ai5labs/relay-llm/actions/workflows/ci.yml/badge.svg)](https://github.com/ai5labs/relay-llm/actions/workflows/ci.yml)
[![Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://python.org)

A Python library that gives you one interface to every major LLM — chat, streaming, tool calls, structured output, batch, MCP — defined in a YAML file you check into your repo. Production-grade, enterprise-ready, OSS.

**~5–19× faster cold start than LiteLLM**, **~20% faster streaming TTFT**, and tied at the median on chat overhead with more consistent tails ([reproducible benchmarks](BENCHMARKS.md)).

```bash
pip install ai5labs-relay
```

```python
from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    resp = await hub.chat(
        "fast-cheap",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(resp.text)
    print(resp.cost_usd, resp.cost.source)
```

## Why Relay

| | LiteLLM | LangChain | **Relay** |
|---|---|---|---|
| YAML model catalog | ✓ | — | ✓ |
| Built-in pricing snapshot with provenance | partial | — | ✓ |
| Live pricing (Bedrock, Azure, OpenRouter) | — | — | **✓** |
| Tool-call streaming deltas keyed by `index` (not `id`) | bug ([#20711](https://github.com/BerriAI/litellm/issues/20711)) | n/a | **✓** |
| **MCP universal tool layer** (any MCP server → any provider) | — | — | **✓** |
| **Cross-provider tool-schema compiler** with Mastra-style fallback | — | — | **✓** |
| **Pydantic structured output** (compiles per-provider, not text-coerced) | — | partial | **✓** |
| **Hub-level cache + Anthropic prompt-cache passthrough** | partial | — | **✓** |
| **Circuit breakers** with cooldown + half-open probes | — | — | **✓** |
| **OpenTelemetry GenAI semantic conventions** (opt-in) | — | — | **✓** |
| **Reasoning budget unification** across OpenAI/Anthropic/Gemini | — | — | **✓** |
| **OpenAI Responses API** opt-in (alongside Chat Completions) | — | — | **✓** |
| **Batch API wrapper** (OpenAI Batch + Anthropic Message Batches, ~50% off) | — | — | **✓** |
| **Native Bedrock / Azure / Gemini / Vertex / Cohere adapters** | OpenAI-compat shims | partial | **✓** native |
| **PII redaction pipeline** (regex + Presidio hooks) | — | — | **✓** |
| **Audit logging** (OTel-aligned schema, pluggable sinks) | enterprise SKU | — | **✓** |
| **Pre/post guardrails** (max-input, blocked-keywords, plugin-able) | enterprise SKU | — | **✓** |
| Anthropic `thinking` blocks preserved | flattened | flattened | **✓** |
| Typed errors (rate-limit / context-window / content-policy distinct) | partial | — | **✓** |
| `mypy --strict` (3 codes opted-out, see `pyproject.toml`) | — | — | ✓ |
| Apache-2.0 with explicit patent grant | MIT | MIT | ✓ |

## Quickstart

### 1. Define your models

Create `models.yaml`:

```yaml
# yaml-language-server: $schema=./relay.schema.json
# (generate the schema file once with: `relay schema --out relay.schema.json`)
version: 1

models:
  fast-cheap:
    target: groq/llama-3.3-70b-versatile
    credential: $env.GROQ_API_KEY

  smart:
    target: anthropic/claude-sonnet-4-5
    credential: $env.ANTHROPIC_API_KEY
    params:
      max_tokens: 4096

  cheap-vision:
    target: openai/gpt-4o-mini
    credential: $env.OPENAI_API_KEY

groups:
  default:
    strategy: fallback
    members: [smart, fast-cheap]    # try smart first, fall back to fast-cheap
```

Then point your editor at the schema URL on line 1 — the [Red Hat YAML extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml) for VS Code will give you autocomplete and inline validation while editing.

### 2. Use it

```python
from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    # Single model
    resp = await hub.chat("fast-cheap", messages=[
        {"role": "user", "content": "Hello"}
    ])

    # Group with fallback
    resp = await hub.chat("default", messages=[...])

    # Streaming
    async for ev in hub.stream("smart", messages=[...]):
        if ev.type == "text_delta":
            print(ev.text, end="", flush=True)
        elif ev.type == "thinking_delta":     # Anthropic extended thinking
            ...
        elif ev.type == "end":
            print(f"\nDone in {ev.response.latency_ms:.0f}ms, "
                  f"${ev.response.cost_usd:.4f}")

    # Bound handle for hot loops
    model = hub.get("fast-cheap")
    for prompt in prompts:
        resp = await model.chat(messages=[{"role": "user", "content": prompt}])
```

### 3. CLI

```bash
relay schema --out relay.schema.json     # JSON Schema for editors / docs
relay validate models.yaml               # validate config
relay models list                        # list configured aliases
relay models inspect smart               # show one alias's full config + catalog row
relay models compare sonnet 4o flash     # side-by-side: price, speed, MMLU, GPQA, HumanEval...
relay models recommend --task code --budget cheap --needs tools  # which model for the job?
relay catalog list --provider anthropic  # browse the built-in catalog
relay providers                          # list all supported providers
```

## Supported providers

**OpenAI-compatible** (one adapter): OpenAI, Groq, Together, DeepSeek, xAI, Mistral, Fireworks, Perplexity, OpenRouter, Ollama, vLLM, LM Studio.

**Native** (proper, lossless adapters): Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, Vertex AI.

## Routing

`relay.routing` is the public extension point for picking a model per call. Two implementations ship with v0.2:

- `RuleBasedRouter` — deterministic, constraint-driven, in-process. Same scoring logic as `relay models recommend`, free.
- `SemanticRouter` — HTTP client for the hosted semantic router (paid, optional). Wire protocol documented in [`docs/routing/api-spec.md`](docs/routing/api-spec.md).

Attach a router and call `chat_routed` instead of `chat` — Relay picks the alias, falls back through alternates on error, and stamps the decision onto `response.metadata["routing"]`. Custom routers satisfying the `Router` Protocol are accepted. See [`docs/routing/usage.md`](docs/routing/usage.md) for examples.

## Pricing & cost tracking

Every response carries a `Cost` object with full provenance:

```python
resp.cost.total_usd        # 0.00234
resp.cost.source           # "live_api" | "snapshot" | "user_override" | "estimated" | "unknown"
resp.cost.confidence       # "exact" | "list_price" | "estimated" | "unknown"
resp.cost.fetched_at       # ISO 8601 timestamp (when fetched live)
```

**Tier order** (first match wins):

1. **User override** — explicit `cost:` block on a model entry, or a `pricing_profile`.
2. **Live APIs** (cached 6h in-process):
   - AWS Pricing API for Bedrock
   - Azure Retail Prices API for Azure OpenAI
   - OpenRouter `/api/v1/models` for ~400 models from OpenAI, Anthropic, Google, Groq, etc. at list price
3. **Snapshot** — JSON shipped with each release, regenerated weekly via CI.
4. **Unknown** — `cost_usd = None`, never wrong-by-default.

### Negotiated rates

No public API exposes enterprise discounts (AWS EDP, Azure committed-use, OpenAI custom tiers). Configure them yourself:

```yaml
pricing_profiles:
  acme-aws-prod:
    description: "15% EDP discount"
    input_multiplier: 0.85
    output_multiplier: 0.85

  openai-team-tier:
    fixed_overrides:
      openai/gpt-4o:
        input_per_1m: 1.25
        output_per_1m: 5.00

models:
  bedrock-sonnet:
    target: bedrock/anthropic.claude-sonnet-4-5-20250929-v1:0
    region: us-east-1
    credential: { type: aws_profile, profile: prod }
    pricing_profile: acme-aws-prod
```

## Production-grade design

- **Connection pooling**: one `httpx.AsyncClient` per `(provider, base_url)`, HTTP/2 enabled, keep-alive tuned for streaming workloads.
- **Lazy SDK imports**: `boto3` and other heavy deps only load when their first call happens.
- **Streaming hot path** uses `orjson` and dicts — no Pydantic validation per-token. Pydantic only runs on the final assembled response.
- **Tool-call delta merging keyed by `index`**, not `id`. (LiteLLM keys by `id` and drops ~90% of argument deltas — issue [#20711](https://github.com/BerriAI/litellm/issues/20711).)
- **Provider-specific blocks preserved**: Anthropic `thinking`, Gemini `grounding`, citations — emitted as typed events, not flattened.
- **Classified errors**: `RateLimitError`, `ContextWindowError`, `ContentPolicyError`, `AuthenticationError` are distinct types — fall back vs retry vs fail-fast can be decided automatically.
- **OpenTelemetry GenAI semantic conventions** (opt-in): emits `gen_ai.*` spans + metrics that Datadog, Honeycomb, Langfuse, and Arize all consume.

## Security

- **Keys never inline in YAML** — credentials are reified objects (env var, AWS Secrets Manager, GCP Secret Manager, Vault).
- **Library, not a hosted proxy** by default. Your API keys stay in your process. (Compare: the LiteLLM proxy PyPI compromise of March 2026 leaked keys from every centralized deployment.)
- Releases will be **Sigstore-signed** via OIDC Trusted Publishing.
- See [SECURITY.md](SECURITY.md) for vulnerability reporting.

## Status

**v0.2.2 (alpha)** — chat, streaming, tool calls, structured output, batch (OpenAI Batch + Anthropic Message Batches), MCP, Hub-level cache + provider-cache passthrough, PII redaction, audit logging, pre/post guardrails, OpenTelemetry GenAI semantic conventions, cost tracking with live pricing, 12 OpenAI-compatible providers + 6 native adapters (Anthropic, Azure OpenAI, AWS Bedrock, Cohere, Google Gemini direct, Vertex AI), plus opt-in OpenAI Responses API.

API surface is stable; everything under `_internal/` and `_*` modules is not.

## Development

```bash
uv sync --all-groups
uv run pytest
uv run ruff check
uv run mypy
uv run pyright
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Please read [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) before opening a PR.

## Support

Relay is free, Apache-2.0, and actively maintained by [ai5labs Research OPC Pvt Ltd](https://ai5labs.com). If your team uses it in production, please consider:

- ⭐ **[Star the repo](https://github.com/ai5labs/relay-llm)** — actually helps a lot at this stage
- 🤝 **[Become a design partner](SUPPORT.md#3-become-a-design-partner)** — direct line to maintainers, roadmap influence, free for the program duration
- 🏢 **Enterprise support (planned for v0.3, Q3 2026)** — SLAs, custom features, VPC deployment, SOC 2, BAA/DPA on the roadmap. Email **engineering@ai5labs.com** to be a design partner.

See [SUPPORT.md](SUPPORT.md) for full details.

## License

Apache-2.0. See [LICENSE](LICENSE). Copyright © 2026 ai5labs Research OPC Pvt Ltd.
