Metadata-Version: 2.4
Name: llm-cache-router
Version: 0.2.3
Summary: Semantic cache, multi-provider LLM router and cost tracker (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen)
Author-email: Alexander Valenchits <valenchits@icloud.com>
Maintainer-email: Alexander Valenchits <valenchits@icloud.com>
License: MIT
Project-URL: Homepage, https://github.com/svalench/llm-cache-router
Project-URL: Repository, https://github.com/svalench/llm-cache-router
Project-URL: Issues, https://github.com/svalench/llm-cache-router/issues
Project-URL: Changelog, https://github.com/svalench/llm-cache-router/releases
Project-URL: Documentation, https://github.com/svalench/llm-cache-router#readme
Keywords: llm,cache,semantic-cache,router,openai,anthropic,gemini,ollama,minimax,qwen,cost-tracking,fastapi,prometheus,ai,async
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Classifier: Framework :: AsyncIO
Classifier: Framework :: FastAPI
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: httpx>=0.27
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: faiss-cpu>=1.8
Requires-Dist: numpy>=1.26
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == "redis"
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.9; extra == "qdrant"
Provides-Extra: fastapi
Requires-Dist: fastapi>=0.111; extra == "fastapi"
Requires-Dist: starlette>=0.37; extra == "fastapi"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Provides-Extra: all
Requires-Dist: llm-cache-router[fastapi,qdrant,redis]; extra == "all"
Dynamic: license-file

# llm-cache-router

[![PyPI version](https://badge.fury.io/py/llm-cache-router.svg)](https://pypi.org/project/llm-cache-router/)
[![Python versions](https://img.shields.io/pypi/pyversions/llm-cache-router.svg)](https://pypi.org/project/llm-cache-router/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/llm-cache-router.svg)](https://pypi.org/project/llm-cache-router/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/svalench/llm-cache-router/actions/workflows/ci.yml/badge.svg)](https://github.com/svalench/llm-cache-router/actions/workflows/ci.yml)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

> A lightweight, production-ready Python library that combines **semantic caching**, **multi-provider LLM routing**, and **cost tracking** in a single async-first API. Cut your LLM bill, ship faster, and never hardcode a single provider again.

---

## Table of Contents

- [Why llm-cache-router](#why-llm-cache-router)
- [Features](#features)
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Streaming](#streaming)
- [Cache Warmup](#cache-warmup)
- [Routing Strategies](#routing-strategies)
- [Cache Backends](#cache-backends)
- [Budget and Cost Tracking](#budget-and-cost-tracking)
- [FastAPI Integration](#fastapi-integration)
- [Async Context Manager](#async-context-manager)
- [Supported Providers](#supported-providers)
- [Architecture](#architecture)
- [Development](#development)
- [Roadmap](#roadmap)
- [Contributing](#contributing)
- [License](#license)

---

## Why llm-cache-router

Calling LLMs directly is expensive, slow, and locks you into a single vendor. This library solves all three problems at once:

- **Save money** — a semantic cache returns answers for near-duplicate queries without re-calling the provider, typically cutting spend by 30–70% on production workloads.
- **Stay resilient** — swap providers on the fly, use fallback chains, and never take a full outage because one vendor is down.
- **Control cost** — built-in daily/monthly budget guardrails with Prometheus metrics for every request.

One dependency. Six providers. Three cache backends. Full async support.

## Features

- **Semantic cache** — vector-similarity matching via `sentence-transformers`, not just exact string hashing.
- **Multi-provider routing** across OpenAI, Anthropic, Google Gemini, Ollama, MiniMax and Qwen (Dashscope).
- **Three routing strategies**: `CHEAPEST_FIRST`, `FASTEST_FIRST`, `FALLBACK_CHAIN`.
- **Pluggable cache backends**: in-memory (FAISS), Redis, Qdrant.
- **Streaming** — native async SSE streaming for every provider, transparent to the cache layer.
- **Cost tracker** with per-model pricing, daily/monthly budget limits and savings accounting.
- **Cache warmup** with controlled concurrency for pre-production pre-loading.
- **FastAPI middleware** + Prometheus metrics endpoint out of the box.
- **Typed** — Pydantic v2 models everywhere, fully typed public API.
- **Tested** — 10 test modules covering router, cache, providers, retry, warmup, and HTTP middleware.

## Installation

```bash
pip install llm-cache-router
```

Optional extras:

```bash
pip install "llm-cache-router[redis]"     # Redis cache backend
pip install "llm-cache-router[qdrant]"    # Qdrant vector cache backend
pip install "llm-cache-router[fastapi]"   # FastAPI middleware + Prometheus
pip install "llm-cache-router[all]"       # everything above
pip install "llm-cache-router[dev]"       # tests, ruff, mypy
```

Requires **Python 3.11+**.

## Quickstart

```python
import asyncio
from llm_cache_router import CacheConfig, LLMRouter, RoutingStrategy


async def main() -> None:
    router = LLMRouter(
        providers={
            "openai":    {"api_key": "sk-...",           "models": ["gpt-4o-mini"]},
            "anthropic": {"api_key": "sk-ant-...",       "models": ["claude-3-5-sonnet"]},
            "gemini":    {"api_key": "AIza...",          "models": ["gemini-1.5-flash"]},
            "ollama":    {"base_url": "http://localhost:11434", "models": ["llama3.2"]},
        },
        cache=CacheConfig(
            backend="memory",
            threshold=0.92,       # cosine similarity threshold
            ttl=3600,             # cache TTL in seconds
            max_entries=10_000,
        ),
        strategy=RoutingStrategy.CHEAPEST_FIRST,
        budget={"daily_usd": 5.0, "monthly_usd": 50.0},
    )

    response = await router.complete(
        messages=[{"role": "user", "content": "What is a semantic cache?"}],
        model="gpt-4o-mini",
    )
    print(response.content)
    print(f"cache_hit={response.cache_hit} cost=${response.cost_usd:.6f}")


asyncio.run(main())
```

## Streaming

All providers (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) support native SSE streaming. The cache layer is transparent: on a cache hit you receive a single final chunk, on a miss — a real streaming response that is also written to the cache once complete.

```python
async for chunk in router.stream(
    messages=[{"role": "user", "content": "Explain async/await in Python"}],
    model="gpt-4o-mini",
):
    print(chunk.delta, end="", flush=True)
    if chunk.is_final:
        print(f"\nprovider={chunk.provider_used} cost=${chunk.cost_usd:.6f}")
```

## Cache Warmup

Pre-load the cache with known queries before traffic hits production:

```python
from llm_cache_router.models import WarmupEntry

results = await router.warmup(
    entries=[
        WarmupEntry(
            messages=[{"role": "user", "content": "What is RAG?"}],
            model="gpt-4o-mini",
        ),
        WarmupEntry(
            messages=[{"role": "user", "content": "Explain vector databases"}],
            model="gpt-4o-mini",
        ),
    ],
    concurrency=5,
    skip_cached=True,
)
print(results)  # {"warmed": 2, "skipped": 0, "failed": 0}
```

## Routing Strategies

| Strategy | Description |
|---|---|
| `CHEAPEST_FIRST` | Picks the cheapest provider/model by live pricing for each call. |
| `FASTEST_FIRST` | Picks the provider with the lowest observed latency (EMA). |
| `FALLBACK_CHAIN` | Tries providers in order, falls back on error/timeout. |

```python
router = LLMRouter(
    providers={
        "openai":    {"api_key": "sk-...",     "models": ["gpt-4o"]},
        "anthropic": {"api_key": "sk-ant-...", "models": ["claude-3-5-sonnet"]},
    },
    strategy=RoutingStrategy.FALLBACK_CHAIN,
    fallback_chain=["openai/gpt-4o", "anthropic/claude-3-5-sonnet"],
)
```

## Cache Backends

### In-memory (FAISS)

Default. Zero dependencies beyond the core install. Best for single-process apps and tests.

```python
cache=CacheConfig(backend="memory", threshold=0.92, ttl=3600, max_entries=10_000)
```

### Redis

Production-grade distributed cache with LRU eviction, configurable timeouts, retry/backoff and bounded candidate set for vector search.

```python
cache=CacheConfig(
    backend="redis",
    redis_url="redis://localhost:6379/0",
    redis_namespace="llm_cache_router_prod",
    threshold=0.92,
    ttl=3600,
    max_entries=50_000,
    redis_command_timeout_sec=1.5,
    redis_retry_attempts=3,
    redis_retry_backoff_sec=0.2,
    redis_candidate_k=256,
)
```

### Qdrant

Native vector database for very large caches (millions of entries) and cross-service deployments.

```bash
pip install "llm-cache-router[qdrant]"
```

```python
cache=CacheConfig(
    backend="qdrant",
    qdrant_url="http://localhost:6333",
    qdrant_api_key=None,           # optional for Qdrant Cloud
    qdrant_collection="llm_cache",
    threshold=0.92,
    ttl=3600,
    max_entries=100_000,
)
```

## Budget and Cost Tracking

Set per-day and per-month USD limits — requests that would exceed the budget are rejected before hitting the provider.

```python
router = LLMRouter(
    providers={...},
    budget={"daily_usd": 5.0, "monthly_usd": 50.0},
)

stats = router.stats()
print(stats.total_cost_usd)           # total spent since start
print(stats.saved_cost_usd)           # saved via cache hits
print(stats.daily_spend_usd)
print(stats.budget_remaining_usd)     # None if no limit is set
print(stats.cache_hit_rate)           # 0.0–1.0
```

## FastAPI Integration

```bash
pip install "llm-cache-router[fastapi]"
```

```python
from fastapi import FastAPI
from llm_cache_router.middleware.fastapi import (
    add_http_metrics_middleware,
    mount_metrics_endpoint,
)

app = FastAPI()
add_http_metrics_middleware(app=app)
mount_metrics_endpoint(app=app, router=router, path="/metrics")
```

Exposed Prometheus metrics:

- `llm_router_http_requests_total{method,path,status}`
- `llm_router_http_request_duration_seconds_*` (histogram)
- `llm_router_cache_hits_total`, `llm_router_cache_misses_total`
- `llm_router_cost_usd_total`, `llm_router_saved_cost_usd_total`

## Async Context Manager

```python
async with LLMRouter(providers={...}) as router:
    response = await router.complete(messages=[...], model="gpt-4o-mini")
# close() is called automatically — closes provider clients and cache connections
```

## Supported Providers

| Provider | Streaming | Notes |
|---|---|---|
| OpenAI | yes | `gpt-4o`, `gpt-4o-mini`, `o1-*`, etc. |
| Anthropic | yes | Claude 3.5 Sonnet/Haiku, Opus |
| Google Gemini | yes | 1.5 Flash, 1.5 Pro |
| Ollama | yes | Any locally-served model |
| MiniMax | yes | `MiniMax-Text-01` and others |
| Qwen (Dashscope) | yes | `qwen-plus`, `qwen-max`, etc. |

Adding a new provider = subclass `LLMProvider`, register with `@register_provider("name")`. See `llm_cache_router/providers/base.py`.

## Architecture

```text
llm_cache_router/
  cache/          # memory (FAISS) / redis / qdrant backends
  providers/      # openai, anthropic, gemini, ollama, minimax, qwen
  strategies/     # cheapest, fastest, fallback
  embeddings/     # SentenceEncoder, HashingEncoder
  cost/           # CostTracker with daily/monthly budgets
  middleware/     # FastAPI middleware
  observability/  # Prometheus metrics
  models.py       # Pydantic models (LLMResponse, LLMStreamChunk, ...)
  router.py       # LLMRouter — public entrypoint
  retry.py        # RetryConfig + exponential backoff
  warmup.py       # async warmup helper
```

## Development

```bash
git clone https://github.com/svalench/llm-cache-router.git
cd llm-cache-router

# using uv (recommended)
uv sync --all-extras
uv run pytest

# or plain pip
pip install -e ".[all,dev]"
pytest
```

Code quality is enforced in CI via:

- `ruff check` (lint) and `ruff format --check` (style)
- `mypy --ignore-missing-imports` (type check)
- `pytest` on Python 3.11, 3.12, 3.13 with coverage

## Roadmap

- **v0.3** — Django helpers and middleware.
- **v0.4** — Streaming retry (reconnect on SSE drop).
- **v0.5** — Request tracing hooks (OpenTelemetry).
- **v1.0** — Full OTel spans, pluggable pricing providers, cache invalidation API.

## Contributing

Pull requests are welcome. Please:

1. Open an issue first for anything larger than a small bug fix.
2. Add tests for new behaviour.
3. Run `ruff check`, `ruff format`, `mypy` and `pytest` before pushing.

## License

MIT — see [LICENSE](LICENSE) for details.

---

## 🇷🇺 Краткое описание (Russian)

**llm-cache-router** — лёгкая production-ready Python-библиотека для семантического кэширования LLM-запросов, мульти-провайдер роутинга и контроля бюджета. Экономит 30–70% на LLM-счетах за счёт векторного кэша, переключается между провайдерами (OpenAI, Anthropic, Gemini, Ollama, MiniMax, Qwen) без изменений в коде приложения, и включает встроенный трекинг стоимости с дневными/месячными лимитами. Поддерживает три бэкенда кэша (in-memory / Redis / Qdrant), нативный стриминг для всех провайдеров и FastAPI-middleware с Prometheus-метриками.

**Установка:**

```bash
pip install llm-cache-router

# с дополнительными бэкендами
pip install "llm-cache-router[redis]"
pip install "llm-cache-router[qdrant]"
pip install "llm-cache-router[fastapi]"
pip install "llm-cache-router[all]"
```

Требуется **Python 3.11+**. Полная документация и примеры — выше (на английском).
