Metadata-Version: 2.4
Name: cacheback-ai
Version: 0.2.0
Summary: Universal semantic cache for AI APIs — text, image, voice. Drop-in wrapper for OpenAI/Anthropic SDKs.
Project-URL: Homepage, https://cacheback.ai
Project-URL: Repository, https://github.com/bgml-ai/cacheback
Project-URL: Documentation, https://cacheback.ai/docs
Author-email: Bogumił Jankiewicz <bogumil@bgml.ai>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,anthropic,cache,clip,embeddings,llm,multimodal,openai,semantic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: hnswlib>=0.8.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: onnxruntime>=1.17.0
Requires-Dist: tokenizers>=0.15.0
Provides-Extra: all
Requires-Dist: anthropic>=0.20.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: soundfile>=0.12.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Provides-Extra: image
Requires-Dist: pillow>=10.0.0; extra == 'image'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Provides-Extra: proxy
Requires-Dist: fastapi>=0.110.0; extra == 'proxy'
Requires-Dist: httpx>=0.27.0; extra == 'proxy'
Requires-Dist: pydantic>=2.0.0; extra == 'proxy'
Requires-Dist: uvicorn[standard]>=0.27.0; extra == 'proxy'
Provides-Extra: voice
Requires-Dist: soundfile>=0.12.0; extra == 'voice'
Description-Content-Type: text/markdown

# cacheback

**Universal semantic cache for AI APIs.** Drop-in wrapper for OpenAI and Anthropic SDKs with three-tier response: verbatim cache, synthesis, upstream.

Cache semantically similar queries and return instant responses (<10ms). When no exact match exists, synthesize from cached knowledge (~300ms, ~$0.002). Save 30-90% on API costs.

[![PyPI](https://img.shields.io/pypi/v/cacheback-ai)](https://pypi.org/project/cacheback-ai/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/cacheback-ai)](https://pypi.org/project/cacheback-ai/)
[![Tests](https://github.com/bgml-ai/cacheback/actions/workflows/ci.yml/badge.svg)](https://github.com/bgml-ai/cacheback/actions)

## Install

```bash
pip install cacheback-ai              # core
pip install cacheback-ai[openai]      # + OpenAI wrapper
pip install cacheback-ai[anthropic]   # + Anthropic wrapper
pip install cacheback-ai[proxy]       # + proxy server (FastAPI)
pip install cacheback-ai[all]         # everything
```

## Quick Start

### OpenAI (drop-in, zero code change)

```python
from cacheback import CachedOpenAI

client = CachedOpenAI(api_key="sk-...")

# First call: ~500ms (API + cache populate)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

# Second call with similar query: ~5ms (cache hit)
response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "capital of France?"}],
)
print(response2.cacheback_hit)  # True
```

### Anthropic

```python
from cacheback import CachedAnthropic

client = CachedAnthropic(api_key="sk-ant-...")
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is Python?"}],
)
print(message.cacheback_hit)  # True on cache hit
```

### Streaming

Streaming works transparently. Cache misses buffer and store the response; cache hits replay as a synthetic stream.

```python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
```

### Cache-Augmented Synthesis (CAS)

When a query is similar to cached entries but not an exact match, CAS synthesizes a fresh response from cached knowledge using a cheap LLM — instead of calling the expensive upstream API.

```python
from cacheback import CachedOpenAI

client = CachedOpenAI(
    synthesis_mode="auto",  # enable three-tier response
    # Uses Gemini Flash Lite via OpenRouter by default (~$0.002/synthesis)
    # Or point to local llama-cpp: synthesis_model="local/phi-4-mini"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain photosynthesis"}],
)

if response.cacheback_hit:
    print("Verbatim cache hit (<10ms, $0.00)")
elif response.cacheback_synthesized:
    print("Synthesized from cache (~300ms, ~$0.002)")
else:
    print("Upstream API call (~500ms, ~$0.03)")
```

```
Three-tier response:

  Query  -->  Embed  -->  HNSW search
                |
    sim >= 0.92 |  VERBATIM HIT   -->  Return cached response     <10ms   $0.00
    sim >= 0.80 |  SYNTHESIS       -->  Top-K cached Q&A + LLM    ~300ms  ~$0.002
    sim <  0.80 |  UPSTREAM MISS   -->  Call API, cache response   ~500ms  ~$0.03
```

Validated with 100-question benchmark across 5 domains: **0.892 mean quality ratio** vs direct API responses.

### Proxy Mode (zero code change)

Run cacheback as a standalone proxy server. No SDK integration needed — just change `base_url`:

```bash
# Docker (recommended)
docker run -e OPENAI_API_KEY=sk-... -p 8080:8080 cacheback/proxy

# Or pip
pip install cacheback-ai[proxy]
cacheback-proxy  # starts on :8080
```

Then point your existing code at the proxy:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1")  # that's it
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
)
# Cache headers: X-Cacheback-Hit, X-Cacheback-Synthesized
```

Works with any OpenAI-compatible client (curl, LangChain, LiteLLM, etc). Configure via environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENAI_API_KEY` | — | API key for upstream provider |
| `CACHEBACK_SIMILARITY_THRESHOLD` | `0.92` | Cache hit threshold |
| `CACHEBACK_SYNTHESIS_MODE` | `off` | `off` / `auto` / `always` |
| `CACHEBACK_TTL` | `86400` | Cache TTL in seconds |
| `CACHEBACK_PORT` | `8080` | Server port |

### Async

```python
from cacheback import AsyncCachedOpenAI, AsyncCachedAnthropic

async_client = AsyncCachedOpenAI()
response = await async_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
```

### Standalone Cache

Use `SemanticCache` directly for any embedding-based caching:

```python
from cacheback import SemanticCache

cache = SemanticCache(
    similarity_threshold=0.92,
    cache_ttl=86400,  # 24 hours
)

cache.populate("What is Python?", "Python is a programming language...")
result = cache.lookup("Tell me about Python")  # cache hit
```

### Negative Cache (blocklist)

Block known-bad query patterns before they hit the API:

```python
# Block a query pattern
client.cache.negative.add(
    "What is the airspeed of an unladen swallow?",
    reason="hallucination",
)

# Similar queries are now blocked
client.cache.negative.check("airspeed of swallows")  # returns match info

# Manage the blocklist
client.cache.negative.list(limit=50)
client.cache.negative.remove(entry_id=42)
client.cache.negative.report_false_positive(entry_id=42)
```

## Configuration

```python
client = CachedOpenAI(
    # Cache settings
    cache_dir="~/.cacheback",        # where to store cache data
    similarity_threshold=0.92,        # cosine similarity for cache hit (0-1)
    negative_threshold=0.85,          # threshold for negative cache
    cache_ttl=86400,                  # TTL in seconds (24h default)
    cache_max_entries=100_000,        # max entries before LRU eviction
    cache_enabled=True,               # set False to disable
    on_negative_hit="raise",          # "raise" | "skip" | callable

    # Synthesis settings (CAS)
    synthesis_mode="off",             # "off" | "auto" | "always"
    synthesis_model="google/gemini-2.0-flash-lite-001",  # any OpenAI-compatible model
    synthesis_model_base_url=None,    # auto-detected from OPENROUTER_API_KEY
    synthesis_model_api_key=None,     # auto-detected from env
    synthesis_threshold=0.80,         # min similarity for synthesis candidates
    synthesis_top_k=5,                # number of cached Q&A pairs for synthesis

    # OpenAI settings (passthrough)
    api_key="sk-...",
)
```

## How It Works

```
Query --> Embed (MiniLM-L6, 384-dim) --> Search HNSW index
  |-- VERBATIM HIT  (sim >= 0.92) --> Return cached response (<10ms)
  |-- SYNTHESIS      (sim >= 0.80) --> Top-K cached Q&A + cheap LLM (~300ms)
  '-- MISS           (sim <  0.80) --> Call upstream API, cache response (~500ms)
```

- **Embedder**: ONNX MiniLM-L6-v2 (90MB, runs locally, no API calls)
- **Index**: hnswlib HNSW for fast approximate nearest neighbor search
- **Store**: SQLite with WAL mode for concurrent access
- **Fallback**: numpy brute-force if hnswlib is unavailable

## CLI

```bash
cacheback stats          # Show cache statistics
cacheback entries        # List cached entries
cacheback evict          # Remove expired entries
cacheback clear          # Clear all entries
cacheback lookup "query" # Test a cache lookup
```

## Custom Embedders

Register your own embedder for any modality:

```python
from cacheback.embedders import BaseEmbedder, register_embedder
import numpy as np

class MyEmbedder(BaseEmbedder):
    dim = 256
    modality = "custom"

    def encode(self, input_data) -> np.ndarray:
        # Your embedding logic here
        ...

register_embedder("my-embedder", MyEmbedder)
cache = SemanticCache(embedder="my-embedder")
```

Built-in embedders: `minilm` (text), `clip` (image, coming soon), `clap` (voice, coming soon).

## Comparison

| Feature | cacheback | GPTCache | LiteLLM | Redis LangCache |
|---------|-----------|----------|---------|-----------------|
| Semantic similarity | Yes | Yes | Exact only | Yes |
| Cache-Augmented Synthesis | Yes | No | No | No |
| OpenAI drop-in | Yes | Partial | Yes | No |
| Anthropic drop-in | Yes | No | Yes | No |
| Streaming support | Yes | No | No | No |
| Negative cache | Yes | No | No | No |
| Multimodal (planned) | Yes | No | No | No |
| Async | Yes | No | Yes | No |
| Zero config | Yes | No | No | No |
| Proxy mode (Docker) | Yes | No | Yes | No |
| Local (no server) | Yes | Yes | No | No |
| License | Apache 2.0 | MIT | MIT | Redis |

## License

Apache 2.0 — see [LICENSE](LICENSE).

Built by [BGML.ai](https://bgml.ai) / [Fundacja BLOOM](https://bloom.foundation).
