v0.1.1 — Now with Cache-Augmented Synthesis

Save 30–90% on AI costs with three‑tier response

Universal semantic cache for OpenAI & Anthropic.
Verbatim cache • Synthesis from cache • Upstream fallback.
SDK, proxy, or Docker — zero config.

$ pip install cacheback-ai[openai] Click to copy
app.py
# Before
from openai import OpenAI
client = OpenAI()

# After — just change the import
from cacheback import CachedOpenAI
client = CachedOpenAI()

# Everything else stays the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}],
    stream=True,  # streaming works too
)

response.cacheback_hit  # True on cache hit, <10ms
<10ms
Verbatim hit
~300ms
Synthesis from cache
30–90%
Cost savings
0.89 quality
CAS benchmark score
O
OpenAI
A
Anthropic
~
Async
Streaming
Proxy
🖼
Image
Everything you need.
Nothing you don't.
One import. Full API coverage. No infrastructure to manage.

Semantic Matching

ONNX-powered MiniLM embeddings match similar queries regardless of wording. Tunable threshold, 0.92 default.

Streaming & Async

Buffer-and-replay streaming for cache hits. Full async/await with AsyncCachedOpenAI & AsyncCachedAnthropic. Your code doesn't change.

Cache-Augmented Synthesis

No exact match? Top-5 cached Q&A synthesized via cheap LLM. ~300ms, ~$0.002. Validated at 0.89 quality score.

Negative Cache

Blocklist known-bad queries. Auto-removes after false positive threshold. No other cache has this.

Multimodal Embedders

Text (MiniLM), image (CLIP ViT-B/32, 512-dim), voice (CLAP/Whisper) — or bring your own. Registry with lazy loading.

Proxy Mode

OpenAI-compatible server. base_url=proxy — zero code change. Docker or pip. SSE streaming.

Graceful Degradation

Cache fails? Passthrough. ONNX missing? Numpy fallback. Corrupt DB? Fresh start. Never blocks your app.

Built-in Observability

Hit rate, entries, embedder info — all via cache.stats. CLI tools included.

Fully Local, Zero Config

SQLite + HNSW on disk. No Redis, no cloud. 92% threshold, 24h TTL, LRU eviction. Works offline. Change anything.

Three-tier response pipeline
Every API call goes through embed → search → one of three response tiers. Transparent. Deterministic.

Verbatim Hit

sim ≥ 0.92 — exact match from cache

<10ms • $0.00

Synthesis

sim ≥ 0.80 — top-5 cached Q&A compiled by cheap LLM

~300ms • ~$0.002

Upstream

sim < 0.80 — full API call, response cached for next time

~500ms • ~$0.03

Blocked

Negative cache match — known-bad query pattern rejected

0ms • $0.00
from cacheback import CachedOpenAI

client = CachedOpenAI(
    similarity_threshold=0.92,  # tune for your use case
    cache_ttl=86400,              # 24 hours
)

# Sync
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.cacheback_hit)  # True on 2nd call

# Streaming — works transparently
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
from cacheback import CachedAnthropic

client = CachedAnthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

print(message.cacheback_hit)  # True when served from cache
print(message.content[0].text)

# Async
from cacheback import AsyncCachedAnthropic
client = AsyncCachedAnthropic()
response = await client.messages.create(...)
from cacheback import CachedOpenAI

# Enable three-tier response
client = CachedOpenAI(synthesis_mode="auto")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain photosynthesis"}],
)

response.cacheback_hit          # True if verbatim (sim ≥ 0.92)
response.cacheback_synthesized  # True if CAS (0.80 ≤ sim < 0.92)
# Both False = upstream API call
# Terminal: start the proxy
# $ docker run -e OPENAI_API_KEY=sk-... -p 8080:8080 cacheback/proxy

# Your app: just change base_url
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
# Works with curl, LangChain, LiteLLM, anything OpenAI-compatible
# Blocklist known-bad patterns
client.cache.negative.add(
    "How to hack into systems",
    reason="refusal",
    category="safety",
    severity=5,
)

# Check before calling API
result = client.cache.negative.check("How do I hack computers")
# result = {"reason": "refusal", "entry_id": 1, ...}

# False positive? Report it
client.cache.negative.report_false_positive(entry_id=1)
# Auto-removes after 5 false positives

# List & manage
entries = client.cache.negative.list(category="safety")
client.cache.negative.remove(entry_id=1)
from cacheback import SemanticCache

# Use as a general-purpose semantic cache
cache = SemanticCache(
    cache_dir="~/.cacheback/my-app",
    similarity_threshold=0.90,
)

# Cache any string query → response pair
cache.populate("What is Python?", "Python is a programming language...")

# Lookup by semantic similarity
result = cache.lookup("Tell me about Python")  # hits!

# Stats & management
print(cache.stats)   # {"hits": 1, "misses": 0, ...}
cache.evict_expired() # clean up old entries

# Context manager
with SemanticCache() as cache:
    result = cache.lookup("query")
Don't just cache. Synthesize.
When a query is similar but not identical, CAS retrieves top-5 cached Q&A pairs and synthesizes a fresh response with a cheap LLM — 10× cheaper than a full API call.
Verbatim
<10ms
$0.00
Synthesis
~300ms
~$0.002
Upstream
~500ms
~$0.03
cas_example.py
from cacheback import CachedOpenAI

client = CachedOpenAI(
    synthesis_mode="auto",  # enable three-tier response
    # Uses Gemini Flash Lite via OpenRouter (~$0.002/synthesis)
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain photosynthesis"}],
)

if response.cacheback_hit:
    print("Verbatim hit — <10ms, $0.00")
elif response.cacheback_synthesized:
    print("Synthesized from cache — ~300ms, ~$0.002")
else:
    print("Upstream API — ~500ms, ~$0.03")

Validated with 100-question benchmark across 5 domains: 0.892 mean quality ratio vs direct API responses.

Zero code change. Just change the URL.
Run cacheback as a standalone proxy. Any OpenAI-compatible client works — curl, LangChain, LiteLLM, your existing app.
terminal
# Docker (recommended)
$ docker run -e OPENAI_API_KEY=sk-... \
    -p 8080:8080 cacheback/proxy

# Or pip
$ pip install cacheback-ai[proxy]
$ cacheback-proxy
your_app.py
from openai import OpenAI

# Just change base_url — that's it
client = OpenAI(
    base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
# Cache headers in response:
# X-Cacheback-Hit: true
# X-Cacheback-Synthesized: false

Environment Variables

OPENAI_API_KEY — upstream provider key
CACHEBACK_SIMILARITY_THRESHOLD — 0.92
CACHEBACK_SYNTHESIS_MODE — off / auto / always
CACHEBACK_TTL — 86400 (24h)
CACHEBACK_PORT — 8080

Endpoints

POST /v1/chat/completions
GET  /v1/cache/stats
GET  /health

Features

SSE streaming with buffer-and-cache
Cache hit/synthesis response headers
Three-tier response (verbatim/CAS/upstream)
Works with any OpenAI-compatible client
The only cache that does it all
We built what was missing.
Feature cacheback GPTCache LiteLLM Redis LangCache
Semantic similarity
Cache-Augmented Synthesis
OpenAI drop-inPartial
Anthropic drop-in
Streaming support
Proxy mode (Docker)
Negative cache
Multimodal (image)
Async support
Zero config / local
LicenseApache 2.0MITMITMIT

Ship faster. Spend less.

SDK, proxy, or Docker. Three-tier response catches what binary cache misses. Immediate savings on every query.

$ pip install cacheback-ai[all] Click to copy