Universal semantic cache for OpenAI & Anthropic.
Verbatim cache • Synthesis from cache • Upstream fallback.
SDK, proxy, or Docker — zero config.
# Before
from openai import OpenAI
client = OpenAI()
# After — just change the import
from cacheback import CachedOpenAI
client = CachedOpenAI()
# Everything else stays the same
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}],
stream=True, # streaming works too
)
response.cacheback_hit # True on cache hit, <10ms
ONNX-powered MiniLM embeddings match similar queries regardless of wording. Tunable threshold, 0.92 default.
Buffer-and-replay streaming for cache hits. Full async/await with AsyncCachedOpenAI & AsyncCachedAnthropic. Your code doesn't change.
No exact match? Top-5 cached Q&A synthesized via cheap LLM. ~300ms, ~$0.002. Validated at 0.89 quality score.
Blocklist known-bad queries. Auto-removes after false positive threshold. No other cache has this.
Text (MiniLM), image (CLIP ViT-B/32, 512-dim), voice (CLAP/Whisper) — or bring your own. Registry with lazy loading.
OpenAI-compatible server. base_url=proxy — zero code change. Docker or pip. SSE streaming.
Cache fails? Passthrough. ONNX missing? Numpy fallback. Corrupt DB? Fresh start. Never blocks your app.
Hit rate, entries, embedder info — all via cache.stats. CLI tools included.
SQLite + HNSW on disk. No Redis, no cloud. 92% threshold, 24h TTL, LRU eviction. Works offline. Change anything.
sim ≥ 0.92 — exact match from cache
sim ≥ 0.80 — top-5 cached Q&A compiled by cheap LLM
sim < 0.80 — full API call, response cached for next time
Negative cache match — known-bad query pattern rejected
from cacheback import CachedOpenAI
client = CachedOpenAI(
similarity_threshold=0.92, # tune for your use case
cache_ttl=86400, # 24 hours
)
# Sync
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.cacheback_hit) # True on 2nd call
# Streaming — works transparently
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
from cacheback import CachedAnthropic
client = CachedAnthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(message.cacheback_hit) # True when served from cache
print(message.content[0].text)
# Async
from cacheback import AsyncCachedAnthropic
client = AsyncCachedAnthropic()
response = await client.messages.create(...)
from cacheback import CachedOpenAI
# Enable three-tier response
client = CachedOpenAI(synthesis_mode="auto")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain photosynthesis"}],
)
response.cacheback_hit # True if verbatim (sim ≥ 0.92)
response.cacheback_synthesized # True if CAS (0.80 ≤ sim < 0.92)
# Both False = upstream API call
# Terminal: start the proxy
# $ docker run -e OPENAI_API_KEY=sk-... -p 8080:8080 cacheback/proxy
# Your app: just change base_url
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
# Works with curl, LangChain, LiteLLM, anything OpenAI-compatible
# Blocklist known-bad patterns
client.cache.negative.add(
"How to hack into systems",
reason="refusal",
category="safety",
severity=5,
)
# Check before calling API
result = client.cache.negative.check("How do I hack computers")
# result = {"reason": "refusal", "entry_id": 1, ...}
# False positive? Report it
client.cache.negative.report_false_positive(entry_id=1)
# Auto-removes after 5 false positives
# List & manage
entries = client.cache.negative.list(category="safety")
client.cache.negative.remove(entry_id=1)
from cacheback import SemanticCache
# Use as a general-purpose semantic cache
cache = SemanticCache(
cache_dir="~/.cacheback/my-app",
similarity_threshold=0.90,
)
# Cache any string query → response pair
cache.populate("What is Python?", "Python is a programming language...")
# Lookup by semantic similarity
result = cache.lookup("Tell me about Python") # hits!
# Stats & management
print(cache.stats) # {"hits": 1, "misses": 0, ...}
cache.evict_expired() # clean up old entries
# Context manager
with SemanticCache() as cache:
result = cache.lookup("query")
from cacheback import CachedOpenAI
client = CachedOpenAI(
synthesis_mode="auto", # enable three-tier response
# Uses Gemini Flash Lite via OpenRouter (~$0.002/synthesis)
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain photosynthesis"}],
)
if response.cacheback_hit:
print("Verbatim hit — <10ms, $0.00")
elif response.cacheback_synthesized:
print("Synthesized from cache — ~300ms, ~$0.002")
else:
print("Upstream API — ~500ms, ~$0.03")
Validated with 100-question benchmark across 5 domains: 0.892 mean quality ratio vs direct API responses.
# Docker (recommended)
$ docker run -e OPENAI_API_KEY=sk-... \
-p 8080:8080 cacheback/proxy
# Or pip
$ pip install cacheback-ai[proxy]
$ cacheback-proxy
from openai import OpenAI
# Just change base_url — that's it
client = OpenAI(
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
# Cache headers in response:
# X-Cacheback-Hit: true
# X-Cacheback-Synthesized: false
| Feature | cacheback | GPTCache | LiteLLM | Redis LangCache |
|---|---|---|---|---|
| Semantic similarity | ✓ | ✓ | — | ✓ |
| Cache-Augmented Synthesis | ✓ | — | — | — |
| OpenAI drop-in | ✓ | Partial | ✓ | — |
| Anthropic drop-in | ✓ | — | ✓ | — |
| Streaming support | ✓ | — | — | — |
| Proxy mode (Docker) | ✓ | — | ✓ | — |
| Negative cache | ✓ | — | — | — |
| Multimodal (image) | ✓ | — | — | — |
| Async support | ✓ | — | ✓ | — |
| Zero config / local | ✓ | ✓ | — | — |
| License | Apache 2.0 | MIT | MIT | MIT |
SDK, proxy, or Docker. Three-tier response catches what binary cache misses. Immediate savings on every query.