Semantic cache layer for LLM apps. Exact-match in microseconds. Near-match in milliseconds. PII scrubbed before anything touches your database.
Every query cascades through three layers before ever hitting your LLM provider. Most never make it past the second.
Your existing LLM function stays exactly as-is. One wrapper replaces every direct call.
pip install thriftlm
# For the local dashboard + API server:
pip install thriftlm[api]
from thriftlm import SemanticCache import openai # Initialize once per process โ bulk-loads embeddings into local numpy index cache = SemanticCache(threshold=0.82, api_key="your-key") def call_llm(query: str) -> str: res = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": query}] ) return res.choices[0].message.content # Drop-in. Handles cache check + LLM fallback automatically. response = cache.get_or_call("Explain semantic caching", call_llm) # Near-duplicate query โ instant hit, no LLM called response2 = cache.get_or_call("What is semantic caching?", call_llm)
SUPABASE_URL=https://your-project.supabase.co SUPABASE_KEY=your-anon-key REDIS_URL=redis://localhost:6379 OPENAI_API_KEY=sk-...
Create a project at supabase.com and run supabase/setup.sql to create the HNSW-indexed cache table.
Run docker compose up -d for local Redis, or point REDIS_URL at Upstash for a managed option.
Replace your direct LLM call with cache.get_or_call(). That is the entire integration.
Run thriftlm serve --api-key your-key. Opens a live dashboard at localhost:8000 โ bundled in the package, nothing extra to deploy.
# See live hit rate, tokens saved, and top cached queries thriftlm serve --api-key your-key # โ ThriftLM dashboard โ http://localhost:8000 # โ Opens in browser automatically
Open-source, self-hostable, no vendor lock-in. Your data stays on your infrastructure.
thriftlm serve --api-key sc_xxx for live hit rates, tokens saved, and top queries โ bundled in pip, nothing to deploy.200 duplicate question pairs from the Quora Question Pairs dataset. question1 stored, question2 used for lookup. Threshold controls the precision/recall tradeoff.
| Threshold | Hit Rate | Hits / 200 | |
|---|---|---|---|
| 0.70 | 92.5% | 185 | |
| 0.75 | 86.0% | 172 | |
| 0.80 | 78.0% | 156 | |
| 0.82 โ default | 73.5% | 147 | |
| 0.85 | 62.5% | 125 | |
| 0.90 | 40.0% | 80 |
all-MiniLM-L6-v2 ยท mean sim=0.859 ยท HNSW index (Supabase pgvector)
V1 is the cache primitive. V2 is where it gets interesting.
Redis โ LocalIndex โ HNSW three-tier pipeline. Presidio PII scrubbing. Multi-tenant FastAPI. thriftlm serve bundled dashboard CLI. pip install thriftlm.
Cache entire multi-step agent plans, not just responses. When intent repeats in a loop, skip re-planning entirely. Built for Claude Code SDK and long-running agent workflows.
Publish as an OpenClaw skill โ drop-in for any Claude Code loop via a single slash command.
Per-model token pricing table. Real-time dollar savings tracked on the dashboard.
Three lines of Python. Open-source. Your data never leaves your own infrastructure.