LLM cost infrastructure

Stop paying for the same prompt twice.

A local proxy for LLM APIs. Exact-match and semantic deduplication intercept redundant calls before they cost you a token. One env var. No gateway.

View on GitHub

Open source. Runs on your machine.

setup.shshell
1
2
3
4
5
6
7
8
9
pip install "inferencache[embed,serve]"
inferencache serve
 
# Claude Code / Cursor
export ANTHROPIC_BASE_URL=http://localhost:8080
 
# repeat call → cache hit 0ms $0.00
# near-match → semantic hit 4ms $0.00
# new prompt → api call 820ms $0.0031

01how it works

Two-tier cache, one check

Every call runs an exact match in SQLite, then a semantic search in Qdrant. Hit either layer and the response returns immediately — no network, no tokens, no cost. Miss both and the real call goes through and writes back.

02integration

Proxy-first, nothing to rearchitect

Point Cursor or Claude Code at localhost:8080 with one env var. No decorator, no gateway, no changed call signatures — your existing SDK calls just get cheaper.

03vs alternatives

GPTCache is abandoned. Gateways are overhead.

Every other caching layer either stopped shipping or routes traffic through infrastructure you don't own. inferencache is just a library — pip install inferencache. Your prompts stay on your machine.

04MCP server

Your editor can see the cache too

A read-only MCP server ships with the library. Cursor and Claude Code can inspect hit rates, cost savings, and cache state without leaving the editor.