LLM cost infrastructure
Stop paying for the same prompt twice.
A local proxy for LLM APIs. Exact-match and semantic deduplication intercept redundant calls before they cost you a token. One env var. No gateway.
Open source. Runs on your machine.
LLM cost infrastructure
A local proxy for LLM APIs. Exact-match and semantic deduplication intercept redundant calls before they cost you a token. One env var. No gateway.
Open source. Runs on your machine.
01how it works
Every call runs an exact match in SQLite, then a semantic search in Qdrant. Hit either layer and the response returns immediately — no network, no tokens, no cost. Miss both and the real call goes through and writes back.
02integration
Point Cursor or Claude Code at localhost:8080 with one env var. No decorator, no gateway, no changed call signatures — your existing SDK calls just get cheaper.
03vs alternatives
Every other caching layer either stopped shipping or routes traffic through infrastructure you don't own. inferencache is just a library — pip install inferencache. Your prompts stay on your machine.
04MCP server
A read-only MCP server ships with the library. Cursor and Claude Code can inspect hit rates, cost savings, and cache state without leaving the editor.