Reduce Your LLM API Costs by 70–95%

Last updated: June 2026  ·  Apache-2.0 Local-first No code changes

If you build with Claude, GPT-4, or Gemini, you're almost certainly paying 5–10× more than you need to. The problem isn't the model pricing — it's what you're sending.

70–95%
fewer input tokens sent
$0
hallucination guard cost
30s
setup, no code changes

Entroly is an open-source local proxy and context engine that cuts your Claude, OpenAI, and Gemini API bills by 70–95% through three mechanisms: smart context compression, provider cache alignment, and query-relevant file selection. It runs entirely on your machine — your code never leaves your device.

Try it in 30 seconds — no API key needed:

pip install entroly && entroly verify-claims

Measures token reduction on your own repo and writes a JSON report. No outbound calls.

Why Your LLM Bills Are Higher Than They Should Be

1. You're Sending the Wrong Files

When a coding agent reads a repo, it often sends every file it knows about — not just the files relevant to your specific query. A 400-file codebase hitting a "fix the login bug" request sends auth.py, sure, but also 200 unrelated files. You pay for every token of every file.

2. Provider Cache Discounts Aren't Applying

Anthropic offers a 90% discount on cached token prefixes. OpenAI offers 50%. But this only works if the prefix bytes are identical across requests. Most tools dynamically rebuild context on every turn — adding timestamps, session IDs, updated file contents — which breaks the prefix and busts the cache. You pay full price every time.

3. Compressed Context Still Contains Junk

Even if you're compressing, most compression tools shrink whatever you give them. If you gave them the wrong 200 files, you're just sending smaller versions of the wrong files. You need to select first, then compress.

How Entroly Cuts Your LLM API Bills

Context Compression (Knapsack Optimizer)

Entroly's core is an information-theoretic context selector. It ranks every file and fragment in your repo by: BM25 lexical relevance to the query, Shannon entropy density (information per token), SimHash deduplication, and dependency graph proximity. Then a knapsack solver picks the mathematically optimal subset under your token budget.

Critical files go in at full resolution. Supporting files become function signatures. Everything else becomes a reference path the model can ask to expand. Result: the model sees broader coverage of your codebase in a smaller prompt.

Cache Aligner (Prefix Stability)

Entroly's Cache Aligner tracks the injected context prefix per client. When a new request arrives and the context hasn't materially changed (Jaccard similarity > 90%), it reuses the previous prefix verbatim — byte-for-byte. This means Anthropic's 90% cache discount and OpenAI's 50% discount actually apply consistently.

On agent loops where the same files are relevant across 10+ turns, this alone can reduce costs by 80–90%.

Content-Compressed Retrieval (CCR — Lossless)

When fragments are compressed to skeletons, Entroly stores the full originals in a local Content-Compressed Retrieval (CCR) store. If the model needs more detail on a compressed fragment, it calls entroly_retrieve via MCP to get the full content back. Nothing is permanently lost — compression is fully reversible.

WITNESS — $0 Hallucination Guard

Every response is checked by WITNESS, a local deterministic NLI verifier that audits the model's answer against the evidence it was given. It flags unsupported claims before they reach you. Cost: $0. Latency: ~3ms. Accuracy: 0.844 AUROC on HaluEval-QA — statistically equivalent to GPT-4o-mini judge at zero marginal cost.

Measured Token Savings

All numbers below are from committed JSON artifacts in the repo — not screenshots. Reproduce them with entroly verify-claims.

Token BudgetToken Reduction
8K tokens99.1%
32K tokens96.7%
Average across workloads87.0%

Accuracy Is Preserved

Cutting tokens is only useful if the answers stay correct. Measured with gpt-4o-mini on standard benchmarks:

BenchmarkToken SavingsAccuracy Retention
NeedleInAHaystack99.5%100%
LongBench (HotpotQA)85.3%103% (better)
Berkeley Function Calling79.3%100%
SQuAD 2.043.8%90%

Reduce Costs Across Every Provider

Cut Claude / Anthropic API Bills

Entroly intercepts Anthropic API calls, compresses the context, and stabilizes the prefix so the 90% prompt caching discount applies. Works with Claude Code, Claude Sonnet, Claude Opus, and all Anthropic models.

ANTHROPIC_BASE_URL=http://localhost:9377 claude

Cut OpenAI / GPT-4 API Bills

Same approach for OpenAI — context compression + prefix stability for OpenAI's 50% cached-prefix discount. Works with GPT-4o, GPT-4o-mini, and all OpenAI-compatible APIs.

OPENAI_BASE_URL=http://localhost:9377/v1 your-app

Cut Gemini API Bills

GEMINI_BASE_URL=http://localhost:9377 your-app

Works With Cursor, Aider, Claude Code + 34 More

One command wraps your IDE or coding agent:

entroly wrap claude   # Claude Code
entroly wrap cursor   # Cursor
entroly wrap aider    # Aider
entroly go            # auto-detect your tool

Quick Start — Reduce Your Bills in 30 Seconds

# Install
pip install entroly

# Test savings on your own repo (no API key, no outbound calls)
cd /your/repo && entroly verify-claims

# Start the proxy
entroly proxy                          # http://localhost:9377

# Or auto-detect and wrap your coding tool
entroly go

Stop Overpaying for LLM API Calls

Open-source. Local-first. Apache-2.0. Your code stays on your machine.

pip install entroly && entroly go

View on GitHub →

Frequently Asked Questions

How much can I actually save on my Claude API bill?

It depends on your repo size and query patterns. Run entroly demo in your repo for a before/after estimate with no API calls. On large repos (300+ files) with agent loops, most users see 70–90% reduction. On small repos or single-file prompts, savings are lower.

Does it work with my existing tool?

If your tool supports a custom OPENAI_BASE_URL or ANTHROPIC_BASE_URL, it works via the proxy. Entroly also has native MCP server support for Cursor, Claude Code, VS Code, Windsurf, Zed, and Cline.

Is my code safe? Does it send my data anywhere?

No. Entroly runs entirely on your machine. Context selection, compression, and the CCR store are all local. The only outbound call is the one your tool was already making to the LLM provider. No outbound analytics by default.

What's the difference between Entroly and other compression tools?

Most compression tools shrink whatever you give them. Entroly selects first (knapsack optimizer over the full repo), then compresses. You get the right context at the right resolution — not just a smaller version of everything. The Cache Aligner is also unique: it actively stabilizes the prefix to keep provider discounts applying.