Prompt Compression & Context Compression for LLMs

Last updated: June 2026

Prompt compression — reducing the number of tokens sent to an LLM without losing the information the model needs to answer correctly — is one of the highest-leverage optimizations in AI engineering. Entroly is an open-source prompt and context compression engine that achieves 70–95% token reduction on large-repo workloads while preserving answer accuracy.

What makes Entroly different: Most prompt compression tools shrink whatever text you give them. Entroly selects the right content first (from your entire codebase or document set), then compresses. You get the right context at the right resolution — not just a smaller version of everything.

What Is Prompt Compression?

Prompt compression is the process of reducing the token count of the input sent to an LLM while preserving the information needed to answer the query. It sits between your application and the LLM provider, transforming a large, raw context into a compact, information-dense one.

There are three levels of prompt compression:

Entroly uses all three, applied in a mathematically optimal order.

Entroly's Prompt Compression Pipeline

your prompt ──► Entroly (local, <10ms) ──► LLM provider │ ├─ 1. Ingest + index repo (BM25 + entropy + dep-graph) ├─ 2. Rank fragments by query relevance ├─ 3. Select optimal subset (knapsack solver) ├─ 4. Compress: full → skeleton → reference ├─ 5. Store originals in CCR store (lossless) ├─ 6. Align prefix for provider cache hit (Cache Aligner) └─ 7. Verify response against evidence (WITNESS)

The 9 Compressors

Entroly ships 9 content-type-aware compressors, each optimized for a specific input type:

Input TypeCompressorTechniqueTypical Savings
Source codeCode SkeletonizerAST-based, extracts signatures + docstrings60–90%
Shell output / logsShell CodecEntropy-based deduplication60–95%
JSON / API responsesJSON CompressorSchema extraction + value sampling70–90%
Prose / docsSemantic PrunerSentence-level relevance scoring40–70%
Diffs / patchesDiff CompressorContext-window selection around changed hunks50–80%
Test outputTest CodecFailure extraction, pass collapsing70–90%
CSV / tabularTable CompressorSchema + sample rows80–95%
Images (base64)Image OptimizerResolution reduction40–60%
Conversation historyEntropic PrunerLow-entropy turn removal30–60%

Content-Compressed Retrieval (CCR) — Lossless Compression

Entroly's Content-Compressed Retrieval (CCR) solves the fundamental problem with lossy compression: what happens when the model needs the detail you compressed away?

When Entroly compresses a fragment to a skeleton or reference, it stores the original full content in a local CCR store (content-addressed, LRU-evicted, ~5MB for 500 fragments). The compressed fragment contains a retrieval handle. If the model needs more detail, it calls entroly_retrieve via MCP or GET /retrieve?source=file:src/auth.py via the proxy endpoint — and gets the full original back immediately.

Nothing is permanently lost. Compression is fully reversible.

# List all retrievable compressed fragments
curl localhost:9377/retrieve

# Retrieve a specific file's full original
curl localhost:9377/retrieve?source=file:src/auth.py

Cache Aligner — Free 90% Discount From Anthropic

Anthropic gives a 90% discount on cached token prefixes. OpenAI gives 50%. But to get this discount, the prefix bytes must be identical across requests. Standard prompt compression changes the context on every call — busting the cache.

Entroly's Cache Aligner solves this: it tracks the injected context prefix per client and uses Jaccard token similarity to detect when the context hasn't materially changed. When similarity exceeds the threshold (default: 90%), it reuses the previous prefix verbatim — byte-for-byte — preserving the provider's cache hit.

On agent loops making 10+ calls with similar context, this alone can reduce your effective per-token cost by 80–90%.

Knapsack-Optimal Selection

Before any compression happens, Entroly solves an optimization problem: given your token budget and a query, which fragments from your entire repository or document set maximize information value?

Each fragment gets a score combining:

The knapsack solver then selects the globally optimal subset under the budget. This is NP-hard in general; Entroly uses a proven greedy approximation with provable bounds that runs in under 10ms via the Rust core.

Measured Compression Results

Token BudgetToken ReductionSource
8K99.1%benchmarks/results/
32K96.7%benchmarks/results/
Average87.0%entroly verify-claims

Reproduce on your own repo: pip install entroly && entroly verify-claims — no API key needed.

Accuracy Benchmarks

BenchmarkToken SavingsAccuracy Retention
NeedleInAHaystack99.5%100%
LongBench (HotpotQA)85.3%103%
Berkeley Function Calling79.3%100%
SQuAD 2.043.8%90%

Use Entroly's Prompt Compression

As a Transparent Proxy (Zero Code Changes)

pip install entroly
entroly proxy                         # http://localhost:9377
ANTHROPIC_BASE_URL=http://localhost:9377 your-app
OPENAI_BASE_URL=http://localhost:9377/v1 your-app

As a Python Library

from entroly import compress, compress_messages, optimize

# Compress a list of chat messages
compressed = compress_messages(messages, budget=30000)

# Compress arbitrary content (auto-detects type)
compressed = compress(log_output, budget=2000)

# Task-conditioned selection from repo fragments
context = optimize(fragments, budget=8000, query="fix the login bug")

As an MCP Server

entroly serve   # starts MCP server
# then add to Cursor, Claude Code, VS Code config via: entroly init

Start Compressing Your Prompts

Open-source. Local-first. Apache-2.0. Runs on your machine.

pip install entroly && entroly verify-claims

View on GitHub →