Prompt Compression & Context Compression for LLMs
Last updated: June 2026
Prompt compression — reducing the number of tokens sent to an LLM without losing the information the model needs to answer correctly — is one of the highest-leverage optimizations in AI engineering. Entroly is an open-source prompt and context compression engine that achieves 70–95% token reduction on large-repo workloads while preserving answer accuracy.
What Is Prompt Compression?
Prompt compression is the process of reducing the token count of the input sent to an LLM while preserving the information needed to answer the query. It sits between your application and the LLM provider, transforming a large, raw context into a compact, information-dense one.
There are three levels of prompt compression:
- Selection — choosing which documents, files, or fragments to include at all
- Summarization — replacing long passages with shorter summaries (lossy)
- Skeletonization — replacing code with function signatures and type annotations (structure-preserving)
Entroly uses all three, applied in a mathematically optimal order.
Entroly's Prompt Compression Pipeline
The 9 Compressors
Entroly ships 9 content-type-aware compressors, each optimized for a specific input type:
| Input Type | Compressor | Technique | Typical Savings |
|---|---|---|---|
| Source code | Code Skeletonizer | AST-based, extracts signatures + docstrings | 60–90% |
| Shell output / logs | Shell Codec | Entropy-based deduplication | 60–95% |
| JSON / API responses | JSON Compressor | Schema extraction + value sampling | 70–90% |
| Prose / docs | Semantic Pruner | Sentence-level relevance scoring | 40–70% |
| Diffs / patches | Diff Compressor | Context-window selection around changed hunks | 50–80% |
| Test output | Test Codec | Failure extraction, pass collapsing | 70–90% |
| CSV / tabular | Table Compressor | Schema + sample rows | 80–95% |
| Images (base64) | Image Optimizer | Resolution reduction | 40–60% |
| Conversation history | Entropic Pruner | Low-entropy turn removal | 30–60% |
Content-Compressed Retrieval (CCR) — Lossless Compression
Entroly's Content-Compressed Retrieval (CCR) solves the fundamental problem with lossy compression: what happens when the model needs the detail you compressed away?
When Entroly compresses a fragment to a skeleton or reference, it stores the original full content in a local CCR store (content-addressed, LRU-evicted, ~5MB for 500 fragments). The compressed fragment contains a retrieval handle. If the model needs more detail, it calls entroly_retrieve via MCP or GET /retrieve?source=file:src/auth.py via the proxy endpoint — and gets the full original back immediately.
Nothing is permanently lost. Compression is fully reversible.
# List all retrievable compressed fragments curl localhost:9377/retrieve # Retrieve a specific file's full original curl localhost:9377/retrieve?source=file:src/auth.py
Cache Aligner — Free 90% Discount From Anthropic
Anthropic gives a 90% discount on cached token prefixes. OpenAI gives 50%. But to get this discount, the prefix bytes must be identical across requests. Standard prompt compression changes the context on every call — busting the cache.
Entroly's Cache Aligner solves this: it tracks the injected context prefix per client and uses Jaccard token similarity to detect when the context hasn't materially changed. When similarity exceeds the threshold (default: 90%), it reuses the previous prefix verbatim — byte-for-byte — preserving the provider's cache hit.
On agent loops making 10+ calls with similar context, this alone can reduce your effective per-token cost by 80–90%.
Knapsack-Optimal Selection
Before any compression happens, Entroly solves an optimization problem: given your token budget and a query, which fragments from your entire repository or document set maximize information value?
Each fragment gets a score combining:
- BM25 lexical relevance to the query
- Shannon entropy density (information per token)
- SimHash deduplication score (penalizes near-duplicates)
- Dependency graph distance (rewards files referenced by high-relevance files)
- PRISM learned weights (improves over time from local feedback)
The knapsack solver then selects the globally optimal subset under the budget. This is NP-hard in general; Entroly uses a proven greedy approximation with provable bounds that runs in under 10ms via the Rust core.
Measured Compression Results
| Token Budget | Token Reduction | Source |
|---|---|---|
| 8K | 99.1% | benchmarks/results/ |
| 32K | 96.7% | benchmarks/results/ |
| Average | 87.0% | entroly verify-claims |
Reproduce on your own repo: pip install entroly && entroly verify-claims — no API key needed.
Accuracy Benchmarks
| Benchmark | Token Savings | Accuracy Retention |
|---|---|---|
| NeedleInAHaystack | 99.5% | 100% |
| LongBench (HotpotQA) | 85.3% | 103% |
| Berkeley Function Calling | 79.3% | 100% |
| SQuAD 2.0 | 43.8% | 90% |
Use Entroly's Prompt Compression
As a Transparent Proxy (Zero Code Changes)
pip install entroly entroly proxy # http://localhost:9377 ANTHROPIC_BASE_URL=http://localhost:9377 your-app OPENAI_BASE_URL=http://localhost:9377/v1 your-app
As a Python Library
from entroly import compress, compress_messages, optimize # Compress a list of chat messages compressed = compress_messages(messages, budget=30000) # Compress arbitrary content (auto-detects type) compressed = compress(log_output, budget=2000) # Task-conditioned selection from repo fragments context = optimize(fragments, budget=8000, query="fix the login bug")
As an MCP Server
entroly serve # starts MCP server # then add to Cursor, Claude Code, VS Code config via: entroly init
Start Compressing Your Prompts
Open-source. Local-first. Apache-2.0. Runs on your machine.
pip install entroly && entroly verify-claims