Metadata-Version: 2.4
Name: contextpack-ai
Version: 0.1.0
Summary: Negotiated session codebook compression for LLMs — cut 20-60% of tokens losslessly
Project-URL: Homepage, https://github.com/surya16122114/contextpack
Project-URL: Repository, https://github.com/surya16122114/contextpack
Author: Chinnasurya Prasad Vulavala
License: MIT
License-File: LICENSE
Keywords: anthropic,claude,compression,llm,mcp,openai,proxy,tokens
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.11
Requires-Dist: aiosqlite>=0.20.0
Requires-Dist: anthropic>=0.40.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: openai>=1.54.0
Requires-Dist: pydantic-settings>=2.6.0
Requires-Dist: pydantic>=2.9.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.9.0
Requires-Dist: tiktoken>=0.8.0
Requires-Dist: typer>=0.13.0
Requires-Dist: uvicorn[standard]>=0.32.0
Requires-Dist: xxhash>=3.5.0
Provides-Extra: evals
Requires-Dist: datasets>=2.0.0; extra == 'evals'
Description-Content-Type: text/markdown

# ContextPack

**Negotiated session codebook compression for LLMs — cut tokens, keep answers.**

ContextPack is an OpenAI-compatible proxy + library that compresses the context you send to any LLM (OpenAI, Anthropic, or any OpenAI-compatible API). Unlike one-sided compressors that simply throw bytes away, its signature feature is a **negotiated session codebook**: it negotiates a shared abbreviation dictionary *with the model*, so compression is lossless — the model confirms each symbol before it's used. Pure Python, no Rust or ML binaries, works out of the box.

---

## Why ContextPack

- **Negotiated codebook (lossless).** ContextPack proposes `[CP_1] = <big chunk of context>`, the model acknowledges it, and every later turn sends the symbol instead of the chunk. Because the model confirmed the mapping, nothing is lost. Nothing else does this.
- **Content-aware compression.** Separate, format-specific compressors for JSON, code, logs, stacktraces, and query-aware prose — each strips redundancy the way that format allows.
- **Lazy references.** Huge blobs (over a configurable token threshold) are replaced with a reference; the model retrieves the full content on demand instead of re-sending it every turn.
- **Token budget optimizer + semantic dedup.** Fit a conversation into a target budget and drop near-duplicate content automatically.
- **4 ways to use it:** HTTP proxy, Python library, CLI, or MCP server.
- **Live analytics dashboard.** Watch token savings accumulate in real time at `/dashboard`.
- **Bring-your-own-key (BYOK).** Each request can carry its own upstream key, so every user pays their own bill.
- **Pure Python.** No native binaries, no downloaded ML models, no GPU. `pip install -e .` and go.

---

## Benchmarks

The core claim — **compression doesn't change the model's answers** — is tested two ways against real datasets (GSM8K, SQuAD v2, TruthfulQA), with deterministic sampling (`seed=42`) and Wilson/normal confidence intervals. The two runs answer different questions and are reported separately (never blended):

- **Scale** — does it hold across *thousands* of diverse inputs? (`gpt-4o-mini`, full datasets)
- **Strength** — does it hold on a *stronger* model? (`gpt-4o`, N=100)

### Scale — `gpt-4o-mini`, 6,557 cases (full datasets)

| Benchmark | N | Baseline | Compressed | Δ | Compression | Tokens saved |
|---|---|---|---|---|---|---|
| **Codebook** | 21 | 100% | 100% | **±0.0%** | **57%** | 6,852 |
| **Workload** (code/JSON/log) | 26 | 100% | 100% | **±0.0%** | 24% | 499 |
| **SQuAD v2** (prose) | 5,236 | 46.2% | 46.7% | **+0.6%** | 24% | **228,976** |
| GSM8K | 1,029 | 79.6% | 79.6% | **±0.000** | 0%¹ | 0 |
| TruthfulQA | 245 | 48.6% | 48.6% | **±0.000** | 0%¹ | 0 |

### Strength — `gpt-4o`, N=100

| Benchmark | N | Baseline | Compressed | Δ | Compression |
|---|---|---|---|---|---|
| **Codebook** | 21 | 100% | 100% | **±0.0%** | **57%** |
| **Workload** | 26 | 100% | 100% | **±0.0%** | 24% |
| **SQuAD v2** | 100 | 70.7% | 70.5% | **-0.2%** | 20% |
| GSM8K | 100 | 88.0% | 88.0% | **±0.000** | 0%¹ |
| TruthfulQA | 100 | 56.0% | 56.0% | **±0.000** | 0%¹ |

**Codebook, per scenario** (the unique angle — lossless by construction, the model confirms every symbol):

| Scenario | Turns | Accuracy | Tokens saved |
|---|---|---|---|
| `auth_service` | 6 | 100% | **41–44%** |
| `data_schema`  | 7 | 100% | **58–59%** |
| `api_spec`     | 8 | 100% | **60%** |

**Verdict: compression preserves accuracy** — every delta is within ±0.6%, and the codebook path is exactly lossless on both models. On `gpt-4o`, GSM8K (88%) lands in the same league as published baselines (~87%), confirming the setup is sound.

¹ GSM8K/TruthfulQA are short prose with nothing to compress, so compression is a deliberate no-op — those rows prove **non-interference**, not savings.

> Honest notes: the scale run hit the account's rate limit at full concurrency, so 1,527 cases exhausted retries (SQuAD landed at 5,236 of 5,928, TruthfulQA at 245 of 790) — the completed cases are valid and the SQuAD CI is still tight [46–48%]. The two runs use different N by design (scale vs. strength); they are reported as separate tables with their own confidence intervals and never averaged together.

Reproduce:

```bash
python -m evals.suite --tier 3 --n 200                 # scale-ish, mini, cheap
python -m evals.suite --tier 3 --n 100 --model gpt-4o  # strength
```

---

## Quick start

```bash
git clone https://github.com/surya16122114/contextpack
cd contextpack
pip install -e .
cp .env.example .env        # add your upstream key (or use bring-your-own-key per request)
contextpack serve           # starts the proxy on :8000
```

Then point the OpenAI SDK at the proxy — no other code changes:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-...",   # your real upstream key; passed through as BYOK
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize the attached spec..."}],
    extra_headers={"x-session-id": "my-session"},   # reuse a session to build a codebook
)
print(resp.choices[0].message.content)
```

Every response includes `X-ContextPack-*` headers (`Original-Tokens`, `Compressed-Tokens`, `Savings`, `Codebook-Size`) so you can see exactly what was saved.

---

## The 4 usage modes

### 1. HTTP Proxy

Drop-in OpenAI-compatible endpoint. Change `base_url` and you're done — works with Cursor, the OpenAI SDK, LangChain, or any OpenAI client.

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="sk-your-upstream-key",   # BYOK: this key is used for the upstream call
)
```

The `Authorization: Bearer <key>` header is treated as bring-your-own-key — ContextPack forwards it to the upstream provider instead of using the server's own key. Pass `x-session-id` to keep building the same codebook across calls.

### 2. Python library

Use the compression pipeline in-process, no server required:

```python
from contextpack import ContextPackClient

client = ContextPackClient(
    upstream_provider="openai",       # or "anthropic"
    upstream_api_key="sk-...",
    session_id="my-session",          # optional; auto-generated if omitted
)

response = client.chat(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
)

print("Response:", response.content)
print("Tokens saved:", response.tokens_saved)
print("Codebook size:", response.codebook_size)
```

### 3. CLI

```bash
contextpack serve                    # start the proxy (--port, --host, --reload)
contextpack stats                    # global compression stats
contextpack stats <session-id>       # per-session stats
contextpack codebook <session-id>    # show the negotiated codebook for a session
contextpack mcp-install              # auto-configure the MCP server in your clients
```

### 4. MCP server

Expose ContextPack's compression as tools (`compress_text`, `analyze_tokens`, `get_stats`) to any MCP client.

Let ContextPack configure it for you:

```bash
contextpack mcp-install                       # configures Claude Desktop, Cursor, and Claude Code
contextpack mcp-install --client cursor       # just one client
contextpack mcp-install --dry-run             # preview without writing anything
```

Or add it by hand to your client's MCP config (e.g. Claude Desktop / Cursor):

```json
{
  "mcpServers": {
    "contextpack": {
      "command": "python",
      "args": ["-m", "contextpack.mcp_server"]
    }
  }
}
```

`mcp-install` writes exactly this block (using the active Python interpreter) to:

- **Claude Desktop** — `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS), `%APPDATA%/Claude/claude_desktop_config.json` (Windows), `~/.config/Claude/claude_desktop_config.json` (Linux)
- **Cursor** — `~/.cursor/mcp.json`
- **Claude Code** — via `claude mcp add contextpack -- <python> -m contextpack.mcp_server` (if the `claude` CLI is on your PATH)

---

## How it works

```
┌────────┐      ┌──────────────────────────────────────────────────────────┐      ┌──────────────┐
│        │      │                       ContextPack                         │      │              │
│ Client │─────▶│  ContentRouter ─▶ Compressors ─▶ Codebook Negotiator      │─────▶│ Upstream LLM │
│ (SDK / │      │  (JSON/code/log/  (format-aware)  (negotiates [CP_n] with  │      │ (OpenAI /    │
│  Cursor│      │   stacktrace/                       the model)             │      │  Anthropic)  │
│  /any) │◀─────│   prose)        ─▶ Lazy Refs ─▶ Budget Optimizer          │◀─────│              │
│        │      │                                                           │      │              │
└────────┘      │   ◀── response decompress (symbols → original content)    │      └──────────────┘
                └──────────────────────────────────────────────────────────┘
```

**The negotiated codebook.** When a chunk of context recurs (or is large enough to be worth it), ContextPack injects a one-time system message establishing a mapping — `[CP_1] = <the full content>` — and asks the model to confirm it. Once the model acknowledges, every subsequent turn sends just `[CP_1]` instead of the full chunk. The mapping lives for the session, so the savings compound the longer the conversation runs. On the way back, any symbols in the response are expanded to their original content before the client sees them. Because the dictionary is *agreed with the model*, this is lossless — the model knows exactly what each symbol means.

---

## Configuration

Set these in `.env` (see `.env.example`) or as environment variables.

| Variable | Default | Description |
|---|---|---|
| `UPSTREAM_PROVIDER` | `anthropic` | `anthropic` or `openai` |
| `UPSTREAM_API_KEY` | `""` | Default upstream key (overridable per-request via BYOK) |
| `UPSTREAM_BASE_URL` | `https://api.anthropic.com` | Upstream API base URL |
| `PROXY_PORT` | `8000` | Port the proxy listens on |
| `PROXY_HOST` | `0.0.0.0` | Host the proxy binds to |
| `DB_PATH` | `~/.contextpack/contextpack.db` | SQLite store for sessions, codebooks, analytics |
| `CODEBOOK_MIN_FREQ` | `3` | Times a pattern must recur before it's a codebook candidate |
| `CODEBOOK_NEGOTIATE_AFTER` | `2` | Recurrences after which negotiation is triggered |
| `CODEBOOK_MAX_ENTRIES` | `50` | Max codebook entries per session |
| `CODEBOOK_MIN_TOKEN_SAVINGS` | `10` | Minimum net token savings for an entry to be worth it |
| `ENABLE_CROSS_SESSION` | `true` | Allow codebook reuse across sessions |
| `REF_THRESHOLD_TOKENS` | `100` | Token size above which content becomes a lazy reference |
| `ENABLE_LAZY_REFS` | `true` | Enable lazy reference loading |
| `ENABLE_SUMMARIZER` | `false` | Auto-summarize long content (off — costs upstream tokens) |
| `SUMMARIZE_THRESHOLD` | `500` | Token size above which auto-summarize kicks in |
| `TOKEN_BUDGET` | `8000` | Default token budget for the optimizer |
| `LOG_LEVEL` | `INFO` | Logging level |

---

## Dashboard

While the proxy is running, open **`http://localhost:8000/dashboard`** for a live view of token savings, per-session compression ratios, and active codebooks. Append `?session_id=<id>` to focus on a single session.

---

## How it compares

| | Doing nothing | Generic compressor | **ContextPack** |
|---|---|---|---|
| Token savings on repeated context | 0% | partial | **41–60% (codebook)** |
| Lossless | n/a | no — drops bytes one-sidedly | **yes — model confirms the dictionary** |
| Content-aware (JSON/code/logs) | no | sometimes | **yes** |
| Lazy references for huge blobs | no | rarely | **yes** |
| Drop-in OpenAI-compatible proxy | n/a | varies | **yes** |
| Library / CLI / MCP server | n/a | varies | **all three** |
| Runtime footprint | none | often Rust/ML deps | **pure Python** |

The honest summary: generic compressors decide unilaterally what to throw away. ContextPack's **negotiated codebook** is the unique angle — it reaches an explicit agreement with the model about what each symbol means, which is why accuracy stays at 100% on the codebook benchmark while still saving up to 60% of tokens.

---

## Development / running evals

```bash
pip install -e .                  # core install
pip install -e ".[evals]"         # optional: pull datasets via HuggingFace instead of canonical URLs

# Tiered benchmark suite (datasets auto-download and cache under ~/.contextpack/eval_cache/)
python -m evals.suite --tier 1            # workload + codebook (fast)
python -m evals.suite --tier 2 --n 50     # + SQuAD context compression
python -m evals.suite --tier 3 --n 100    # full suite (SQuAD + GSM8K + TruthfulQA)
```

Results are printed as a rich table and written to `evals/RESULTS.md`.

---

## License

MIT © 2026 Surya Vulavala. See [LICENSE](LICENSE).
