Metadata-Version: 2.4
Name: tokenai
Version: 0.1.0
Summary: LLM context management — token counting, rolling summarization, dollar savings reports
Project-URL: Homepage, https://github.com/Ksumanth-hub/tokenai
Project-URL: Repository, https://github.com/Ksumanth-hub/tokenai
License: MIT License
        
        Copyright (c) 2026 Sumanth Koppina
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: anthropic,context,llm,openai,summarization,tiktoken,tokens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: anthropic>=0.100.0
Requires-Dist: tiktoken>=0.7.0
Description-Content-Type: text/markdown

# ctxmgr

LLM context management for Python — token counting, rolling summarization, and dollar savings reports.

LLM APIs charge per token. A 20-turn support conversation can hit 800+ tokens before the user even asks their real question. `ctxmgr` compresses the history with Claude Haiku (cheap and fast) so you send fewer tokens to your expensive model — and you get an exact dollar amount saved per call.

```
pip install ctxmgr
```

---

## Quick start — 2 lines

```python
from ctxmgr import compress

result = compress(messages, max_tokens=4000, model="claude-sonnet-4-6")
print(f"Saved {result.saved_tokens} tokens — ${result.estimated_savings_usd:.4f} per call")
```

---

## Before / After — real numbers

The following uses `tests/test_20turn_review.py`, a 20-turn conversation about token economics (871 tokens). With a 600-token budget, ctxmgr compresses it end-to-end via a live Claude Haiku call:

```
ORIGINAL  : 31 messages, 871 tokens
COMPRESSED:  9 messages, 544 tokens
BUDGET    : 600 tokens
REDUCTION : 327 tokens (37.5%)
```

The compressed result keeps:
- the **system prompt** (assistant persona, always pinned)
- a **single summary message** covering the 14 oldest turns
- the **last 3 user/assistant pairs verbatim** (most recent context, always pinned)

Savings per call at different model tiers:

| Model | Tokens saved | $/call saved |
|---|---|---|
| claude-haiku-4-5 | 327 | $0.000327 |
| claude-sonnet-4-6 | 327 | $0.000981 |
| claude-opus-4-8 | 327 | $0.001635 |

At 10,000 calls/day on Sonnet 4.6 that is **~$358/month** saved.

---

## Three conversation types

Benchmarked on realistic fixtures with medium aggressiveness (3 pinned pairs):

| Type | Original | Compressed | Reduction | $/call (Sonnet) |
|---|---|---|---|---|
| Support chat | 426 tok | 237 tok | 44.4% | $0.000567 |
| Coding assistant | 726 tok | 515 tok | 29.1% | $0.000633 |
| RAG Q&A | 541 tok | 251 tok | 53.6% | $0.000870 |

---

## API

### `compress(messages, max_tokens, model, aggressiveness)`

```python
from ctxmgr import compress, CompressionResult

result: CompressionResult = compress(
    messages,                      # list of {"role": ..., "content": ...}
    max_tokens=4000,               # token budget for the result
    model="claude-sonnet-4-6",     # used only to calculate dollar savings
    aggressiveness="medium",       # "light" | "medium" | "aggressive"
)
```

**`CompressionResult` fields:**

| Field | Type | Description |
|---|---|---|
| `messages` | `list[dict]` | Compressed conversation |
| `original_tokens` | `int` | Token count before compression |
| `compressed_tokens` | `int` | Token count after compression |
| `saved_tokens` | `int` | `original - compressed` |
| `ratio` | `float` | `compressed / original` (lower = more compression) |
| `estimated_savings_usd` | `float` | `saved_tokens × model_input_price` |
| `aggressiveness` | `str` | Level used |

**Aggressiveness levels** — controls how many recent user/assistant pairs are pinned (never summarized):

| Level | Pinned pairs | Use when |
|---|---|---|
| `"light"` | 5 pairs | Long coding sessions, high coherence needed |
| `"medium"` | 3 pairs | General-purpose (default) |
| `"aggressive"` | 1 pair | Support chats, RAG lookups, cost is critical |

### `TokenCounter`

```python
from ctxmgr import TokenCounter

counter = TokenCounter("claude-sonnet-4-6")
print(counter.count("Hello, world!"))        # 4
print(counter.count_messages(messages))      # full conversation estimate
```

Supported models: all `claude-*` and `gpt-*` variants. Unknown models fall back to `cl100k_base`.

Accepts both plain-string content and list-of-blocks format (OpenAI tool calls, Anthropic multi-modal).

### `RollingSummarizer`

Lower-level class if you need more control:

```python
from ctxmgr import RollingSummarizer

summarizer = RollingSummarizer(
    model="claude-haiku-4-5-20251001",  # summarization model
    token_budget=4000,
    pin_last_pairs=3,
)
compressed = summarizer.compress(messages)
```

---

## Message format support

Both **Anthropic** and **OpenAI** message formats work:

```python
# Plain strings (both APIs)
{"role": "user", "content": "What is a token?"}

# OpenAI list-of-blocks (vision, tool calls)
{"role": "user", "content": [
    {"type": "text", "text": "Describe this image."},
    {"type": "image_url", "image_url": {"url": "https://..."}},
]}

# Anthropic list-of-blocks (tool use)
{"role": "assistant", "content": [
    {"type": "text", "text": "I'll look that up."},
    {"type": "tool_use", "id": "tu_01", "name": "search", "input": {"q": "tokens"}},
]}
```

Images and tool-use blocks are counted as short placeholders (`[image]`, `[tool:name]`) so token estimates stay meaningful.

---

## Edge cases

| Scenario | Behaviour |
|---|---|
| Empty history `[]` | Returns `[]`, `saved_tokens=0` |
| Single-turn (no assistant reply) | Returns unchanged — nothing to summarize |
| Single message larger than budget | Returns unchanged — cannot split a single message |
| Already under budget | Returns unchanged, no API call made |
| `content=None` | Treated as empty string |

---

## How it works

1. **Count** — `TokenCounter` uses tiktoken (`cl100k_base` for Claude, `o200k_base` for GPT-4o) to estimate the token count of the full conversation.
2. **Split** — the system prompt and last N user/assistant pairs are pinned. Everything older is passed to the summarizer.
3. **Summarize** — Claude Haiku receives the old turns and returns a single summary message in under 300 words.
4. **Reassemble** — `[system prompt] + [summary] + [pinned tail]` replaces the original history.
5. **Report** — `CompressionResult` returns exact token counts and estimated dollar savings at the target model's input price.

---

## Requirements

- Python 3.10+
- `anthropic >= 0.100.0`
- `tiktoken >= 0.7.0`
- `ANTHROPIC_API_KEY` env variable (used only when compression actually runs; token counting is fully local)

---

## License

MIT
