Metadata-Version: 2.4
Name: tooltrim
Version: 0.1.0
Summary: Drop-in compression for LLM agent tool outputs. Shrink bloated HTML/JSON/log results before they re-enter context — cut tokens, stay on-task, keep full output retrievable.
Author: Nachiket Lele
License: MIT
Project-URL: Homepage, https://github.com/nac7/tooltrim
Project-URL: Repository, https://github.com/nac7/tooltrim
Project-URL: Issues, https://github.com/nac7/tooltrim/issues
Keywords: llm,agents,tool-use,context,token,compression,openai,anthropic,langchain,rag,function-calling
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: tokens
Requires-Dist: tiktoken>=0.5; extra == "tokens"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.2; extra == "langchain"
Provides-Extra: redis
Requires-Dist: redis>=4; extra == "redis"
Provides-Extra: s3
Requires-Dist: boto3>=1.26; extra == "s3"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: tiktoken>=0.5; extra == "dev"
Dynamic: license-file

# tooltrim

**Drop-in compression for LLM agent tool outputs.** Shrink bloated tool results
— fetched web pages, paginated JSON, log dumps, CSV exports, long documents —
*before* they re-enter your agent's context window. Keep the facts the model
needs, drop the boilerplate, and keep the full output one `expand()` away.

```python
from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def web_fetch(url: str) -> str:
    ...                      # returns a 3,000-token HTML page
# your agent now receives a compact, on-topic extract instead
```

- **Zero dependencies** in the core. Pure-stdlib, deterministic, reproducible.
- **Provider-agnostic.** Works with OpenAI, Anthropic, local models, LangChain,
  LlamaIndex, raw function-calling — anything. It compresses *strings*, not APIs.
- **Lossless by reference.** Compression is extractive, and the full output stays
  retrievable via a short `ref` — so it's compression *plus retrieval*, not
  blind truncation.
- **Content-aware.** Separate compressors for HTML, JSON, tabular data, logs,
  and free text. Optionally **query-aware** (BM25) to keep what the agent is
  actually looking for.
- **Faithfulness-tested.** A built-in harness measures whether the model still
  answers correctly on compressed output (with Wilson 95% CIs) — not just how
  many tokens you saved.
- **Deploy as a proxy.** An OpenAI-compatible compression proxy trims
  `role:"tool"` messages in flight, so any app/language adopts it with zero code
  changes — just a `base_url`.

---

## Why

In a real agent loop, the prompt isn't what blows up your context — **tool
outputs are.** A single `web_fetch` returns thousands of tokens of nav bars and
footers; a REST call returns a 300-item paginated array; a log tool dumps
10,000 lines of `INFO heartbeat`. And because the agent's transcript is replayed
on **every** turn, you pay for that bloat again and again — slower responses,
higher bills, and a model that loses the thread.

Routers, caches, and prompt compressors don't touch this. `tooltrim` targets the
tool output directly, at the exact point it enters context.

## Benchmark

Realistic tool outputs compressed to a **400-token** budget, exact `tiktoken`
(`cl100k_base`) counts. Each output contains one planted fact ("needle") that the
agent needs; `tooltrim` is given the task as its relevance query.
Reproduce with [`benchmark.py`](benchmark.py).

| Tool output           |  before |  after |  saved | needle kept |
|-----------------------|--------:|-------:|-------:|:-----------:|
| Web page (HTML)       |   2,816 |     13 |  99.5% |     yes     |
| REST response (JSON)  |  15,119 |    325 |  97.9% |     yes     |
| Server logs           |   7,606 |    390 |  94.9% |     yes     |
| CSV export            |   7,895 |    373 |  95.3% |     yes     |
| Long document (text)  |   6,139 |     10 |  99.8% |     yes     |
| **Total**             | **39,575** | **1,111** | **97.2%** | **5/5** |

**39,575 → 1,111 tokens — a 35.6× smaller context, with the relevant fact kept
in every case.** (HTML/text collapse to the matching passage when the query
pinpoints it; structured types keep a representative, schema-preserving sample.)

## Does compression lose information? (it can *help*)

Throwing away 99% of the tokens is only safe if the model still answers
correctly. We measure that directly: for **62 curated `(tool output, question,
gold answer)` cases** across all five content types — including **multi-fact**
cases (the answer needs several facts from different parts of the output) and
**distractor** cases (a deprecated value sits next to the current one) — a model
is asked the question twice: once on the **full** output, once on the
**tooltrim-compressed** output. Accuracy is reported with **Wilson 95%
confidence intervals**. Reproduce with [`run_faithfulness.py`](run_faithfulness.py)
— it runs **offline by default (no API key)** and has adapters for
Claude / OpenAI / Groq / Ollama.

On small local models, compression doesn't just preserve accuracy — it
**improves** it, because the model is no longer distracted by thousands of tokens
of noise. The effect reproduces across two independent model families:

| model | full | @128 (−98.6%) | @256 (−97.3%) | @400 (−96.5%) |
|---|---:|---:|---:|---:|
| `mistral:7b`  | 13% [7–23%]  | **84% [73–91%]** | 81% [69–89%] | 82% [71–90%] |
| `llama3.1:8b` | 23% [14–34%] | **73% [60–82%]** | 66% [54–77%] | 66% [54–77%] |

The compressed intervals don't overlap the full-context intervals — at n=62 this
is a **significant** improvement for both models, not noise. Full provenance,
per-case answers, and the cross-model table are saved as citable artifacts under
[`benchmarks/runs/`](benchmarks/runs/) and [`benchmarks/COMPARISON.md`](benchmarks/COMPARISON.md).

*Stated plainly:* these are small 7–8B models. A frontier long-context model
handles the full context far better, so its baseline is higher and the accuracy
*uplift* shrinks — but the token/cost savings remain. The uplift is largest for
smaller/cheaper models and longer contexts. The harness is wired so a frontier
run (`--model claude`) drops a new row into the same table when an API key is
available; n=62 is a pilot, which is why the CIs are reported.

## Install

```bash
pip install tooltrim          # zero-dependency core (heuristic token counts)
pip install tooltrim[tokens]  # add tiktoken for exact token counts
```

## Usage

### 1. Decorate a tool

```python
from tooltrim import compressed_tool

@compressed_tool(max_tokens=400)
def read_file(path: str) -> str:
    return open(path).read()
```

### 2. Make it query-aware

Pull the relevance query from the call arguments…

```python
@compressed_tool(max_tokens=400, query_from=lambda query, **_: query)
def web_search(query: str) -> str:
    ...
```

…or set the agent's current goal ambiently, so every tool call this turn keeps
what's relevant to it:

```python
from tooltrim import query_scope

with query_scope("find the customer's refund status"):
    result = run_agent_step()   # all @compressed_tool calls inside use this query
```

### 3. Imperative API + expand-on-demand

```python
from tooltrim import ToolCompressor

tc = ToolCompressor(max_tokens=400)
res = tc.compress(huge_json_response, query="refund status for customer C-1007")

res.text             # compact text to feed back to the model
res.saved_tokens     # e.g. 14794
res.saved_ratio      # e.g. 0.979
res.ref              # e.g. "a1b2c3d4"

full = tc.expand(res.ref)                    # get the original back
slice_ = tc.expand(res.ref, start=0, length=2000)
```

By default the compressed output ends with a small footer the model can act on:

```
…compressed extract…

[tooltrim: compressed 15119->325 tokens (saved 14794); full output ref=a1b2c3d4]
```

Expose an `expand(ref)` tool to your agent and it can pull the full output back
whenever the extract isn't enough — turning aggressive compression into a safe
default. tooltrim hands you both the tool schema and the handler:

```python
tools = my_tools + [tc.expand_tool_spec(style="openai")]   # or style="anthropic"

# when the model calls expand_tool_output(ref=..., start=..., length=...):
result_text = tc.handle_expand(ref, start=start, length=length)   # paged, safe
```

See [`examples/04_expand_tool.py`](examples/04_expand_tool.py) for a full wiring.
Extractive compressors also keep **neighbor context** (a line/sentence around each
match) so the model gets context, not just the bare matching line.

### 4. Optional: LLM distillation (any provider)

The deterministic compressors need no LLM. When you want summarization instead
of extraction, plug in *any* model with a one-line completion function — use a
small/cheap one; distilling 15k → 300 tokens once saves your expensive model
from re-reading the blob every turn.

```python
from tooltrim import LLMDistiller

def complete(prompt: str) -> str:
    # wrap OpenAI / Anthropic / local — your choice
    return my_client.responses(prompt)

distiller = LLMDistiller(complete, max_tokens=300)
summary = distiller.compress(huge_output, query="refund status")
```

### 5. Drop into LangChain — one line per tool

Already have LangChain tools? Wrap any of them and you get back a tool with the
**same name, description, and argument schema**, so the agent calls it unchanged —
but its (string) output is compressed before it lands in the scratchpad. The
relevance query comes from the tool's own arguments.

```bash
pip install tooltrim[langchain]
```

```python
from tooltrim.integrations import compress_langchain_tool, compress_langchain_tools

fetch = compress_langchain_tool(my_tool, max_tokens=400,
                                query_from=lambda query, **_: query)

# or wrap the whole toolset at once (sharing one compressor + expand store):
tools = compress_langchain_tools(my_tools, max_tokens=400)
```

See [`examples/03_langchain_tool.py`](examples/03_langchain_tool.py).

### 6. Or run it as a proxy — zero code changes

Point your client at the tooltrim proxy; every tool result is compressed (using
the latest user message as the relevance query) before being forwarded upstream.
Both wire formats are understood, routed by request path — you only change
`base_url`.

```bash
python run_proxy.py --upstream https://api.openai.com/v1     # OpenAI-compatible
python run_proxy.py --upstream https://api.anthropic.com/v1  # Claude
```

```python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8800/v1", api_key="<upstream key>")

from anthropic import Anthropic
client = Anthropic(base_url="http://127.0.0.1:8800")
```

`/v1/chat/completions` compresses OpenAI `role:"tool"` messages; `/v1/messages`
compresses Anthropic `tool_result` blocks. The proxy is stdlib-only and **fails
open**: if anything goes wrong it forwards the original request untouched, so it
never breaks a production call.

**Online, it also keeps you under provider rate limits.** Against a live hosted
model (Groq free tier, 6,000-tokens-per-request cap), **45% of raw tool outputs
are rejected (HTTP 413) but 100% of tooltrim-compressed calls fit** — a 14,415-token
result is compressed to 26 tokens in flight and the call succeeds. See
[`benchmarks/ONLINE_GROQ.md`](benchmarks/ONLINE_GROQ.md).

### 7. Scale out — shared expand-store + metrics

The default expand-store is in-process, fine for one worker. To run multiple
workers/replicas behind a load balancer, the store must be **shared** — otherwise
a `ref` minted by one worker can't be expanded by another. Swap in a backend
(all are content-addressed, so writes dedup automatically):

```python
from tooltrim import ToolCompressor, FileStore, RedisStore, S3Store

tc = ToolCompressor(store=FileStore("/mnt/shared/tooltrim"))         # zero-dep, shared volume
tc = ToolCompressor(store=RedisStore(url="redis://cache:6379/0",     # pip install tooltrim[redis]
                                     ttl_seconds=86_400))
tc = ToolCompressor(store=S3Store(bucket="my-bucket"))               # pip install tooltrim[s3]
```

The proxy exposes **Prometheus metrics** at `GET /metrics` (tokens in/out/saved,
messages compressed, fail-open count, upstream errors, latency) — scrape it to
quantify savings fleet-wide:

```
tooltrim_tokens_saved_total 14389
tooltrim_messages_compressed_total 1
tooltrim_fail_open_total 0
```

## How it works

1. **Pass-through** if the output already fits the budget (zero overhead).
2. **Detect** the content type (JSON / HTML / tabular / logs / text).
3. **Compress** with a type-specific strategy:
   - **JSON** — preserve structure; sample arrays (keeping the key schema), note
     `(+N more items)`, truncate long strings; tighten until it fits.
   - **HTML** — extract readable text (drop `script`/`style`/`nav`/`footer`),
     then fit the budget.
   - **Tabular** — keep the header + a sample of rows + `(+N more rows)`.
   - **Logs** — collapse repeated lines (`x42`), always keep errors/warnings,
     fill with head/tail context.
   - **Text** — query-aware extractive selection (BM25), with `[…]` elisions.
4. **Stash** the full output under a content-addressed `ref` for `expand()`.

With a query, every compressor keeps the most *relevant* parts; without one, it
falls back to structure-preserving head/tail selection.

## How it's different

| Tool class           | What it optimizes            | tooltrim |
|----------------------|------------------------------|----------|
| Routers (RouteLLM…)  | *which model* gets the call  | orthogonal |
| Semantic caches      | repeated *identical* calls   | orthogonal |
| Prompt compressors (LLMLingua) | the *prompt/instructions* | different target |
| Memory frameworks (MemGPT…) | conversation history, as a framework you adopt | tooltrim is a drop-in on the *tool boundary* |

tooltrim targets the **tool-output boundary** — the largest and most-ignored
token sink in agentic apps — and works alongside all of the above.

## Status

v0.1 — deterministic zero-dependency core, 79-test suite, reproducible token +
**faithfulness** benchmarks (with Wilson CIs, cross-model), a **proxy** speaking
both **OpenAI and Anthropic** wire formats with Prometheus **/metrics**, a
**LangChain** adapter, pluggable **File/Redis/S3 expand-stores** for horizontal
scale, and citable run artifacts under [`benchmarks/`](benchmarks/).

Roadmap: PyPI release + `tooltrim` CLI, frontier-model faithfulness runs,
embedding-based relevance, streaming compression, and native LlamaIndex /
OpenAI-Agents wrappers.

Contributions and benchmark cases welcome. MIT licensed.
