Metadata-Version: 2.4
Name: burnless
Version: 0.3.0
Summary: Stop burning tokens on repeated context.
Author: Roberto Monticelli Wydra
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: anthropic>=0.40
Requires-Dist: prompt-toolkit>=3.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Description-Content-Type: text/markdown

# Burnless

**A maestro that orchestrates any LLM from any vendor. Multi-turn agent loops cost O(N²) — Burnless makes them O(N).**

Burnless is a vendor-agnostic orchestration layer for multi-agent workflows. You pick the model that **conducts** the orchestra (Maestro / Brain) — Claude, GPT, Gemini, Mistral, a local Llama, anything — and the models that **execute** each task (Workers). Tiers are roles, not vendors: `gold`/`silver`/`bronze`/`diamond` map to whatever CLI you put in `config.yaml`. Mix providers freely. Run encoder and decoder on a local Ollama model for zero marginal cost on the cheap stages.

On top of that independence, Burnless flips the cost curve. Every turn in a standalone agent loop replays the full conversation as input — token cost on turn `N` is proportional to `N`, so total cost across `N` turns is `Θ(N²)`. Burnless keeps only short capsules in history and shares a cached system prompt across Maestro and Workers. History stays linear; the persistent prefix is billed once per cache window instead of once per turn.

The asymmetry is mechanical, not heuristic. Any provider that charges per input token is subject to the same arithmetic — Anthropic, OpenAI, Google, Mistral, anyone. The reference numbers below use Anthropic's pricing because their cache read/write spread is published and the cheapest to verify (`$15/MTok` fresh input vs `$0.15/MTok` cache read — a 100× spread). The mechanism reproduces wherever a provider exposes prompt caching.

## Four things, in this order

1. **Independence.** Any model as Maestro. Any model as Worker. Switch providers in one line.
2. **User-enforced rules, not LLM goodwill.** You write the routing keywords, the per-tier `allowedTools`, and the cost budgets in `.burnless/config.yaml`. With `routing.hardcore_filter: true` (or `BURNLESS_HARDCORE=1`), the Maestro **cannot escape** to a higher tier than the keyword router resolved — no quiet upgrades to Opus for tasks the rules said belong to Haiku. `allowedTools` is enforced by the worker CLI itself, not hinted at in the prompt: when bronze ships with `Read,Bash`, it physically cannot `Edit`. Bypass requires an explicit `--force` from the human.
3. **Three compression layers.** Deterministic minifier (regex, zero cost), semantic encoder (small model, ~$0.001/turn), optional LLMLingua-2 (CPU-only, no API). Each layer is independent and additive.
4. **Math, not marketing.** 88% cheaper at turn 10 by arithmetic on the published pricing pages. Verify with `python bench/run.py --turns 8` and your own API key.

## The numbers

Reference run against `claude-opus-4-7` (8 turns, 23k-token system prompt, 600 output tok/turn, 80-char capsules). Reproduce the same shape with any provider that supports prompt caching:

| Turns | Standalone | Burnless | Savings |
|------:|-----------:|---------:|--------:|
|     2 |      $0.80 |    $0.14 |   82.7% |
|     5 |      $2.06 |    $0.29 |   86.1% |
|     8 |      $3.40 |    $0.44 |   87.2% |
|    10 |      $4.34 |    $0.54 |   87.6% |
|    20 |      $9.59 |    $1.07 |   88.9% |
|    50 |     $30.72 |    $2.83 |   90.8% |

![Burnless cost chart](docs/cost_chart.png)

## Design decisions

The 88% number is an outcome. These are the calls that produced it, in the order they were made.

**1. Treat the cost curve as math, not engineering.** Multi-turn agents replay full history every turn. Tokens billed across `N` turns sum to `Θ(N²)` — that is arithmetic on the pricing page, not a property of any SDK. Once the problem is stated as O(N²), the only useful question is what to truncate. Everything else follows.

**2. Brain stores capsules, not transcripts.** The Brain's conversation history holds ~80-char single-line summaries of each prior turn, not the raw exchange. Full output stays on disk, read on demand. This is the single change that flips the curve to O(N) — every other layer compounds on top of an already-linear baseline.

**3. Shared prefix cache across models.** If two models from the same provider see a byte-identical system prompt with `cache_control` set, they hit the same prefix cache. Switching Opus → Sonnet mid-session does not invalidate it. Brain and Worker can be different models and still amortize the 23k-token system prompt at read price ($0.15/MTok) instead of write price ($15/MTok). The 100× spread is the lever.

**4. Tiers are roles, not models.** `gold`/`silver`/`bronze`/`diamond` map to commands in `config.yaml`, not to Opus/Sonnet/Haiku. Any model can be Brain. Any model can be Worker. GPT-4o as Brain delegating to Codex workers is a one-line config change. Hardcoding tier→model would have made the orchestration layer a single-vendor wrapper instead of a pattern.

**5. Determinism before LLMs.** Layer 1 of the compression stack is pure regex plus a glossary — no model call, zero latency, zero cost. Filler phrases, normalized whitespace, abbreviations applied before the encoder ever sees the text. A semantic compressor running on already-clean input is cheaper and more stable than one fighting prose. Cheap stages run first for a reason.

**6. Pluggable encoder.** The semantic compression layer accepts Haiku, LLMLingua-2 (Microsoft Research, XLM-RoBERTa, CPU-only), or any local model — selected per-invocation via `--encoder`. Coupling the architecture to one specific compressor would have tied savings to one model's pricing. Keeping each layer swappable means the stack improves automatically as local models get better, with no API change.

**7. The benchmark is the proof.** `bench/run.py` is short, dependency-light, hits the Anthropic SDK directly with no mocks, and writes raw `response.usage` to JSON. Anyone can rerun it, contest the numbers, and open an issue with their own results file. We did not write a marketing page about savings; we wrote a script that produces them and invited disagreement. That is the only honest way to publish a cost claim.

## Install

```bash
git clone https://github.com/rudekwydra/burnless.git
cd burnless && pip install -e .
burnless setup               # one-time, detects local agents and keys
burnless init                # inside any project directory
```

Python 3.10+. Tiers map to whatever models you configure — mix providers freely.

PyPI release coming soon. Until then, install from source.

## Any model. Any role. Full control.

Tiers are **roles**, not models. You decide what runs each role — and any model can be the Brain.

```yaml
# .burnless/config.yaml — example: GPT-4o as Brain, Sonnet as executor, Codex for code
agents:
  gold:    { command: "openai api chat.completions.create -m gpt-4o" }
  silver:  { command: "claude --model claude-sonnet-4-6 -p --allowedTools Read,Edit,Write,Bash" }
  bronze:  { command: "claude --model claude-haiku-4-5 -p --allowedTools Read,Bash" }
  diamond: { command: "codex exec --sandbox workspace-write" }
```

Or flip it — Sonnet as Brain delegating to Codex workers:

```yaml
agents:
  gold:    { command: "claude --model claude-sonnet-4-6 -p" }   # Brain
  diamond: { command: "codex exec --sandbox workspace-write" }   # code execution
  bronze:  { command: "ollama run llama3" }                      # local model, cheap tasks
```

Each tier gets its own `allowedTools`, routing keywords, and cost budget. The routing layer reads your task description and picks the right tier automatically — or you override it explicitly.

The O(N²) → O(N) math applies to any provider that charges per input token. Burnless is the **orchestration and caching layer**, not a wrapper for one API.

Taking it further: the encoder and decoder — the models that compress user messages into capsules and expand capsules back into natural language — can run on a **local model at zero marginal cost**:

```yaml
agents:
  bronze: { command: "ollama run llama3.2" }   # capsule encoder/decoder — $0
  silver: { command: "claude --model claude-haiku-4-5 -p" }
  gold:   { command: "claude --model claude-sonnet-4-6 -p" }   # Brain
  diamond: { command: "codex exec --sandbox workspace-write" }
```

As local models improve, more tiers move to zero cost. The expensive models (Opus, GPT-4o) handle only what requires genuine reasoning — and they do it with a cached prefix and a linear history.

## Three compression layers

Each layer is independent and additive:

| Layer | What it does | Cost | When it fires |
|-------|-------------|------|--------------|
| **1. Deterministic minifier** | Strips filler phrases, normalizes whitespace, applies glossary abbreviations | Zero — pure regex | Every turn, before encoder |
| **2. Semantic encoder** | A small model (Haiku, GPT-4o-mini, local Llama) compresses prose into structured capsule format | ~$0.001/turn | Every turn |
| **3. LLMLingua-2 (optional)** | XLM-RoBERTa token classification (Microsoft Research, GPT-4 distilled) | Local CPU, no API | Long inputs, `--encoder llmlingua2` |

The 88% cost reduction in the benchmark comes primarily from Layer 3 of the *architecture* — shared prefix cache + linear capsule history. Layers 1 and 2 compound on top of that.

## How it works

**Brain.** A thin orchestrator — any model you configure — that holds the plan, decides what to delegate, and reasons over results. Its conversation history contains only capsules — single-line summaries of past turns, ~80 characters each.

**Worker.** A delegated execution (any tier, any provider — local Ollama, Codex, Claude, GPT, Gemini) that receives one task, the cached system prompt, and the relevant capsules. It runs, returns a compact result, and exits. Raw output is written to `.burnless/logs/dNNN.log`, never replayed into the Brain.

**Capsule.** The compact handoff between turns. The Brain reads the capsule; the full log stays on disk and is read on demand. This is what flips the cost curve from quadratic to linear.

**Shared cache.** Brain and Worker use a byte-identical system prompt with the provider's prompt-caching directive (e.g. Anthropic's `cache_control: {"type": "ephemeral", "ttl": "1h"}`). Both sides hit the same prefix cache, so the persistent context is paid for at write price once per cache window and read price afterwards.

## Benchmark

The benchmark in `bench/run.py` is the source of truth for the table above. Three scenarios run through a real provider SDK directly with no mocks; costs come from `response.usage` exactly. The reference run uses Anthropic because their cache pricing is published and easiest to reproduce — adapters for OpenAI and Gemini are tracked in the issues.

- **A** — standalone, no cache, full history each turn
- **B** — standalone, system prompt cached, full history each turn
- **C** — Burnless Maestro: cached system prompt + capsule history

Reproduce the math without an API key:

```bash
python bench/run.py --project 50
```

Reproduce empirically (real API calls, ~$5 for 8 turns):

```bash
ANTHROPIC_API_KEY=sk-ant-... python bench/run.py --turns 8
```

Raw results land in `bench/results/run_<timestamp>.json` for inspection.

## CLI

```bash
burnless                     # interactive shell (Brain)
burnless plan "<objective>"  # write a plan to .burnless/maestro.md
burnless delegate "<task>"   # create a delegation, route to a tier
burnless run d001            # execute it (worker streams to live panel)
burnless status              # current plan + open delegations
burnless metrics             # token counter + audit ledger
```

State lives entirely under `.burnless/` in your project. No hosted backend.

## Contributing

Issues, PRs, and benchmark contestation are all welcome. The benchmark script is intentionally short and dependency-light so you can read it end-to-end and disagree with concrete numbers. If your workload produces a different ratio, open an issue with the JSON from `bench/results/` — that is exactly the conversation worth having.

## Status — what works today, what's roadmap

The architecture is provider-agnostic by design. Current implementation status:

- ✅ **Workers**: shell out to **any CLI** (`claude`, `codex`, `openai`, `gemini`, `ollama`, anything). Configure per tier in `config.yaml`. Works today.
- ✅ **Routing, capsules, exec_log, three compression layers, shared system prompt**: provider-neutral, work today.
- ✅ **Reference benchmark**: uses Anthropic SDK because their cache pricing is published and easiest to reproduce. The math reproduces wherever a provider exposes prompt caching.
- ⚠️ **`burnless brain` interactive command**: uses the Anthropic SDK in-process today. OpenAI, Gemini, and OpenRouter adapters are tracked in v0.4. If you want to skip the in-process Brain, `burnless delegate` + `burnless run` already cover the full Brain→Worker loop using whatever CLI you configured.
- ⚠️ **PyPI release**: `pip install burnless` is not live yet. Install from source today; PyPI in next release.

Honest about gaps. PRs welcome — especially for the OpenAI/Gemini Brain adapter.

## License

MIT. See `LICENSE`.

---

Repo: `github.com/rudekwydra/burnless`
