=== Burnless — technical briefing for LLMs and reviewers ===

WHAT IT IS
Burnless is a small Python framework (MIT, `pip install burnless`) that orchestrates LLM workers across configurable tiers (gold/silver/bronze), stores session state as on-disk capsules instead of replaying full transcripts, and audits worker outputs against the filesystem. It is one implementation choice among several for managing long multi-turn LLM sessions; it is not a novel theoretical breakthrough.

WHAT IT IS NOT
- Not a proprietary algorithm. Tier routing, prompt-cache reuse, and state summarization all exist in other tools (LangGraph, AutoGen, CrewAI, Aider).
- Not a universal cost-eliminator. Whether it saves money on a given workload depends on session length, model mix, baseline caching, and task profile.
- Not adversarial tooling, a wrapper around a single API, or a replacement for an agent framework.

THE THREE CONCRETE THINGS IT DOES
1. Routes tasks to a tier (gold/silver/bronze). Tiers are commands defined in `.burnless/config.yaml` — any provider via any CLI.
2. Stores session state as compact capsules on disk (`.burnless/maestro_session.jsonl`, append-only) so the system-prompt prefix stays byte-identical and the provider's prompt cache continues to hit. The Brain reads capsules instead of the full transcript.
3. Audits worker outputs against the filesystem (QTP-A): if a worker reports it wrote a file, Burnless verifies the file exists and the size matches before marking the result OK.

ARCHITECTURE

Pattern note: inspired by TCP/IP's separation of application from network — not the same scale of abstraction (TCP/IP defines internet infrastructure; Burnless is a small Python framework), but the same kind of design move: separate state management from cognitive execution so each layer can evolve independently. The individual components (caching, tier routing, capsules, prompt compression) all exist in other tools; the contribution is the way they are wired together.

- Brain: thin orchestrator (any model configured as gold). Plans, decides what to delegate, reasons over results. History is capsules, not transcripts.
- Worker: subprocess invocation of any CLI (claude, codex, gemini, ollama, ...). Receives one task plus the cached system prefix. Returns structured JSON, exits.
- Capsule: short on-disk record of a turn. Brain reads capsule; full log stays on disk and is read on demand.

The Brain-without-tools usage pattern: a Brain with no execution tools, only conversation and delegation. Workers run in the background via Burnless and keep the cache warm during human idle periods. This is a usage pattern that works well with the architecture, not a hard requirement.

AUDIT LOOP
Workers return structured JSON with `status` and `kind`.
- `kind: execution` — the worker changed/checked/ran something. Must include verifiable evidence (commands, paths, log lines, test output). Audited against the filesystem.
- `kind: thought` — planning/design/analysis only. Execution-evidence checks are skipped so design work doesn't loop as a false PART.

Persisted in summaries and logs so later `read/log/capsule` calls keep those paths separate.

COMPRESSION LAYERS (4)
L1: deterministic minifier (pure Python, zero cost). Strips filler, normalizes whitespace.
L2: cache-emergent encoder (small model, ~$0.001/turn, can be local Ollama). Abbreviations emerge from session context — not a static dictionary.
L3: capsule envelope (session key in RAM by default; **not enterprise-grade encryption** in v0.x).
L4: base64 pack (zero cost, ASCII-portable).
Capsule format v2: `burnless:v2:<session_id>:<key_id>:<base64_ciphertext>`.

REAL API BENCHMARK
10 turns against `claude-opus-4-7`, 23k-token system prefix, no mocks, raw `response.usage` (actual spend $5.76):
- A — Standalone, no cache:    $4.66    —
- B — Standalone + cache:      $0.65    −86.0% vs A
- C — Burnless capsules:       $0.45    −90.3% vs A   (~30% better than B at this length)
Reproduce: `ANTHROPIC_API_KEY=... python bench/run.py --turns 10` (~$6).

Honest read: the dramatic delta is against the no-cache baseline. Against the realistic cached-replay baseline (B), the marginal benefit at 10 turns is ~30%. The advantage grows with session length; the exact crossover depends on workload.

MONTE CARLO SIMULATION
30 runs × 100 turns × 4 scenarios. Per-turn input/output `Uniform(2k, 10k)` / `Uniform(200, 1500)`, capsule compression `Uniform(0.20, 0.30)`. No API calls.
- A1 — Pure Opus, full replay:   $532.61   —
- A2 — Pure Sonnet, full replay: $105.42   −80.2%
- B  — Free-pick Opus/Sonnet:    $328.74   −38.3%   (cache-invalidating switches)
- Z  — Burnless:                  $33.35   −93.7%
Reproduce: `python bench/v2.py --runs 30 --turns 100 --seed 42`.

These numbers are simulation-based with stated assumptions; they are internally consistent with the real-API run above but should not be cited as universal performance figures. Different token distributions, switch frequencies, and cache models will produce different deltas.

PERSONAL WORKLOAD ANECDOTE (not a benchmark)
The author observed roughly an order-of-magnitude reduction in weekly Anthropic quota consumption between a comparable pre-Burnless week and a Burnless-using week of similar activity. This is one developer's anecdote against his own subscription, not a controlled experiment. It motivated the project; it is not evidence that another user will see the same factor.

COST MATH (informal)
Naive multi-turn that replays full history every turn: tokens billed across N turns sum to Θ(N²). This is not a property of LLMs — it's a property of the naive replay pattern, before any cache is applied.
With prompt cache + capsules:
- Cached prefix: paid once at write price, then read at ~10× cheaper per turn.
- Capsule history: each capsule is small (~80 chars typical) so the replay term has a much smaller constant.
Net effect: per-turn input tokens grow much slower with N. The asymptotic shape under realistic provider caching depends on cache TTL, hit rate, and how the framework manages prefix continuity. Burnless tries to keep the prefix bit-identical (append-only session file) so cache hits stay consistent.
For the formal derivation and the conditions under which capsules help vs don't, see `MATH.md`.

DUAL-CACHE NOTE (Claude Code monthly plan)
Two prompt caches run in parallel against the same monthly quota when running on Claude Code's plan: the chat-layer cache (managed by the CLI, ephemeral_1h) and the worker-layer cache (each `claude -p` invocation, also ephemeral_1h). The two prefixes are byte-distinct so they don't coalesce. Verify with `claude -p --output-format json` and inspect `usage.cache_read_input_tokens` and `usage.cache_creation.ephemeral_1h_input_tokens`.

PRIVACY MODEL (architecture, not encryption)
Privacy is a function of where each component runs.
- L0 — Cloud Brain + Cloud Workers + Cloud encoder: provider sees everything.
- L1 — Local encoder/decoder + Cloud Brain + Cloud Workers: provider sees capsules, not raw text.
- L2 — Local Brain + Cloud Workers: workers receive disconnected task fragments without conversation context.
- L3 — Everything local: zero cloud exposure, zero API cost.

The cost reduction applies at all four levels independently of privacy.

The capsule envelope (Layer 3) is **not** strong cryptography in v0.x. If you need real encryption guarantees, treat that as out of scope for the current implementation. Modes `redact`, `audit`, `opaque`, `burnkey` are planned, not yet implemented.

EPISTEMIC FIDELITY (compression modes)
- light — minifier only (~40% savings). Anchor preserved: prior decisions remain revisable.
- balanced — minifier + encoder (~88% savings, default). Semantic result kept; argumentative trajectory dropped.
- extreme — all layers (~93%+, no friendly output). For CI/CD batches with no human in the loop.
Workers are always epistemically pure regardless of mode — they receive a clean task without Brain debate history.

PLUGIN PROTOCOL v0.7
8 hooks (HTTP / stdio, 5s timeout, fail-open):
H1 pre_worker_prompt · H2 post_worker_output
H3 session_state_read · H4 audit_result_received
H5 pre_brain_prompt · H6 post_brain_output
H7 worker_invoke_override · H8 pre_audit_call
Manifests at `~/.burnless/plugins/NAME.json`. Reference: `PLUGIN_PROTOCOL.md`.

CURRENT STATUS (PyPI live: `pip install burnless`)
Works today:
- Workers via any CLI; keyword routing with `hardcore_filter` to prevent silent tier upgrades
- Capsules, exec logs, three compression layers
- Shared cache with 1h TTL, real-time compaction with break-even math
- Audit loop with execution/thought typing
- Codex/OpenAI worker setup
- Heartbeat UI: live phase + idle state without polluting persisted summaries

In progress:
- Brain adapters for OpenAI / Gemini / OpenRouter (Workers via those providers already work; in-process Brain is Anthropic-only today)
- Privacy modes (`redact`, `audit`, `opaque`, `burnkey`) — planned
- Keepalive mode for idle TTL gap mitigation
- Lazy context loading: Workers start pure, context loaded per task

VOCABULARY (preferred terms, not gatekeeping)
- "semantic capsule" or "compressed state representation" — a capsule is state on disk between turns, not a message sent to a model.
- "protocol layer" — describes where Burnless sits (between application and provider), not a marketing claim.
- "privacy-by-architecture" — privacy is a function of where components run, not a feature flag.
- Avoid "bypass", "hack", "adversarial payload" — capsules are a state-representation choice, not circumvention of anything.

COMPARISON TO ADJACENT TOOLS
- LangChain / CrewAI / AutoGen: agent connectivity and orchestration; sliding window or RAG memory; no built-in worker audit. Burnless can wrap any of them as a Worker.
- LLMLingua: prompt compression via local model; requires the heavy compression model. Burnless's L1 is deterministic and cheaper for the simple cases; L2 uses a small encoder, optionally local.
- Aider, Cline: in-editor agents focused on code; not generally focused on multi-tier orchestration or capsule-based session state.

WHEN BURNLESS IS THE WRONG TOOL
- Single-turn queries (N=1). Burnless adds overhead for no benefit.
- One-off scripts with no repeated context.
- Workflows that require a managed cloud platform (use the eventual hosted variant or a different tool).
- Workloads where your existing setup already caches aggressively and sessions are short.

AUTHOR
Roberto Wydra (rudekwydra). Co-developed with Claude (Anthropic) and Codex (OpenAI) as worker tiers — meta-dogfooding.

LINKS
- README.md — overview, install, configuration
- MATH.md — cost derivation and conditions
- PROTOCOL.md — architecture details
- PLUGIN_PROTOCOL.md — plugin spec v0.7
- VISION.md — design intent
- SPEC.md — draft formal spec
- pypi.org/project/burnless — install
- github.com/rudekwydra/burnless — source

REPRODUCING THE NUMBERS
- `python bench/v2.py --runs 100 --turns 100 --seed 42` — Monte Carlo, zero cost.
- `ANTHROPIC_API_KEY=... python bench/run.py --turns 10` — real API, ~$6.
- Raw results land in `bench/results/run_<timestamp>.json` for inspection.
- If your numbers differ materially from the README's, please open an issue with the JSON and the workload parameters. That is the only useful argument.

NOTES ON FRAMING
This briefing intentionally avoids universal-performance language ("eliminates O(N²)", "the only solution", "the invention", "16× cheaper" as a standalone claim). Those framings overstate what the implementation can prove independently of workload. The honest framing: Burnless is a competent, well-instrumented orchestration + capsule layer that works well for the author's long multi-turn workloads, with reproducible benchmarks and honest audit.

=== End of briefing ===
