The Full Architecture

Five layers — each one solving a specific problem from the research. Read top to bottom: your agent talks to the top, Chrome runs at the bottom.

Your Agent Talks To This
Developer API
browser.run("do task") page.snapshot() page.click("button") page.extract({...})
Layer 1
Compression Engine
Transforms raw browser data into a compact summary before the LLM ever sees it. Element cache means we never re-read the same thing twice. Differential updates send only what changed.
Token-first Element cache Snapshot versioning Diff updates
Layer 2
Security & Sanitization
Every page passes through this before reaching the LLM. Strips hidden elements (invisible text that could hijack your agent), detects prompt injection patterns, scopes context to only the relevant part of the page.
Injection guard Hidden element filter Context scoping
Layer 3
Structured Error Recovery
Instead of returning a vague error string, this layer classifies exactly what went wrong and suggests how to recover. The LLM gets a structured object: error type, confidence, and recommended next action.
ElementStale → re-snapshot + retry
AntiBotBlock → rotate identity
AuthRequired → ask human
NetworkTimeout → exponential backoff
Layer 4
CDP Engine (Chrome DevTools Protocol)
Direct WebSocket connection to Chrome. Full Shadow DOM access. Surgical data extraction — only pulls exactly what we need. Vision fallback for canvas apps (Google Sheets, Figma). Zero intermediary overhead.
Full Shadow DOM Vision fallback Surgical extraction No overhead
Runs Here
Chrome / Chromium (local or cloud)
Runs across all layers
Observability Trace
Every action is recorded: what was seen, what was decided, what was executed, what the page looked like after. Machine-queryable. Used for automatic retry classification and can be exported for LLM fine-tuning.
Does this architecture look right? Let me know in the terminal and we'll move to naming + core features.