Aulinx: The Semantic Compositor

What no one has built — compositor-level AI understanding

Current Industry: External Observers

How every AI desktop agent works today (2026)
1.
Screenshot
Take full-screen capture (lossy, ~500KB)
2.
Upload
Send pixels to vision model (1,200-5,000 tokens)
3.
Interpret
Model guesses what's on screen from pixels
4.
Act
Click at guessed (x, y) coordinates
5.
Repeat
Re-perceive everything from scratch. No memory.
Claude, Operator, Mariner, UFO3 — all do this

The Aulinx Approach: Semantic Compositor

What no one has built
Traditional Compositor
Receives pixel buffers
Arranges rectangles
Routes input events
Understands nothing
AI must screenshot to "see"
Semantic Compositor
Receives pixel buffers + semantic data
Maintains live scene graph
Emits semantic events
Understands everything
AI queries meaning directly
Architecture
Semantic Scene Graph (THE NEW LAYER)
Live State
Every element, label, state
updated in real-time
Semantic Diffs
"button changed state"
"dialog appeared at Y"
Cross-App Context
Clipboard flows, drag ops
multi-window workflows
Newton Protocol
Apps push a11y data
to compositor (native)
AT-SPI Bridge
Legacy apps via
existing a11y tree
Vision Fallback
Screen2AX for apps
with no a11y support
XDG Shell
Layer Shell
XWayland
Input / Render
▼ Semantic IPC — scene graph queries + event stream ▼
Python AI Agent
Queries scene graph (50 tokens) instead of screenshots (5,000 tokens)

Why This Is Novel

100x cheaper perception
50 tokens (scene graph query) vs 5,000 tokens (screenshot + vision model). No API calls needed for perception.
Real-time, not polling
Event-driven semantic updates instead of screenshot loops. The AI knows the instant something changes.
Ground truth, not guessing
The compositor knows exactly what's on screen. No OCR errors, no misidentified buttons, no coordinate guessing.
No one has done this
Zero results for compositor-level AI integration. GNOME's Newton is the closest concept but targets screen readers, not AI agents.