1.
Screenshot
Take full-screen capture (lossy, ~500KB)
2.
Upload
Send pixels to vision model (1,200-5,000 tokens)
3.
Interpret
Model guesses what's on screen from pixels
4.
Act
Click at guessed (x, y) coordinates
5.
Repeat
Re-perceive everything from scratch. No memory.
Claude, Operator, Mariner, UFO3 — all do this