wavi
Read full chat histories. Capture audio. Transcribe voice messages. Without touching the DOM.
Philosophy
WhatsApp Web is a black box. Its DOM changes with every deploy, its CSS selectors rotate without warning, and its JavaScript API is private. Any scraper that reads the DOM is brittle by design.
wavi takes a different bet: what a human sees on screen is stable even when the internals change. Chat bubbles look like chat bubbles. Timestamps look like timestamps. Voice messages look like voice messages. That rendering contract is anchored to human expectation — it barely changes across WA versions.
Vision-based extraction is slower than DOM parsing but orders of magnitude more resilient. The investment in OCR and image analysis pays dividends every time WhatsApp ships a breaking DOM change — because wavi doesn't care.
Why this matters beyond WhatsApp
Every technique in this codebase — bubble detection, OCR timestamp extraction, blob URL interception, DOM anchor deduplication — is a reusable building block. The same approach applies to Telegram Web, Slack, or any chat UI that follows a similar visual grammar.
We are building a vocabulary for visual automation. Each solved problem here is a solved problem everywhere.
Speed is not the enemy
Chrome headless + Playwright CDP + Apple Vision OCR is fast enough. A 300-message chat captures in under 2 minutes. A 1000-message chat in under 10. That's fast enough for any real workflow.
The system prioritizes correctness over raw speed: DOM anchor deduplication ensures no message appears twice even across overlapping scroll windows; DOM IDs take priority over OCR text for dedup keys because OCR is imperfect. Every piece has a fallback.
Quick start
# Isolated global install via pipx
pipx install --editable .
# Or dev install in the project venv
pip install -e ".[dev]"
wavi connect --open # first-time: opens QR page in browser
wavi connect # subsequent: restores session headlessly
wavi status # verify: daemon=running, session=restored
GROQ_API_KEY, audio messages are captured but never converted to text.echo "GROQ_API_KEY=gsk_..." >> .env
1. Go to console.groq.com/keys — free account, no credit card required
2. Sign up or log in → click Create API Key
3. Copy the key (starts with
gsk_) and paste it into your .env
output/<contact>/.wavi get default "Ana García"
# → output/ana_garcía/history_bubbles.json
# → output/ana_garcía/iter_*/screenshot.png
# → output/ana_garcía/iter_*/audio_*.ogg
wavi send default "Ana García" "Hola!"
wavi send default "+54 9 11 5561 2767" "test" --screenshot-out sent.png
contacts_list.json and screenshot.png to the output folder, overwriting on each run.wavi list-contacts
# → output/contacts/contacts_list.json
# → output/contacts/screenshot.png
Commands
All commands accept an optional SESSION argument (default: "default"). Multiple sessions can run simultaneously on different CDP ports.
Start or recover the Chrome daemon. Tries headless first — no visible Chrome window ever opens. If QR is needed, captures it headlessly and writes data/qr.html with a live countdown. The connect session closes when the QR is scanned (→ daemon stays up) or expires (→ run again).
--open — open the QR page automatically in the default browser--new — skip any existing session, force a fresh QR scan; session folder is named after the detected phone numberScroll the full chat history with CONTACT. Deduplicates messages across scroll iterations. Downloads and transcribes audio messages if GROQ_API_KEY is set.
Open CONTACT's chat and send MESSAGE. Use your own phone number as CONTACT for self-message testing.
Check if the Chrome daemon is running and whether WhatsApp is authenticated. Returns session=restored when ready.
Gracefully shut down the Chrome daemon. Navigates to about:blank first so WhatsApp flushes IndexedDB before SIGTERM. Never use kill -9 directly.
Show the current operation status for a session. Prints idle or details of the running operation (type, contact, PID, elapsed time).
Run the vision pipeline on a saved screenshot file. No browser needed. Useful for debugging OCR and bubble detection on a specific frame.
Open this page in the default browser.
Open WhatsApp's "New chat" panel and extract all contacts (DOM-based). Scrolls the virtualized list to the bottom with anchor-based overlap dedup (85% steps), so contacts outside the initial viewport are captured too. Saves contacts_list.json and screenshot.png to the output folder, overwriting on every run.
Check the WhatsApp sidebar for new inbound messages. Snapshots every visible chat row via DOM ({name, last_message, timestamp, direction}) and compares it against the previous saved state. A chat is reported only when its last message changed and is inbound (no delivery-tick icon) — outgoing messages and re-reads never trigger an update.
Writes two files to output/<session>/last-updates/:
updates.json—{status, checked_at, contacts, new_inbound}(also the previous-state baseline for the next run)snapshot_current.png— sidebar screenshot from this run (debugging)
List Contacts — output
Running wavi list-contacts opens the WhatsApp "New chat" panel, scrolls the full virtualized list extracting contacts via DOM ([role="listitem"]), and writes two files to output/<session>/contacts/ — overwriting on every run:
contacts_list.json— array of{name, subtitle}objectsscreenshot.png— full 1280×1920 browser viewport to confirm the panel state
Browser viewport — 1280 px wide (ADR-002)
The screenshot below is the actual Chrome viewport captured during the last run. The "New chat" panel occupies the left ~580 px (same as the sidebar used by the message capture pipeline). The right panel shows the previously open chat — irrelevant to contact extraction.
Contact list panel (left 580 px)
Full viewport (1280 px)
The panel uses WA's virtual list renderer — contacts outside the viewport are not in the DOM. _scroll_all_contacts() iterates scrollTop in ~85% viewport steps (guaranteed overlap) and dedups by anchoring on the last contact of the previous page, the same strategy capture_full_history uses for messages. Stops at the bottom or after 3 consecutive scroll stalls.
Output structure
output/contacts/ # overwritten on every run
contacts_list.json # [{"name": "...", "subtitle": "..."}, ...]
screenshot.png # 1280×1920 browser viewport
Check Updates — output
Running wavi check-updates inspects the WhatsApp sidebar for new inbound messages. Detection is DOM-state based: every visible chat row is captured as {name, last_message, timestamp, direction} and compared against the previous run's saved state in updates.json.
Algorithm
ensure_chat_list()— close the New Chat panel if open, press Escape ×3 to clear search/overlays, wait for the chat list.extract_sidebar_updates()— snapshot every visible chat row via DOM.directionisoutboundwhen a delivery-tick icon (msg-check,msg-dbl-check, …) is present in the preview row,inboundotherwise.- No previous state (or
--reset) →status=first_run, save baseline and return. - Compare row-by-row against the previous state: any inbound row whose
last_messageortimestampchanged →status=updateswith those rows innew_inbound. No changes →status=no_updates. - Write
updates.json(new state, replaces the old baseline) andsnapshot_current.png.
Limitation: only the last visible message per chat is tracked. If several messages arrive between two checks, only the most recent one is reported per contact — use wavi get afterwards to retrieve the full history.
Output structure
output/<session>/last-updates/
snapshot_current.png # sidebar at this run (debugging)
updates.json # {status, checked_at, contacts: [{name, last_message, timestamp, direction}], new_inbound}
Status values
| status | meaning |
|---|---|
first_run | No previous state (or --reset) — current sidebar saved as baseline. |
no_updates | No inbound row changed its last message since the previous run. |
updates | At least one inbound row changed — see new_inbound. |
Sequence diagram
Architecture
System overview
wavi is a thin CLI on top of a three-layer stack: session management (Chrome CDP via Playwright), vision (pure image processing), and orchestration (WARunner ties them together with scroll logic and audio capture).
Module dependency graph
Command sequences
One sequence diagram per command. Variants show only the delta from the base flow. Solid arrows (→→) = action/request; dashed arrows (- -→→) = response/return.
wavi get — full history capture
wavi get --newest — incremental update
Loads the existing history_bubbles.json and stops scrolling as soon as the first already-known message is found. New messages are prepended and IDs renumbered.
wavi get --from YYYY-MM-DD — date-bounded capture
Stops scrolling when the oldest visible day-separator pill is earlier than the requested date. Bubbles older than from_date are dropped from that iteration.
wavi get --grow — paged history capture
Loads history_bubbles.json and grow_checkpoint.json, fast-forwards past known messages, then captures --max-iter N new-content iterations toward the past. Each run extends the history by one block. When scrollTop reaches 0 the checkpoint is marked complete and future runs exit immediately. Incompatible with --newest.
wavi send
wavi connect — session already active (fast path)
wavi connect — QR scan required
Optimistic headless probe first. If QR is needed, captures it headlessly and writes a local HTML file — no visible Chrome window ever opens. Session dir is renamed to the phone number once scanned. QR expiry detected via data-ref change.
wavi stop
wavi status
wavi queue
wavi bubbles — offline vision pipeline
No browser. Runs entirely from a saved PNG file. Useful for debugging detection and OCR on a specific frame.
wavi list-contacts
DOM-based extraction — no vision pipeline. Opens the "New chat" panel, reads [role="listitem"] elements while scrolling the virtualized list to the bottom (anchor-based overlap dedup), writes contacts_list.json + screenshot.png to output/<session>/contacts/ (overwritten on every run).
Key classes
Browser interaction layer
WASession interacts with WhatsApp Web using physical coordinates and keyboard events, not DOM selectors or page.locator() calls. This is a deliberate choice: WA's DOM is obfuscated and changes with every deploy, but the visual layout is stable. If a button gets a new data-testid, wavi doesn't care — it clicks wherever the button appears on screen.
Coordinate translation — vision pipeline works in crop-panel pixels (sidebar and header stripped). WASession maps them back to viewport pixels before any click:
viewport_x = crop_x + SIDEBAR_X (580 px) viewport_y = crop_y + HEADER_Y ( 60 px) → page.mouse.click(viewport_x, viewport_y) ← always a coordinate click
DOM queries are used only when there is no alternative: to obtain the coordinates of a dynamic element (compose box), to read scroll state, to read DOM data-id attributes for deduplication, and to capture blob URLs via JS injection.
| Method | Interaction type | Mechanism |
|---|---|---|
click(crop_x, crop_y) | Coordinate | page.mouse.click(x+580, y+60) |
navigate_to_contact | Coordinate + keyboard | mouse.click(317, 80) → keyboard.type() → ArrowDown + Enter |
send_message | JS→coords + keyboard | JS locates compose box → mouse.click(x, y) → keyboard.type() → Enter |
scroll_chat_up/down | DOM scrollTop | el.scrollTop -= pixels (no visual alternative) |
get_visible_message_ids | DOM read | querySelectorAll('[data-id]') — only for dedup keys |
install_blob_monitor | JS injection | Hooks HTMLMediaElement.src, URL.createObjectURL, play() |
fetch_blob | JS evaluate | fetch(blobUrl) → base64 → Python bytes |
navigate_to_new_chat | JS click | span[data-icon="new-chat-outline"].closest('button').click() — icon-stable, locale-agnostic |
extract_contacts | DOM read | querySelectorAll('[role="listitem"]') → [role="gridcell"] text per item |
close_new_chat | JS click + keyboard | span[data-icon="back-refreshed"].click(), fallback Escape |
extract_sidebar_updates | DOM read | querySelectorAll chat cells → name + last_message + timestamp; direction via msg-check tick icons |
ensure_chat_list | keyboard | Escape → wait_for_selector('[data-testid="chat-list"]') |
navigate_to_contact — how it works
send_message — how it works
Vision pipeline
The vision pipeline is pure — no browser, no network, no side effects. It takes a PNG file and returns a list of classified Bubbles. This makes it fully testable without mocking anything real.
element_detector.py uses RGB channel masks to isolate bubble pixels: outgoing ("me") = light green (G>200, G−R>15, G−B>15); incoming ("other") = near-white (R,G,B>248). Morphological closing fills intra-bubble gaps, then scipy.ndimage.label() finds connected components. Each component is filtered by size, aspect ratio, and color density. Each surviving region is a bounding box.ocr_tiled) serves two layout helpers: detecting WA date-separator pills ("Hoy", "Ayer", etc.) to attach dates to bubbles, and detecting timestamps inside color regions that span multiple messages so they can be split. This scan is not the source of each bubble's text — it is structural metadata only. The canonical text comes from the per-bubble OCR in step 3.swift/ocr_vision.swift, which calls macOS's Vision framework (VNRecognizeTextRequest). Returns a JSON array of text blocks with fractional coordinates. This is the authoritative text for the element. Fast, accurate, offline.classify_msg_type() inspects the OCR blocks: duration patterns (0:21) → audio; file extensions or sizes → file; all-blank blocks → media (likely a photo/video); anything else → text. Sender is inferred from x-position (right = "me", left = "other").2:15 p. m.). Special handling for Cyrillic OCR artifacts where Apple Vision misreads Latin "p" as Cyrillic "р" in some locales.dom_id assigned by matching its y-center (crop coords → CSS viewport via DPR) against WA's data-id attributes queried from the DOM. DOM ID is the primary dedup key — stable even when OCR produces different text for the same message.HTMLMediaElement.src, URL.createObjectURL, and play() calls. When wavi clicks a play button, the hook captures the blob URL. The blob is then fetched as binary and saved as audio_N.ogg.Vision pipeline — flowchart
Test suite
All tests run offline — no browser, no real WA account, no API calls. Browser methods are mocked with AsyncMock. The vision tests use synthetic images generated in-memory with Pillow + numpy.
Design decisions
Key architectural choices and the reasoning behind them.
Why: WhatsApp Web's DOM is minified, obfuscated, and changes with every deploy. CSS selectors and JS APIs that work today silently break tomorrow. Visual rendering is stable because it is anchored to human perception.
Trade-off: slower than DOM parsing, but resilient to any non-visual WA change.
--force-device-scale-factor=1 to Chrome.Why: on macOS Retina, Chrome defaults to DPR=2. With a 1280×1920 window spec, the effective CSS viewport becomes 640×960 — half the pixels. Screenshots show far fewer messages and bounding box math breaks. Forcing DPR=1 maps
--window-size 1:1 to CSS pixels.Test guard:
TestViewportRegression ensures this never regresses.
wavi connect starts Chrome once; all other commands connect via CDP and never kill Chrome.Why: WhatsApp stores its auth state in IndexedDB inside the Chrome user-data-dir. Killing Chrome mid-session corrupts IndexedDB and forces re-authentication. CDP connect/disconnect is cheap; Chrome restart is expensive and lossy.
Shutdown: always via
wavi stop which navigates to about:blank first.
dom_id (WA's data-id); fallback = OCR content key.Why: scroll regions overlap by ~15%. A message visible at the bottom of frame N appears again at the top of frame N+1. OCR text alone can vary between captures (noise, punctuation). The DOM ID is immutable for a given message.
Fallback: when DOM ID assignment fails (element off-screen), OCR key (sender + type + text[:80] + timestamp) handles it.
arch -arm64.Why: on Apple Silicon, Chrome launched without explicit arch may run under Rosetta (x86 emulation). WhatsApp Web's service worker and IndexedDB behave inconsistently under Rosetta — the session appears authenticated but the chat list never loads (frozen). ARM64 native is stable.
Memory: this is saved in project memory to prevent regression.
wavi connect; only open a visible window if QR is needed.Why: once authenticated, the session is stored in the Chrome user-data-dir. Headless restores it silently. Opening a visible window unnecessarily is disruptive and slower.
Strategy: headless → check WA auth → if QR needed, kill headless, open visible → scan → kill visible → restart headless.
transcribe_history_audios() runs after runner.close(), not during the scroll loop.Why: Groq API calls can take 1-3 seconds each. Running them inside the Playwright event loop while Chrome is open risks timeouts and interleaves async operations in hard-to-predict ways. Closing the browser first, then transcribing serially, is safe and deterministic.
http://localhost:8700); fall back to local socket scan.Why: multiple wavi sessions (and other local agents) share the 9200–9249 port range. Atomic
POST /ports/claim prevents race conditions when two sessions start simultaneously. Three-tier fallback (claim → free+register → local scan) ensures the tool works even without the society daemon.