WhatsApp automation via vision

wavi

Read full chat histories. Capture audio. Transcribe voice messages. Without touching the DOM.

Python 3.11+ macOS arm64 Playwright CDP Apple Vision OCR Groq Whisper headless Chrome
$ wavi connect # authenticates once, stays headless $ wavi get default "Ana García" --from 2025-01-01 ✓ 247 mensajes capturados, 12 audios transcritos → output/ana_garcía/history_bubbles.json   $ wavi send default "Ana García" "Hola, te paso el resumen" ✓ Mensaje enviado

Philosophy

WhatsApp Web is a black box. Its DOM changes with every deploy, its CSS selectors rotate without warning, and its JavaScript API is private. Any scraper that reads the DOM is brittle by design.

wavi takes a different bet: what a human sees on screen is stable even when the internals change. Chat bubbles look like chat bubbles. Timestamps look like timestamps. Voice messages look like voice messages. That rendering contract is anchored to human expectation — it barely changes across WA versions.

The core thesis

Vision-based extraction is slower than DOM parsing but orders of magnitude more resilient. The investment in OCR and image analysis pays dividends every time WhatsApp ships a breaking DOM change — because wavi doesn't care.

Why this matters beyond WhatsApp

Every technique in this codebase — bubble detection, OCR timestamp extraction, blob URL interception, DOM anchor deduplication — is a reusable building block. The same approach applies to Telegram Web, Slack, or any chat UI that follows a similar visual grammar.

We are building a vocabulary for visual automation. Each solved problem here is a solved problem everywhere.

Speed is not the enemy

Chrome headless + Playwright CDP + Apple Vision OCR is fast enough. A 300-message chat captures in under 2 minutes. A 1000-message chat in under 10. That's fast enough for any real workflow.

The system prioritizes correctness over raw speed: DOM anchor deduplication ensures no message appears twice even across overlapping scroll windows; DOM IDs take priority over OCR text for dedup keys because OCR is imperfect. Every piece has a fallback.

Quick start

1
Install
Requires Python 3.11+, pipx, and Google Chrome for macOS (arm64).
# Isolated global install via pipx
pipx install --editable .

# Or dev install in the project venv
pip install -e ".[dev]"
2
Connect once
Launches Chrome headless. If QR is needed, captures it headlessly and writes a local HTML file with a countdown timer. Scan with your phone from that page — no visible Chrome window ever opens. The session persists between reboots.
wavi connect --open  # first-time: opens QR page in browser
wavi connect         # subsequent: restores session headlessly
wavi status          # verify: daemon=running, session=restored
3
Set GROQ_API_KEY  required
Mandatory. Wavi cannot transcribe voice messages without this key — transcription is a core feature, not optional. Without GROQ_API_KEY, audio messages are captured but never converted to text.
echo "GROQ_API_KEY=gsk_..." >> .env
How to get your key:
1. Go to console.groq.com/keys — free account, no credit card required
2. Sign up or log in → click Create API Key
3. Copy the key (starts with gsk_) and paste it into your .env
4
Capture a chat
Scrolls the full history, captures every bubble, downloads audio files, transcribes them. Output in output/<contact>/.
wavi get default "Ana García"
# → output/ana_garcía/history_bubbles.json
# → output/ana_garcía/iter_*/screenshot.png
# → output/ana_garcía/iter_*/audio_*.ogg
5
Send a message
Navigates to the contact, types the message, presses Enter. Optionally saves a screenshot of the chat after sending.
wavi send default "Ana García" "Hola!"
wavi send default "+54 9 11 5561 2767" "test" --screenshot-out sent.png
6
List contacts
Opens the "New chat" panel and scrolls the full virtualized list, collecting every contact with anchor-based overlap dedup. Saves contacts_list.json and screenshot.png to the output folder, overwriting on each run.
wavi list-contacts
# → output/contacts/contacts_list.json
# → output/contacts/screenshot.png

Commands

All commands accept an optional SESSION argument (default: "default"). Multiple sessions can run simultaneously on different CDP ports.

wavi connect [SESSION]

Start or recover the Chrome daemon. Tries headless first — no visible Chrome window ever opens. If QR is needed, captures it headlessly and writes data/qr.html with a live countdown. The connect session closes when the QR is scanned (→ daemon stays up) or expires (→ run again).

--open — open the QR page automatically in the default browser
--new — skip any existing session, force a fresh QR scan; session folder is named after the detected phone number
wavi get [SESSION] CONTACT

Scroll the full chat history with CONTACT. Deduplicates messages across scroll iterations. Downloads and transcribes audio messages if GROQ_API_KEY is set.

--from YYYY-MM-DD --newest --grow --max-iter N --assets DIR --json-out
wavi send [SESSION] CONTACT MESSAGE

Open CONTACT's chat and send MESSAGE. Use your own phone number as CONTACT for self-message testing.

--screenshot-out FILE
wavi status [SESSION]

Check if the Chrome daemon is running and whether WhatsApp is authenticated. Returns session=restored when ready.

wavi stop [SESSION]

Gracefully shut down the Chrome daemon. Navigates to about:blank first so WhatsApp flushes IndexedDB before SIGTERM. Never use kill -9 directly.

wavi queue [SESSION]

Show the current operation status for a session. Prints idle or details of the running operation (type, contact, PID, elapsed time).

--json-out
wavi bubbles SCREENSHOT

Run the vision pipeline on a saved screenshot file. No browser needed. Useful for debugging OCR and bubble detection on a specific frame.

--assets DIR --json-out --debug/--no-debug
wavi boarding

Open this page in the default browser.

wavi list-contacts [SESSION]

Open WhatsApp's "New chat" panel and extract all contacts (DOM-based). Scrolls the virtualized list to the bottom with anchor-based overlap dedup (85% steps), so contacts outside the initial viewport are captured too. Saves contacts_list.json and screenshot.png to the output folder, overwriting on every run.

--assets DIR  (default: output/contacts) --json-out --no-headless
wavi check-updates [SESSION]

Check the WhatsApp sidebar for new inbound messages. Snapshots every visible chat row via DOM ({name, last_message, timestamp, direction}) and compares it against the previous saved state. A chat is reported only when its last message changed and is inbound (no delivery-tick icon) — outgoing messages and re-reads never trigger an update.

Writes two files to output/<session>/last-updates/:

  • updates.json{status, checked_at, contacts, new_inbound} (also the previous-state baseline for the next run)
  • snapshot_current.png — sidebar screenshot from this run (debugging)
--assets DIR --reset  (force first-run, ignoring previous snapshot)

List Contacts — output

Running wavi list-contacts opens the WhatsApp "New chat" panel, scrolls the full virtualized list extracting contacts via DOM ([role="listitem"]), and writes two files to output/<session>/contacts/ — overwriting on every run:

Browser viewport — 1280 px wide (ADR-002)

The screenshot below is the actual Chrome viewport captured during the last run. The "New chat" panel occupies the left ~580 px (same as the sidebar used by the message capture pipeline). The right panel shows the previously open chat — irrelevant to contact extraction.

Contact list panel (left 580 px)

WA New Chat panel — contact list

Full viewport (1280 px)

Full Chrome viewport 1280px
Full-list scroll — implemented

The panel uses WA's virtual list renderer — contacts outside the viewport are not in the DOM. _scroll_all_contacts() iterates scrollTop in ~85% viewport steps (guaranteed overlap) and dedups by anchoring on the last contact of the previous page, the same strategy capture_full_history uses for messages. Stops at the bottom or after 3 consecutive scroll stalls.

Output structure

output/contacts/          # overwritten on every run
  contacts_list.json      # [{"name": "...", "subtitle": "..."}, ...]
  screenshot.png          # 1280×1920 browser viewport

Check Updates — output

Running wavi check-updates inspects the WhatsApp sidebar for new inbound messages. Detection is DOM-state based: every visible chat row is captured as {name, last_message, timestamp, direction} and compared against the previous run's saved state in updates.json.

Algorithm

  1. ensure_chat_list() — close the New Chat panel if open, press Escape ×3 to clear search/overlays, wait for the chat list.
  2. extract_sidebar_updates() — snapshot every visible chat row via DOM. direction is outbound when a delivery-tick icon (msg-check, msg-dbl-check, …) is present in the preview row, inbound otherwise.
  3. No previous state (or --reset) → status=first_run, save baseline and return.
  4. Compare row-by-row against the previous state: any inbound row whose last_message or timestamp changed → status=updates with those rows in new_inbound. No changes → status=no_updates.
  5. Write updates.json (new state, replaces the old baseline) and snapshot_current.png.

Limitation: only the last visible message per chat is tracked. If several messages arrive between two checks, only the most recent one is reported per contact — use wavi get afterwards to retrieve the full history.

Output structure

output/<session>/last-updates/
  snapshot_current.png  # sidebar at this run (debugging)
  updates.json          # {status, checked_at, contacts: [{name, last_message, timestamp, direction}], new_inbound}

Status values

status meaning
first_runNo previous state (or --reset) — current sidebar saved as baseline.
no_updatesNo inbound row changed its last message since the previous run.
updatesAt least one inbound row changed — see new_inbound.

Sequence diagram

sequenceDiagram participant U as User participant CLI as cli.py participant R as WARunner participant S as WASession participant C as Chrome CDP participant FS as last-updates/ U->>CLI: wavi check-updates [session] CLI->>R: check_updates(assets_dir) R->>S: connect_over_cdp(:port) C-->>S: browser context S-->>R: restored R->>S: ensure_chat_list() Note over S,C: close New Chat panel + Escape ×3 → wait for chat-list R->>S: extract_sidebar_updates() S->>C: JS per chat row → {name, last_message, timestamp, direction} Note over S,C: direction = outbound if msg-check/msg-dbl-check tick present C-->>R: current sidebar state alt previous updates.json exists AND not --reset R->>FS: read previous contacts state R->>R: compare row-by-row (inbound + last_message/timestamp changed) alt no inbound row changed R-->>CLI: {status: no_updates, contacts, new_inbound: []} else inbound changes found R-->>CLI: {status: updates, contacts, new_inbound} end else first run or --reset R-->>CLI: {status: first_run, contacts, new_inbound: []} end R->>S: screenshot() S->>C: page.screenshot() C-->>S: PNG bytes R->>FS: write snapshot_current.png R->>FS: write updates.json (new baseline) R->>S: close() Note over S: Playwright disconnects — Chrome stays alive CLI-->>U: status + new_inbound list + output path

Architecture

System overview

wavi is a thin CLI on top of a three-layer stack: session management (Chrome CDP via Playwright), vision (pure image processing), and orchestration (WARunner ties them together with scroll logic and audio capture).

graph LR subgraph CLI["cli.py · Click"] connect; get; send; stop; status; queue; bubbles; boarding; list-contacts; check-updates end subgraph Core["Core modules"] session["session.py\nWASession · CDP"] runner["runner.py\nWARunner · orchestrator"] vision["vision.py\nOCR + classify"] detector["element_detector.py\nbbox detection"] transcribe["transcription.py\nGroq Whisper"] q["queue.py\nsession lock"] end subgraph External["External"] chrome["Chrome headless\nCDP :9200+"] wa["WhatsApp Web"] groq["Groq API\nwhisper-large-v3"] end subgraph Output["Output"] json["history_bubbles.json"] ogg["audio_*.ogg"] contacts["contacts_list.json\n+ screenshot.png"] end CLI --> session CLI --> runner CLI --> q runner --> session runner --> vision runner --> transcribe vision --> detector session --> chrome chrome --> wa transcribe --> groq runner --> json runner --> ogg runner --> contacts

Module dependency graph

graph TD cli["cli.py"] runner["runner.py"] session["session.py"] vision["vision.py"] detector["element_detector.py"] transcription["transcription.py"] queue["queue.py"] cli --> runner cli --> session cli --> queue runner --> session runner --> vision runner --> transcription vision --> detector style cli fill:#128C7E,color:#fff,stroke:#25D366 style runner fill:#1c2330,color:#d4dce8,stroke:#2a3547 style session fill:#1c2330,color:#d4dce8,stroke:#2a3547 style vision fill:#1c2330,color:#d4dce8,stroke:#2a3547 style detector fill:#141920,color:#7a8a9e,stroke:#2a3547 style transcription fill:#141920,color:#7a8a9e,stroke:#2a3547 style queue fill:#141920,color:#7a8a9e,stroke:#2a3547

Command sequences

One sequence diagram per command. Variants show only the delta from the base flow. Solid arrows (→→) = action/request; dashed arrows (- -→→) = response/return.

wavi get — full history capture

sequenceDiagram participant U as User participant CLI as cli.py participant Q as queue.py participant R as WARunner participant S as WASession participant C as Chrome CDP participant V as vision.py participant G as Groq API U->>CLI: wavi get [session] contact CLI->>Q: is_locked() + session_lock() CLI->>R: run_enhanced() R->>S: connect_over_cdp(:port) C-->>S: browser context S-->>R: restored R->>S: navigate_to_contact(contact) Note over S,C: mouse.click(317,80) — fixed coords, search box S->>C: keyboard.type(contact, delay=40ms) S->>C: keyboard.press(ArrowDown + Enter) C-->>S: WA chat open at bottom R->>S: install_blob_monitor() Note over S,C: JS hooks HTMLMediaElement.src + URL.createObjectURL R->>S: get_dpr() C-->>R: 1.0 R->>S: get_visible_message_ids() C-->>R: list of id+vy pairs from DOM data-id R->>V: analyze(screenshot) — iter_000 V-->>R: Bubble[] R->>R: _assign_dom_ids() — match bubble y-center to DOM ids R->>R: _download_audio_for_bubbles() Note over R,C: JS find play buttons by aria-label, coordinate click, drain_blobs, fetch_blob loop scroll iterations up to max_iter R->>S: get_chat_scroll_state() C-->>R: scrollTop, scrollHeight, clientHeight alt scrollTop < 20 Note over R: reached top — stop end Note over R: save anchor = topmost bubble (oldest visible) R->>S: scroll_chat_up(css_px) S->>C: JS el.scrollTop -= pixels R->>S: get_visible_message_ids() C-->>R: id+vy pairs R->>V: analyze(screenshot) — iter_N V-->>R: Bubble[] R->>R: _assign_dom_ids() R->>R: find anchor in new bubbles — prefer dom_id, fallback OCR key R->>R: new content = bubbles above anchor y (positional dedup) R->>R: key dedup — bubble_key = dom_id or sender+type+text+timestamp R->>R: _download_audio_for_bubbles() — skip already downloaded dom_ids R->>R: extract_day_pills() — date metadata for bubbles in this iter end R->>S: close() — Playwright disconnect, Chrome stays alive R->>G: transcribe_history_audios() — second pass after browser close G-->>R: transcript text per .ogg R->>R: renumber IDs 1..N (id=1 = newest) + write history_bubbles.json R-->>CLI: Bubble[] CLI-->>U: N messages, M audios

wavi get --newest — incremental update

Loads the existing history_bubbles.json and stops scrolling as soon as the first already-known message is found. New messages are prepended and IDs renumbered.

sequenceDiagram participant U as User participant CLI as cli.py participant R as WARunner participant S as WASession participant V as vision.py U->>CLI: wavi get contact --newest CLI->>CLI: load history_bubbles.json CLI->>CLI: build known_keys set (dom_id or sender+type+text+ts) CLI->>R: run_enhanced(newest=True, known_keys) R->>S: connect() + navigate_to_contact() loop scroll iterations R->>V: analyze(screenshot) V-->>R: Bubble[] R->>R: _assign_dom_ids() loop for each candidate bubble R->>R: bubble_key(b) in known_keys? alt duplicate found Note over R: should_stop_newest = true — break inner loop end end R->>S: scroll_chat_up() opt should_stop_newest Note over R: break outer loop end end R->>R: merge new_bubbles prepended to existing_bubbles R->>R: renumber IDs 1..N (newest=1) R->>R: write history_bubbles.json R-->>CLI: updated bubble list CLI-->>U: history updated

wavi get --from YYYY-MM-DD — date-bounded capture

Stops scrolling when the oldest visible day-separator pill is earlier than the requested date. Bubbles older than from_date are dropped from that iteration.

sequenceDiagram participant U as User participant CLI as cli.py participant R as WARunner participant S as WASession participant V as vision.py U->>CLI: wavi get contact --from 2024-01-15 CLI->>CLI: parse from_date = date(2024, 1, 15) CLI->>R: run_enhanced(from_date=2024-01-15) R->>S: connect() + navigate_to_contact() loop scroll iterations R->>V: analyze(screenshot) V-->>R: Bubble[] R->>R: extract_day_pills() from cropped screenshot R->>R: parse pill dates — find oldest pill in current view alt oldest pill date < from_date R->>R: for each bubble compute its date from nearest pill above it R->>R: drop bubbles with date < from_date Note over R: should_stop_from_date = true end R->>S: scroll_chat_up() opt should_stop_from_date Note over R: break loop end end R->>R: write history_bubbles.json (messages on/after from_date only) R-->>CLI: filtered Bubble[] CLI-->>U: history from 2024-01-15 onward

wavi get --grow — paged history capture

Loads history_bubbles.json and grow_checkpoint.json, fast-forwards past known messages, then captures --max-iter N new-content iterations toward the past. Each run extends the history by one block. When scrollTop reaches 0 the checkpoint is marked complete and future runs exit immediately. Incompatible with --newest.

sequenceDiagram participant U as User participant CLI as cli.py participant R as WARunner participant S as WASession participant V as vision.py U->>CLI: wavi get contact --grow --max-iter 10 CLI->>CLI: load history_bubbles.json → known_keys CLI->>CLI: load grow_checkpoint.json → anchor dom_id alt checkpoint.completed == true CLI-->>U: "history already complete — nothing to do" end CLI->>R: run_enhanced(grow=True, max_iterations=10) R->>S: connect() + navigate_to_contact() note over R: Phase 1 — fast-forward (cheap DOM polls) loop scroll until anchor dom_id visible R->>S: get_visible_message_ids() S-->>R: [{id, vy}, ...] alt anchor found Note over R: fast_forwarded = true — exit FF loop end R->>S: scroll_chat_up() end note over R: Phase 2 — normal capture (grow_new_iters counts new-content iters) loop until grow_new_iters == 10 or scrollTop == 0 R->>V: analyze(screenshot) V-->>R: Bubble[] R->>R: filter known_keys (skip, don't stop) R->>R: new_count > 0 → grow_new_iters++ R->>S: scroll_chat_up() end R->>R: merge existing_bubbles + new all_bubbles R->>R: renumber IDs 1..N R->>R: write history_bubbles.json R->>R: write grow_checkpoint.json (oldest dom_id, completed flag) R-->>CLI: new Bubble[] CLI-->>U: N new messages captured (M total)

wavi send

sequenceDiagram participant U as User participant CLI as cli.py participant Q as queue.py participant S as WASession participant C as Chrome CDP U->>CLI: wavi send [session] contact message CLI->>Q: is_locked() + session_lock() CLI->>S: connect() S->>C: connect_over_cdp(:port) C-->>S: restored CLI->>S: navigate_to_contact(contact) Note over S,C: mouse.click(317,80) + keyboard.type() + ArrowDown+Enter S->>C: coordinate click + keyboard events C-->>S: WA chat open CLI->>S: send_message(message) S->>C: JS _FIND_COMPOSE_INPUT_JS — locate footer contenteditable C-->>S: x,y viewport coords of compose box S->>C: mouse.click(x, y) — coordinate click using JS-returned coords S->>C: keyboard.type(text, delay=30ms) Note over S: multiline: Shift+Enter per newline, bare Enter to send S->>C: keyboard.press(Enter) S->>C: JS check compose box empty alt compose empty — message sent C-->>S: sent = true else Enter acted as newline S->>C: JS click send icon button — span[data-icon=send] C-->>S: sent = true end S-->>CLI: sent — input at x,y opt --screenshot-out FILE CLI->>S: screenshot_to_file(path) S->>C: page.screenshot(type=png) C-->>S: PNG bytes S-->>CLI: saved at path end CLI->>S: close() CLI-->>U: Mensaje enviado a contact

wavi connect — session already active (fast path)

sequenceDiagram participant U as User participant CLI as cli.py participant FS as filesystem participant S as WASession participant C as Chrome daemon U->>CLI: wavi connect [session] CLI->>FS: read chrome_daemon.pid FS-->>CLI: pid=N CLI->>CLI: _is_process_alive(pid) = true CLI->>S: connect() via CDP S->>C: connect_over_cdp(:port) C-->>S: browser context S-->>CLI: restored CLI-->>U: Session active (PID N, CDP port)

wavi connect — QR scan required

Optimistic headless probe first. If QR is needed, captures it headlessly and writes a local HTML file — no visible Chrome window ever opens. Session dir is renamed to the phone number once scanned. QR expiry detected via data-ref change.

sequenceDiagram participant U as User participant CLI as cli.py participant Ch as Chrome headless participant H as data/qr.html U->>CLI: wavi connect [--open] [--new] [session] Note over CLI: no alive daemon (or --new forces fresh profile) CLI->>CLI: _claim_port() from society registry (or local scan) CLI->>Ch: arch -arm64 chrome --headless=new --window-size=1280x1920 CLI->>Ch: connect_over_cdp — navigate to WA Web Ch-->>CLI: qr_needed CLI->>Ch: screenshot [data-testid='qrcode'] element Ch-->>CLI: QR PNG bytes CLI->>H: write HTML (QR base64 + 60s countdown) CLI-->>U: "QR → data/qr.html" opt --open CLI-->>U: opens browser on data/qr.html end U->>U: scan QR from HTML page with phone loop poll every 2s CLI->>Ch: check AUTH_SEL / data-ref value alt QR scanned — WA authenticated Ch-->>CLI: chat-list visible CLI->>H: write connected HTML else QR expired (data-ref changed) Ch-->>CLI: new data-ref value CLI->>H: write expired HTML CLI-->>U: "QR expirado — ejecutá wavi connect de nuevo" end end CLI->>Ch: JS read phone number from WA Web store Ch-->>CLI: phone number (e.g. 5491155612767) CLI->>Ch: page.goto("about:blank") — WA flushes IndexedDB CLI->>Ch: SIGTERM Note over CLI: rename session dir → phone number, update .default alias CLI->>Ch: arch -arm64 chrome --headless=new (fresh daemon) Note over Ch: WA session restored from IndexedDB on disk CLI->>Ch: connect_over_cdp — verify auth Ch-->>CLI: restored CLI-->>U: Daemon headless activo (PID, port)

wavi stop

sequenceDiagram participant U as User participant CLI as cli.py participant S as WASession participant C as Chrome daemon U->>CLI: wavi stop [session] CLI->>S: connect() via CDP S->>C: connect_over_cdp C-->>S: page S->>C: page.goto("about:blank") Note over C: WA React app unmounts, flushes IndexedDB C-->>S: navigation ok S->>C: browser.close() — Playwright disconnects S->>C: os.kill(pid, SIGTERM) alt exits within 10s C-->>CLI: process terminated cleanly else timeout S->>C: os.kill(pid, SIGKILL) end S->>S: unlink PID + port files CLI->>CLI: _release_port() — deregister from society registry CLI-->>U: Daemon stopped cleanly.

wavi status

sequenceDiagram participant U as User participant CLI as cli.py participant S as WASession participant C as Chrome daemon U->>CLI: wavi status [session] CLI->>S: load_pid() + _is_process_alive(pid) alt pid alive S-->>CLI: daemon=running pid=N else no pid file or dead process S-->>CLI: daemon=stopped end CLI->>S: connect() via CDP S->>C: connect_over_cdp(:port) C-->>S: browser context S-->>CLI: restored | qr_needed | timeout | error CLI-->>U: session=restored CLI->>S: close() Note over S: Playwright disconnects — Chrome stays alive

wavi queue

sequenceDiagram participant U as User participant CLI as cli.py participant Q as queue.py participant FS as filesystem U->>CLI: wavi queue [session] CLI->>Q: get_status(profile) Q->>FS: read lock file alt no lock file FS-->>Q: not found Q-->>CLI: None CLI-->>U: session=default idle else lock file exists FS-->>Q: operation, pid, contact, started_at Q-->>CLI: status dict CLI->>CLI: compute elapsed from started_at CLI-->>U: operation=get contact='Ana' pid=N running 2m05s end

wavi bubbles — offline vision pipeline

No browser. Runs entirely from a saved PNG file. Useful for debugging detection and OCR on a specific frame.

sequenceDiagram participant U as User participant CLI as cli.py participant V as vision.py participant D as element_detector.py participant OCR as Swift Vision OCR U->>CLI: wavi bubbles screenshot.png [--assets DIR] CLI->>V: analyze(screenshot_path, assets_dir) V->>V: crop_chat_panel() — remove sidebar 580px + header 60px V->>OCR: ocr_tiled() — structural scan OCR-->>V: text blocks (date pills + split-point detection only) V->>D: detect_bubbles(img) Note over D: RGB color masks + scipy.ndimage connected components D-->>V: bbox list x,y,w,h,type V->>V: extract_day_pills() reusing structural scan — no extra OCR call V->>V: _split_bubbles_by_timestamps() — cut regions spanning multiple messages loop for each bubble bbox V->>V: crop region + upscale 2x-5x depending on height V->>OCR: _run_ocr(cropped_bubble) OCR-->>V: text blocks — canonical text for this element V->>V: classify_msg_type() — text / audio / file / media V->>V: _extract_timestamp() + _build_timestamp() — date from nearest pill end V-->>CLI: Bubble[] opt --debug V->>V: _save_debug_image() — bboxes + play-button crosses end CLI-->>U: N mensajes detectados

wavi list-contacts

DOM-based extraction — no vision pipeline. Opens the "New chat" panel, reads [role="listitem"] elements while scrolling the virtualized list to the bottom (anchor-based overlap dedup), writes contacts_list.json + screenshot.png to output/<session>/contacts/ (overwritten on every run).

sequenceDiagram participant U as User participant CLI as cli.py participant R as WARunner participant S as WASession participant C as Chrome CDP participant FS as output/contacts/ U->>CLI: wavi list-contacts [session] CLI->>R: list_contacts(assets_dir) R->>S: connect_over_cdp(:port) C-->>S: browser context S-->>R: restored R->>S: navigate_to_new_chat() Note over S,C: JS span[data-icon="new-chat-outline"].click() S->>C: wait_for_selector [role="listitem"] timeout=8s C-->>S: contact list rendered R->>S: extract_contacts() S->>C: JS querySelectorAll [role="listitem"] → name+subtitle C-->>R: [{name, subtitle}, ...] R->>S: screenshot_to_file(assets_dir/screenshot.png) S->>C: page.screenshot() C-->>S: PNG bytes S-->>R: saved R->>FS: write contacts_list.json R->>S: close_new_chat() Note over S,C: JS span[data-icon="back-refreshed"].click() OR Escape R->>S: close() Note over S: Playwright disconnects — Chrome stays alive R-->>CLI: {contacts, screenshot, assets_dir} CLI-->>U: Found N contacts + Output: output/contacts/

Key classes

classDiagram class Bubble { +int id +int screen_id +str sender +str msg_type +str|None timestamp +str text +dict bbox +str|None dom_id +str|None audio_path +str|None transcript +as_dict() dict } class WASession { +Path profile_dir +bool headless +connect() str +navigate_to_contact(contact) +navigate_to_new_chat() +extract_contacts() list +close_new_chat() +extract_sidebar_updates() list +ensure_chat_list() +screenshot() bytes +get_chat_scroll_state() dict +scroll_chat_up(px) +install_blob_monitor() +drain_blobs() list +get_dpr() float +get_visible_message_ids() list +fetch_blob(url) bytes +stop_daemon() +daemon_alive() bool } class WARunner { +WASession session +connect() str +open_chat(contact) +get_bubbles(assets_dir) list +find_play_buttons() list +capture_full_history() list +capture_audio_bubbles() list +list_contacts(assets_dir) dict +check_updates(assets_dir, reset) dict } WARunner "1" --> "1" WASession WARunner ..> Bubble : produces

Browser interaction layer

WASession interacts with WhatsApp Web using physical coordinates and keyboard events, not DOM selectors or page.locator() calls. This is a deliberate choice: WA's DOM is obfuscated and changes with every deploy, but the visual layout is stable. If a button gets a new data-testid, wavi doesn't care — it clicks wherever the button appears on screen.

Coordinate translation — vision pipeline works in crop-panel pixels (sidebar and header stripped). WASession maps them back to viewport pixels before any click:

viewport_x = crop_x + SIDEBAR_X  (580 px)
viewport_y = crop_y + HEADER_Y   ( 60 px)

→ page.mouse.click(viewport_x, viewport_y)   ← always a coordinate click

DOM queries are used only when there is no alternative: to obtain the coordinates of a dynamic element (compose box), to read scroll state, to read DOM data-id attributes for deduplication, and to capture blob URLs via JS injection.

Method Interaction type Mechanism
click(crop_x, crop_y)Coordinatepage.mouse.click(x+580, y+60)
navigate_to_contactCoordinate + keyboardmouse.click(317, 80)keyboard.type()ArrowDown + Enter
send_messageJS→coords + keyboardJS locates compose box → mouse.click(x, y)keyboard.type()Enter
scroll_chat_up/downDOM scrollTopel.scrollTop -= pixels (no visual alternative)
get_visible_message_idsDOM readquerySelectorAll('[data-id]') — only for dedup keys
install_blob_monitorJS injectionHooks HTMLMediaElement.src, URL.createObjectURL, play()
fetch_blobJS evaluatefetch(blobUrl) → base64 → Python bytes
navigate_to_new_chatJS clickspan[data-icon="new-chat-outline"].closest('button').click() — icon-stable, locale-agnostic
extract_contactsDOM readquerySelectorAll('[role="listitem"]')[role="gridcell"] text per item
close_new_chatJS click + keyboardspan[data-icon="back-refreshed"].click(), fallback Escape
extract_sidebar_updatesDOM readquerySelectorAll chat cells → name + last_message + timestamp; direction via msg-check tick icons
ensure_chat_listkeyboardEscapewait_for_selector('[data-testid="chat-list"]')

navigate_to_contact — how it works

flowchart TD N1["mouse.click(317, 80)\ncoordinate — search box\n(fixed: sidebar center, row 80)"] --> N2 N2["keyboard.press(Cmd+A)\nkeyboard.press(Delete)\nclear previous search"] --> N3 N3["keyboard.type(contact, delay=40ms)\nno fill(), no locator"] --> N4 N4["keyboard.press(ArrowDown)\nkeyboard.press(Enter)\nopen first search result"] --> N5 N5["wait_for_selector\nbubble container ready\n(timeout 15s, non-fatal)"] --> N6 N6["JS: click scroll-to-bottom btn\n(WA floating button, DOM position only)\nOR dom scrollTop → 999 999"] --> N7 N7["verify: scrollHeight − scrollTop − clientHeight ≤ 50px\nretry × 5 if virtualizer restores old position"]

send_message — how it works

flowchart TD S1["JS: _FIND_COMPOSE_INPUT_JS\nqueries footer contenteditable\nreturns {x, y} in viewport coords"] --> S2 S2["mouse.click(x, y)\ncoordinate click using coords from JS\nnot a locator click"] --> S3 S3["keyboard.type(text, delay=30ms)\nSplit on \\n → Shift+Enter\n(WA would send on bare Enter)"] --> S4 S4["keyboard.press(Enter)"] --> S5 S5{"JS: compose box empty?"} S5 -->|"yes — sent"| S6["return {selector, x, y}"] S5 -->|"no — Enter acted as newline"| S7["JS: _CLICK_SEND_BTN_JS\nfallback: span[data-icon=send].click()"] S7 --> S8{"compose empty now?"} S8 -->|"yes"| S6 S8 -->|"no"| S9["raise RuntimeError"]

Vision pipeline

The vision pipeline is pure — no browser, no network, no side effects. It takes a PNG file and returns a list of classified Bubbles. This makes it fully testable without mocking anything real.

1
Crop — remove chrome
The 1280×1920 screenshot includes WhatsApp's sidebar (left 580px) and header (top 60px). These are cropped out. The resulting panel is pure chat bubbles, left-right aligned by sender.
2
Bubble detection — numpy
element_detector.py uses RGB channel masks to isolate bubble pixels: outgoing ("me") = light green (G>200, G−R>15, G−B>15); incoming ("other") = near-white (R,G,B>248). Morphological closing fills intra-bubble gaps, then scipy.ndimage.label() finds connected components. Each component is filtered by size, aspect ratio, and color density. Each surviving region is a bounding box.
2b
Auxiliary: structural scan for date separators
One OCR pass over the full panel (ocr_tiled) serves two layout helpers: detecting WA date-separator pills ("Hoy", "Ayer", etc.) to attach dates to bubbles, and detecting timestamps inside color regions that span multiple messages so they can be split. This scan is not the source of each bubble's text — it is structural metadata only. The canonical text comes from the per-bubble OCR in step 3.
3
Per-element OCR — Apple Vision
Each bounding box is cropped from the panel and upscaled (2–5×, depending on height). The crop is passed to swift/ocr_vision.swift, which calls macOS's Vision framework (VNRecognizeTextRequest). Returns a JSON array of text blocks with fractional coordinates. This is the authoritative text for the element. Fast, accurate, offline.
4
Classify — text / audio / file / media
classify_msg_type() inspects the OCR blocks: duration patterns (0:21) → audio; file extensions or sizes → file; all-blank blocks → media (likely a photo/video); anything else → text. Sender is inferred from x-position (right = "me", left = "other").
5
Timestamp extraction
Timestamps are extracted with regex against known WA formats (2:15 p. m.). Special handling for Cyrillic OCR artifacts where Apple Vision misreads Latin "p" as Cyrillic "р" in some locales.
6
DOM anchor matching
Each Bubble gets a dom_id assigned by matching its y-center (crop coords → CSS viewport via DPR) against WA's data-id attributes queried from the DOM. DOM ID is the primary dedup key — stable even when OCR produces different text for the same message.
7
Audio capture — blob monitor
A JavaScript hook intercepts HTMLMediaElement.src, URL.createObjectURL, and play() calls. When wavi clicks a play button, the hook captures the blob URL. The blob is then fetched as binary and saved as audio_N.ogg.

Vision pipeline — flowchart

flowchart TD A["📸 screenshot.png\n1280×1920px, DPR=1"] --> B B["✂️ Crop\nremove sidebar 580px\nremove header 60px"] --> C & AUX AUX["🔎 Structural scan\nocr_tiled — full panel\ndate pills + split points"]:::aux C["🔍 detect_bubbles\nRGB color masks\nconnected components"] --> D D["✂️ Split by timestamps\nstructural scan → cut regions\nspanning multiple messages"] --> E E["📝 Per-element OCR\ncrop + upscale each bbox\nSwift VNRecognizeTextRequest"] --> F F["🏷️ classify_msg_type\ntext / audio / file / media"] --> G G["📅 Attach dates\ndate pills from structural scan\nbubble_y → nearest pill date"] --> H H["⚓ assign_dom_ids\ny-proximity matching\ncrop px → CSS viewport"] --> I I["🔀 Dedup\ndom_id primary key\nOCR fallback"] --> J J["📄 Bubble list\nid · sender · type\ntimestamp · text · bbox"] AUX -.->|"auxiliary\ndate pills"| G AUX -.->|"auxiliary\nsplit points"| D classDef aux fill:#1c2330,color:#7a8a9e,stroke:#2a3547,stroke-dasharray:4 4

Test suite

All tests run offline — no browser, no real WA account, no API calls. Browser methods are mocked with AsyncMock. The vision tests use synthetic images generated in-memory with Pillow + numpy.

Design decisions

Key architectural choices and the reasoning behind them.

ADR-001
Vision over DOM
Decision: extract data via screenshots + OCR, not DOM selectors.

Why: WhatsApp Web's DOM is minified, obfuscated, and changes with every deploy. CSS selectors and JS APIs that work today silently break tomorrow. Visual rendering is stable because it is anchored to human perception.

Trade-off: slower than DOM parsing, but resilient to any non-visual WA change.
ADR-002
Force DPR=1 in headless Chrome
Decision: pass --force-device-scale-factor=1 to Chrome.

Why: on macOS Retina, Chrome defaults to DPR=2. With a 1280×1920 window spec, the effective CSS viewport becomes 640×960 — half the pixels. Screenshots show far fewer messages and bounding box math breaks. Forcing DPR=1 maps --window-size 1:1 to CSS pixels.

Test guard: TestViewportRegression ensures this never regresses.
ADR-003
Chrome as a long-lived daemon
Decision: wavi connect starts Chrome once; all other commands connect via CDP and never kill Chrome.

Why: WhatsApp stores its auth state in IndexedDB inside the Chrome user-data-dir. Killing Chrome mid-session corrupts IndexedDB and forces re-authentication. CDP connect/disconnect is cheap; Chrome restart is expensive and lossy.

Shutdown: always via wavi stop which navigates to about:blank first.
ADR-004
DOM anchor dedup + OCR fallback
Decision: primary dedup key = dom_id (WA's data-id); fallback = OCR content key.

Why: scroll regions overlap by ~15%. A message visible at the bottom of frame N appears again at the top of frame N+1. OCR text alone can vary between captures (noise, punctuation). The DOM ID is immutable for a given message.

Fallback: when DOM ID assignment fails (element off-screen), OCR key (sender + type + text[:80] + timestamp) handles it.
ADR-005
ARM64 Chrome on macOS
Decision: launch Chrome with arch -arm64.

Why: on Apple Silicon, Chrome launched without explicit arch may run under Rosetta (x86 emulation). WhatsApp Web's service worker and IndexedDB behave inconsistently under Rosetta — the session appears authenticated but the chat list never loads (frozen). ARM64 native is stable.

Memory: this is saved in project memory to prevent regression.
ADR-006
Optimistic headless connect
Decision: always try headless first in wavi connect; only open a visible window if QR is needed.

Why: once authenticated, the session is stored in the Chrome user-data-dir. Headless restores it silently. Opening a visible window unnecessarily is disruptive and slower.

Strategy: headless → check WA auth → if QR needed, kill headless, open visible → scan → kill visible → restart headless.
ADR-007
Transcription after browser closes
Decision: transcribe_history_audios() runs after runner.close(), not during the scroll loop.

Why: Groq API calls can take 1-3 seconds each. Running them inside the Playwright event loop while Chrome is open risks timeouts and interleaves async operations in hard-to-predict ways. Closing the browser first, then transcribing serially, is safe and deterministic.
ADR-008
Session port registry via Local Agent Society
Decision: claim CDP ports from the Local Agent Society registry (http://localhost:8700); fall back to local socket scan.

Why: multiple wavi sessions (and other local agents) share the 9200–9249 port range. Atomic POST /ports/claim prevents race conditions when two sessions start simultaneously. Three-tier fallback (claim → free+register → local scan) ensures the tool works even without the society daemon.