Metadata-Version: 2.4
Name: talk-to-tux
Version: 0.2.0
Summary: Voice-to-smart-paste pipeline for Linux
Project-URL: Homepage, https://github.com/viperjuice/talk-to-tux
Project-URL: Repository, https://github.com/viperjuice/talk-to-tux
Project-URL: Issues, https://github.com/viperjuice/talk-to-tux/issues
Author: viperjuice
License-Expression: AGPL-3.0-only
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: X11 Applications
Classifier: Environment :: X11 Applications :: Gnome
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.11
Requires-Dist: baml-py==0.221.0
Requires-Dist: evdev>=1.7
Requires-Dist: httpx>=0.28
Requires-Dist: keyring>=25
Requires-Dist: numpy>=1.26
Requires-Dist: openai>=1.60
Requires-Dist: pillow>=10.0
Requires-Dist: pulsectl-asyncio>=1.2
Requires-Dist: pydantic-settings>=2.7
Requires-Dist: pydantic>=2.10
Requires-Dist: python-dotenv>=1.0
Requires-Dist: sounddevice>=0.5
Requires-Dist: soundfile>=0.13
Requires-Dist: torch>=2.5
Requires-Dist: torchaudio>=2.5
Provides-Extra: indicator
Requires-Dist: pygobject>=3.42; extra == 'indicator'
Description-Content-Type: text/markdown

# Talk-to-Tux

Voice-to-smart-paste pipeline for Linux. Hold a mouse button, speak, and
the transcribed + LLM-rewritten text is pasted into your active application —
formatted for the context you're in.

```mermaid
flowchart LR
    A["Hold Button"] --> B["Record Audio"]
    B --> C["STT\nhosted Groq Whisper or BYO chain"]
    C --> D["Rewrite\nhosted Groq Scout or BYO BAML"]
    D --> E["Smart Paste"]

    F["App Context\nwindow + AT-SPI text + optional screenshot"] --> D
```

## How It Works

1. **Hold** your mouse side button (or keyboard hotkey) and speak. The
   microphone stream keeps a local in-memory 2-second ring buffer, so any
   speech immediately before the button press is also captured
2. **Release** — audio is checked for speech by VAD, then sent to the STT provider chain
3. The transcription is **rewritten** by an LLM using your active window's context
   (app name, window title, AT-SPI widget text, optional screenshot)
4. The result is **pasted** into the focused application using the correct shortcut
5. **Double-tap** the side button after a paste to send Enter (e.g., submit a chat message)

Works on both **X11** and **Wayland** (GNOME, tested on Ubuntu 24.04+).

## Architecture

```mermaid
flowchart TB
    subgraph Trigger["Trigger Layer"]
        SB["Side Button / Keyboard Hotkey"]
    end

    subgraph Recording["Recording Phase"]
        SB -->|hold| REC["Audio Recorder\n(sounddevice + 2s ring buffer)"]
        SB -->|press| CTX0["Capture Context\nwindow + AT-SPI + screenshot"]
    end

    subgraph Processing["Processing Phase"]
        REC -->|release| VAD["VAD Gate\n(Silero)"]
        VAD -->|raw WAV| STT{"API Mode"}
        STT -->|hosted| HSTT["Supabase v1-stt\nGroq Whisper"]
        STT -->|BYO| BSTT["STT Chain\ndefault: groq, openai, google\noptional: gpu_server, local_whisper, elevenlabs"]
        HSTT --> TX["Transcription"]
        BSTT --> TX
        CTX0 --> RP["Rewrite Prompt\n6-layer context"]
        TX --> RP
        RP --> RW{"API Mode"}
        RW -->|hosted| HRW["Supabase v1-rewrite\nGroq Scout text\noptional visual model"]
        RW -->|BYO| BRW["BAML SmartRewrite\nLocalAI, Groq, Gemini, Ollama fallback"]
    end

    subgraph Output["Paste Phase"]
        HRW --> TT["Tooltip / Notification\n(confirm or auto-paste)"]
        BRW --> TT
        TT --> PASTE["Paster\nxclip/wl-copy + xdotool/wtype"]
    end
```

## Prerequisites

**OS**: Linux with GNOME (X11 or Wayland). **Python**: 3.11+.
**STT backend**: hosted beta account, *or* a BYO key/local server for Groq,
OpenAI, Google, ElevenLabs, `gpu_server`, or `local_whisper`.

### System packages

| Tool(s) | Group | Ubuntu/Debian (`apt`) | Arch (`pacman`) | Fedora (`dnf`) |
|---------|-------|-----------------------|-----------------|----------------|
| `libportaudio2` | Audio | `libportaudio2` | `portaudio` | `portaudio` |
| `grim` | Screenshots (Wayland) | `grim` | `grim` | `grim` |
| `scrot` | Screenshots (X11/XWayland) | `scrot` | `scrot` | `scrot` |
| `xclip` | Clipboard (X11) | `xclip` | `xclip` | `xclip` |
| `wl-copy`, `wl-paste` | Clipboard (Wayland) | `wl-clipboard` | `wl-clipboard` | `wl-clipboard` |
| `xdotool` | Keystroke injection (X11) | `xdotool` | `xdotool` | `xdotool` |
| `wtype` | Keystroke injection (Wayland/wlroots) | `wtype` | `wtype` | `wtype` |
| `ydotool` + `ydotoold` | Keystroke injection (GNOME Wayland) | see note below | `ydotool` | `ydotool` |
| `dbus-send`, `busctl`, `gdbus` | D-Bus utilities | `dbus` / `systemd` | `dbus` / `systemd` | `dbus` / `systemd` |
| `notify-send` | Desktop notifications | `libnotify-bin` | `libnotify` | `libnotify` |
| `evtest` | Input device debug | `evtest` | `evtest` | `evtest` |
| `pgrep` | Process checks | `procps` | `procps-ng` | `procps-ng` |

### ydotool on Ubuntu/Debian — build v1.0+ from source

Ubuntu's `apt` ships **ydotool 0.1.8**, which has no daemon and produces garbage
key injection. You need **v1.0+** built from source:

```bash
# Build dependencies
sudo apt install cmake libevdev-dev libudev-dev

# Clone and build
git clone https://github.com/ReimuNotMoe/ydotool
cd ydotool && cmake -B build && cmake --build build && sudo cmake --install build

# Enable the daemon and grant /dev/uinput access
systemctl --user enable --now ydotoold
sudo usermod -aG input $USER   # re-login required
# or add udev rule: echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/99-uinput.rules
```

Arch and Fedora ship a working `ydotool` via their package managers.

### Self-verify

```bash
uv run talk-to-tux --doctor
```

## Quick Start

```bash
git clone https://github.com/viperjuice/talk-to-tux.git
cd talk-to-tux
uv sync --all-groups

# First run launches the setup wizard. Choose hosted beta or BYO-key mode.
uv run talk-to-tux
```

On first run, the setup wizard writes `~/.config/talk-to-tux/config.toml`.
It can also install a desktop launcher and optionally enable login autostart so
daily use does not require a terminal.

For production-style local use, run without `--debug`:

```bash
uv run talk-to-tux
```

For launcher and startup integration:

```bash
uv run talk-to-tux desktop status
uv run talk-to-tux desktop install-launcher
uv run talk-to-tux desktop enable-autostart
uv run talk-to-tux desktop disable-autostart
uv run talk-to-tux desktop remove-launcher
```

`--debug` enables debug logging and the debug popup. Leave it off for normal
beta/production use.
Hosted mode signs in with GitHub OAuth. BYO-key mode tells you where to put
provider keys in `~/.config/talk-to-tux/secrets.env`. `.env.example` is a
reference file; do not copy it wholesale unless you want every sample override.

## Trigger Modes and Key Mapping

### Default: Mouse Side Buttons (hold-to-record)

| Button | evdev Code | Action |
|--------|-----------|--------|
| **BTN_SIDE** (thumb back) | 275 | Either button starts recording |
| **BTN_EXTRA** (thumb forward) | 276 | Release ALL buttons to stop |

The device is grabbed exclusively so side buttons don't trigger browser
back/forward. All other mouse events (movement, clicks, scroll) are forwarded
transparently via uinput.

### Alternative: Keyboard Hotkey

| Key Combo | evdev Names | Action |
|-----------|------------|--------|
| **Ctrl + Super** (left) | `KEY_LEFTCTRL+KEY_LEFTMETA` | Toggle recording |

> **Note:** The Super key may trigger GNOME Activities. Disable with:
> `gsettings set org.gnome.mutter overlay-key ''`

### Customizing the Trigger

**Option 1: TOML config** (`~/.config/talk-to-tux/config.toml`)

```toml
[trigger]
mode = "mouse"           # "auto", "mouse", or "keyboard"
record_mode = "hold"     # "hold" (release to stop) or "toggle" (tap/tap)

[trigger.mouse]
button_codes = [275, 276]          # any evdev button codes
device_name = "Logitech G502"     # match by name substring (stable across reboots and USB replug)
# device_path = "/dev/input/event5"  # or explicit path (fragile)
grab = true

[trigger.keyboard]
hotkey = "KEY_LEFTCTRL+KEY_LEFTMETA"
```

**Option 2: Environment variables**

```bash
TTT_TRIGGER_MODE=keyboard
TTT_HOTKEY=KEY_RIGHTCTRL
TTT_RECORD_MODE=toggle
# Or nested format:
TTT_TRIGGER__MOUSE__BUTTON_CODES='[275, 276]'
TTT_TRIGGER__MOUSE__DEVICE_NAME="Logitech"
```

**Option 3: CLI flags**

```bash
uv run talk-to-tux --trigger keyboard --record-mode toggle
```

### Finding Your Button Codes

```bash
# List input devices
sudo evtest

# Pick your mouse, press buttons, note the codes:
#   Event: type 1 (EV_KEY), code 275 (BTN_SIDE), value 1
```

## Configuration

Configuration is loaded with this precedence (highest first):

1. **CLI arguments** (`--trigger mouse`, `--debug`, etc.)
2. **Environment variables** (`TTT_*` prefix)
3. **`~/.config/talk-to-tux/secrets.env`** (CWD `.env` intentionally **not** loaded — prevents rogue `.env` in a project dir from overriding secrets)
4. **TOML config** (`~/.config/talk-to-tux/config.toml`)

### Key Settings

| Section | Setting | Default | Description |
|---------|---------|---------|-------------|
| `api_mode` | value | `byo_key` (auto-hosted when a token exists) | Hosted account vs bring-your-own-provider keys |
| `stt` | `providers` | `groq,openai,google` | BYO STT fallback chain (tried in order) |
| `stt.gpu_server` | `url` | `http://localhost:8000` | Self-hosted Whisper server URL |
| `rewrite` | `enabled` | `true` | Enable LLM smart rewrite |
| `rewrite` | `local_ai_url` | `http://ai:8002/v1` | LocalAI / vLLM-compatible rewrite endpoint |
| `rewrite` | `ollama_base_url` | `http://localhost:11434/v1` | Ollama-compatible rewrite fallback endpoint |
| `context` | `screenshot_enabled` | `true` | Include screenshot in LLM context |
| `ducking` | `enabled` | `true` | Reduce other apps' volume while recording |
| `ducking` | `factor` | `0.15` | Duck to 15% of original volume |
| `learning` | `enabled` | `true` | Store local correction-learning artifacts |
| `tooltip` | `enabled` | `false` | Show confirm-before-paste tooltip (disabled = auto-paste) |
| `tooltip` | `use_notifications` | `true` | Use desktop notifications (vs GTK tooltip) |
| `paste` | `enabled` | `true` | Auto-paste into active window |
| `indicator` | `enabled` | `true` | Show system tray indicator |

### Model Selection Workflow

Talk-to-Tux now treats model settings as **dropdowns**, not free-text fields.
The Providers tab exposes catalog-backed STT and rewrite model selectors, and
the Local Inference tab exposes local model selectors once endpoint discovery
has something safe to offer.

- Local endpoint URLs stay editable text fields. Paste the STT or rewrite base
  URL into the Local Inference tab, then probe the endpoint to refresh bounded
  health plus discovered model choices.
- `uv run talk-to-tux settings --check-local-inference --json` reports the same
  sanitized discovery summary from the CLI, including reachable/degraded status
  and bounded model counts.
- Rewrite model choices are limited to **visual-capable** models because the
  rewrite pipeline can include screenshot context. Text-only local discoveries
  stay visible as read-only feedback, but they are not selectable for rewrite.
- Unsupported legacy model ids are preserved non-destructively when older
  configs are loaded. The settings UI surfaces the current unsupported value,
  and the fix is to replace it with a supported dropdown selection rather than
  typing a new arbitrary model string.

### Per-App Rules

Customize behavior per application in `config.toml`:

```toml
[[app]]
match = "Google-chrome"
match_title = "*ChatGPT*"         # optional title filter (glob or ~regex)
paste_shortcut = "ctrl+shift+v"   # override paste shortcut
rewrite_hint = "Conversational tone, no markdown"

[[app]]
match = "Code"
rewrite_hint = "Generate code in the language of the active file"

[[app]]
match = "kitty"
is_terminal = true
paste_shortcut = "ctrl+shift+v"
```

Default rules for common apps (browsers, terminals, editors, chat apps) are
shipped in `src/talk_to_tux/data/default_app_rules.toml`. User rules in
`config.toml` take priority.

### API Keys

In BYO-key mode, store API keys in `~/.config/talk-to-tux/secrets.env` (CWD
`.env` is not loaded):

```bash
TTT_OPENAI_API_KEY=sk-...
TTT_GROQ_API_KEY=gsk_...
TTT_GOOGLE_API_KEY=AIza...
TTT_ELEVENLABS_API_KEY=...
```

## GPU Server Deployment

The STT server runs [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
on NVIDIA GPUs.

```bash
cd server
uv sync
uv run ttt-server --host 0.0.0.0 --port 8000

# Or Docker:
docker build -f deploy/Dockerfile.server -t ttt-server .
docker run --gpus all -p 8000:8000 ttt-server
```

Systemd service files are in `deploy/`.

## Development

```bash
uv sync --all-groups          # install all deps including dev
uv run pytest tests/ -q       # run desktop tests
cd server && uv run pytest tests/ -q  # run GPU server tests
make lint                     # ruff check
make format                   # ruff format
uv run baml-cli generate      # regenerate BAML client after .baml changes
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full developer guide.

## Privacy & data flow

Talk-to-Tux captures your voice, screen, and active-app context to power the
voice-to-paste pipeline. What leaves your machine depends on the API mode you
pick during setup:

- **Hosted mode** — recorded WAV audio and metadata are sent to Supabase Edge
  Functions for Groq Whisper STT. Rewrite requests send the transcript, active
  window, static/dynamic/internal context, app-specific hints, and optional
  screenshot bytes to Supabase. The hosted rewrite backend uses Groq Scout for
  text by default; Groq receives image bytes only when a hosted visual rewrite
  model is configured. Your account email/GitHub identity, quota rows, and
  audit data are retained per the [privacy policy](https://www.talk-to-tux.com/privacy).
- **BYO-key mode** — the same data is sent only to the providers whose keys
  you configure (`TTT_OPENAI_API_KEY`, `TTT_GROQ_API_KEY`, etc.); nothing hits
  our servers.

The local debug run cache (audio, transcripts, screenshots, rewrites) is **off
by default** in both modes. Enable it only when you need to diagnose a bug:

```bash
TTT_CACHE_ENABLED=true uv run talk-to-tux
# or set [cache] enabled = true in ~/.config/talk-to-tux/config.toml
```

Correction-learning and feedback artifacts are stored separately under
`~/.config/talk-to-tux/` and are not affected by the cache toggle.

## CLI Reference

```
uv run talk-to-tux [COMMAND] [OPTIONS]

Daemon options (no subcommand):
  --trigger {auto,mouse,keyboard}   Trigger mode
  --record-mode {toggle,hold}       Recording mode
  --no-indicator                    Disable system tray
  --no-tooltip                      Disable tooltip/notifications
  --no-validation                   Disable recording validation sound
  --retry [RUN_ID]                  Retry from a cached run (default: latest)
  --show-config                     Print resolved config and exit
  --doctor                          Run diagnostics and exit
  --setup                           Run interactive first-run setup wizard
  --migrate-config                  Convert .env to config.toml
  --debug                           Enable debug logging

Hosted-mode subcommands (beta):
  login                             Sign in to hosted mode (GitHub OAuth)
  logout                            Clear saved hosted-mode token
  whoami                            Show signed-in account + tier + token expiry
  usage                             Show current quota (STT hours, rewrite calls)
  switch-mode {hosted, byo-key}     Switch between hosted and BYO-key API modes

Settings subcommand:
  settings [--json]                 Show resolved non-secret settings
  settings --set PATH=VALUE         Update allowlisted non-secret settings
  settings --reset-learning         Remove local global learning artifacts
  settings --reset-learning-app APP_ID
  settings --learning-export PATH
  settings --learning-import PATH
  settings --learning-reset SCOPE
  settings --learning-forget TARGET
```

The tray settings window now reports whether each saved change applied live,
rebuilt runtime providers, or requires a restart. `talk-to-tux settings --set`
prints the same apply mode metadata for local writes, but it only updates local
config on disk; it does not hot-patch another running Talk-to-Tux process.
Secret saves are still separate: use the tray settings window BYOK key controls
or edit `~/.config/talk-to-tux/secrets.env` directly. The CLI `settings --set`
path is intentionally limited to allowlisted non-secret values in
`~/.config/talk-to-tux/config.toml`.

### Settings Recovery

If a settings change leaves Talk-to-Tux pointed at a bad local endpoint or a
restart-required combination, recover with the same storage split the app uses:

1. Run `uv run talk-to-tux settings --json` to inspect the current non-secret
   state, or add `--check-local-inference` to see bounded endpoint health
   results.
2. Fix or remove the bad non-secret value in
   `~/.config/talk-to-tux/config.toml`, then restart Talk-to-Tux if the saved
   change was marked `restart_required`.
3. Fix, replace, or clear BYOK provider keys only in
   `~/.config/talk-to-tux/secrets.env`.
4. Run `uv run talk-to-tux --doctor` if desktop integration or provider setup
   still looks wrong after the config/secrets fix.

On a fresh install with no `config.toml`, no `TTT_API_MODE` env var, no
`secrets.env`, and no stored hosted token, the first invocation auto-launches
the setup wizard (same as running `--setup`). The wizard asks you to pick
**hosted** (GitHub OAuth, quota-managed) or **byo-key** (provide your own
OpenAI/Groq/ElevenLabs keys in `secrets.env`). `switch-mode byo-key` never
creates or modifies `secrets.env` — you edit it yourself.

## Running as a Service

To start Talk-to-Tux automatically on login and restart on crash:

```bash
# Copy the service file
cp deploy/talk-to-tux.service ~/.config/systemd/user/

# Enable and start
systemctl --user enable --now talk-to-tux

# Check status / logs
systemctl --user status talk-to-tux
journalctl --user -u talk-to-tux -f
```

The service auto-restarts within 3 seconds if the app crashes. The GNOME
Shell extension also auto-hides the recording overlay after 30 seconds if
the app stops responding.

## License

AGPL-3.0-only — see [LICENSE](LICENSE). Hosted beta service terms are tracked
separately on the Talk-to-Tux website.
