Metadata-Version: 2.4
Name: talk-to-tux
Version: 0.1.0
Summary: Voice-to-smart-paste pipeline for Linux
Project-URL: Homepage, https://github.com/viperjuice/talk-to-tux
Project-URL: Repository, https://github.com/viperjuice/talk-to-tux
Project-URL: Issues, https://github.com/viperjuice/talk-to-tux/issues
Author: viperjuice
License-Expression: AGPL-3.0-only
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: X11 Applications
Classifier: Environment :: X11 Applications :: Gnome
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.11
Requires-Dist: baml-py>=0.218.0
Requires-Dist: evdev>=1.7
Requires-Dist: httpx>=0.28
Requires-Dist: keyring>=25
Requires-Dist: numpy>=1.26
Requires-Dist: openai>=1.60
Requires-Dist: pillow>=10.0
Requires-Dist: pulsectl-asyncio>=1.2
Requires-Dist: pydantic-settings>=2.7
Requires-Dist: pydantic>=2.10
Requires-Dist: python-dotenv>=1.0
Requires-Dist: sounddevice>=0.5
Requires-Dist: soundfile>=0.13
Requires-Dist: torch>=2.5
Requires-Dist: torchaudio>=2.5
Provides-Extra: indicator
Requires-Dist: pygobject>=3.42; extra == 'indicator'
Description-Content-Type: text/markdown

# Talk-to-Tux

Voice-to-smart-paste pipeline for Linux. Hold a mouse button, speak, and
the transcribed + LLM-rewritten text is pasted into your active application —
formatted for the context you're in.

```mermaid
flowchart LR
    A["Hold Button"] --> B["Record Audio"]
    B --> C["STT (Whisper)"]
    C --> D["LLM Rewrite"]
    D --> E["Smart Paste"]

    F["App Context\n(window + screenshot)"] --> D
```

## How It Works

1. **Hold** your mouse side button (or keyboard hotkey) and speak. The
   microphone is **always listening** in a 2-second ring buffer, so any
   speech immediately before the button press is also captured
2. **Release** — audio is checked for speech by VAD, then sent to the STT provider chain
3. The transcription is **rewritten** by an LLM using your active window's context
   (app name, window title, AT-SPI widget text, screenshot)
4. The result is **pasted** into the focused application using the correct shortcut
5. **Double-tap** the side button after a paste to send Enter (e.g., submit a chat message)

Works on both **X11** and **Wayland** (GNOME, tested on Ubuntu 24.04+).

## Architecture

```mermaid
flowchart TB
    subgraph Trigger["Trigger Layer"]
        SB["Side Button / Keyboard Hotkey"]
    end

    subgraph Recording["Recording Phase"]
        SB -->|hold| REC["Audio Recorder\n(sounddevice + 2s ring buffer)"]
        SB -->|press| CTX0["Capture Context\nwindow + AT-SPI + screenshot"]
    end

    subgraph Processing["Processing Phase"]
        REC -->|release| VAD["VAD Gate\n(Silero)"]
        VAD -->|raw WAV| STT["STT Chain\ngpu_server -> elevenlabs -> groq -> openai -> google"]
        STT --> TX["Transcription"]
        CTX0 --> RP["Rewrite Prompt\n6-layer context"]
        TX --> RP
        RP --> LLM["BAML SmartRewrite\n(Ollama / Gemini / Groq)"]
    end

    subgraph Output["Paste Phase"]
        LLM --> TT["Tooltip / Notification\n(confirm or auto-paste)"]
        TT --> PASTE["Paster\nxclip/wl-copy + xdotool/wtype"]
    end
```

## Prerequisites

**OS**: Linux with GNOME (X11 or Wayland). **Python**: 3.11+.
**STT backend**: GPU server with CUDA, *or* an API key for ElevenLabs/Groq/OpenAI/Google.

### System packages

| Tool(s) | Group | Ubuntu/Debian (`apt`) | Arch (`pacman`) | Fedora (`dnf`) |
|---------|-------|-----------------------|-----------------|----------------|
| `libportaudio2` | Audio | `libportaudio2` | `portaudio` | `portaudio` |
| `grim` | Screenshots (Wayland) | `grim` | `grim` | `grim` |
| `scrot` | Screenshots (X11/XWayland) | `scrot` | `scrot` | `scrot` |
| `xclip` | Clipboard (X11) | `xclip` | `xclip` | `xclip` |
| `wl-copy`, `wl-paste` | Clipboard (Wayland) | `wl-clipboard` | `wl-clipboard` | `wl-clipboard` |
| `xdotool` | Keystroke injection (X11) | `xdotool` | `xdotool` | `xdotool` |
| `wtype` | Keystroke injection (Wayland/wlroots) | `wtype` | `wtype` | `wtype` |
| `ydotool` + `ydotoold` | Keystroke injection (GNOME Wayland) | see note below | `ydotool` | `ydotool` |
| `dbus-send`, `busctl`, `gdbus` | D-Bus utilities | `dbus` / `systemd` | `dbus` / `systemd` | `dbus` / `systemd` |
| `notify-send` | Desktop notifications | `libnotify-bin` | `libnotify` | `libnotify` |
| `evtest` | Input device debug | `evtest` | `evtest` | `evtest` |
| `pgrep` | Process checks | `procps` | `procps-ng` | `procps-ng` |

### ydotool on Ubuntu/Debian — build v1.0+ from source

Ubuntu's `apt` ships **ydotool 0.1.8**, which has no daemon and produces garbage
key injection. You need **v1.0+** built from source:

```bash
# Build dependencies
sudo apt install cmake libevdev-dev libudev-dev

# Clone and build
git clone https://github.com/ReimuNotMoe/ydotool
cd ydotool && cmake -B build && cmake --build build && sudo cmake --install build

# Enable the daemon and grant /dev/uinput access
systemctl --user enable --now ydotoold
sudo usermod -aG input $USER   # re-login required
# or add udev rule: echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/99-uinput.rules
```

Arch and Fedora ship a working `ydotool` via their package managers.

### Self-verify

```bash
uv run talk-to-tux --doctor
```

## Quick Start

```bash
git clone https://github.com/viperjuice/talk-to-tux.git
cd talk-to-tux
uv sync --all-groups

# Copy and edit config — secrets live under ~/.config (CWD .env is not loaded)
mkdir -p ~/.config/talk-to-tux
cp .env.example ~/.config/talk-to-tux/secrets.env

# Run
uv run talk-to-tux
```

On first run, the app auto-detects your mouse and starts listening for side
button presses. A system tray indicator shows the current state.

## Trigger Modes and Key Mapping

### Default: Mouse Side Buttons (hold-to-record)

| Button | evdev Code | Action |
|--------|-----------|--------|
| **BTN_SIDE** (thumb back) | 275 | Either button starts recording |
| **BTN_EXTRA** (thumb forward) | 276 | Release ALL buttons to stop |

The device is grabbed exclusively so side buttons don't trigger browser
back/forward. All other mouse events (movement, clicks, scroll) are forwarded
transparently via uinput.

### Alternative: Keyboard Hotkey

| Key Combo | evdev Names | Action |
|-----------|------------|--------|
| **Ctrl + Super** (left) | `KEY_LEFTCTRL+KEY_LEFTMETA` | Toggle recording |

> **Note:** The Super key may trigger GNOME Activities. Disable with:
> `gsettings set org.gnome.mutter overlay-key ''`

### Customizing the Trigger

**Option 1: TOML config** (`~/.config/talk-to-tux/config.toml`)

```toml
[trigger]
mode = "mouse"           # "auto", "mouse", or "keyboard"
record_mode = "hold"     # "hold" (release to stop) or "toggle" (tap/tap)

[trigger.mouse]
button_codes = [275, 276]          # any evdev button codes
device_name = "Logitech G502"     # match by name substring (stable across reboots and USB replug)
# device_path = "/dev/input/event5"  # or explicit path (fragile)
grab = true

[trigger.keyboard]
hotkey = "KEY_LEFTCTRL+KEY_LEFTMETA"
```

**Option 2: Environment variables**

```bash
TTT_TRIGGER_MODE=keyboard
TTT_HOTKEY=KEY_RIGHTCTRL
TTT_RECORD_MODE=toggle
# Or nested format:
TTT_TRIGGER__MOUSE__BUTTON_CODES='[275, 276]'
TTT_TRIGGER__MOUSE__DEVICE_NAME="Logitech"
```

**Option 3: CLI flags**

```bash
uv run talk-to-tux --trigger keyboard --record-mode toggle
```

### Finding Your Button Codes

```bash
# List input devices
sudo evtest

# Pick your mouse, press buttons, note the codes:
#   Event: type 1 (EV_KEY), code 275 (BTN_SIDE), value 1
```

## Configuration

Configuration is loaded with this precedence (highest first):

1. **CLI arguments** (`--trigger mouse`, `--debug`, etc.)
2. **Environment variables** (`TTT_*` prefix)
3. **`~/.config/talk-to-tux/secrets.env`** (CWD `.env` intentionally **not** loaded — prevents rogue `.env` in a project dir from overriding secrets)
4. **TOML config** (`~/.config/talk-to-tux/config.toml`)

### Key Settings

| Section | Setting | Default | Description |
|---------|---------|---------|-------------|
| `stt` | `providers` | `gpu_server,elevenlabs,groq,openai,google` | STT fallback chain (tried in order) |
| `stt.gpu_server` | `url` | `http://localhost:8000` | Self-hosted Whisper server URL |
| `rewrite` | `enabled` | `true` | Enable LLM smart rewrite |
| `rewrite` | `ollama_base_url` | `http://localhost:11434/v1` | Ollama / vLLM endpoint |
| `context` | `screenshot_enabled` | `true` | Include screenshot in LLM context |
| `ducking` | `enabled` | `true` | Reduce other apps' volume while recording |
| `ducking` | `factor` | `0.15` | Duck to 15% of original volume |
| `tooltip` | `enabled` | `false` | Show confirm-before-paste tooltip (disabled = auto-paste) |
| `tooltip` | `use_notifications` | `true` | Use desktop notifications (vs GTK tooltip) |
| `paste` | `enabled` | `true` | Auto-paste into active window |
| `indicator` | `enabled` | `true` | Show system tray indicator |

### Per-App Rules

Customize behavior per application in `config.toml`:

```toml
[[app]]
match = "Google-chrome"
match_title = "*ChatGPT*"         # optional title filter (glob or ~regex)
paste_shortcut = "ctrl+shift+v"   # override paste shortcut
rewrite_hint = "Conversational tone, no markdown"

[[app]]
match = "Code"
rewrite_hint = "Generate code in the language of the active file"

[[app]]
match = "kitty"
is_terminal = true
paste_shortcut = "ctrl+shift+v"
```

Default rules for common apps (browsers, terminals, editors, chat apps) are
shipped in `src/talk_to_tux/data/default_app_rules.toml`. User rules in
`config.toml` take priority.

### API Keys

Store API keys in `~/.config/talk-to-tux/secrets.env` (CWD `.env` is not loaded):

```bash
TTT_OPENAI_API_KEY=sk-...
TTT_GROQ_API_KEY=gsk_...
TTT_GOOGLE_API_KEY=AIza...
TTT_ELEVENLABS_API_KEY=...
```

## GPU Server Deployment

The STT server runs [faster-whisper](https://github.com/SYSTRAN/faster-whisper)
on NVIDIA GPUs.

```bash
cd server
uv sync
uv run ttt-server --host 0.0.0.0 --port 8000

# Or Docker:
docker build -f deploy/Dockerfile.server -t ttt-server .
docker run --gpus all -p 8000:8000 ttt-server
```

Systemd service files are in `deploy/`.

## Development

```bash
uv sync --all-groups          # install all deps including dev
uv run pytest tests/ -q       # run all tests (1611)
make lint                     # ruff check
make format                   # ruff format
uv run baml-cli generate      # regenerate BAML client after .baml changes
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full developer guide.

## CLI Reference

```
uv run talk-to-tux [COMMAND] [OPTIONS]

Daemon options (no subcommand):
  --trigger {auto,mouse,keyboard}   Trigger mode
  --record-mode {toggle,hold}       Recording mode
  --no-indicator                    Disable system tray
  --no-tooltip                      Disable tooltip/notifications
  --no-validation                   Disable recording validation sound
  --show-config                     Print resolved config and exit
  --doctor                          Run diagnostics and exit
  --setup                           Run interactive first-run setup wizard
  --migrate-config                  Convert .env to config.toml
  --debug                           Enable debug logging

Hosted-mode subcommands (beta):
  login                             Sign in to hosted mode (GitHub OAuth)
  logout                            Clear saved hosted-mode token
  whoami                            Show signed-in account + tier + token expiry
  usage                             Show current quota (STT hours, rewrite calls)
  switch-mode {hosted, byo-key}     Switch between hosted and BYO-key API modes
```

On a fresh install with no `config.toml`, no `TTT_API_MODE` env var, no
`secrets.env`, and no stored hosted token, the first invocation auto-launches
the setup wizard (same as running `--setup`). The wizard asks you to pick
**hosted** (GitHub OAuth, quota-managed) or **byo-key** (provide your own
OpenAI/Groq/ElevenLabs keys in `secrets.env`). `switch-mode byo-key` never
creates or modifies `secrets.env` — you edit it yourself.

## Running as a Service

To start Talk-to-Tux automatically on login and restart on crash:

```bash
# Copy the service file
cp deploy/talk-to-tux.service ~/.config/systemd/user/

# Enable and start
systemctl --user enable --now talk-to-tux

# Check status / logs
systemctl --user status talk-to-tux
journalctl --user -u talk-to-tux -f
```

The service auto-restarts within 3 seconds if the app crashes. The GNOME
Shell extension also auto-hides the recording overlay after 30 seconds if
the app stops responding.

## License

AGPL-3.0 — see [LICENSE](LICENSE). Commercial licensing available for
organizations that need to use Talk-to-Tux without the copyleft requirements.
