Metadata-Version: 2.4
Name: livekit-wakeword
Version: 0.2.1
Summary: Wake word detection for voice-enabled applications
Project-URL: Homepage, https://github.com/livekit/livekit-wakeword
Project-URL: Repository, https://github.com/livekit/livekit-wakeword
Project-URL: Issues, https://github.com/livekit/livekit-wakeword/issues
Author-email: Binh Pham <binh.pham@livekit.io>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: keyword-spotting,livekit,onnx,speech,voice,wake-word
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.17
Provides-Extra: eval
Requires-Dist: matplotlib>=3.8; extra == 'eval'
Provides-Extra: export
Requires-Dist: onnx>=1.15; extra == 'export'
Requires-Dist: onnxruntime>=1.17; extra == 'export'
Requires-Dist: onnxscript>=0.6.2; extra == 'export'
Provides-Extra: listener
Requires-Dist: pyaudio>=0.2.14; extra == 'listener'
Provides-Extra: train
Requires-Dist: audiomentations>=0.36; extra == 'train'
Requires-Dist: cmudict>=1.0; extra == 'train'
Requires-Dist: huggingface-hub>=0.20; extra == 'train'
Requires-Dist: librosa>=0.10; extra == 'train'
Requires-Dist: nltk>=3.8; extra == 'train'
Requires-Dist: pronouncing>=0.2; extra == 'train'
Requires-Dist: pydantic>=2.0; extra == 'train'
Requires-Dist: pyyaml>=6.0; extra == 'train'
Requires-Dist: rich>=13.0; extra == 'train'
Requires-Dist: scikit-learn>=1.3; extra == 'train'
Requires-Dist: scipy>=1.11; extra == 'train'
Requires-Dist: setuptools<81,>=69.0; extra == 'train'
Requires-Dist: soundfile>=0.12; extra == 'train'
Requires-Dist: torch-audiomentations>=0.11; extra == 'train'
Requires-Dist: torch>=2.5; extra == 'train'
Requires-Dist: torchaudio>=2.5; extra == 'train'
Requires-Dist: tqdm>=4.60; extra == 'train'
Requires-Dist: typer>=0.12; extra == 'train'
Requires-Dist: webrtcvad-wheels>=2.0.14; extra == 'train'
Provides-Extra: voxcpm
Requires-Dist: voxcpm>=2.0.0; extra == 'voxcpm'
Description-Content-Type: text/markdown

<a href="https://livekit.io/">
  <img src="https://raw.githubusercontent.com/livekit/livekit-wakeword/main/.github/assets/livekit-mark.png" alt="LiveKit logo" width="100" height="100">
</a>

# livekit-wakeword

[![CI](https://github.com/livekit/livekit-wakeword/actions/workflows/ci.yml/badge.svg)](https://github.com/livekit/livekit-wakeword/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-v0.2-green)](https://github.com/livekit/livekit-wakeword)

An open-source wake word library for creating voice-enabled applications. Based on [openWakeWord](https://github.com/dscripka/openWakeWord) with streamlined training: generate synthetic data, augment, train, and export from a single YAML config.

**Features:**

- **Conv-Attention classifier**: 1D temporal convolutions + multi-head self-attention replace openWakeWord's flat DNN head, preserving temporal structure across the 16-frame embedding window for better accuracy and fewer false positives (see [comparison](#why-livekit-wakeword) below)
- **Backward compatible** with openWakeWord models and library
- **Multilingual support**: over 30 languages via VoxCPM synthetic data generation
- **Train anywhere**: local machine, cloud, or spawn [SkyPilot](https://github.com/skypilot-org/skypilot) jobs
- **Zero dependency headaches**: uv handles everything

**Quick Links:**

- [Why livekit-wakeword](#why-livekit-wakeword)
- [Using a Pre-trained Model](#using-a-pre-trained-model)
- [Training a Custom Wake Word](#training-a-custom-wake-word)
- [Multilingual Support](#multilingual)
- [Python API](#python-api)
- [Example: Wake Word–Triggered Agent](https://github.com/livekit-examples/hello-wakeword)

## Why livekit-wakeword

Both livekit-wakeword and openWakeWord share the same audio front-end: mel spectrograms are fed through frozen [Google speech embedding](https://github.com/google-research/google-research/tree/master/embedding_fns) and [openWakeWord embedding](https://github.com/dscripka/openWakeWord) models to produce a `(16, 96)` feature matrix (16 timesteps × 96-dim embeddings). The difference is the classification head that sits on top.

### Architecture

**openWakeWord** flattens the `(16, 96)` matrix into a 1536-d vector and feeds it through a small fully-connected DNN:

```
Flatten(16×96=1536) → Dense → Dense → Sigmoid
```

While the positional information is technically still present in the flattened vector, the dense layer has no inductive bias for temporal structure and must learn any sequential patterns from scratch.

**livekit-wakeword** introduces a **Conv-Attention** (`conv_attention`) classifier:

```
Conv1D blocks → MultiheadAttention → Mean pool → Linear(1) → Sigmoid
```

1. **1D Convolutions** (kernel size 3) slide across the 16 timesteps, capturing local temporal patterns (e.g., syllable transitions).
2. **Multi-Head Self-Attention** models long-range dependencies across the full temporal window, letting the model learn which timestep relationships matter.
3. **Mean pooling** aggregates attended features into a fixed-size vector for the final sigmoid output.

### Results

To compare, we evaluated an openWakeWord DNN, a livekit-wakeword DNN (same architecture, better training pipeline), and a livekit-wakeword conv-attention model on the same "hey livekit" validation set (15,000 positive clips, 45,084 negative clips, 25 hours of audio). The livekit-wakeword models were trained with the [prod config](configs/prod.yaml).

| Metric              | openWakeWord (DNN) | livekit-wakeword (DNN) | livekit-wakeword (conv-attention) |
| ------------------- | :----------------: | :--------------------: | :-------------------------------: |
| **AUT\***           |       0.0720       |         0.0423         |            **0.0012**             |
| **FPPH\***          |        8.50        |          3.07          |             **0.08**              |
| **Recall\***        |       68.6%        |         85.3%          |             **86.1%**             |
| Optimal Threshold\* |        0.01        |          0.01          |               0.68                |

<table>
<tr>
<td align="center"><strong>openWakeWord (DNN)</strong></td>
<td align="center"><strong>livekit-wakeword (DNN)</strong></td>
<td align="center"><strong>livekit-wakeword (conv-attention)</strong></td>
</tr>
<tr>
<td><img src="https://raw.githubusercontent.com/livekit/livekit-wakeword/main/.github/assets/det_openwakeword.png" alt="DET curve: openWakeWord" width="280"></td>
<td><img src="https://raw.githubusercontent.com/livekit/livekit-wakeword/main/.github/assets/det_livekit_wakeword_dnn.png" alt="DET curve: livekit-wakeword DNN" width="280"></td>
<td><img src="https://raw.githubusercontent.com/livekit/livekit-wakeword/main/.github/assets/det_livekit_wakeword.png" alt="DET curve: livekit-wakeword conv-attention" width="280"></td>
</tr>
</table>

The livekit-wakeword DNN already outperforms openWakeWord's DNN thanks to the improved training pipeline (focal loss, embedding mixup, 3-phase training, checkpoint averaging). However, both DNN models fail to meet the FPPH target: their optimal thresholds fall to 0.01, meaning no operating point can keep false positives low enough.

The conv-attention head is what unlocks the low false positive rate: **60x lower AUT** and **100x fewer false positives per hour** than openWakeWord, while detecting 17% more wake words.

_\***AUT** (Area Under the DET curve): summarizes the full DET (Detection Error Tradeoff) curve, which plots false positive rate vs false negative rate across all thresholds. Lower is better (0 = perfect). A DET curve that hugs the bottom-left corner indicates strong separation between wake words and non-wake-words._

_\***FPPH** (False Positives Per Hour): how many times the model falsely triggers per hour of non-wake-word audio. Lower is better. For production use, < 0.5 FPPH is typical._

_\***Recall**: the percentage of actual wake words correctly detected. Higher is better._

_\***Optimal Threshold**: the detection threshold that maximizes recall while keeping FPPH at or below the target (configurable, default 0.1). A threshold of 0.01 indicates no threshold could meet the FPPH target; the evaluator fell back to the highest balanced accuracy._

### Why conv-attention wins

- **Temporal awareness**: the conv-attention model sees the _order_ of speech events, not just their presence, reducing false triggers from phonetically similar but differently ordered phrases.
- **Better accuracy at the same model size**: attention lets a small model selectively focus on discriminative time regions rather than learning dense connections over the full flattened input.
- **Lower false-positive rates**: temporal structure helps reject partial or reordered matches that a flat DNN would accept.

The conv-attention head is the default. You can switch to the original DNN or an RNN head via `model_type` in your config:

```yaml
model:
  model_type: conv_attention # conv_attention (default) | dnn | rnn
  model_size: small # tiny, small, medium, large
```

## Using a Pre-trained Model

### Python

**System dependencies (for microphone listener):**

```bash
# macOS
brew install portaudio

# Ubuntu/Debian
sudo apt install portaudio19-dev
```

**Installation:**

```bash
pip install livekit-wakeword
# or
uv add livekit-wakeword
```

For microphone listening, install with the `listener` extra:

```bash
pip install livekit-wakeword[listener]
```

**Basic inference:**

```python
from livekit.wakeword import WakeWordModel

model = WakeWordModel(models=["hey_livekit.onnx"])

# Feed audio frames (16kHz, int16 or float32)
scores = model.predict(audio_frame)
if scores["hey_livekit"] > 0.5:
    print("Wake word detected!")
```

**Async listener with microphone:**

```python
import asyncio
from livekit.wakeword import WakeWordModel, WakeWordListener

model = WakeWordModel(models=["hey_livekit.onnx"])

async def main():
    async with WakeWordListener(model, threshold=0.5, debounce=2.0) as listener:
        while True:
            detection = await listener.wait_for_detection()
            print(f"Detected {detection.name}! ({detection.confidence:.2f})")

asyncio.run(main())
```

### Rust

For Rust applications, use the [`livekit-wakeword`](https://crates.io/crates/livekit-wakeword) crate:

```toml
[dependencies]
livekit-wakeword = "0.1"
```

```rust
use livekit_wakeword::WakeWordModel;

let mut model = WakeWordModel::new(&["hey_livekit.onnx"], 16000)?;

// Feed ~2s PCM audio chunks (i16, at configured sample rate)
let scores = model.predict(&audio_chunk)?;
if scores["hey_livekit"] > 0.5 {
    println!("Wake word detected!");
}
```

The mel spectrogram and speech embedding models are compiled into the binary; only the wake word classifier ONNX file is loaded at runtime. Audio at supported sample rates (22050–384000 Hz) is automatically resampled to 16 kHz.

### Swift

For Swift applications on iOS 16+ / macOS 14+, add the [`LiveKitWakeWord`](swift) Swift package:

```swift
// Package.swift
.package(url: "https://github.com/livekit/livekit-wakeword", branch: "main"),
```

**Basic inference:**

```swift
import LiveKitWakeWord

let classifier = Bundle.main.url(forResource: "hey_livekit", withExtension: "onnx")!
let model = try WakeWordModel(models: [classifier], sampleRate: 16_000)

// Feed ~2 s PCM chunks (Int16, at the configured sample rate):
let scores = try model.predict(audioChunk)
if (scores["hey_livekit"] ?? 0) > 0.5 {
    print("Wake word detected!")
}
```

**Async listener with microphone:**

```swift
import LiveKitWakeWord

let classifier = Bundle.main.url(forResource: "hey_livekit", withExtension: "onnx")!
let model = try WakeWordModel(models: [classifier], sampleRate: 16_000)
let listener = WakeWordListener(model: model, threshold: 0.5, debounce: 2.0)

try await listener.start()
for await detection in listener.detections() {
    print("Detected \(detection.name)! (confidence=\(String(format: "%.2f", detection.confidence)))")
}
```

The mel spectrogram and speech embedding `.onnx` models ship inside the Swift package; only the classifier ships with your app. Audio at any sample rate is resampled to 16 kHz internally via `AVAudioConverter` (matches the Rust crate's 22050–384000 Hz input range); the listener handles mic-hardware resampling automatically. ONNX Runtime with the CoreML Execution Provider dispatches to ANE / GPU / CPU by default (override via `executionProvider:`).

Add `NSMicrophoneUsageDescription` to Info.plist (and `com.apple.security.device.audio-input` on sandboxed macOS apps) for listener use. A runnable SwiftUI demo (iOS + macOS) lives in [examples/ios_wakeword/](examples/ios_wakeword/).

## Training a Custom Wake Word

### CLI quick start

**System dependencies:**

```bash
# macOS
brew install espeak-ng ffmpeg portaudio

# Ubuntu/Debian
sudo apt install espeak-ng libsndfile1 ffmpeg sox portaudio19-dev
```

**Installation (with pip):**

```bash
pip install livekit-wakeword[train,eval,export]
```

**Installation (with uv):**

```bash
uv tool install livekit-wakeword[train,eval,export]
```

**Installation (from source):**

```bash
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/livekit/livekit-wakeword
cd livekit-wakeword
uv sync --all-extras
```

**Download models and data:**

```bash
livekit-wakeword setup --config configs/prod.yaml
```

**Train a wake word:**

```bash
livekit-wakeword run configs/prod.yaml
```

Or run stages individually:

```bash
livekit-wakeword generate configs/prod.yaml  # TTS synthesis + adversarial negatives
livekit-wakeword augment configs/prod.yaml   # Augment + extract features
livekit-wakeword train configs/prod.yaml     # 3-phase adaptive training
livekit-wakeword export configs/prod.yaml    # Export to ONNX
livekit-wakeword eval configs/prod.yaml      # Evaluate model (DET curve, AUT, FPPH)
```

You can also evaluate any compatible ONNX model (e.g., one trained with openWakeWord):

```bash
livekit-wakeword eval configs/prod.yaml -m /path/to/other_model.onnx
```

Eval produces a DET curve plot and metrics JSON in the output directory. See [Evaluation](docs/evaluation.md) for details.

### Configuration

The full pipeline runs based on a single YAML configuration file. Example configs:

| Config                                               | Wake word      | Use                                                      |
| ---------------------------------------------------- | -------------- | -------------------------------------------------------- |
| [configs/prod.yaml](configs/prod.yaml)               | "hey livekit"  | Production-scale for English with **Piper TTS** backbone |
| [configs/test.yaml](configs/test.yaml)               | "hey livekit"  | Small end-to-end test run with **Piper TTS** backbone    |
| [configs/prod_voxcpm.yaml](configs/prod_voxcpm.yaml) | "你好 livekit" | Production-scale multilingual with **VoxCPM** backbone   |
| [configs/test_voxcpm.yaml](configs/test_voxcpm.yaml) | "你好 livekit" | Small end-to-end test run with **VoxCPM** backbone       |

The bare minimum configuration required is as follows:

```yaml
model_name: hey_livekit
target_phrases:
  - "hey livekit"

n_samples: 10000 # training samples per class
model:
  model_type: conv_attention # conv_attention, dnn, or rnn
  model_size: small # tiny, small, medium, large
steps: 50000
target_fp_per_hour: 0.2
```

### Multilingual

We support training wake words in 30 languages with [VoxCPM2 TTS](https://github.com/OpenBMB/VoxCPM) synthetic data generation:

```
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
```

To use this, add `tts_backend` in your configuration YAML:

```yaml
tts_backend: voxcpm
```

And install `livekit-wakeword` with the `voxcpm` optional dependency:

```bash
pip install livekit-wakeword[train,eval,export,voxcpm]
```

> [!WARNING]
> **Multilingual models currently achieve lower accuracy than English models.** This is due to two factors: (1) the frozen [Google speech embedding](https://www.kaggle.com/models/google/speech-embedding) model was trained predominantly on English data, so its representations are weaker for other languages, and (2) VoxCPM produces less diverse synthetic speech compared to the large Piper TTS speaker pool available for English.
>
> <img src="https://raw.githubusercontent.com/livekit/livekit-wakeword/main/.github/assets/det_nihao_livekit_voxcpm.png" alt="DET curve: 你好 livekit (VoxCPM)" width="350">
>
> _DET curve for "你好 livekit" trained with VoxCPM — note the higher error rates compared to the English "hey livekit" results [above](#results)._
>
> To improve multilingual performance, increase the number of `voice_design_prompts` (50–100) and `n_samples` in your config so the model sees a wider range of speaker and prosody variation:
>
> ```yaml
> n_samples: 50000 # ↑ from 25000
> n_samples_val: 10000 # ↑ from 5000
>
> voxcpm_tts:
>   voice_design_prompts:
>     # Add 50-100 diverse prompts covering age, gender, pitch, pace, accent, energy, etc.
>     - "A young adult woman, clear mid-pitch voice, moderate pace, calm and professional"
>     - "A young adult man, warm baritone, steady pace, friendly and articulate"
>     - "A middle-aged woman, slightly low pitch, measured pace, confident tone"
>     # ... add more prompts for wider speaker diversity
> ```

More detail: [docs/data-generation.md](docs/data-generation.md) (Piper, VoxCPM, setup).

### Python API

The full training pipeline is available as a Python API, so you can import and drive it from your own code instead of using the CLI:

```python
from livekit.wakeword import (
    WakeWordConfig,
    load_config,
    run_generate,
    run_augment,
    run_extraction,
    run_train,
    run_export,
    run_eval,
)

# Load from YAML or construct directly
config = load_config("configs/prod.yaml")

# Or build a config programmatically
config = WakeWordConfig(
    model_name="hey_robot",
    target_phrases=["hey robot"],
    n_samples=5000,
    steps=30000,
)

# Run individual stages
run_generate(config)     # TTS synthesis + adversarial negatives
run_augment(config)      # Add noise, reverb, pitch shifts
run_extraction(config)   # Extract mel spectrograms + speech embeddings → .npy
run_train(config)        # 3-phase adaptive training
onnx_path = run_export(config)       # Export to ONNX

# Evaluate the exported model
results = run_eval(config, onnx_path)
print(f"AUT={results['aut']:.4f}  FPPH={results['fpph']:.2f}  Recall={results['recall']:.1%}")
```

This is useful for integrating wake word training into larger pipelines, automating model iteration, or building custom tooling on top of the data generation and training stages.

### Cloud GPUs with SkyPilot

See [skypilot/train.yaml](skypilot/train.yaml) for an example training job on Nebius.

```bash
sky launch skypilot/train.yaml
```

## Further Reading

- [Architecture Overview](docs/overview.md): system design and data flow
- [Data Generation](docs/data-generation.md): TTS synthesis and adversarial negatives
- [Augmentation](docs/augmentation.md): audio transforms and alignment
- [Feature Extraction](docs/feature-extraction.md): mel spectrograms and embeddings
- [Training](docs/training.md): 3-phase training and checkpoint averaging
- [Export & Inference](docs/export-and-inference.md): ONNX export and Python API
- [Evaluation](docs/evaluation.md): DET curves, AUT, and model comparison

## License

This project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.
