Metadata-Version: 2.4
Name: voxon
Version: 1.0.0
Summary: High-fidelity voice cloning TTS for Apple Silicon — powered by Qwen3-TTS and ChatterboxTurbo via MLX
Author: voxon contributors
License: MIT License
        
        Copyright (c) 2025 jarvis-tts contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/voxon/voxon
Project-URL: Documentation, https://github.com/voxon/voxon#readme
Project-URL: Repository, https://github.com/voxon/voxon
Project-URL: Bug Tracker, https://github.com/voxon/voxon/issues
Project-URL: Changelog, https://github.com/voxon/voxon/blob/main/CHANGELOG.md
Keywords: text-to-speech,tts,voice-cloning,speech-synthesis,mlx,apple-silicon,qwen3,chatterbox
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx-audio>=0.2.0
Requires-Dist: soundfile>=0.12.1
Requires-Dist: sounddevice>=0.4.6
Requires-Dist: numpy>=1.24.0
Requires-Dist: fastapi>=0.110.0
Requires-Dist: uvicorn[standard]>=0.29.0
Requires-Dist: noisereduce>=3.0.0
Requires-Dist: resampy>=0.4.2
Requires-Dist: faster-whisper>=1.0.0
Requires-Dist: torch>=2.1.0
Requires-Dist: torchaudio>=2.1.0
Requires-Dist: transformers>=4.40.0
Dynamic: license-file

# voxon

[![PyPI version](https://img.shields.io/pypi/v/voxon.svg)](https://pypi.org/project/voxon/)
[![Python versions](https://img.shields.io/pypi/pyversions/voxon.svg)](https://pypi.org/project/voxon/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Platform: macOS](https://img.shields.io/badge/platform-macOS%20Apple%20Silicon-lightgrey.svg)]()

High-fidelity, real-time voice cloning on Apple Silicon.

**One command to install. One command to speak.**

```
pip install voxon
voxon "Hello, this is Voxon."
```

---

## Overview

voxon turns a 15-second audio clip into a permanent voice identity that you can synthesise speech through at any time. It runs entirely on-device — no cloud, no API keys, no data leaving your machine.

The synthesis engine loads into memory once and stays there as a background daemon. Subsequent calls have sub-second dispatch overhead regardless of text length.

**Backends supported:**

- **Qwen3-TTS 1.7B** (default) — Alibaba's 1.7B parameter multilingual TTS model, quantised to 8-bit for Apple Silicon via MLX
- **ChatterboxTurbo** — ResembleAI's fast, high-quality voice cloning model (opt-in via `VOXON_MODEL=chatterbox-turbo`)

**Requirements:** macOS 13+, Apple Silicon (M1 or later), Python 3.11+

---

## Prerequisites

**System:** macOS 13+, Apple Silicon (M1 or later), Python 3.11+

**HuggingFace account (free, one-time):**

The TTS model weights are hosted on HuggingFace and require a free account and a one-click licence acceptance before the first download.

1. Create a free account at [huggingface.co](https://huggingface.co)
2. Accept the model licence — one click, no payment:
   - Qwen3-TTS (default): [huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base)
3. Generate a token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (read-only is sufficient)
4. Log in once from your terminal:

```bash
pip install huggingface_hub
huggingface-cli login
```

After this, model weights download automatically on first use and are cached locally — you never need to do this again.

> If you skip this step, voxon will print a clear error with the exact URL to visit when you first try to synthesise.

---

## Installation

```bash
pip install voxon
```

The installer pulls all required dependencies automatically, including MLX, PyTorch (CPU), faster-whisper, and the audio processing libraries. No manual configuration is needed.

> **First run:** On the first synthesis call, model weights (~1–3 GB depending on backend) are downloaded from HuggingFace. This happens once and is cached locally.

---

## Quick Start

### Step 1 — Prepare a reference voice

Record or export approximately 15 seconds of clean speech as any audio file (WAV, MP3, FLAC, AIFF). Then run:

```bash
voxon prep my_recording.wav
```

This processes the audio through a five-stage pipeline:

1. Loads and trims to the specified window
2. Resamples to 24 kHz mono
3. Applies stationary noise reduction
4. Normalises peak amplitude
5. Auto-transcribes using Whisper large-v3-turbo

**Critical:** Open the generated `.txt` file in `~/.voxon/voices/` and verify every word matches the audio exactly — including hesitations and fillers. Transcript accuracy directly determines cloning fidelity.

```bash
# Use a specific time window
voxon prep recording.wav --start 30 --duration 15

# Skip noise reduction (for already-clean audio)
voxon prep recording.wav --no-denoise

# Transcribe manually (write the .txt yourself after)
voxon prep recording.wav --no-transcribe
```

### Step 2 — Speak

```bash
voxon "Hello, this is Voxon speaking."
```

On the first invocation, voxon starts the synthesis daemon, waits for it to warm up (model load + Metal shader compilation), then synthesises and plays the audio. Subsequent calls are immediate.

```bash
# Use a specific voice
voxon --voice alan "Hello world"

# Save the audio to a file
voxon "Hello world" --save output.wav

# Both play and save
voxon "This sentence will be played and saved." --save sentence.wav
```

---

## Command Reference

### `voxon "<text>"` — Synthesise

```
voxon [--voice NAME] [--save FILE] "<text>"
```

| Flag | Description |
|---|---|
| `--voice NAME` | Use a specific prepared voice. Restarts the daemon if the voice changes. |
| `--save FILE` | Save synthesised audio to this WAV file in addition to playing. |

### `voxon prep <file>` — Prepare a voice

```
voxon prep <FILE> [--start SECONDS] [--duration SECONDS]
                   [--out-wav PATH] [--out-transcript PATH]
                   [--no-denoise] [--no-transcribe]
```

| Flag | Default | Description |
|---|---|---|
| `--start` | `0` | Start offset in seconds into the source file. |
| `--duration` | `15` | Duration to extract in seconds. |
| `--out-wav` | `~/.voxon/voices/<name>_clean.wav` | Output WAV path. |
| `--out-transcript` | `~/.voxon/voices/<name>.txt` | Output transcript path. |
| `--no-denoise` | — | Skip noise reduction. |
| `--no-transcribe` | — | Skip Whisper transcription (write the `.txt` manually). |

### `voxon voices` — List voices

```
voxon voices
```

Lists all prepared voices in `~/.voxon/voices/`, indicating which is the current default and which is loaded in the running daemon.

### `voxon status` — Daemon status

```
voxon status
```

Prints the daemon's current state: online/offline, active voice, backend model, and whether voice embeddings are cached.

### `voxon stop` — Stop the daemon

```
voxon stop
```

Sends SIGTERM to the background daemon and removes the PID file.

---

## Voice Files

All runtime state is stored under `~/.voxon/`:

```
~/.voxon/
├── config              # Active voice selection and other persisted settings
├── daemon.pid          # PID of the running synthesis daemon
├── daemon.log          # Daemon stdout/stderr — check this on errors
└── voices/
    ├── myvoice_clean.wav   # Cleaned reference audio
    ├── voxon.txt         # Exact transcript
    ├── alan_clean.wav
    └── alan.txt
```

To add a pre-existing voice (if you already have a clean WAV and transcript), copy the files to `~/.voxon/voices/` manually and run `voxon voices` to confirm detection.

---

## Configuration

voxon respects the following environment variables:

| Variable | Default | Description |
|---|---|---|
| `VOXON_PORT` | `7860` | TCP port the synthesis daemon listens on. |
| `VOXON_MODEL` | `chatterbox-turbo` | Default TTS backend (`chatterbox-turbo` or `qwen3`). |

---

## HTTP API

The daemon exposes a local HTTP API that you can query directly:

```bash
# Full WAV download
curl "http://localhost:7860/synthesize?text=Hello+world" -o out.wav

# Chunked streaming (lowest latency for long text)
curl -sN "http://localhost:7860/stream_chunked?text=Hello+world" -o out.wav

# Daemon health
curl http://localhost:7860/health
```

**Endpoints:**

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Returns daemon status and readiness. |
| `GET` | `/synthesize?text=` | Full synthesis, returns complete WAV. |
| `GET` | `/stream?text=` | Streaming WAV (synthesises first, then streams). |
| `GET` | `/stream_chunked?text=` | True sentence-level streaming — first audio arrives after the first sentence. |

---

## Performance

All measurements on a single-speaker English sentence (~80 chars).

| Mac | Model load | Warmup | Synthesis (1 sentence) |
|---|---|---|---|
| M1 8 GB | ~30 s | ~5 s | ~3–5 s |
| M2 16 GB | ~20 s | ~4 s | ~2–4 s |
| M3 16 GB | ~15 s | ~3 s | ~1–3 s |
| M4 16 GB | ~10 s | ~2 s | ~1–2 s |

Model load and warmup happen once per daemon start. Voice embedding cache pre-computation (also once at startup) eliminates the dominant per-synthesis overhead on subsequent calls.

RTF (real-time factor) below 1.0 means synthesis is faster than real-time.

---

## Tips for Best Quality

**Reference audio:**
- 15 seconds, single speaker, no background music, quiet environment
- Any consistent microphone works — quality matters less than consistency

**Transcript:**
- Must match word-for-word, including "um", "uh", false starts, and any audible sounds
- A single wrong word or missing word will degrade output quality noticeably

**Input text:**
- Sentences of 15–80 characters synthesise fastest; longer inputs are auto-split
- Punctuation matters — commas and periods control synthesis rhythm

---

## Troubleshooting

**Daemon fails to start:**
```bash
cat ~/.voxon/daemon.log
```

**Voice not found:**
```bash
voxon voices   # list all prepared voices
```

**Port conflict:**
```bash
VOXON_PORT=7861 voxon "Hello world"
```

**Model download stalls:** The first run downloads model weights from HuggingFace. Check your network connection. Progress is visible in `~/.voxon/daemon.log`.

**ChatterboxTurbo "reference clip too short":** The clip must be strictly longer than 5 seconds. Re-run `voxon prep` with a longer `--duration`.

---

## License

MIT — see [LICENSE](LICENSE).
