Metadata-Version: 2.4
Name: audiobook-tts
Version: 1.0.2
Summary: Local audiobook generation system using MLX-Audio for Apple Silicon
License: MIT
Requires-Python: >=3.10
Requires-Dist: espeakng-loader
Requires-Dist: fastapi>=0.109.0
Requires-Dist: ffmpeg-normalize>=1.28.0
Requires-Dist: misaki>=0.9.4
Requires-Dist: mlx-audio>=0.3.1
Requires-Dist: nltk>=3.8.1
Requires-Dist: num2words
Requires-Dist: numpy>=1.24.0
Requires-Dist: phonemizer<3.3
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pydub>=0.25.1
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.0.0
Requires-Dist: setuptools<81
Requires-Dist: soundfile>=0.12.1
Requires-Dist: spacy<4,>=3.7
Requires-Dist: transformers>=5.0.0rc3
Requires-Dist: uvicorn[standard]>=0.27.0
Provides-Extra: dev
Requires-Dist: httpx>=0.25.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

## Features

- **High-quality narration** using Kokoro (20 English preset voices)
- **Voice cloning with emotion control** using Chatterbox
- **Multi-speaker dialogue** using Dia with [S1]/[S2] tags
- **Layered configuration** — sensible defaults + project-specific overrides
- **Project scaffolding** — `audiobook-init` sets up config in seconds
- **ACX-compliant audio** with automatic normalization
- **Progress tracking** with resume capability
- **FastAPI server** for integration with other tools

## Requirements

- macOS 14.0+ (Sonoma or later)
- Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) (required — see note below)
- ffmpeg

> **Why uv?** Several dependencies (`misaki`, `transformers`) have Python version metadata that `pip` enforces too strictly on Python 3.13+. `uv` handles this correctly. All install commands below use `uv`.

## Installation

```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg

# Install audiobook-tts
uv add audiobook-tts
```

Verify installation:

```bash
audiobook-generate --list-voices
audiobook-voice-ref --list
```

## Quick Start

### 1. Initialize project config

```bash
audiobook-init
```

This creates `.audiobook/` with template configuration files:

```
.audiobook/
  voices.yaml       # Voice profiles (edit to add characters)
  series.yaml       # Per-book POV-to-voice mappings
  voice-refs/       # Kokoro reference clips for Chatterbox
```

### 2. Generate reference clips (for Chatterbox voice cloning)

```bash
audiobook-voice-ref --all
```

Generates WAV clips for all 20 English Kokoro presets in `.audiobook/voice-refs/`.

### 3. Edit configuration

Edit `.audiobook/voices.yaml` to define character voices and `.audiobook/series.yaml` to map books to POV characters. See [Configuration](#configuration) below.

### 4. Generate audiobook

```bash
# With series config (automatic POV-to-voice mapping)
audiobook-generate \
  --input compiled/my-manuscript.md \
  --series-config .audiobook/series.yaml

# With specific voice
audiobook-generate --input manuscript.md --voice narrator_female_us

# Specific chapters
audiobook-generate --input manuscript.md --chapters 1-5

# As MP3
audiobook-generate --input manuscript.md --format mp3
```

## CLI Commands

| Command | Purpose |
|---------|---------|
| `audiobook-generate` | Generate audiobook from manuscript |
| `audiobook-init` | Scaffold `.audiobook/` project config |
| `audiobook-voice-ref` | Generate Kokoro reference clips for Chatterbox |
| `audiobook-server` | Start FastAPI TTS server |

### audiobook-generate

```bash
audiobook-generate --input manuscript.md [options]

Options:
  --input, -i        Path to compiled manuscript or chapter directory
  --output, -o       Output directory (default: ./output)
  --chapters, -c     Chapter range (e.g., "1-5" or "1,3,5-10")
  --voice, -v        Voice profile name (default: narrator_female_us)
  --format, -f       Output format: wav or mp3 (default: wav)
  --config           Path to voices.yaml config file
  --series-config    Path to series.yaml for automatic book/voice detection
  --resume, -r       Resume from last checkpoint
  --no-normalize     Skip ACX normalization
  --list-voices      List available voice profiles
```

### audiobook-init

```bash
audiobook-init [--dir .audiobook] [--force]
```

### audiobook-voice-ref

```bash
audiobook-voice-ref --all                    # Generate all 20 English presets
audiobook-voice-ref --preset af_bella        # Generate single preset
audiobook-voice-ref --list                   # List available presets
audiobook-voice-ref --all --output-dir DIR   # Custom output directory
audiobook-voice-ref --preset af_bella --text "Custom text"
```

## Configuration

### How config loading works

Configuration is layered:

1. **Built-in defaults** — 9 Kokoro narrator voices + 1 Dia dialogue voice, audio settings, processing settings
2. **Project overrides** (`.audiobook/voices.yaml`) — your character voices and Chatterbox profiles merge on top

When you run `audiobook-generate --series-config .audiobook/series.yaml`, the CLI automatically loads `.audiobook/voices.yaml` from the same directory if it exists.

### voices.yaml — Voice profiles

```yaml
voices:
  # Kokoro voice (fast, preset-based)
  voice_protagonist:
    model: kokoro
    voice_preset: am_liam
    speed: 1.0
    lang_code: a    # 'a' = American, 'b' = British
    description: "Young male - gentle, thoughtful"

  # Chatterbox voice (voice-cloned with emotion control)
  chatterbox_protagonist:
    model: chatterbox
    voice_preset: am_liam
    ref_audio: voice-refs/am_liam.wav   # Relative to this file's directory
    exaggeration: 0.5                    # 0.0 = neutral, 1.0 = max emotion
    cfg_weight: 0.5
    temperature: 0.8
    description: "Young male - thoughtful [Chatterbox]"
```

`ref_audio` paths are resolved relative to the config file's directory, not CWD.

### series.yaml — Per-book POV mappings

```yaml
series:
  name: "My Series"
  default_voice: narrator_female_us

books:
  1-my-first-book:
    title: "My First Book"
    default_voice: narrator_female_us
    pov_voices:
      Alice: chatterbox_alice
      Bob: chatterbox_bob
    chapter_announcement:
      enabled: true
      format_string: "Chapter {number}. {title}"
      pause_after_ms: 1000
```

Book identifiers are matched against manuscript filenames (e.g., `1-my-first-book` matches `compiled/1-my-first-book-manuscript.md`).

## Voice Profiles

### Kokoro Voices (Narration)

| Category | Voice IDs |
|----------|-----------|
| American Female | `af_heart`, `af_bella`, `af_nova`, `af_sky`, `af_nicole`, `af_sarah` |
| American Male | `am_adam`, `am_echo`, `am_eric`, `am_liam`, `am_michael`, `am_onyx` |
| British Female | `bf_alice`, `bf_emma`, `bf_isabella`, `bf_lily` |
| British Male | `bm_daniel`, `bm_fable`, `bm_george`, `bm_lewis` |

### Chatterbox (Voice Cloning)

Chatterbox clones any Kokoro voice from a reference clip, adding emotion control:

| Parameter | Default | Range | Effect |
|-----------|---------|-------|--------|
| `exaggeration` | 0.5 | 0.0-1.0 | Emotion intensity (0 = flat, 1 = maximum) |
| `cfg_weight` | 0.5 | 0.0-1.0 | Text adherence (higher = more faithful) |
| `temperature` | 0.8 | 0.0-1.0 | Sampling randomness (lower = more consistent) |

Guidelines:
- **Controlled characters** (military, strategists): `exaggeration: 0.2-0.3`
- **Moderate characters** (narrators, scholars): `exaggeration: 0.4-0.5`
- **Emotional characters** (protagonists, passionate): `exaggeration: 0.6-0.7`

### Dia (Multi-Speaker Dialogue)

```
[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.
```

Supported non-verbal sounds: `(laughs)`, `(sighs)`, `(gasps)`, `(coughs)`, `(clears throat)`, `(screams)`, `(whispers)`, `(singing)`, `(humming)`, `(whistles)`

## API Server

```bash
audiobook-server --host 0.0.0.0 --port 8000
```

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/health` | GET | Health check |
| `/v1/voices` | GET | List voices |
| `/v1/generate` | POST | Generate narration |
| `/v1/generate/dialogue` | POST | Generate dialogue |
| `/v1/generate/chapter` | POST | Generate chapter (background) |
| `/v1/status/{job_id}` | GET | Check job status |

## ACX/Audible Compliance

Generated audio meets ACX requirements:

- **Loudness**: -20 dB LUFS (-23 to -18 dB acceptable)
- **Peak**: <= -3 dB true peak
- **Noise floor**: <= -60 dB
- **Room tone**: 0.5s silence at start/end
- **MP3**: 192 kbps CBR

## Performance

On Apple Silicon M4 Max:

| Model | Speed | Memory |
|-------|-------|--------|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Chatterbox (fp16) | ~3-5x real-time | ~4-6 GB |
| Dia-1.6B | ~5-8x real-time | ~4-6 GB |

A 100,000-word novel (~11 hours audio):
- **Kokoro**: ~25-30 minutes
- **Chatterbox**: ~2-4 hours

## Project Structure

```
audiobook-tts/
├── src/audiobook_tts/
│   ├── cli.py              # audiobook-generate CLI
│   ├── init_project.py     # audiobook-init CLI
│   ├── voice_ref.py        # audiobook-voice-ref CLI
│   ├── server.py           # FastAPI server
│   ├── config.py           # Layered config loading
│   ├── series_config.py    # Series/book config
│   ├── compat.py           # espeak/misaki compatibility shims
│   ├── defaults/
│   │   ├── voices.yaml           # Built-in default voices
│   │   ├── voices.yaml.template  # Project config template
│   │   └── series.yaml.template  # Series config template
│   ├── models/
│   │   ├── tts_engine.py   # Kokoro, Dia, CSM, Chatterbox engines
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py
│   │   ├── audio_processor.py
│   │   └── manuscript.py
│   └── api/
│       ├── routes.py
│       └── schemas.py
├── tests/
├── pyproject.toml
└── README.md
```

## Troubleshooting

### "Model not found" error

Models download automatically from HuggingFace on first use. Ensure you have internet connectivity.

### "ffmpeg not found" error

```bash
brew install ffmpeg
```

### Out of memory

Close other memory-intensive applications. For dialogue, use the 4-bit Dia model.

### Slow generation

- Ensure you're running natively on Apple Silicon (not Rosetta)
- Use Kokoro instead of Chatterbox for faster generation
- Reduce chunk size via config: `max_chunk_chars: 300`

## License

MIT License

The TTS models (Kokoro, Dia, Chatterbox) are released under Apache 2.0 license and are suitable for commercial use.
