Metadata-Version: 2.4
Name: audiobook-tts
Version: 1.0.0
Summary: Local audiobook generation system using MLX-Audio for Apple Silicon
License: MIT
Requires-Python: >=3.10
Requires-Dist: espeakng-loader
Requires-Dist: fastapi>=0.109.0
Requires-Dist: ffmpeg-normalize>=1.28.0
Requires-Dist: misaki
Requires-Dist: mlx-audio>=0.3.1
Requires-Dist: nltk>=3.8.1
Requires-Dist: num2words
Requires-Dist: numpy>=1.24.0
Requires-Dist: phonemizer<3.3
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pydub>=0.25.1
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.0.0
Requires-Dist: setuptools<81
Requires-Dist: soundfile>=0.12.1
Requires-Dist: spacy<4,>=3.7
Requires-Dist: transformers>=5.0.0rc3
Requires-Dist: uvicorn[standard]>=0.27.0
Provides-Extra: dev
Requires-Dist: httpx>=0.25.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# Audiobook TTS

Local audiobook generation system using MLX-Audio for Apple Silicon Macs.

## Features

- **High-quality narration** using Kokoro (54 preset voices)
- **Voice cloning with emotion control** using Chatterbox (clones Kokoro voices with per-character emotion exaggeration)
- **Multi-speaker dialogue** using Dia with [S1]/[S2] tags
- **ACX-compliant audio** with automatic normalization
- **Progress tracking** with resume capability
- **FastAPI server** for integration with other tools
- **CLI tool** for batch processing

## Requirements

- macOS 14.0+ (Sonoma or later)
- Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) - Fast Python package manager
- ffmpeg

## Installation

### 1. Install System Dependencies

```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ffmpeg
brew install ffmpeg
```

### 2. Install Package with uv

```bash
cd tools/audiobook-tts

# Create venv and install with dev dependencies
uv venv .venv
uv pip install -e ".[dev]"
```

This installs:
- `mlx-audio` - MLX-optimized TTS models
- `pydub`, `soundfile` - Audio processing
- `ffmpeg-normalize` - ACX-compliant normalization
- `fastapi`, `uvicorn` - API server
- `rich` - Beautiful CLI output
- `pytest`, `pytest-cov` - Testing (dev)

### 3. Verify Installation

```bash
# Test Kokoro model
uv run python -c "from mlx_audio.tts.utils import load_model; m = load_model('mlx-community/Kokoro-82M-bf16'); print('Kokoro loaded!')"

# Test CLI
uv run audiobook-generate --list-voices

# Run tests
uv run pytest tests/ -v
```

## Usage

### Command-Line Interface

#### Generate Full Audiobook

```bash
# From compiled manuscript
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md

# With specific voice
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --voice narrator_male_uk
```

#### Generate Specific Chapters

```bash
# Generate chapters 1-5
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1-5

# Generate specific chapters
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --chapters 1,3,5-10
```

#### Resume Interrupted Generation

```bash
audiobook-generate --input ../../compiled/1-resonance-and-reason-manuscript.md --resume
```

#### List Available Voices

```bash
audiobook-generate --list-voices
```

### API Server

#### Start Server

```bash
audiobook-server --host 0.0.0.0 --port 8000
```

#### API Endpoints

```bash
# Health check
curl http://localhost:8000/v1/health

# List voices
curl http://localhost:8000/v1/voices

# Generate narration
curl -X POST http://localhost:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!", "voice": "narrator_female_us"}'

# Generate dialogue
curl -X POST http://localhost:8000/v1/generate/dialogue \
  -H "Content-Type: application/json" \
  -d '{"text": "[S1] Hello! [S2] Hi there. (laughs)"}'

# Generate chapter (background task)
curl -X POST http://localhost:8000/v1/generate/chapter \
  -H "Content-Type: application/json" \
  -d '{"text": "Chapter text here...", "chapter_number": 1}'

# Check job status
curl http://localhost:8000/v1/status/{job_id}
```

### Python Library

```python
from audiobook_tts.models import KokoroEngine, ModelManager
from audiobook_tts.processing import TextProcessor, AudioProcessor
from audiobook_tts.config import Config

# Initialize
config = Config.default()
model_manager = ModelManager()
text_processor = TextProcessor(max_chunk_chars=400)
audio_processor = AudioProcessor()

# Load manuscript
from pathlib import Path
text = Path("../../compiled/1-resonance-and-reason-manuscript.md").read_text()

# Process chapter
chunks = text_processor.chunk_text(text[:5000])  # First 5000 chars

# Generate audio
kokoro = model_manager.get_kokoro()
audio_segments = []

for chunk in chunks:
    for segment in kokoro.generate(chunk, voice="af_bella"):
        audio_segments.append(segment.audio)

# Combine and save
combined = audio_processor.concatenate_segments(audio_segments)
combined = audio_processor.add_silence_padding(combined)
audio_processor.save_audio(combined, Path("./output/test.wav"), normalize=True)
```

## Voice Profiles

### Kokoro Voices (Narration)

| Category | Voice IDs |
|----------|-----------|
| American Female | `af_heart`, `af_bella`, `af_nova`, `af_sky`, `af_nicole`, `af_sarah` |
| American Male | `am_adam`, `am_echo`, `am_eric`, `am_liam`, `am_michael`, `am_onyx` |
| British Female | `bf_alice`, `bf_emma`, `bf_isabella`, `bf_lily` |
| British Male | `bm_daniel`, `bm_fable`, `bm_george`, `bm_lewis` |

### Dia (Multi-Speaker Dialogue)

Use speaker tags to indicate different speakers:

```
[S1] The door creaked open.
[S2] Who's there? (gasps)
[S1] It's just me.
```

Supported non-verbal sounds:
- `(laughs)`, `(sighs)`, `(gasps)`, `(coughs)`
- `(clears throat)`, `(screams)`, `(whispers)`
- `(singing)`, `(humming)`, `(whistles)`

## Chatterbox Voice Cloning

Chatterbox TTS provides voice cloning with per-character emotion control. It clones Kokoro voices from reference audio clips, then adds adjustable emotion exaggeration.

### Step 1: Generate Reference Clips

Generate ~15 second Kokoro reference clips for each character voice:

```bash
cd tools/audiobook-tts
uv run python scripts/generate_voice_refs.py
```

This creates WAV files in `audiobooks/voice-refs/` (one per Kokoro preset used in the series).

### Step 2: Use Chatterbox Profiles

Chatterbox profiles are pre-configured in `config/voices.yaml` (prefixed `chatterbox_*`) and mapped in `config/series.yaml`. Generate as usual:

```bash
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config config/series.yaml \
  --output ./output/book1/
```

### Tuning Chatterbox Parameters

Each Chatterbox voice profile has three tunable parameters in `config/voices.yaml`:

| Parameter | Default | Range | Effect |
|-----------|---------|-------|--------|
| `exaggeration` | 0.5 | 0.0-1.0 | Emotion intensity. 0.0 = flat/neutral, 1.0 = maximum emotion |
| `cfg_weight` | 0.5 | 0.0-1.0 | Classifier-free guidance. Higher = more adherence to text |
| `temperature` | 0.8 | 0.0-1.0 | Sampling randomness. Lower = more consistent, higher = more varied |

Character-appropriate defaults:
- **Controlled characters** (Cassieth, Aurelius, Decimus, Basileon): `exaggeration: 0.2-0.3`
- **Moderate characters** (Dessa, Kael, Jorin, Vara): `exaggeration: 0.4-0.5`
- **Emotional characters** (Mirael, Lysa): `exaggeration: 0.6-0.7`

### Switching Between Engines

To switch a character back to Kokoro, change the profile name in `config/series.yaml`:

```yaml
# Chatterbox (voice-cloned with emotion)
Mirael: chatterbox_mirael

# Kokoro (preset voice, faster)
Mirael: voice_mirael
```

Both Kokoro and Chatterbox profiles are defined in `config/voices.yaml` — the original Kokoro profiles remain available as fallback.

### Performance

Chatterbox is approximately 5-10x slower than Kokoro due to the voice cloning process:

| Model | Speed | Memory |
|-------|-------|--------|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Chatterbox (fp16) | ~3-5x real-time | ~4-6 GB |

A 100,000-word novel takes approximately 2-4 hours with Chatterbox (vs ~30 minutes with Kokoro).

## Configuration

### Custom Voice Configuration

Create a `config/voices.yaml` file:

```yaml
voices:
  my_narrator:
    model: kokoro
    voice_preset: af_bella
    speed: 0.95  # Slightly slower
    lang_code: a
    description: "Custom narrator voice"

audio:
  sample_rate: 24000
  output_format: wav
  target_lufs: -20.0
  true_peak: -3.0

processing:
  max_chunk_chars: 400
  crossfade_ms: 50
  silence_padding_ms: 500
```

Use with CLI:

```bash
audiobook-generate --input manuscript.md --config config/voices.yaml
```

## ACX/Audible Compliance

Generated audio meets ACX requirements:

- **Sample rate**: 44.1 kHz (upsampled from 24kHz)
- **Bit rate**: 192 kbps CBR (MP3)
- **Loudness**: -20 dB LUFS (-23 to -18 dB acceptable)
- **Peak**: ≤ -3 dB true peak
- **Noise floor**: ≤ -60 dB
- **Room tone**: 0.5s silence at start/end

## Performance

On Apple Silicon M4 Max:

| Model | Speed | Memory |
|-------|-------|--------|
| Kokoro-82M | ~25x real-time | ~2-3 GB |
| Dia-1.6B | ~5-8x real-time | ~4-6 GB |
| Dia-1.6B-4bit | ~8-12x real-time | ~2-3 GB |

A 100,000-word novel (~11 hours audio) takes approximately:
- **Kokoro**: ~25-30 minutes to generate
- **With normalization**: Add ~5-10 minutes

## Troubleshooting

### "Model not found" Error

Models are downloaded automatically from HuggingFace on first use. Ensure you have internet connectivity.

### "ffmpeg not found" Error

Install ffmpeg:

```bash
brew install ffmpeg
```

### Out of Memory

Use the 4-bit Dia model for dialogue:

```python
from audiobook_tts.models import DiaEngine
dia = DiaEngine(use_4bit=True)
```

### Slow Generation

- Ensure you're using Apple Silicon (not Rosetta)
- Close other memory-intensive applications
- Use smaller chunk sizes: `--config` with `max_chunk_chars: 300`

## Directory Structure

```
tools/audiobook-tts/
├── src/audiobook_tts/
│   ├── __init__.py
│   ├── server.py          # FastAPI server
│   ├── cli.py             # Command-line interface
│   ├── config.py          # Configuration
│   ├── models/
│   │   ├── tts_engine.py  # TTS model wrappers (Kokoro, Dia, CSM, Chatterbox)
│   │   └── model_manager.py
│   ├── processing/
│   │   ├── text_processor.py   # Text chunking
│   │   ├── audio_processor.py  # Audio concatenation
│   │   └── manuscript.py       # Manuscript handling
│   └── api/
│       ├── routes.py      # API endpoints
│       └── schemas.py     # Pydantic models
├── scripts/
│   └── generate_voice_refs.py  # Generate Kokoro reference clips for Chatterbox
├── config/
│   ├── voices.yaml        # Voice configuration (Kokoro + Chatterbox profiles)
│   └── series.yaml        # Per-book POV-to-voice mappings
├── output/                # Generated audio files
├── cache/                 # Progress tracking
├── pyproject.toml
└── README.md
```

## Next Steps: Offline Audiobook Generation

### Step 1: Verify Prerequisites

```bash
cd tools/audiobook-tts

# Check all dependencies
uv run python -c "from audiobook_tts.utils import get_dependency_status; print(get_dependency_status())"

# Or manually check:
ffmpeg -version          # Should show version info
ffmpeg-normalize --help  # Should show help
```

### Step 2: Test with Sample Text

Before generating a full book, test with a short sample:

```bash
# Generate 30-second test
uv run python -c "
from audiobook_tts.models import KokoroEngine
from audiobook_tts.processing import AudioProcessor
from pathlib import Path

engine = KokoroEngine()
processor = AudioProcessor()

# Test generation
segments = list(engine.generate('This is a test of the audiobook generation system. The quick brown fox jumps over the lazy dog.', voice='af_bella'))
audio = processor.concatenate_segments([s.audio for s in segments])
processor.save_audio_raw(audio, Path('./output/test-sample.wav'))
print('Test audio saved to output/test-sample.wav')
"
```

### Step 3: Generate Audiobook for a Single Chapter

```bash
# Generate Chapter 1 of Book 1
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --chapters 1 \
  --voice af_bella \
  --format mp3 \
  --output ./output/book1/
```

### Step 4: Generate Full Book with Series Configuration

The series configuration maps POV characters to specific voices:

```bash
# Generate Book 1 with POV-aware voice selection
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book1/

# Generate Book 2
uv run audiobook-generate \
  --input ../../compiled/2-vessels-and-vestments-manuscript.md \
  --series-config ./config/series.yaml \
  --format mp3 \
  --output ./output/book2/
```

### Step 5: Resume Interrupted Generation

Generation progress is saved automatically. To resume:

```bash
uv run audiobook-generate \
  --input ../../compiled/1-resonance-and-reason-manuscript.md \
  --series-config ./config/series.yaml \
  --resume
```

### Estimated Generation Times

| Book | Word Count | Estimated Audio | Generation Time* |
|------|------------|-----------------|------------------|
| Book 1: Resonance and Reason | 138,249 | ~15 hours | ~35 min |
| Book 2: Vessels and Vestments | 221,498 | ~24 hours | ~55 min |
| Book 4: Canon and Council | ~200,000 | ~22 hours | ~50 min |

*On Apple Silicon M4 Max with Kokoro. Add ~10-15 min for MP3 normalization.

### Series Configuration (config/series.yaml)

Edit `config/series.yaml` to customize voice assignments:

```yaml
name: "A Testament of Stone"
default_voice: "af_bella"  # Default narrator

books:
  resonance-and-reason:
    title: "Resonance and Reason"
    pov_voices:
      Kael: "am_michael"      # Male, intense
      Basileon: "bm_george"   # British male, imperial
      Tiberus: "am_adam"      # Male, military
      Vara: "bf_emma"         # British female, scholarly
      Jorin: "am_echo"        # Male, gentle
      Dessa: "af_bella"       # Female, warm
```

### Output Structure

Generated files are organized as:

```
output/
├── book1/
│   ├── chapter-01.mp3
│   ├── chapter-02.mp3
│   ├── ...
│   └── progress.json      # Resume tracking
├── book2/
│   └── ...
└── test-sample.wav        # Test files
```

---

## PDF Generation (MacTeX)

MacTeX is installed but needs PATH configuration:

```bash
# Add to ~/.zshrc or ~/.bash_profile
export PATH="/Library/TeX/texbin:$PATH"

# Reload shell
source ~/.zshrc

# Verify
pdflatex --version
```

---

## License

MIT License - see LICENSE file.

The TTS models (Kokoro, Dia) are released under Apache 2.0 license
and are suitable for commercial use.
