Metadata-Version: 2.4
Name: modern-tts
Version: 0.1.1
Summary: A unified, extensible, and modern Python toolkit for LLM-based Text-to-Speech (TTS) synthesis.
Project-URL: Homepage, https://github.com/vra/modern-tts
Project-URL: Repository, https://github.com/vra/modern-tts
Project-URL: Issues, https://github.com/vra/modern-tts/issues
Author: Modern TTS Contributors
License: Apache-2.0
Keywords: audio,bertvits2,chattts,cosyvoice,f5-tts,fish-speech,glm-tts,index-tts,llm,melotts,moss-tts,parler-tts,piper-tts,pocket-tts,qwen3-tts,redfire-tts,synthesis,text-to-speech,tts,xtts
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typing-extensions>=4.8.0
Provides-Extra: all
Requires-Dist: accelerate>=0.25.0; extra == 'all'
Requires-Dist: numpy>=1.24.0; extra == 'all'
Requires-Dist: onnxruntime>=1.17.0; extra == 'all'
Requires-Dist: sentencepiece>=0.2.0; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: torchaudio>=2.0; extra == 'all'
Requires-Dist: transformers>=4.40.0; extra == 'all'
Provides-Extra: all-backends
Requires-Dist: accelerate>=0.25.0; extra == 'all-backends'
Requires-Dist: torch>=2.0; extra == 'all-backends'
Requires-Dist: torchaudio>=2.0; extra == 'all-backends'
Requires-Dist: transformers>=4.40.0; extra == 'all-backends'
Provides-Extra: all-models
Requires-Dist: numpy>=1.24.0; extra == 'all-models'
Requires-Dist: onnxruntime>=1.17.0; extra == 'all-models'
Requires-Dist: sentencepiece>=0.2.0; extra == 'all-models'
Requires-Dist: torch>=2.0; extra == 'all-models'
Requires-Dist: torchaudio>=2.0; extra == 'all-models'
Requires-Dist: transformers>=4.40.0; extra == 'all-models'
Provides-Extra: bertvits2
Requires-Dist: torch>=2.0; extra == 'bertvits2'
Requires-Dist: torchaudio>=2.0; extra == 'bertvits2'
Provides-Extra: chattts
Requires-Dist: torch>=2.0; extra == 'chattts'
Requires-Dist: torchaudio>=2.0; extra == 'chattts'
Provides-Extra: cosyvoice
Requires-Dist: torch>=2.0; extra == 'cosyvoice'
Requires-Dist: torchaudio>=2.0; extra == 'cosyvoice'
Provides-Extra: f5
Requires-Dist: torch>=2.0; extra == 'f5'
Requires-Dist: torchaudio>=2.0; extra == 'f5'
Provides-Extra: fishspeech
Requires-Dist: torch>=2.0; extra == 'fishspeech'
Requires-Dist: torchaudio>=2.0; extra == 'fishspeech'
Provides-Extra: glm
Requires-Dist: sentencepiece>=0.2.0; extra == 'glm'
Requires-Dist: torch>=2.0; extra == 'glm'
Requires-Dist: torchaudio>=2.0; extra == 'glm'
Requires-Dist: transformers>=4.40.0; extra == 'glm'
Provides-Extra: gptsovits
Requires-Dist: torch>=2.0; extra == 'gptsovits'
Requires-Dist: torchaudio>=2.0; extra == 'gptsovits'
Provides-Extra: index
Requires-Dist: torch>=2.0; extra == 'index'
Requires-Dist: torchaudio>=2.0; extra == 'index'
Requires-Dist: transformers>=4.40.0; extra == 'index'
Provides-Extra: maskgct
Requires-Dist: torch>=2.0; extra == 'maskgct'
Requires-Dist: torchaudio>=2.0; extra == 'maskgct'
Requires-Dist: transformers>=4.40.0; extra == 'maskgct'
Provides-Extra: melotts
Requires-Dist: torch>=2.0; extra == 'melotts'
Requires-Dist: torchaudio>=2.0; extra == 'melotts'
Provides-Extra: moss
Requires-Dist: torch>=2.0; extra == 'moss'
Requires-Dist: torchaudio>=2.0; extra == 'moss'
Provides-Extra: parler
Requires-Dist: torch>=2.0; extra == 'parler'
Requires-Dist: torchaudio>=2.0; extra == 'parler'
Requires-Dist: transformers>=4.40.0; extra == 'parler'
Provides-Extra: piper
Requires-Dist: onnxruntime>=1.17.0; extra == 'piper'
Provides-Extra: pocket
Requires-Dist: numpy>=1.24.0; extra == 'pocket'
Provides-Extra: qwen3-tts
Requires-Dist: torch>=2.0; extra == 'qwen3-tts'
Requires-Dist: torchaudio>=2.0; extra == 'qwen3-tts'
Requires-Dist: transformers>=4.40.0; extra == 'qwen3-tts'
Provides-Extra: redfire
Requires-Dist: torch>=2.0; extra == 'redfire'
Requires-Dist: torchaudio>=2.0; extra == 'redfire'
Provides-Extra: transformers
Requires-Dist: accelerate>=0.25.0; extra == 'transformers'
Requires-Dist: torch>=2.0; extra == 'transformers'
Requires-Dist: torchaudio>=2.0; extra == 'transformers'
Requires-Dist: transformers>=4.40.0; extra == 'transformers'
Provides-Extra: xtts
Requires-Dist: torch>=2.0; extra == 'xtts'
Requires-Dist: torchaudio>=2.0; extra == 'xtts'
Description-Content-Type: text/markdown

# Modern TTS

A **unified, extensible, and future-proof** Python toolkit for locally running state-of-the-art LLM-based Text-to-Speech (TTS) synthesis models.

[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)

---

## ✨ Features

- 🧩 **25+ Models** — MeloTTS, ChatTTS, CosyVoice, Fish Speech, Parler-TTS, XTTS, GPT-SoVITS, F5-TTS, Qwen3-TTS, GLM-TTS, Index-TTS, MaskGCT, and more
- 🔌 **Plugin Architecture** — Add new models with `@register_model` decorator
- 🚀 **Hot-Swap** — Switch models at runtime without restarting
- 🌍 **Multi-Language** — Chinese, English, Japanese, Korean, and more
- 🎯 **Multi-Task** — Speech synthesis, voice cloning, emotion control, style transfer, streaming
- 💻 **Local-First** — All inference on-device. No APIs. No data leaves your machine.
- 🐍 **Modern Python** — uv-native packaging, Pydantic configs, rich CLI
- 📦 **Zero-Config for select models** — GLM-TTS and Index-TTS automatically download their official code repositories on first use

---

## 📦 Installation

```bash
# Clone the repository
git clone https://github.com/vra/modern-tts.git
cd modern-tts

# Sync all dependencies (recommended)
uv sync --all-extras

# Or install specific extras only
uv sync --extra melotts --extra chattts --extra glm --extra index

# Or just core dependencies
uv sync
```

**Python 3.10+ recommended.** Some models (e.g. Index-TTS) require specific PyTorch / transformers versions—see per-model notes below.

---

## 🚀 Quick Start

```python
from modern_tts import TTSPipeline

# Synthesize with MeloTTS
pipe = TTSPipeline("melotts-zh")
result = pipe("你好世界，这是语音合成测试。")
result.save("output.wav")

# Switch to ChatTTS for emotional speech
pipe.switch_model("chattts")
result = pipe("这是一个带有情感的语音合成。")
result.save("output_emotion.wav")

# Voice cloning with CosyVoice
pipe.switch_model("cosyvoice-300m")
result = pipe("这是克隆的声音。", task="clone", reference_audio="reference.wav")
result.save("cloned.wav")

# Zero-config voice cloning with GLM-TTS (auto-downloads code)
pipe.switch_model("glm-tts")
result = pipe("你好，这是 GLM-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("glm_cloned.wav")

# Zero-config voice cloning with Index-TTS (auto-downloads code)
pipe.switch_model("index-tts")
result = pipe("你好，这是 Index-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("index_cloned.wav")
```

---

## 🎙️ Supported Models

### ✅ Ready to use (loadable out-of-the-box)

| Model ID | Type | Languages | Modes | Install Extra | Notes |
|---|---|---|---|---|---|
| `melotts-zh` | TTS | zh, en | speak, emotion | `--extra melotts` | Many text-processing deps (pypinyin, jieba, etc.) |
| `melotts-en` | TTS | zh, en | speak, emotion | `--extra melotts` | English variant |
| `chattts` | TTS | zh, en | speak, clone, emotion | `--extra chattts` | Emotional prosody control |
| `f5-tts` | ZS-VC | zh, en, ja, ko | speak, clone, emotion | `--extra f5` | Requires reference audio for synthesis |
| `glm-tts` | ZS-VC | zh, en | speak, clone | `--extra glm` | **Auto-downloads** official repo. Heavy deps (transformers, onnxruntime, peft). |
| `index-tts` | ZS-VC | zh, en, ja, ko, yue | speak, clone, emotion, style | `--extra index` | **Auto-downloads** official repo. Requires Python ≥ 3.10. |
| `moss-tts` | TTS | zh, en, ja, ko | speak, emotion | `--extra moss` | MOSS-TTS-Nano (0.1B), CPU-friendly |
| `piper-tts` | TTS | 15+ | speak | `--extra piper` | ONNX-based, edge-optimized |
| `qwen3-tts-0.6b` | ZS-VC | 11+ | speak, clone | `--extra qwen3-tts` | Requires `qwen-tts` package |
| `qwen3-tts-1.7b` | ZS-VC | 11+ | speak, clone | `--extra qwen3-tts` | Larger Qwen3-TTS variant |
| `xtts-v1` | ZS-VC | 13+ | speak, clone | `--extra xtts` | Requires `coqui-tts` |
| `xtts-v2` | ZS-VC | 13+ | speak, clone | `--extra xtts` | Adds Chinese support |
| `xtts-v2.1` | ZS-VC | 13+ | speak, clone, streaming | `--extra xtts` | Adds streaming mode |

> **ZS-VC** = Zero-Shot Voice Cloning (requires a `reference_audio` sample).

### ⚠️ Requires manual setup

These models need you to manually clone their official repositories and/or download weights before use. Calling `load()` will raise a `RuntimeError` with setup instructions.

| Model ID | Type | Languages | Modes | Install Extra | Setup Notes |
|---|---|---|---|---|---|
| `bertvits2-zh` | TTS | zh, en | speak, emotion | `--extra bertvits2` | Clone repo + download weights |
| `bertvits2-en` | TTS | en | speak, emotion | `--extra bertvits2` | Clone repo + download weights |
| `bertvits2-jp` | TTS | ja, en | speak, emotion | `--extra bertvits2` | Clone repo + download weights |
| `cosyvoice-300m` | ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | `--extra cosyvoice` | Clone repo + download weights |
| `cosyvoice-300m-sft` | ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | `--extra cosyvoice` | SFT variant |
| `cosyvoice-300m-instruct` | ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | `--extra cosyvoice` | Instruct variant |
| `fishspeech-1.5` | ZS-VC | zh, en, ja, ko | speak, clone, emotion | `--extra fishspeech` | Clone repo + weights; pyaudio needs system headers |
| `gptsovits` | ZS-VC | zh, en, ja, yue | speak, clone | `--extra gptsovits` | Clone repo + download weights |
| `redfire-tts` | ZS-VC | zh, en, yue | speak, clone, emotion | `--extra redfire` | fairseq needs C++ build headers |

### ❌ Temporarily unavailable

| Model ID | Reason |
|---|---|
| `maskgct` | Custom tokenizer incompatible with generic `TextToAudioLLMModel` loader |
| `parler-tts-mini` | `parler-tts` package incompatible with `transformers >= 4.50` |
| `parler-tts-large` | Same compatibility issue as `parler-tts-mini` |
| `pocket-tts` | No public repository or weights found (reserved for future implementation) |

---

## 📋 Changelog & API Changes

### Latest

#### New Models
- **`glm-tts`** — LLM + Flow Matching zero-shot TTS (Zhipu AI). Merged previous `glm-tts-nano-2512` and `glm-tts-2512` into a single `glm-tts` model ID.
- **`index-tts`** — Industrial-level multilingual zero-shot voice cloning (IndexTeam).

#### Zero-Config Auto-Download
- **GLM-TTS** and **Index-TTS** no longer require manual environment variables (`GLM_TTS_REPO_PATH`, `INDEX_TTS_REPO_PATH`) or `PYTHONPATH` manipulation.
- On first use, the framework automatically:
  1. Clones the official repository to `~/.cache/modern-tts/repos/`
  2. Injects the path into `sys.path`
  3. Proceeds with model loading
- You can still override the auto-download path via `config.extra["glm_tts_repo_path"]` / `config.extra["index_tts_repo_path"]` or the corresponding environment variables.

#### New Infrastructure Modules
- **`modern_tts.core.hf_hub`** — HuggingFace Hub download helpers (`download_hf_model`, `get_hf_model_path`) so custom-code adapters don't re-implement caching logic.
- **`modern_tts.core.repo_manager`** — Generic git repository auto-downloader (`ensure_repo`, `inject_repo_path`) used by adapters that depend on upstream code not on PyPI.

#### Base Class Improvements
- `TextToAudioLLMModel.load()` now raises a clear `NotImplementedError` when a subclass has not set `PROCESSOR_CLS` / `MODEL_CLS`, signaling that the subclass must override `load()` for custom loading logic.

#### Model ID Changes
| Old ID | New ID | Note |
|---|---|---|
| `glm-tts-nano-2512` | `glm-tts` | Merged into unified `glm-tts` |
| `glm-tts-2512` | `glm-tts` | Merged into unified `glm-tts` |

---

## 🏗️ Architecture

Modern TTS is built on three layers:

1. **TTSPipeline** — Unified user API. Handles text normalization, task dispatch, model lifecycle.
2. **TTSModel / TextToAudioLLMModel** — Adapter layer. New models often need only **8 lines of config** via `TextToAudioLLMModel`.
3. **Backends** — Transformers, vLLM, ONNX Runtime.

### Adding a New Model

```python
from modern_tts.core.audio_llm import TextToAudioLLMModel
from modern_tts.core.registry import register_model

@register_model("my-tts-1b")
class MyTTS1B(TextToAudioLLMModel):
    HF_PATH = "org/MyTTS-1B"
    PROCESSOR_CLS = "transformers.AutoTokenizer"
    MODEL_CLS = "transformers.AutoModelForTextToWaveform"
    SUPPORTED_LANGUAGES = {"zh", "en"}
    DEFAULT_SAMPLE_RATE = 24000

    @property
    def model_id(self) -> str:
        return "my-tts-1b"
```

That's it. The registry auto-discovers it at runtime.

---

## 🤝 Contributing

See [Contributing Guide](docs/contributing.md) for development setup, code style, and PR checklist.

---

## 📄 License

Apache-2.0
