Metadata-Version: 2.4
Name: modern-tts
Version: 0.1.6.post1
Summary: A unified, extensible, and modern Python toolkit for LLM-based Text-to-Speech (TTS) synthesis.
Project-URL: Homepage, https://github.com/vra/modern-tts
Project-URL: Repository, https://github.com/vra/modern-tts
Project-URL: Issues, https://github.com/vra/modern-tts/issues
Author: Modern TTS Contributors
License: Apache-2.0
Keywords: audio,chattts,f5-tts,glm-tts,index-tts,llm,melotts,moss-tts,piper-tts,qwen3-tts,redfire-tts,synthesis,text-to-speech,tts
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typing-extensions>=4.8.0
Provides-Extra: all
Requires-Dist: accelerate>=0.25.0; extra == 'all'
Requires-Dist: conformer>=0.3.0; extra == 'all'
Requires-Dist: onnxruntime>=1.17.0; extra == 'all'
Requires-Dist: peft>=0.10.0; extra == 'all'
Requires-Dist: sentencepiece>=0.2.0; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: torchaudio>=2.0; extra == 'all'
Requires-Dist: transformers>=4.40.0; extra == 'all'
Requires-Dist: wetext>=0.1.0; extra == 'all'
Provides-Extra: all-backends
Requires-Dist: accelerate>=0.25.0; extra == 'all-backends'
Requires-Dist: torch>=2.0; extra == 'all-backends'
Requires-Dist: torchaudio>=2.0; extra == 'all-backends'
Requires-Dist: transformers>=4.40.0; extra == 'all-backends'
Provides-Extra: all-models
Requires-Dist: conformer>=0.3.0; extra == 'all-models'
Requires-Dist: onnxruntime>=1.17.0; extra == 'all-models'
Requires-Dist: peft>=0.10.0; extra == 'all-models'
Requires-Dist: sentencepiece>=0.2.0; extra == 'all-models'
Requires-Dist: torch>=2.0; extra == 'all-models'
Requires-Dist: torchaudio>=2.0; extra == 'all-models'
Requires-Dist: transformers>=4.40.0; extra == 'all-models'
Requires-Dist: wetext>=0.1.0; extra == 'all-models'
Provides-Extra: chattts
Requires-Dist: torch>=2.0; extra == 'chattts'
Requires-Dist: torchaudio>=2.0; extra == 'chattts'
Provides-Extra: f5
Requires-Dist: torch>=2.0; extra == 'f5'
Requires-Dist: torchaudio>=2.0; extra == 'f5'
Provides-Extra: glm
Requires-Dist: conformer>=0.3.0; extra == 'glm'
Requires-Dist: peft>=0.10.0; extra == 'glm'
Requires-Dist: sentencepiece>=0.2.0; extra == 'glm'
Requires-Dist: torch>=2.0; extra == 'glm'
Requires-Dist: torchaudio>=2.0; extra == 'glm'
Requires-Dist: transformers>=4.40.0; extra == 'glm'
Provides-Extra: index
Requires-Dist: torch>=2.0; extra == 'index'
Requires-Dist: torchaudio>=2.0; extra == 'index'
Requires-Dist: transformers>=4.40.0; extra == 'index'
Requires-Dist: wetext>=0.1.0; extra == 'index'
Provides-Extra: melotts
Requires-Dist: torch>=2.0; extra == 'melotts'
Requires-Dist: torchaudio>=2.0; extra == 'melotts'
Provides-Extra: moss
Requires-Dist: torch>=2.0; extra == 'moss'
Requires-Dist: torchaudio>=2.0; extra == 'moss'
Provides-Extra: piper
Requires-Dist: onnxruntime>=1.17.0; extra == 'piper'
Provides-Extra: qwen3-tts
Requires-Dist: torch>=2.0; extra == 'qwen3-tts'
Requires-Dist: torchaudio>=2.0; extra == 'qwen3-tts'
Requires-Dist: transformers>=4.40.0; extra == 'qwen3-tts'
Provides-Extra: redfire
Requires-Dist: torch>=2.0; extra == 'redfire'
Requires-Dist: torchaudio>=2.0; extra == 'redfire'
Provides-Extra: transformers
Requires-Dist: accelerate>=0.25.0; extra == 'transformers'
Requires-Dist: torch>=2.0; extra == 'transformers'
Requires-Dist: torchaudio>=2.0; extra == 'transformers'
Requires-Dist: transformers>=4.40.0; extra == 'transformers'
Description-Content-Type: text/markdown

# Modern TTS

A **unified, extensible, and future-proof** Python toolkit for locally running state-of-the-art LLM-based Text-to-Speech (TTS) synthesis models.

[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)

---

## ✨ Features

- 🧩 **15+ Models** — MeloTTS, ChatTTS, F5-TTS, Qwen3-TTS, GLM-TTS, Index-TTS, RedFire-TTS, MOSS-TTS, Piper-TTS, and more
- 🔌 **Plugin Architecture** — Add new models with `@register_model` decorator
- 🚀 **Hot-Swap** — Switch models at runtime without restarting
- 🌍 **Multi-Language** — Chinese, English, Japanese, Korean, Cantonese, and more
- 🎯 **Multi-Task** — Speech synthesis, voice cloning, emotion control, style transfer, streaming
- 💻 **Local-First** — All inference on-device. No APIs. No data leaves your machine.
- 🐍 **Modern Python** — uv-native packaging, Pydantic configs, rich CLI
- 📦 **Zero-Config for select models** — GLM-TTS, Index-TTS, and RedFire-TTS automatically download their official code repositories and weights on first use

---

## 📦 Installation

```bash
# Clone the repository
git clone https://github.com/vra/modern-tts.git
cd modern-tts

# Sync all dependencies (recommended)
uv sync --all-extras

# Or install specific extras only
uv sync --extra melotts --extra chattts --extra glm --extra index --extra redfire

# Or just core dependencies
uv sync
```

**Python 3.10+ recommended.** Some models (e.g. Index-TTS) require specific PyTorch / transformers versions—see per-model notes below.

---

## 🚀 Quick Start

```python
from modern_tts import TTSPipeline

# Synthesize with MeloTTS
pipe = TTSPipeline("melotts-zh")
result = pipe("你好世界，这是语音合成测试。")
result.save("output.wav")

# Switch to ChatTTS for emotional speech
pipe.switch_model("chattts")
result = pipe("这是一个带有情感的语音合成。")
result.save("output_emotion.wav")

# Voice cloning with F5-TTS
pipe.switch_model("f5-tts")
result = pipe("这是克隆的声音。", task="clone", reference_audio="reference.wav")
result.save("cloned.wav")

# Zero-config voice cloning with GLM-TTS (auto-downloads code)
pipe.switch_model("glm-tts")
result = pipe("你好，这是 GLM-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("glm_cloned.wav")

# Zero-config voice cloning with Index-TTS (auto-downloads code)
pipe.switch_model("index-tts")
result = pipe("你好，这是 Index-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("index_cloned.wav")
```

---

## 🎙️ Supported Models

### ✅ Ready to use (loadable out-of-the-box)

| Model ID | Type | Languages | Modes | Install Extra | Notes |
|---|---|---|---|---|---|
| `melotts-zh` | TTS | zh, en | speak, emotion | `--extra melotts` | Many text-processing deps (pypinyin, jieba, etc.) |
| `melotts-en` | TTS | zh, en | speak, emotion | `--extra melotts` | English variant |
| `chattts` | TTS | zh, en | speak, clone, emotion | `--extra chattts` | Emotional prosody control |
| `f5-tts` | ZS-VC | zh, en, ja, ko | speak, clone, emotion | `--extra f5` | Requires reference audio for synthesis |
| `glm-tts` | ZS-VC | zh, en | speak, clone | `--extra glm` | **Auto-downloads** official repo. Heavy deps (transformers, onnxruntime, peft). |
| `index-tts` | ZS-VC | zh, en, ja, ko, yue | speak, clone, emotion, style | `--extra index` | **Auto-downloads** official repo. Requires Python ≥ 3.10. |
| `moss-tts` | TTS | zh, en, ja, ko | speak, emotion | `--extra moss` | MOSS-TTS-Nano (0.1B), CPU-friendly |
| `piper-tts` | TTS | 15+ | speak | `--extra piper` | ONNX-based, edge-optimized |
| `qwen3-tts-0.6b` | ZS-VC | 11+ | speak, clone | `--extra qwen3-tts` | Requires `qwen-tts` package |
| `qwen3-tts-1.7b` | ZS-VC | 11+ | speak, clone | `--extra qwen3-tts` | Larger Qwen3-TTS variant |
| `redfire-tts` | ZS-VC | zh, en, yue | speak, clone, emotion | `--extra redfire` | **Auto-downloads** official repo & weights. fairseq needs C++ build headers |

> **ZS-VC** = Zero-Shot Voice Cloning (requires a `reference_audio` sample).

---

## 📋 Changelog & API Changes

### v0.1.6

#### New Supported Models
- **`redfire-tts`** — FireRedTTS-1S from Xiaohongshu / FireRedTeam. Auto-downloads repo and extracts 11GB pretrained weights on first use.

#### Removed Models
The following models have been removed due to upstream compatibility issues, unmaintained dependencies, or lack of working weights:
- `bertvits2-zh`, `bertvits2-en`, `bertvits2-jp`
- `cosyvoice-300m`, `cosyvoice-300m-sft`, `cosyvoice-300m-instruct`
- `fishspeech-1.5`
- `gptsovits`
- `maskgct`
- `parler-tts-mini`, `parler-tts-large`
- `pocket-tts`
- `xtts-v1`, `xtts-v2`, `xtts-v2.1`

#### Fixes
- **index-tts**: Runtime-patched vendored transformers files for compatibility with `transformers >= 4.57`
- **redfire-tts**: Patched `fairseq` dataclasses for Python 3.13; fixed `torch.load(weights_only)`; patched `LogitsWarper→LogitsProcessor` for new transformers
- **glm-tts**: Added CPU fallback when CUDA driver is incompatible; patched `tn→wetext` normalizer for Python 3.13
- **qwen3-tts-0.6b**: Fixed hardcoded local path to use HuggingFace Hub ID

### v0.1.5 and earlier

#### New Models
- **`glm-tts`** — LLM + Flow Matching zero-shot TTS (Zhipu AI). Merged previous `glm-tts-nano-2512` and `glm-tts-2512` into a single `glm-tts` model ID.
- **`index-tts`** — Industrial-level multilingual zero-shot voice cloning (IndexTeam).

#### Zero-Config Auto-Download
- **GLM-TTS** and **Index-TTS** no longer require manual environment variables (`GLM_TTS_REPO_PATH`, `INDEX_TTS_REPO_PATH`) or `PYTHONPATH` manipulation.
- On first use, the framework automatically:
  1. Clones the official repository to `~/.cache/modern-tts/repos/`
  2. Injects the path into `sys.path`
  3. Proceeds with model loading
- You can still override the auto-download path via `config.extra["glm_tts_repo_path"]` / `config.extra["index_tts_repo_path"]` or the corresponding environment variables.

#### New Infrastructure Modules
- **`modern_tts.core.hf_hub`** — HuggingFace Hub download helpers (`download_hf_model`, `get_hf_model_path`) so custom-code adapters don't re-implement caching logic.
- **`modern_tts.core.repo_manager`** — Generic git repository auto-downloader (`ensure_repo`, `inject_repo_path`) used by adapters that depend on upstream code not on PyPI.

#### Base Class Improvements
- `TextToAudioLLMModel.load()` now raises a clear `NotImplementedError` when a subclass has not set `PROCESSOR_CLS` / `MODEL_CLS`, signaling that the subclass must override `load()` for custom loading logic.

#### Model ID Changes
| Old ID | New ID | Note |
|---|---|---|
| `glm-tts-nano-2512` | `glm-tts` | Merged into unified `glm-tts` |
| `glm-tts-2512` | `glm-tts` | Merged into unified `glm-tts` |

---

## 🏗️ Architecture

Modern TTS is built on three layers:

1. **TTSPipeline** — Unified user API. Handles text normalization, task dispatch, model lifecycle.
2. **TTSModel / TextToAudioLLMModel** — Adapter layer. New models often need only **8 lines of config** via `TextToAudioLLMModel`.
3. **Backends** — Transformers, vLLM, ONNX Runtime.

### Adding a New Model

```python
from modern_tts.core.audio_llm import TextToAudioLLMModel
from modern_tts.core.registry import register_model

@register_model("my-tts-1b")
class MyTTS1B(TextToAudioLLMModel):
    HF_PATH = "org/MyTTS-1B"
    PROCESSOR_CLS = "transformers.AutoTokenizer"
    MODEL_CLS = "transformers.AutoModelForTextToWaveform"
    SUPPORTED_LANGUAGES = {"zh", "en"}
    DEFAULT_SAMPLE_RATE = 24000

    @property
    def model_id(self) -> str:
        return "my-tts-1b"
```

That's it. The registry auto-discovers it at runtime.

---

## 🤝 Contributing

See [Contributing Guide](docs/contributing.md) for development setup, code style, and PR checklist.

---

## 📄 License

Apache-2.0
