Metadata-Version: 2.4
Name: indextts-onnx
Version: 0.3.0
Summary: IndexTTS2 voice cloning via ONNX Runtime — no PyTorch at inference time
Author: vra
License-Expression: MIT
Requires-Dist: onnxruntime>=1.17
Requires-Dist: numpy>=1.24
Requires-Dist: librosa>=0.10
Requires-Dist: sentencepiece>=0.2
Requires-Dist: soundfile>=0.12
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: torch>=2.1 ; extra == 'export'
Requires-Dist: torchaudio>=2.1 ; extra == 'export'
Requires-Dist: transformers>=4.40 ; extra == 'export'
Requires-Dist: safetensors>=0.4 ; extra == 'export'
Requires-Dist: omegaconf>=2.3 ; extra == 'export'
Requires-Dist: onnx>=1.15 ; extra == 'export'
Requires-Dist: onnxruntime>=1.17 ; extra == 'export'
Requires-Dist: modelscope>=1.10 ; extra == 'modelscope'
Requires-Python: >=3.11
Project-URL: Homepage, https://github.com/vra/indextts-onnx
Project-URL: Repository, https://github.com/vra/indextts-onnx
Provides-Extra: export
Provides-Extra: modelscope
Description-Content-Type: text/markdown

# indextts-onnx

[IndexTTS2](https://github.com/index-tts/index-tts) voice cloning via ONNX Runtime. No PyTorch at inference time.

Models are automatically downloaded from [HuggingFace](https://huggingface.co/vra/indextts-onnx) on first use.

## One-liner

```bash
uvx indextts-onnx --ref-audio reference.wav --text "你好，欢迎使用语音克隆系统。" --output output.wav
```

## Python API

```python
from indextts_onnx import IndexTTSInfer

tts = IndexTTSInfer("~/.cache/indextts-onnx/models")
tts.infer("reference.wav", "你好，欢迎使用语音克隆系统。", "output.wav")
```

## Install

```bash
pip install indextts-onnx
```

## CLI

```bash
indextts-onnx --ref-audio ref.wav --text "Your text" --output out.wav
```

Options:
- `--model-dir`: Custom model directory (default: auto-download to `~/.cache/indextts-onnx/models`)
- `--threads`: CPU threads (default: auto)

## Architecture

IndexTTS2 inference involves 10 ONNX models. This package runs them with ONNX Runtime + numpy, eliminating the PyTorch runtime dependency (~3.3GB total):

| Model | Format | Size | Purpose |
|-------|--------|------|---------|
| wav2vec2bert | int8 | 396M | Semantic feature extraction |
| semantic_codec | int8 | 22M | Quantize semantic embeddings |
| campplus | int8 | 8M | Speaker style embedding |
| gpt2_init | int8 | 837M | First autoregressive step |
| gpt2_step | int8 | 476M | Per-token generation step |
| gpt2_forward | int8 | 826M | Forward pass for latent |
| s2mel_ref | int8 | 4M | Reference length regulation |
| s2mel_gen | int8 | 4M | Generation preprocessing |
| dit_step | fp32 | 363M | DiT flow matching step |
| bigvgan | fp32 | 429M | Neural vocoder |

GPT2 models use int8 quantization (3x faster). DiT and BigVGAN stay fp32 (int8 degrades quality for flow matching).

## License

MIT
