Metadata-Version: 2.4
Name: moss-tts-local-onnx-cli
Version: 0.1.0
Summary: MOSS-TTS-Local-Transformer-v1.5 voice cloning via ONNX Runtime
Author: vra
License-Expression: MIT
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.17
Requires-Dist: torch>=2.1
Requires-Dist: torchaudio>=2.1
Requires-Dist: transformers>=4.40
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: modelscope>=1.10 ; extra == 'modelscope'
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/vra/moss-tts-local-onnx-cli
Project-URL: Repository, https://github.com/vra/moss-tts-local-onnx-cli
Provides-Extra: modelscope
Description-Content-Type: text/markdown

# moss-tts-local-onnx-cli

MOSS-TTS-Local-Transformer-v1.5 voice cloning via ONNX Runtime. Converted from the [original PyTorch model](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5) (4.13B params, Qwen3 global + GPT2 local transformer).

## Quick Start

One-liner with `uvx` (auto-downloads ~31GB models on first run):

```bash
uvx moss-tts-local-onnx-cli --text "你好，这是一段测试语音合成。" --reference ref.wav --output output.wav
```

## Installation

```bash
# Install with uv
uv pip install moss-tts-local-onnx-cli

# Or with pip
pip install moss-tts-local-onnx-cli

# For ModelScope download support (China mainland):
pip install "moss-tts-local-onnx-cli[modelscope]"
```

## CLI Usage

```bash
# Basic voice cloning
moss-tts-local-onnx-cli --text "你好" --reference speaker.wav --output output.wav

# With language hint
moss-tts-local-onnx-cli --text "Hello world" --reference speaker.wav --language English --output output.wav

# Download from ModelScope (China)
moss-tts-local-onnx-cli --text "你好" --reference speaker.wav --source modelscope --output output.wav

# Custom model directory
moss-tts-local-onnx-cli --text "你好" --reference speaker.wav --model-dir /path/to/models --output output.wav
```

## Python API

```python
from moss_tts_local_onnx_cli import MossTTS

# Initialize once (auto-downloads models)
tts = MossTTS()

# Generate multiple times without reloading
audio = tts.generate("你好，这是第一段。", reference="ref.wav", language="Chinese")
tts.save(audio, "output1.wav")

audio2 = tts.generate("这是第二段。", reference="ref.wav")
tts.save(audio2, "output2.wav")

# Or generate and save in one call
tts.generate("你好", reference="ref.wav", output="output3.wav")
```

## Audio Codec

The audio codec ([MOSS-Audio-Tokenizer-v2](https://www.modelscope.cn/models/OpenMOSS/MOSS-Audio-Tokenizer-v2)) is required for decoding audio tokens. It will be auto-detected from `~/.cache/modelscope/hub/models/OpenMOSS/MOSS-Audio-Tokenizer-v2`.

To download it:

```bash
pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('OpenMOSS/MOSS-Audio-Tokenizer-v2')"
```

Or specify a custom path:

```python
tts = MossTTS(codec_path="/path/to/MOSS-Audio-Tokenizer-v2")
```

## Model Conversion

To convert the original PyTorch model to ONNX yourself:

```bash
# Install export dependencies
pip install torch transformers onnx

# Run export (requires ~18GB RAM)
python scripts/export_onnx.py

# Validate correctness
python scripts/validate.py
```

See `scripts/export_onnx.py` for the full conversion pipeline.

## Model Architecture

| ONNX Model | Description | Size |
|---|---|---|
| `global_prefill.onnx` | Qwen3 36-layer full-sequence forward | ~15GB |
| `global_decode.onnx` | Qwen3 single-token decode with KV cache | ~15GB |
| `local_transformer_first.onnx` | GPT2 1-layer (first step) | 290MB |
| `local_transformer.onnx` | GPT2 1-layer (with KV cache) | 290MB |
| `lm_heads.npz` | LM head weight matrices | 120MB |

## Requirements

- Python >= 3.10
- ~32GB RAM (for loading models)
- ~31GB disk space (for model files)

## License

MIT
