Metadata-Version: 2.4
Name: moss-tts-local-mnn-cli
Version: 0.1.0
Summary: MOSS-TTS-Local-Transformer-v1.5 voice cloning via MNN
Author: vra
License-Expression: MIT
Requires-Dist: mnn>=3.0
Requires-Dist: numpy>=1.24
Requires-Dist: torch>=2.1
Requires-Dist: torchaudio>=2.1
Requires-Dist: transformers>=4.40
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: modelscope>=1.10 ; extra == 'modelscope'
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/vra/moss-tts-local-mnn-cli
Project-URL: Repository, https://github.com/vra/moss-tts-local-mnn-cli
Provides-Extra: modelscope
Description-Content-Type: text/markdown

# moss-tts-local-mnn-cli

MOSS-TTS-Local-Transformer-v1.5 voice cloning via [MNN](https://github.com/alibaba/MNN) (Mobile Neural Network). Converted from the [original PyTorch model](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer-v1.5) (4.13B params, Qwen3 global + GPT2 local transformer).

MNN provides optimized CPU inference on ARM (Apple Silicon, Android) with fp16/int8 quantization support.

## Quick Start

```bash
uvx moss-tts-local-mnn-cli --text "你好，这是一段测试语音合成。" --reference ref.wav --output output.wav
```

## Installation

```bash
uv pip install moss-tts-local-mnn-cli

# For ModelScope download support (China mainland):
pip install "moss-tts-local-mnn-cli[modelscope]"
```

## CLI Usage

```bash
# Voice cloning with fp16 (default)
moss-tts-local-mnn-cli --text "你好" --reference speaker.wav --output output.wav

# Use int8 quantization (smaller, faster)
moss-tts-local-mnn-cli --text "你好" --reference speaker.wav --quantization int8 --output output.wav

# Download from ModelScope (China)
moss-tts-local-mnn-cli --text "你好" --reference speaker.wav --source modelscope --output output.wav
```

## Python API

```python
from moss_tts_local_mnn_cli import MossTTS

# Initialize once
tts = MossTTS(quantization="fp16")

# Generate multiple times without reloading
audio = tts.generate("你好，这是第一段。", reference="ref.wav", language="Chinese")
tts.save(audio, "output1.wav")

audio2 = tts.generate("这是第二段。", reference="ref.wav")
tts.save(audio2, "output2.wav")
```

## Audio Codec

The audio codec ([MOSS-Audio-Tokenizer-v2](https://www.modelscope.cn/models/OpenMOSS/MOSS-Audio-Tokenizer-v2)) is required for decoding. Download it:

```bash
pip install modelscope
python -c "from modelscope import snapshot_download; snapshot_download('OpenMOSS/MOSS-Audio-Tokenizer-v2')"
```

## Model Conversion

Convert ONNX models to MNN:

```bash
# Prerequisites: build MNNConvert from https://github.com/alibaba/MNN

# fp16
MNNConvert -f ONNX --modelFile global_prefill.onnx --MNNModel global_prefill_fp16.mnn --fp16 --saveExternalData
MNNConvert -f ONNX --modelFile global_decode.onnx --MNNModel global_decode_fp16.mnn --fp16 --saveExternalData
MNNConvert -f ONNX --modelFile local_transformer_first.onnx --MNNModel local_transformer_first_fp16.mnn --fp16
MNNConvert -f ONNX --modelFile local_transformer.onnx --MNNModel local_transformer_fp16.mnn --fp16

# int8
MNNConvert -f ONNX --modelFile global_prefill.onnx --MNNModel global_prefill_int8.mnn --weightQuantBits 8 --saveExternalData
MNNConvert -f ONNX --modelFile global_decode.onnx --MNNModel global_decode_int8.mnn --weightQuantBits 8 --saveExternalData
MNNConvert -f ONNX --modelFile local_transformer_first.onnx --MNNModel local_transformer_first_int8.mnn --weightQuantBits 8
MNNConvert -f ONNX --modelFile local_transformer.onnx --MNNModel local_transformer_int8.mnn --weightQuantBits 8
```

## Benchmark (Apple M3 Max, 36GB)

| Engine | ms/step | vs PyTorch |
|---|---|---|
| PyTorch fp32 | 338 | 1.00x |
| ONNX fp32 | 374 | 0.90x |
| MNN fp16 | 274 | 1.24x |
| MNN int8 | 277 | 1.22x |

## Requirements

- Python >= 3.10
- ~20GB RAM (fp16) or ~16GB RAM (int8)
- ~14GB disk (fp16) or ~11GB disk (int8)

## License

MIT
