Metadata-Version: 2.4
Name: x-voice
Version: 0.1.1
Summary: X-Voice multilingual TTS toolkit
Author: Rixi Xu, Qingyu Liu
License: MIT License
Project-URL: Homepage, https://github.com/sunnyxrxrx/X-Voice
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: addict
Requires-Dist: accelerate>=0.33.0
Requires-Dist: bg_text_normalizer
Requires-Dist: bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
Requires-Dist: cached_path
Requires-Dist: click
Requires-Dist: ctranslate2==4.5.0
Requires-Dist: datasets
Requires-Dist: deepfilternet
Requires-Dist: ema_pytorch>=0.5.2
Requires-Dist: epitran
Requires-Dist: finnsyll
Requires-Dist: g2pk
Requires-Dist: gradio>=6.0.0
Requires-Dist: hydra-core>=1.3.0
Requires-Dist: jieba
Requires-Dist: fastlid
Requires-Dist: fasttext
Requires-Dist: librosa
Requires-Dist: matplotlib
Requires-Dist: nemo_text_processing
Requires-Dist: num2words
Requires-Dist: numpy<=1.26.4; python_version <= "3.10"
Requires-Dist: openai-whisper
Requires-Dist: phonemizer
Requires-Dist: pydantic<=2.10.6
Requires-Dist: pydub
Requires-Dist: pyloudnorm
Requires-Dist: pykakasi
Requires-Dist: pypinyin
Requires-Dist: pyphen
Requires-Dist: pythainlp
Requires-Dist: pyopenjtalk
Requires-Dist: pytest-runner
Requires-Dist: python-crfsuite
Requires-Dist: rjieba
Requires-Dist: regex
Requires-Dist: safetensors
Requires-Dist: simplejson
Requires-Dist: soundfile
Requires-Dist: tomli
Requires-Dist: torch>=2.0.0
Requires-Dist: torchaudio>=2.0.0
Requires-Dist: torchcodec
Requires-Dist: torchdiffeq
Requires-Dist: tqdm>=4.65.0
Requires-Dist: transformers<=4.48.3
Requires-Dist: transformers_stream_generator
Requires-Dist: unidecode
Requires-Dist: vocos
Requires-Dist: wandb
Requires-Dist: WeTextProcessing
Requires-Dist: x_transformers>=1.31.14
Requires-Dist: xphonebr
Provides-Extra: eval
Requires-Dist: faster_whisper==0.10.1; extra == "eval"
Requires-Dist: funasr; extra == "eval"
Requires-Dist: jiwer; extra == "eval"
Requires-Dist: modelscope; extra == "eval"
Requires-Dist: onnxruntime-gpu; extra == "eval"
Requires-Dist: zhconv; extra == "eval"
Requires-Dist: zhon; extra == "eval"
Provides-Extra: ipa-v4
Requires-Dist: lingua; extra == "ipa-v4"
Requires-Dist: spellchecker; extra == "ipa-v4"

# X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

<a href="https://arxiv.org/abs/unknown" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Paper-Coming%20Soon-b31b1b.svg?logo=arXiv&style=for-the-badge" alt="Paper"></a>
<a href="https://sunnyxrxrx.github.io/X-Voice-Demo/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Demo-Samples-orange.svg?logo=github&style=for-the-badge" alt="Demo"></a>
<img src="https://img.shields.io/badge/Python-3.11%2B-3776AB?style=for-the-badge&logo=python&logoColor=white" alt="Python">
<a href="unknown" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Online%20Experience-HF%20Space-yellow?labelColor=grey&logo=huggingface&style=for-the-badge" alt="HF Space"></a>
<a href="https://huggingface.co/datasets/XRXRX/X-Voice-Dataset-Train" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Dataset-Train%20Set-yellow?labelColor=grey&logo=huggingface&style=for-the-badge" alt="HF Dataset"></a>
<a href="https://huggingface.co/datasets/XRXRX/X-Voice-Testset" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Benchmark-Test%20Set-lightgrey?labelColor=grey&logo=huggingface&style=for-the-badge" alt="HF Benchmark"></a>
<a href="https://modelscope.cn/datasets/sunnyxrxrx/X-Voice-Dataset-Train" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/ModelScope-Dataset-blue?logo=alibabacloud&style=for-the-badge" alt="ModelScope"></a>
<a href="https://x-lance.sjtu.edu.cn/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/X--LANCE-grey?labelColor=lightgrey&logo=leanpub&style=for-the-badge" alt="X-LANCE"></a>
<a href="https://www.sii.edu.cn/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/SII-grey?labelColor=lightgrey&logo=leanpub&style=for-the-badge" alt="SII"></a>
<a href="https://www.geely.com" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Geely-grey?labelColor=lightgrey&logo=accenture&style=for-the-badge" alt="Geely"></a>
<a href="https://www.clsp.jhu.edu" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/CLSP-grey?labelColor=lightgrey&logo=leanpub&style=for-the-badge" alt="CLSP"></a>
<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->

**X-Voice** is a flow-matching-based multilingual zero-shot voice cloning system that enables one speaker to speak 30 languages.

## News

- **2026/04/30**: X-Voice <a href="https://github.com/sunnyxrxrx/X-Voice" target="_blank" rel="noopener noreferrer">codebase</a>, <a href="https://huggingface.co/XRXRX/X-Voice" target="_blank" rel="noopener noreferrer">model</a>, <a href="https://sunnyxrxrx.github.io/X-Voice-Demo/" target="_blank" rel="noopener noreferrer">demo</a>, <a href="https://huggingface.co/datasets/XRXRX/X-Voice-Dataset-Train" target="_blank" rel="noopener noreferrer">dataset</a>, and <a href="https://huggingface.co/datasets/XRXRX/X-Voice-Testset" target="_blank" rel="noopener noreferrer">benchmark</a> are released.

## Installation

### Create a separate environment if needed

```bash
# Create a conda env with python_version>=3.11
conda create -n x-voice python=3.11
conda activate x-voice

# Install FFmpeg if you haven't yet
conda install ffmpeg
```

### Install PyTorch with matched device

<details>
<summary>NVIDIA GPU</summary>

> ```bash
> # Install pytorch with your CUDA version, e.g.
> pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
> ```

</details>

<details>
<summary>AMD GPU</summary>

> ```bash
> # Install pytorch with your ROCm version (Linux only), e.g.
> pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
> ```

</details>

<details>
<summary>Intel GPU</summary>

> ```bash
> # Install pytorch with your XPU version, e.g.
> pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu
> ```

</details>

<details>
<summary>Apple Silicon</summary>

> ```bash
> # Install the stable pytorch, e.g.
> pip install torch torchaudio
> ```

</details>

### Install X-Voice

```bash
git clone https://github.com/sunnyxrxrx/X-Voice.git
cd X-Voice
pip install -e .
```

Check your ESpeak-ng installation:

```bash
espeak-ng --version
```

If not found, run `src/x_voice/prepare_ipa.sh` first.

## Inference

- In order to achieve desired performance, take a moment to read [detailed guidance](src/x_voice/infer).

### 1. Gradio App

```bash
x-voice_infer-gradio --host 0.0.0.0 --port 7860
```

### 2. CLI Inference

```bash
# X-Voice Stage1
python -m x_voice.infer.infer_cli_stage1 -c src/x_voice/infer/examples/basic/basic_stage1.toml

# X-Voice Stage2
python -m x_voice.infer.infer_cli_stage2 -c src/x_voice/infer/examples/basic/basic_stage2.toml
```

## Training

### TTS Model Training

Refer to [training guidance](src/x_voice/train/README.md) for best practice.

### Speaking Rate Predictor Training

Refer to [speaking rate predictor guidance](src/rate_pred) for the multilingual speaking rate predictor used in X-Voice.

## Evaluation

Refer to [evaluation guidance](src/x_voice/eval/README.md) for benchmark and metric scripts.

## Repo Structure

```text
X-Voice/
├── ckpts/                  # checkpoints
├── data/                   # datasets and processed data
├── src/
│   ├── rate_pred/          # speaking rate predictor
│   ├── third_party/
│   │   └── BigVGAN/        # BigVGAN submodule
│   └── x_voice/            # main X-Voice package
└── pyproject.toml          # package definition and dependencies
```

## Development

Use pre-commit to ensure code quality:

```bash
pip install pre-commit
pre-commit install
pre-commit run --all-files
```

## Acknowledgements

- [F5-TTS](https://arxiv.org/abs/2410.06885) brilliant work and the foundation of this codebase
- Cross-Lingual F5-TTS 2 for its supervised fine-tuning strategy with synthetic audio prompts
- [Cross-Lingual F5-TTS](https://arxiv.org/abs/2509.14579) for its speaking rate predictor
- [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) for translation in the Gradio demo
- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
- [MAVL](https://github.com/k1064190/MAVL/tree/main) for Japanese syllable counting

## License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
