Metadata-Version: 2.4
Name: leva-tts
Version: 0.1.6
Summary: Low-latency Levantine Arabic / English code-switching TTS (fine-tuned XTTS-v2)
Author-email: Mohammed Aly <mae678900@gmail.com>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/MohammedAly22/Leva-TTS
Project-URL: Demo, https://mohammedaly22.github.io/Leva-TTS/
Project-URL: Model, https://huggingface.co/mohammedaly22/leva-tts
Project-URL: Space, https://huggingface.co/spaces/mohammedaly22/leva-tts
Project-URL: Dataset, https://huggingface.co/datasets/mohammedaly22/lahgtna-levantine-tts
Keywords: tts,arabic,levantine,code-switching,speech-synthesis,xtts
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: coqui-tts>=0.27.5
Requires-Dist: torch>=2.3.0
Requires-Dist: torchaudio>=2.3.0
Requires-Dist: soundfile>=0.12.1
Requires-Dist: librosa>=0.11.0
Requires-Dist: huggingface_hub>=0.23.3
Requires-Dist: rich>=13.7.1
Requires-Dist: numpy>=1.26.0
Requires-Dist: scipy>=1.11
Requires-Dist: requests>=2.32
Requires-Dist: tqdm>=4.66
Requires-Dist: pyyaml>=6.0
Requires-Dist: pandas>=1.5.3
Requires-Dist: sympy
Requires-Dist: transformers<5,>=4.57
Provides-Extra: server
Requires-Dist: fastapi>=0.111; extra == "server"
Requires-Dist: uvicorn[standard]>=0.30; extra == "server"
Requires-Dist: websockets>=11; extra == "server"
Provides-Extra: pipecat
Requires-Dist: pipecat-ai>=0.0.48; extra == "pipecat"
Requires-Dist: openai; extra == "pipecat"
Requires-Dist: ormsgpack; extra == "pipecat"
Provides-Extra: gradio
Requires-Dist: gradio>=4.37; extra == "gradio"
Provides-Extra: all
Requires-Dist: leva-tts[gradio,pipecat,server]; extra == "all"
Dynamic: license-file


<div align="center">

# 🌿 Leva-TTS

### Low-Latency Code-Switching TTS — Levantine Arabic ⇄ English

*A production-oriented Levantine Text-to-Speech pipeline built on a fine-tuned **XTTS-v2** optimized for real-time conversational agents.*


[![Demo](https://img.shields.io/badge/🔊_Live_Demo-Listen-2ea043)](https://mohammedaly22.github.io/Leva-TTS/)
[![HF Model](https://img.shields.io/badge/🤗_Model-Leva--TTS-FFD21E)](https://huggingface.co/mohammedaly22/leva-tts)
[![HF Space](https://img.shields.io/badge/🤗_Space-Gradio_Demo-FFD21E)](https://huggingface.co/spaces/mohammedaly22/leva-tts)
[![PyPI](https://img.shields.io/pypi/v/leva-tts?color=3775A9&logo=pypi&logoColor=white)](https://pypi.org/project/leva-tts/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MohammedAly22/Leva-TTS/blob/main/examples/01_quick_start.ipynb)

| 🎯 KPI | Target | **Measured** | Status |
|---|---|---|---|
| Peak VRAM (inference) | ≤ 3 GB | **2.13 GB** | ✅ |
| Time-to-First-Audio (p50) | < 300 ms | **565 ms** | ⚠️ |
| Real-Time Factor (RTF) | < 0.3 | **0.21** | ✅ |
| Streaming output | required | **chunked PCM + WS** | ✅ |

</div>

---

## 🌟 Overview

Leva-TTS is a production-ready streaming TTS system that handles **natural code-switching between Levantine Arabic dialect and English** — the way real speakers actually talk.

It fine-tunes **XTTS-v2** (Coqui) on 50,000 high-quality synthetic Levantine Arabic + code-switching utterances generated by **[Lahgtna-OmniVoice v2](https://huggingface.co/oddadmix/lahgtna-omnivoice-v2)** — a zero-shot TTS model already fine-tuned for the Levantine Arabic dialect (ISO 639-3: `apc`).

### ✨ Key Features

| Feature | Details |
|---------|---------|
| 🗣️ **Natural code-switching** | Intra-sentence Arabic ↔ English |
| ⚡ **Streaming output** | First audio chunk < 300 ms |
| 💾 **Low VRAM** | ≤ 3 GB at inference |
| 🌿 **Levantine dialect** | ق→/ʔ/ glottal, ج→/ʒ/, *il-* article, *b-* prefix |
| 🔤 **Smart text front-end** | Partial diacritics on homographs + Levantine lexicon CSV |
| 👥 **10 speakers** | 5 male + 5 female, diverse Levantine accents |
| 📡 **WebSocket streaming** | FastAPI server with real-time chunked PCM |
| 🔌 **Pipecat ready** | Drop-in `TTSService` for voice agents |

---

## 📊 Performance

Measured on a single **NVIDIA H100** (fp16) over a 15-sentence held-out set
(6 pure Levantine · 3 pure English · 6 code-switched), speaker **Mohamed**:

| Metric | Target | Achieved |
|--------|--------|----------|
| 💾 Peak VRAM (inference only) | ≤ 3 GB | **2.13 GB** ✅ |
| ⚡ TTFA — streaming (first chunk) | < 300 ms | **~565 ms** ⚠️ |
| ⏱️ TTFA — batch p50 | — | 707 ms |
| 🎚️ RTF p50 / p95 | < 0.3 | **0.21 / 0.59** ✅ (p50) |
| 📡 Streaming | Required | ✅ |

> Notes: RTF p50 is well under target; longer sentences raise p95. Streaming
> TTFA (~565 ms) is the time to the first playable audio chunk — XTTS-v2's
> autoregressive GPT is slower than the 300 ms streaming target on first token,
> but audio plays continuously thereafter. VRAM excludes the Whisper model used
> only during evaluation.

---

## 🎵 Audio Samples

> ### 🔊 **[▶ Open the interactive demo page →](https://mohammedaly22.github.io/Leva-TTS/)**

---

## ⚡ Try it on Colab (zero setup)

Run everything on a **free Colab T4 GPU** — no local install:

| Notebook | Description | |
|----------|-------------|--|
| **Quick Start** | Synthesize, zero-shot clone, stream | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MohammedAly22/Leva-TTS/blob/main/examples/01_quick_start.ipynb) |
| **Inference Server** | FastAPI streaming server + requests | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MohammedAly22/Leva-TTS/blob/main/examples/02_inference_server.ipynb) |
| **Evaluation** | RTF / TTFA / CER / WER / UTMOS on T4 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MohammedAly22/Leva-TTS/blob/main/examples/03_evaluation.ipynb) |
| **Gradio App** | Full web demo with a public link | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MohammedAly22/Leva-TTS/blob/main/examples/04_gradio_app.ipynb) |

See [`examples/`](examples/) for details.

---

## 🚀 Getting Started

Leva-TTS supports two usage paths:

| Path | For whom | What you get |
|------|----------|--------------|
| **A — `pip install`** | You only want to **synthesize speech** | The `LevaTTS` Python class — `synthesize`, `zero_shot_synthesize`, `stream`, `zero_shot_stream`. The fine-tuned checkpoint + 10 reference speakers download automatically on first use. |
| **B — Clone the repo** | You want **full control** — streaming server, Pipecat, Gradio app, fine-tuning | Everything in A plus the FastAPI server, the Pipecat plugin, the Gradio demo, the evaluation suite, and the training pipeline. |

---

## 📦 Path A — `pip install` (inference only)

### 1. Create the environment

```bash
conda create -n leva-tts python=3.10 -y
conda activate leva-tts

# System audio libraries (Ubuntu/Debian)
sudo apt-get install -y espeak-ng ffmpeg libsndfile1
```

### 2. Install PyTorch first

Install PyTorch **before** `leva-tts` so pip locks the right CUDA build for your
machine. Pick the command that matches your hardware from
https://pytorch.org/get-started/locally/

```bash
# CUDA 12.1 (most H100 / A100 / RTX setups)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU only
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
```

### 3. Install leva-tts

```bash
pip install leva-tts
```

> **Engine note:** Leva-TTS depends on **`coqui-tts`** — the maintained Coqui
> fork that exposes the same `TTS`/XTTS modules. The original `TTS` package is
> unmaintained and pins `numpy==1.22.0`, which cannot resolve against modern
> `librosa`/`numba` on Python 3.10+ (the classic `ResolutionImpossible` error).
> `coqui-tts` ships a coherent, numpy-2-compatible dependency set, so a plain
> `pip install leva-tts` resolves cleanly.

> First synthesis call auto-downloads the fine-tuned checkpoint and the 10
> reference speakers from HuggingFace (`mohammedaly22/leva-tts`), falling back to
> the GitHub release. To pre-download:
> ```bash
> python -c "import leva_tts; leva_tts.download_model()"
> ```

### 4. Initialize

```python
from leva_tts import LevaTTS, SPEAKERS

tts = LevaTTS(
    device="cuda",          # "cuda" | "cpu" (auto-detected if omitted)
    preprocess_text=True,   # Levantine text front-end (numbers, dates, diacritics, lexicon)
    verbose=False,          # print the text-processing stages
)

print(SPEAKERS)
# ['Badr', 'Mohamed', 'Saad', 'Rami', 'Fadi',
#  'Amina', 'Fatma', 'Lamyaa', 'Mona', 'Haneen']
```

### 5. Synthesize with a built-in speaker

`synthesize(text, speaker, language="ar", **gen_params)` returns `(wav, sr)` —
a float32 NumPy array at 24 kHz. `speaker` **must** be one of the 10 names above,
otherwise a `ValueError` is raised.

```python
import soundfile as sf

wav, sr = tts.synthesize(
    "هَلَّق أنا عم أشتغل على the project",
    speaker="Badr",
    temperature=0.65,          # generation params are optional per-call
    repetition_penalty=5.0,
    top_p=0.85,
    top_k=50,
    speed=1.0,
)
sf.write("output.wav", wav, sr)   # sr == 24000
```

### 6. Zero-shot voice cloning

`zero_shot_synthesize(text, reference_audio, language="ar", **gen_params)` —
same as `synthesize`, but you pass a **path to your own 3–10 s reference clip**
instead of a built-in speaker name.

```python
wav, sr = tts.zero_shot_synthesize(
    "والله the meeting today كانت important كتير",
    "my_voice.wav",
    language="ar",
)
sf.write("cloned.wav", wav, sr)
```

### 7. Streaming (generators)

`stream(...)` and `zero_shot_stream(...)` mirror the two methods above but
**yield audio chunks** as they are generated — ideal for low-latency playback or
sending over a socket.

```python
import numpy as np, soundfile as sf

# Built-in speaker
chunks = []
for chunk in tts.stream("بِدِّي أحكيلك عن the new feature هَلَّق", speaker="Amina"):
    chunks.append(chunk)        # play / forward each chunk in real time
sf.write("streamed.wav", np.concatenate(chunks), 24000)

# Zero-shot streaming
for chunk in tts.zero_shot_stream("هلق عم نشتغل على الموضوع", "my_voice.wav"):
    ...
```

**Generation parameters** (all optional, valid on every method): `temperature`,
`length_penalty`, `repetition_penalty`, `top_k`, `top_p`, `speed`.

---

## 🛠️ Path B — Clone the repo (advanced)

For the streaming server, Pipecat integration, the Gradio app, evaluation, or
fine-tuning, clone the repo and create the full conda environment.

### 1. Clone & create the environment

```bash
git clone https://github.com/MohammedAly22/Leva-TTS.git
cd Leva-TTS

# System dependencies
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Full conda environment (XTTS, training, server, pipecat, gradio)
conda env create -f environment.yml
conda activate leva-tts
pip install -e .

# Optional — GPU training acceleration
bash scripts/install_deepspeed.sh
```

Download the checkpoint + reference speakers:

```bash
python -c "import leva_tts; leva_tts.download_model('./checkpoints')"
```

### 2. Inference (CLI)

```bash
# Built-in speaker
python scripts/inference.py --text "كيفك اليوم؟" --speaker Amina --out output.wav

# Streaming mode
python scripts/inference.py --text "..." --speaker Badr --stream

# Zero-shot with your own reference audio
python scripts/inference.py --text "..." --ref-audio your_speaker.wav --out clone.wav
```

### 3. FastAPI streaming server

```bash
# Start the server
LEVA_CHECKPOINT=./checkpoints LEVA_SPEAKER_WAV=./reference_audios/Badr.wav python -m leva_tts.server.app

# Health check
curl http://localhost:8000/health

# Batch synthesize
curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text":"كيفك اليوم؟","language":"ar","format":"wav"}' \
  --output output.wav
```

Endpoints: `POST /synthesize` (WAV/PCM), `WS /stream` (real-time chunks),
`GET /health`, `GET /metrics`.

### 4. Pipecat integration

```python
from leva_tts.pipecat_plugin import LevaTTSService
from pipecat.pipeline.pipeline import Pipeline

# Local GPU mode
tts = LevaTTSService(
    mode="local",
    checkpoint="./checkpoints",
    speaker_wav="./reference_audios/Badr.wav",
    language="ar",
)

# Remote WebSocket mode (points at the streaming server above)
tts_remote = LevaTTSService(
    mode="remote",
    server_url="ws://localhost:8000/stream",
    language="ar",
)

pipeline = Pipeline([..., tts, ...])
```

The service emits `TTSStartedFrame → TTSAudioRawFrame(s) → TTSStoppedFrame`,
streaming audio chunk-by-chunk for conversational latency.

### 5. Gradio demo

```bash
python app.py
# Open http://localhost:7860
```

Features: 🎤 10-speaker dropdown with reference playback · 📝 processed-text
preview (see exactly what XTTS-v2 receives) · 🎵 batch synthesis with TTFA / RTF /
VRAM metrics · 🎙️ zero-shot upload (any 3–10 s clip) · 💡 pre-loaded
code-switching examples.

### 6. Fine-tuning

The full data pipeline (50K synthetic utterances via Lahgtna-OmniVoice v2) and
the XTTS-v2 fine-tuning steps are documented in the
[Data Pipeline](#-data-pipeline) section below.

```bash
python scripts/prepare_data.py --metadata data/metadata.csv --out data/
python scripts/train.py --config configs/<YOUR_TRAINING_CONFIG>.json
```

---

## 📊 Evaluation

```bash
python scripts/evaluate.py --checkpoint checkpoints/

# Skip ASR (faster)
python scripts/evaluate.py --checkpoint checkpoints/ --no-asr
```

Reports:
- TTFA p50/p95, RTF p50/p95, Peak VRAM
- CER/WER via Whisper large-v3 ASR round-trip
- UTMOS (reference-free neural MOS)
- Per-type breakdown: pure_levantine / pure_english / code_switching

### Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)

**Overall**

| Metric | Value |
|--------|-------|
| Peak VRAM (inference) | 2.13 GB |
| RTF p50 / p95 | 0.36 / 0.53 |
| TTFA p50 / p95 (batch) | 1194 / 1743 ms |
| TTFA streaming (first chunk) | ~565 ms |
| CER (mean) | 0.255 |
| WER (mean) | 0.496 |
| **UTMOS** (reference-free MOS) | **3.13 / 5.0** |

**Per-category (intelligibility via ASR round-trip)**

| Category | n | CER ↓ | WER ↓ | RTF ↓ | UTMOS ↑ |
|----------|---|-------|-------|-------|---------|
| Pure English | 3 | **0.144** | **0.190** | 0.365 | **3.35** |
| Pure Levantine Arabic | 6 | 0.236 | 0.544 | 0.412 | 2.97 |
| Code-Switching | 6 | 0.330 | 0.602 | 0.358 | 3.19 |

> Pure English achieves the lowest CER/WER, confirming English quality is
> well retained. Arabic CER/WER are higher partly because
> Whisper large-v3 transcribes MSA-normalized Arabic while the references keep
> Levantine spelling and partial diacritics — so a fraction of the "errors" are
> orthographic differences, not pronunciation errors. Code-switching is the
> hardest case (language boundaries), as expected.

### ⚡ v2 — Inference Optimization (TF32 + torch.compile)

We provide an optimized inference path that enables **TF32 matmul** (Hopper/Ampere)
and **`torch.compile`** (`reduce-overhead`) on the autoregressive GPT — the main
latency bottleneck. Run it with:

```bash
# Baseline
python scripts/evaluate.py --checkpoint checkpoints --tag default

# Optimized (fp16 GPT path + TF32 + compiled kernels)
python scripts/evaluate.py --checkpoint checkpoints --tag optimized --optimize
```

**Default vs Optimized** (same 15-sentence set, speaker Mohamed, H100):

| Metric | Default | Optimized | Δ |
|--------|---------|-----------|---|
| RTF p50 | 0.362 | **0.355** | −1.9% |
| RTF p95 | 0.528 | **0.494** | **−6.4%** |
| TTFA p50 (ms) | 1194 | **1150** | −44 ms |
| UTMOS ↑ | 3.13 | **3.24** | **+3.5%** |
| CER | 0.255 | 0.173 | (within sampling variance) |

The optimization **lowers RTF (p95 −6.4%) and TTFA while slightly improving UTMOS**
— quality is preserved. The CER/WER spread between runs is dominated by the
sampling temperature (0.65), not the optimization.

> **Tried & rejected:** Full fp16 on the HiFi-GAN decoder broke the fp32 conv
> filters in the speaker encoder (dtype mismatch). ONNX export of the
> autoregressive GPT is non-trivial (KV-cache + dynamic loop) and gave no
> reliable speedup over `torch.compile` for streaming, so TF32 + compile is the
> recommended path.

---

## 🏗️ Data Pipeline

### Step 1 — Text collection (50K sentences)

```bash
python scripts/gather_levantine_text.py
# → data/levantine_50k.txt
```

**Sources:**
- [GU-CLASP Shami Corpus](https://github.com/GU-CLASP/shami-corpus) — 60K real Levantine sentences (Syrian, Lebanese, Palestinian, Jordanian)
- Synthetic code-switching templates (35K+ unique combinations)

**Text processing pipeline:**
```
Raw text
  → Unicode NFC + tatweel removal
  → Number verbalization (Levantine: مية not مئة, تلاتة, etc.)
  → ه → ة correction (nouns/adjectives, names — preserves والله, pronoun suffixes)
  → Partial diacritics on homographs (ضَلّ, هَلَّق, مِشْ, بِدِّي, etc.)
  → Levantine lexicon CSV overrides (148 entries)
```

### Step 2 — Audio synthesis with Lahgtna-OmniVoice v2

```bash
python scripts/generate_lahgetna_data.py
# → data/synthetic_data/wavs/<spk_id>/*.wav  +  metadata.csv
```

| Property | Value |
|----------|-------|
| Model | [`oddadmix/lahgtna-omnivoice-v2`](https://huggingface.co/oddadmix/lahgtna-omnivoice-v2) |
| Language | `apc` — North Levantine Arabic (ISO 639-3) |
| Speakers | 10 (5M + 5F), 5,000 utterances each |
| Generation params | temperature=0.7, top_p=0.7, repetition_penalty=1.2 |

### Step 3 — Data preparation

```bash
python scripts/prepare_dataset.py --skip_download
```

Final training data:

| Source | Language | Utterances | Est. Hours |
|--------|----------|------------|------------|
| **Lahgtna synthetic** (primary) | Levantine AR + EN CS | 50,000 | ~70 h |
| **LibriSpeech clean-100** | English | 5,888 | ~20 h |
| **Total** | | **55,888** | **~90 h** |

### Step 4 — Fine-tuning XTTS-v2

```bash
CUDA_VISIBLE_DEVICES=0 python scripts/train.py
# Monitor:
tensorboard --logdir checkpoints/tensorboard --port 6006
```

**Training config** (`configs/finetune_xtts.yaml`):

| Parameter | Value |
|-----------|-------|
| Model | XTTS-v2 GPT backbone |
| Optimizer | AdamW, lr=5e-6 |
| Batch size | 4 (grad_accum=8 → effective 32) |
| Epochs | 30 |
| Checkpoint | Every 2,000 steps |

---

## 👥 Speakers

| # | ID | Name | Gender |
|---|-----|------|--------|
| 1 | spk_01_male | **Badr** | Male | 
| 2 | spk_02_male | **Mohamed** | Male |
| 3 | spk_03_male | **Saad** | Male | 
| 4 | spk_04_male | **Rami** | Male | 
| 5 | spk_05_male | **Fadi** | Male | 
| 6 | spk_06_female | **Amina** | Female |
| 7 | spk_07_female | **Fatma** | Female |
| 8 | spk_08_female | **Lamyaa** | Female | 
| 9 | spk_09_female | **Mona** | Female | 
| 10 | spk_10_female | **Haneen** | Female |

---

## 🏗️ Architecture

### Why XTTS-v2?

| Requirement | XTTS-v2 | F5-TTS | Kokoro |
|-------------|---------|--------|--------|
| Native Arabic | ✅ | ❌ | ❌ |
| Code-switching | ✅ | ✅ | ❌ |
| Native streaming | ✅ | ❌ | partial |
| RTF < 0.3 | ✅ | ❌ (real ~3.0) | ✅ |
| VRAM ≤ 3 GB | ✅ | ❌ | ✅ |


---

## 📁 Project Structure

```
leva-tts/
├── leva_tts/
│   ├── text/
│   │   ├── processor.py       ← TextProcessor (normalization + lexicon)
│   │   └── lexicon.py         ← CSV loader
│   ├── inference/
│   │   └── engine.py          ← LevaTTSEngine (streaming, DeepSpeed)
│   ├── server/
│   │   └── app.py             ← FastAPI (POST /synthesize, WS /stream)
│   ├── pipecat_plugin/
│   │   └── leva_tts_service.py ← Pipecat TTSService
│   └── training/
│       └── finetune.py        ← XTTS-v2 GPT fine-tune
├── scripts/
│   ├── train.py               ← Fine-tuning
│   ├── inference.py           ← CLI synthesis (rich UI)
│   └── evaluate.py            ← Full evaluation suite
├── configs/
│   └── finetune_xtts.yaml
├── data/
│   └── levantine_lexicon.csv  ← 148 Levantine dialect overrides
├── reference_audios/
│   ├── references.json        ← 10 speaker reference configs
│   └── *.wav / *.mp3          ← Reference recordings
└── app.py                     ← Gradio demo
```

---

## 📄 License

- **Code** (this repository and the `leva-tts` package): **Apache-2.0** — see [`LICENSE`](LICENSE).
- **Model weights** (`mohammedaly22/leva-tts` on HuggingFace): the fine-tuned
  XTTS-v2 weights inherit **Coqui's non-commercial license (CPML)** from the
  [base model](https://huggingface.co/coqui/XTTS-v2) and are for **research /
  non-commercial** use.

## 📜 Citation

```bibtex
@misc{leva-tts-2026,
  title   = {Leva-TTS: Levantine Arabic / English Code-Switching TTS},
  author  = {Mohammed Aly},
  year    = {2026},
  url     = {https://huggingface.co/mohammedaly22},
}
```
