Metadata-Version: 2.4
Name: qwencleo-asr
Version: 0.1.0
Summary: QwenCleo-ASR — Egyptian Arabic & code-switching speech recognition, built on Qwen3-ASR.
Author: Mohammed Aly
License: Apache-2.0
Project-URL: Homepage, https://huggingface.co/mohammedaly22/QwenCleo-ASR
Project-URL: Model Card, https://huggingface.co/mohammedaly22/QwenCleo-ASR
Keywords: asr,speech-recognition,arabic,egyptian,code-switching,qwen
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: qwen-asr>=0.0.6
Requires-Dist: torch>=2.5
Requires-Dist: torchaudio>=2.5
Requires-Dist: numpy
Requires-Dist: soundfile
Requires-Dist: huggingface_hub
Provides-Extra: server
Requires-Dist: fastapi; extra == "server"
Requires-Dist: uvicorn[standard]; extra == "server"
Requires-Dist: python-multipart; extra == "server"
Provides-Extra: app
Requires-Dist: gradio>=4.0; extra == "app"
Provides-Extra: vllm
Requires-Dist: vllm>=0.6.0; extra == "vllm"
Requires-Dist: openai; extra == "vllm"
Requires-Dist: httpx; extra == "vllm"
Provides-Extra: vllm-client
Requires-Dist: openai; extra == "vllm-client"
Requires-Dist: httpx; extra == "vllm-client"
Provides-Extra: all
Requires-Dist: fastapi; extra == "all"
Requires-Dist: uvicorn[standard]; extra == "all"
Requires-Dist: python-multipart; extra == "all"
Requires-Dist: gradio>=4.0; extra == "all"
Requires-Dist: openai; extra == "all"
Requires-Dist: httpx; extra == "all"
Dynamic: license-file

<div align="center">

# 🎙️ QwenCleo-ASR

### The best open-source model for Egyptian Arabic & code-switching speech recognition

*Built on [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B), fine-tuned for Egyptian dialect and Arabic ↔ English code-switching.*

[![🤗 Model](https://img.shields.io/badge/🤗%20Model-mohammedaly22/QwenCleo--ASR-D4AF37)](https://huggingface.co/mohammedaly22/QwenCleo-ASR)
[![PyPI](https://img.shields.io/badge/PyPI-qwencleo--asr-3775A9?logo=pypi&logoColor=white)](https://pypi.org/project/qwencleo-asr/)
[![Base model](https://img.shields.io/badge/Base-Qwen3--ASR--1.7B-615CED)](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue)](LICENSE)

</div>


<img src="assets/QwenCleo-ASR-Banner.png" alt="QwenCleo-ASR" width="100%"/>


---

> **QwenCleo** — the name carries three meanings: **Qwen**, the powerful base model it is built on; **Queen**, signalling a model that reigns over its domain; and **Cleo**, for **Cleopatra**, the queen of Egypt — because this model is tailored for **Egyptian** Arabic. 👑🏺

**QwenCleo-ASR is, to our knowledge, the best open-source ASR model for Egyptian Arabic and Arabic/English code-switching.** It cuts the error rate of the strong Qwen3-ASR base **roughly in half**, and correctly keeps embedded English tech/loan words in Latin script (`engineering`, `download`, `React`, `at least`) instead of mangling them into broken Arabic.

- 🎯 **Egyptian dialect** — tuned on hundreds of hours of Egyptian podcast speech.
- 🔀 **Code-switching** — keeps English terms in `code`-script, Arabic in Arabic.
- 🥇 **State-of-the-art (open)** — beats Qwen3-ASR base, NVIDIA Nemotron, Cohere, and every Whisper variant on our Egyptian + CS test set.
- 📦 **`pip install qwencleo-asr`** — inference & chunked long-audio transcription in three lines.
- ⚡ **Real streaming** — token-by-token via vLLM (`asr.stream(...)`), plus a mic web demo.
- 🚀 **Serving** — FastAPI server, Gradio demo, OpenAI-compatible vLLM endpoint.

---

## 📊 Results

WER / CER (%) on an Egyptian-Arabic + code-switching test set (3,699 utterances).
**Lower is better.** All models scored with the same Egyptian-aware normalization.

![Benchmark overview](assets/QwenCleo-ASR-Benchmark.png)

<div align="center">

| Model | Params | WER all | CER all | WER · AR | CER · AR | WER · CS | CER · CS |
|---|---:|---:|---:|---:|---:|---:|---:|
| **🏆 QwenCleo-ASR** | 1.7B | **19.85** | **10.64** | **19.08** | **10.43** | **20.29** | **10.92** |
| NVIDIA Nemotron-3.5 | 0.6B | 38.88 | 20.58 | 37.14 | 17.40 | 42.15 | 26.30 |
| Qwen3-ASR-1.7B (base) | 1.7B | 41.51 | 20.86 | 40.59 | 18.52 | 43.20 | 25.04 |
| Whisper Large-v3 Turbo (FT) | 0.81B | 50.83 | 22.86 | 48.37 | 18.42 | 55.08 | 37.84 |
| Cohere Transcribe 03-2026 | 2.0B | 53.78 | 39.63 | 48.57 | 34.12 | 63.76 | 49.66 |
| Whisper Large-v3 | 1.54B | 63.94 | 39.76 | 49.25 | 22.76 | 59.32 | 31.52 |
| Whisper Large-v2 | 1.54B | 72.34 | 48.73 | 60.75 | 33.21 | 66.85 | 40.75 |
| Whisper Large-v3 Turbo | 0.81B | 73.83 | 46.86 | 59.37 | 29.42 | 66.08 | 37.84 |
| Whisper Medium | 0.76B | 80.46 | 53.19 | 74.77 | 41.76 | 74.15 | 44.90 |
| Whisper Small | 0.24B | 89.99 | 60.34 | 77.42 | 42.53 | 87.09 | 55.22 |
| Whisper Tiny | 0.04B | 124.68 | 89.42 | 116.02 | 77.74 | 110.67 | 74.57 |

</div>


## 🗣️ Sample outputs

Real transcriptions from the test set. **Ground truth** first; each model's output below it.
Notice how QwenCleo keeps English terms in Latin script and Egyptian dialect intact, while
the other models transliterate English into broken Arabic or drop words entirely.

### 🔀 Code-switching

> **✅ Ground truth**
> طب وانتوا يعني ك`engineering` المفروض ان بيكون مثلا ال`staff engineer` بيقعد مع ال`engineering managers`

| Model | Output |
|---|---|
| 🏆 **QwenCleo** | طب وانتوا يعني ك`engineering` المفروض ان بيكون مثلا ال`staff engineer` بيقعد مع ال`engineering managers` ✅ |
| Qwen3-ASR (base) | وأنتوا يعني كإنجنييرينج المفروض إنه بيكون مثلاً الأستاف إنجنيير بيعد مع ال engineering managers ❌ |
| Cohere Transcribe | وانتم كانجينيري المفروض ان يكون مثلا الاستفاده من الهدف ❌ *(truncated)* |
| Nemotron 3.5 ASR | وانتو يعني كإنجينير المفروض ان بيكون مثلاً الإستف إنجنير بيقعد مع الإنجنير مانيجرز ❌ |

> **✅ Ground truth**
> يعني شوية حاجات كده `across` كل ال`domains` او `at least` يعني مع 4 5 `squads` فالموضوع صعب

| Model | Output |
|---|---|
| 🏆 **QwenCleo** | يعني شوية حاجات كده `across` كل ال`domains` او `at least` يعني مع 4 5 `squads` فالموضوع صعب ✅ |
| Qwen3-ASR (base) | يعني شوية حاجات كده أكرس كل الدومينز وأتلست يعني مع أربعة خمسة سكوات ❌ |
| Cohere Transcribe | يعني شويه حاجات كده اكروس كل الدومينز او اتليست مع اربع خمسه سكوات ❌ |
| Nemotron 3.5 ASR | يعني آه شوائد حاجات كده أكرس كل الدومين أو أتليست يعني مع أربع خمسة سكوات ❌ |

> **✅ Ground truth**
> يقعد معاك حد مثلا من اللي هما ال `C level` او مثلا `engineer manager` او كده حسب ال`position` بتاعه

| Model | Output |
|---|---|
| 🏆 **QwenCleo** | يقعد معاك حد مثلا من هما ال `C level` او مثلا `engineer manager` او كده حسب ال`position` بتاعك ✅ |
| Qwen3-ASR (base) | بيعرض معك حد مثلاً من هم C level أو مثلاً إنجنير مانAGER أو كذا حسب البوسيشن ❌ |
| Cohere Transcribe | يقعد معاك حد مثلا اللي هم السي لافل او انجنير مانجر او كده حسب الموضوع ❌ |
| Nemotron 3.5 ASR | بيقعد معاك حد مثلاً إن هم السي لفل أو مثلاً إنجينير مانجر أو كده حسب البوزيشن ❌ |

### 🇪🇬 Egyptian Arabic

> **✅ Ground truth**
> طب دي كانت مثلا تاخد 84% 88%

| Model | Output |
|---|---|
| 🏆 **QwenCleo** | طب دي كانت مثلا تاخد 84% 88% ✅ |
| Qwen3-ASR (base) | طب دي كانت مثلاً تأخذ أربعة وثمانين في المية، ثمانية وثمانين في المية ❌ |
| Cohere Transcribe | طيب دي كانت مثلا تاخد اربعه وثمانين في الميه ثمانيه وثمانين في الميه ❌ |
| Nemotron 3.5 ASR | طيب دي كانت مثلاً تاخد أربعة وثمانين في المئة ثمانية وثمانين في المئة ❌ |

> **✅ Ground truth**
> خد ال 4 في 4 او 4 ونص طلع دور 9

| Model | Output |
|---|---|
| 🏆 **QwenCleo** | خد ال 4 في 4 او 4 ونص طلع دور 9 ✅ |
| Qwen3-ASR (base) | خادل الأربعة فاربعة واربعة ونص تطلع دور تاسع ❌ |
| Cohere Transcribe | خد الاربعه فاربعه واربعه ونص طلع دور تسعه ❌ |
| Nemotron 3.5 ASR | خد الأربعة في أربعة وأربعة ونص طلع دور تسعة ❌ |

---

## 📦 Installation

> **Install the right torch first.** A plain `pip install` pulls the newest torch
> (built for the latest CUDA), which fails on older drivers with *"NVIDIA driver
> too old"*. Install a torch build matching **your** driver **before** the package,
> then add QwenCleo with `--no-deps` so torch is never reinstalled.
>
> Pick the wheel index for your CUDA driver — `cu121` (driver ≥ 12.1, e.g. CUDA 12.2),
> `cu118` (driver ≥ 11.8), or `cpu`. Check yours with `nvidia-smi`.

### For inference & chunked transcription (PyPI)

```bash
conda create -n qwencleo-asr python=3.12 -y
conda activate qwencleo-asr

# 1) torch matching your driver (cu121 shown — change the index for yours)
pip install torch==2.5.1 torchaudio==2.5.1 \
  --index-url https://download.pytorch.org/whl/cu121

# 2) QwenCleo without touching torch, then its remaining deps
pip install qwencleo-asr --no-deps
pip install "qwen-asr>=0.0.6" numpy soundfile huggingface_hub
```

That's all you need for the Python API and the `qwencleo` CLI.


### For serving / Gradio / vLLM (clone the repo)

```bash
conda create -n qwencleo-asr python=3.12 -y
conda activate qwencleo-asr

# 1) torch matching your driver, first
pip install torch==2.5.1 torchaudio==2.5.1 \
  --index-url https://download.pytorch.org/whl/cu121

# 2) the repo (without re-resolving torch) + serving deps
git clone https://github.com/MohammedAly22/qwencleo-asr.git
cd qwencleo-asr
pip install -e . --no-deps
pip install "qwen-asr>=0.0.6" numpy soundfile huggingface_hub
pip install -r requirements-serving.txt
```

Verify torch sees the GPU before running:

```bash
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
# -> 2.5.1+cu121 True
```

---

## 🚀 Usage

### Python — basic transcription

```python
from qwencleo_asr import QwenCleoASR

asr = QwenCleoASR()                       # loads mohammedaly22/QwenCleo-ASR
result = asr.transcribe("clip.wav")       # language defaults to "Arabic"
print(result.text)
```

Batch, auto-detect language, and Egyptian normalization:

```python
results = asr.transcribe(["a.wav", "b.wav"], language=None)   # auto-detect
clean   = asr.transcribe("clip.wav", normalize=True)          # normalized text
```

### Python — chunked transcription of long audio / mic

```python
from qwencleo_asr import QwenCleoASR, stream_file

asr = QwenCleoASR()
for chunk in stream_file(asr, "long_podcast.wav", chunk_s=20, overlap_s=2):
    print(f"[{chunk.start:.0f}-{chunk.end:.0f}s] {chunk.text}")
```

> ℹ️ **This is chunked transcription, not true streaming.** It splits long/live
> audio into overlapping windows and transcribes each — convenient for captioning
> without a server, but latency is per-window. For genuine **token-by-token
> streaming**, use the vLLM path below.

### Python — true streaming (vLLM)

QwenCleo inherits Qwen3-ASR's **real token-by-token streaming** via vLLM. Start a
server (see [`server/vllm_serve.md`](server/vllm_serve.md)):

```bash
pip install "qwencleo-asr[vllm]"          # vLLM nightly recommended — see docs
vllm serve mohammedaly22/QwenCleo-ASR
```

Then stream straight off the model object — deltas arrive as they're generated:

```python
from qwencleo_asr import QwenCleoASR

asr = QwenCleoASR()
for delta in asr.stream("clip.wav"):       # talks to the vLLM server
    print(delta, end="", flush=True)
```

Or use the helpers directly:

```python
from qwencleo_asr import stream_vllm, transcribe_vllm, VLLMOffline

for delta in stream_vllm("clip.wav", language="Arabic"):
    print(delta, end="", flush=True)

print(transcribe_vllm("clip.wav"))         # one-shot via the server
print(VLLMOffline().transcribe("clip.wav"))  # in-process, no server
```

### CLI

```bash
qwencleo transcribe clip.wav
qwencleo transcribe a.wav b.wav --language None --normalize
qwencleo stream long_podcast.wav --chunk-s 20 --overlap-s 2
```

---

## 🌐 Serving

### FastAPI server

```bash
QWENCLEO_MODEL=mohammedaly22/QwenCleo-ASR \
uvicorn server.app:app --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/v1/transcribe -F file=@clip.wav -F language=Arabic
```

### Gradio demo

```bash
python app/gradio_app.py        # http://localhost:7860  (mic + file upload)
```

### vLLM — serving, streaming & OpenAI-compatible API

Full guide in **[`server/vllm_serve.md`](server/vllm_serve.md)**. In short:

```bash
pip install "qwencleo-asr[vllm]"           # vLLM nightly recommended (see docs)
vllm serve mohammedaly22/QwenCleo-ASR
```

OpenAI-compatible transcription:

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
print(client.audio.transcriptions.create(
    model="mohammedaly22/QwenCleo-ASR", file=open("clip.wav","rb").read()).text)
```

### Streaming mic web demo

Live browser-mic transcription via the upstream Flask demo:

```bash
qwen-asr-demo-streaming \
  --asr-model-path mohammedaly22/QwenCleo-ASR \
  --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.9
# open http://<your-ip>:8000
```

---

## 🔗 Links

- **🤗 Model card:** [`mohammedaly22/QwenCleo-ASR`](https://huggingface.co/mohammedaly22/QwenCleo-ASR)
- **📦 PyPI:** [`qwencleo-asr`](https://pypi.org/project/qwencleo-asr/)
- **🧱 Base model:** [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) · [Qwen3-ASR repo](https://github.com/QwenLM/Qwen3-ASR)
- **Languages:** Egyptian Arabic, Modern Standard Arabic, Arabic↔English code-switching
- **Recommended `language` hint:** `"Arabic"` (or `None` to auto-detect)

---

## 📜 License & citation

Apache-2.0, inheriting the Qwen3-ASR license terms.

```bibtex
@misc{qwencleo_asr_2026,
  title  = {QwenCleo-ASR: The Best Open-Source Egyptian Arabic and Code-Switching Speech Recognition Model},
  author = {Mohammed Aly},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/mohammedaly22/QwenCleo-ASR}},
  note   = {Fine-tuned from Qwen3-ASR-1.7B}
}
```
