Metadata-Version: 2.4
Name: sonicscribe-asr
Version: 0.1.0
Summary: Unified ASR inference framework — multi-backend, optimized for consumer GPUs
Project-URL: Homepage, https://github.com/SeaL773/SonicScribe
Project-URL: Repository, https://github.com/SeaL773/SonicScribe
Author-email: SeaL <seal@sonicscribe.dev>
License-Expression: MIT
License-File: LICENSE
Keywords: asr,moonshine,speech-recognition,transcription,whisper
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24
Requires-Dist: soundfile>=0.12
Provides-Extra: all
Requires-Dist: accelerate<2,>=1.0; extra == 'all'
Requires-Dist: ctranslate2<5,>=4.0; extra == 'all'
Requires-Dist: faster-whisper<2,>=1.0; extra == 'all'
Requires-Dist: huggingface-hub>=0.30; extra == 'all'
Requires-Dist: onnx-asr[hub]<1,>=0.11; extra == 'all'
Requires-Dist: onnxruntime<2,>=1.17; extra == 'all'
Requires-Dist: punctuators>=0.1; extra == 'all'
Requires-Dist: pyannote-audio<5,>=4.0; extra == 'all'
Requires-Dist: qwen-asr>=0.0.6; extra == 'all'
Requires-Dist: tokenizers<1,>=0.19; extra == 'all'
Requires-Dist: torch<3,>=2.4; extra == 'all'
Requires-Dist: torchaudio<3,>=2.4; extra == 'all'
Requires-Dist: transformers<5,>=4.50; extra == 'all'
Requires-Dist: useful-moonshine-onnx>=2024.10; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: diarization
Requires-Dist: pyannote-audio<5,>=4.0; extra == 'diarization'
Requires-Dist: torch<3,>=2.4; extra == 'diarization'
Requires-Dist: torchaudio<3,>=2.4; extra == 'diarization'
Provides-Extra: firered-vad
Requires-Dist: fireredvad>=0.1; extra == 'firered-vad'
Requires-Dist: huggingface-hub>=0.30; extra == 'firered-vad'
Provides-Extra: hf-whisper
Requires-Dist: accelerate<2,>=1.0; extra == 'hf-whisper'
Requires-Dist: torch<3,>=2.4; extra == 'hf-whisper'
Requires-Dist: transformers<5,>=4.50; extra == 'hf-whisper'
Provides-Extra: moonshine
Requires-Dist: onnxruntime<2,>=1.17; extra == 'moonshine'
Requires-Dist: tokenizers<1,>=0.19; extra == 'moonshine'
Requires-Dist: useful-moonshine-onnx>=2024.10; extra == 'moonshine'
Provides-Extra: parakeet
Requires-Dist: huggingface-hub>=0.30; extra == 'parakeet'
Requires-Dist: onnx-asr[hub]<1,>=0.11; extra == 'parakeet'
Provides-Extra: parakeet-gpu
Requires-Dist: huggingface-hub>=0.30; extra == 'parakeet-gpu'
Requires-Dist: nvidia-cublas-cu12; extra == 'parakeet-gpu'
Requires-Dist: nvidia-cuda-runtime-cu12; extra == 'parakeet-gpu'
Requires-Dist: nvidia-cudnn-cu12; extra == 'parakeet-gpu'
Requires-Dist: nvidia-cufft-cu12; extra == 'parakeet-gpu'
Requires-Dist: nvidia-curand-cu12; extra == 'parakeet-gpu'
Requires-Dist: onnx-asr[hub]>=0.11; extra == 'parakeet-gpu'
Requires-Dist: onnxruntime-gpu>=1.20; extra == 'parakeet-gpu'
Provides-Extra: punctuation
Requires-Dist: onnxruntime<2,>=1.17; extra == 'punctuation'
Requires-Dist: punctuators>=0.1; extra == 'punctuation'
Provides-Extra: qwen
Requires-Dist: qwen-asr>=0.0.6; extra == 'qwen'
Requires-Dist: torch<3,>=2.4; extra == 'qwen'
Requires-Dist: transformers<5,>=4.50; extra == 'qwen'
Provides-Extra: qwen-vllm
Requires-Dist: qwen-asr>=0.0.6; extra == 'qwen-vllm'
Requires-Dist: torch<3,>=2.9; extra == 'qwen-vllm'
Requires-Dist: vllm==0.14.0; extra == 'qwen-vllm'
Provides-Extra: ten-vad
Requires-Dist: ten-vad>=0.1; extra == 'ten-vad'
Provides-Extra: whisper
Requires-Dist: ctranslate2<5,>=4.0; extra == 'whisper'
Requires-Dist: faster-whisper<2,>=1.0; extra == 'whisper'
Description-Content-Type: text/markdown

# SonicScribe

[![tests](https://github.com/SeaL773/SonicScribe/actions/workflows/test.yml/badge.svg)](https://github.com/SeaL773/SonicScribe/actions/workflows/test.yml)
[![lint](https://github.com/SeaL773/SonicScribe/actions/workflows/lint.yml/badge.svg)](https://github.com/SeaL773/SonicScribe/actions/workflows/lint.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

**SonicScribe** is a unified speech-to-text framework that picks the best ASR model for your hardware and gives you one consistent API. It wraps five production backends — from lightweight CPU models to the fastest GPU engines — so you don't have to choose between speed, accuracy, and ease of use.

On a single consumer GPU (RTX 4070 Ti), SonicScribe transcribes audio **up to 53x faster than real-time** with **0.67% WER** on LibriSpeech — better accuracy and speed than any single open-source ASR tool available today.

## Benchmark

### Comparison against popular ASR tools

All numbers measured on the same 30-utterance LibriSpeech val.clean slice (131.8s of real English speech), single RTX 4070 Ti GPU, 16-bit precision where applicable.

| Tool / model | WER | Speed | Notes |
|---|---|---|---|
| **SonicScribe** (Parakeet TDT v3, CUDA) | **0.67%** | **53.4x RT** | best WER + fastest |
| **SonicScribe** (Parakeet TDT v3, CPU) | **0.67%** | 19.9x RT | no GPU needed |
| **SonicScribe** (Moonshine/base, CPU) | 1.34% | 20.9x RT | MIT, 61M params |
| **SonicScribe** (Qwen3-ASR-0.6B, CUDA) | 1.57% | 14.1x RT | 30+ languages |
| faster-whisper large-v3 (fp16 CUDA) | 1.79% | 11.4x RT | via SonicScribe whisper backend |
| HF Whisper large-v2 (fp16 CUDA) | 2.01% | 13.6x RT | via SonicScribe hf_whisper backend |
| openai/whisper large-v2 (fp16 CUDA)\* | ~3% | ~4x RT | reference from upstream benchmarks |

\* openai/whisper numbers are approximate from published benchmarks on comparable hardware; not measured on our test slice.

### Batched transcription (native)

| Backend | Batch N=8 | Batch N=32 | Speedup over sequential |
|---|---|---|---|
| Qwen3-ASR-0.6B | 0.65s | 2.55s | **4.4x** |
| HF Whisper-small | 0.53s | 1.86s | **2.3x** |
| Parakeet TDT v3 (CPU) | via onnx-asr list input | — | native batch |

### CPU-only (no GPU required)

| Tool | WER | Speed | Model size |
|---|---|---|---|
| **SonicScribe** (Moonshine/tiny) | 3.13% | **39.3x RT** | 27M |
| **SonicScribe** (Moonshine/base) | **1.34%** | 20.9x RT | 61M |
| **SonicScribe** (Parakeet, CPU EP) | **0.67%** | 19.9x RT | 600M |
| faster-whisper large-v3 (int8) | 1.57% | 1.2x RT | 1.5B |

Moonshine/tiny on CPU is **33x faster** than faster-whisper int8 on CPU with comparable accuracy.

### Multi-dataset evaluation (8 benchmarks × 5 backends)

The numbers above use a single 30-utterance LibriSpeech slice. The table below extends evaluation to **8 diverse benchmarks**: clean English, three multilingual splits (French / Spanish / German), accented business English, long-form (15-min) talks, dialectal English, and noisy multi-speaker meetings. **200 utterances** for short-form, **20 utterances (~5h audio)** for TED-LIUM long-form, and **10 utterances (~7h)** for CORAAL — same RTX 4070 Ti, same `jiwer` transform across all cells.

**WER (%):**

| Dataset (domain) | parakeet | qwen | whisper-CT2 | hf-whisper | moonshine | best |
|---|---|---|---|---|---|---|
| LibriSpeech clean (en, audiobook) | 3.80% | **2.57%** | 6.32% | 3.06% | 3.52% | qwen |
| VoxPopuli FR (fr, parliament) | **11.96%** | 14.35% | 12.74% | 14.98% | N/A | parakeet |
| VoxPopuli ES (es, parliament) | **8.11%** | 10.08% | 8.38% | 9.15% | N/A | parakeet |
| MLS DE (de, audiobook) | 9.93% | 6.73% | 6.36% | **3.98%** | N/A | hf-whisper |
| Earnings-22 (en, accented finance) | 20.48% | 16.92% | **15.35%** | 17.50% | 31.78% | whisper-CT2 |
| AMI IHM (en, noisy meetings) | 31.86% | **16.73%** | 22.09% | 23.21% | 59.87% | qwen |
| TED-LIUM long-form (en, 15-min talks) | 8.63% | 93.07%\* | **7.31%** | 25.52% | N/A | whisper-CT2 |
| CORAAL Atlanta (en, dialect, 42-min interviews) | **22.94%** | ERR\* | 24.76% | 48.50% | N/A | parakeet |

\* **Honest correction (round 2)**: My initial reading was that Qwen had a lower long-form ceiling than other backends. Follow-up debugging on real continuous TED audio showed Qwen works well at 5 min (RTFx 10.1×) and 10 min (RTFx 10.2×). The 93% WER and CORAAL hang are caused by two separate issues: (1) `max_new_tokens=256` was silently truncating dense long transcripts (fixed in `qwen/backend.py` to 4096); (2) TED utt 1 is **20.8 min**, just past qwen-asr's `MAX_ASR_INPUT_SECONDS=1200`, so the library auto-splits into two chunks and per-chunk decoder cost grows non-linearly. **Manual sweep confirms degradation isn't a chunk-boundary artifact**: WER rises monotonically from < 5% at 10 min to 33.88% at 15 min, 45.33% at 18 min, 53.48% at 19 min — driven by attention/KV-cache numerical drift on long Qwen3-ASR contexts.

I tried the `qwen-asr` vLLM backend as the alleged production fix. After upgrading torch 2.4 → 2.9.1, installing vllm 0.14.0, and validating 5-min audio (vLLM RTFx 16.5× vs transformers 10.1×), **vLLM turned out to be slower on long audio**: 10-min audio takes 246s on vLLM (RTFx 2.4×) vs 58s on transformers (RTFx 10.2×) — a **4.2× regression**. Root cause: Qwen3-ASR is prefill-bound (13K audio tokens fed in one shot), but vLLM is engineered for decode-bound LLM serving (paged KV cache for autoregressive generation). The optimization target doesn't match. The vLLM backend wiring stays in the codebase (`inference_backend="vllm"`) for future hardware/version upgrades, but the recommended production path for >10-min audio remains **Parakeet or whisper-CT2**, not Qwen.

**RTFx (real-time factor):**

| Dataset | parakeet | qwen | whisper-CT2 | hf-whisper | moonshine |
|---|---|---|---|---|---|
| LibriSpeech clean | **62.3x** | 16.0x | 13.3x | 17.0x | 22.1x |
| VoxPopuli FR | **72.1x** | 14.1x | 15.3x | 19.2x | N/A |
| VoxPopuli ES | **75.9x** | 15.2x | 16.9x | 20.5x | N/A |
| MLS DE | **92.0x** | 17.3x | 19.5x | 23.6x | N/A |
| Earnings-22 | **56.3x** | 17.1x | 14.9x | 17.6x | 23.1x |
| AMI IHM | **39.4x** | 13.1x | 7.6x | 8.8x | 14.3x |
| TED-LIUM long-form | **97.6x** | 56.0x | 12.0x | 24.0x | N/A |
| CORAAL Atlanta | **102.3x** | ERR | 12.2x | 34.2x | N/A |

**Three takeaways the headline number doesn't show:**

1. **No single backend wins every category.** Parakeet leads on multilingual + long-form + dialectal; qwen wins clean English and noisy meetings (16.73% WER on AMI is 5pp ahead of #2); HF Whisper takes German (3.98%); whisper-CT2 takes finance and TED long-form. Picking the right backend per domain matters.
2. **Long-form is genuinely hard.** WER roughly doubles or triples on long-form vs short-form for the same backend. The Qwen rows on TED-LIUM and CORAAL look catastrophic, but see the footnote — those numbers reflect a pre-fix configuration plus a transformers-backend cost cliff at the 20-min chunk boundary, not a Qwen model limitation.
3. **Moonshine is English-only and short-form-only by design.** N/A cells are intentional — Moonshine is the right choice for ≤30s English on CPU.

**Reproduce**: `python benchmarks/bench_multi_dataset.py --all` then `--collect`. Per-combo JSONs land in `benchmarks/results/`; runs are resumable.

## Requirements

- Python 3.10 or greater
- No FFmpeg required (audio decoded via soundfile/numpy)
- GPU optional (all backends work on CPU; CUDA gives 2-50x speedup)

## Installation

```bash
pip install sonic-scribe[parakeet]      # Fastest + best WER (CC-BY-4.0)
pip install sonic-scribe[moonshine]     # Best CPU option (MIT)
pip install sonic-scribe[whisper]       # faster-whisper / CTranslate2 (MIT)
pip install sonic-scribe[qwen]          # Multilingual champion (Apache-2.0)
pip install sonic-scribe[hf-whisper]    # PyTorch Whisper, supports encoder hooks (MIT)
pip install sonic-scribe[all]           # All backends
```

For GPU acceleration with Parakeet:
```bash
pip install sonic-scribe[parakeet-gpu]  # includes CUDA 12 wheels
```

## Usage

### Simplest possible

```python
import sonic_scribe

result = sonic_scribe.transcribe("meeting.wav")
print(result.text)
```

SonicScribe auto-detects installed backends and picks the best one.

### Choose a backend

```python
from sonic_scribe import Engine

engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("podcast.mp3", language="en")

print(result.text)
print(f"Duration: {result.duration:.1f}s")
for seg in result.segments:
    print(f"  [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")
```

### Long-form audio (podcasts, lectures, meetings)

SonicScribe automatically handles audio of any length. Parakeet, faster-whisper, HF Whisper, and Qwen all support long-form transcription out of the box:

```python
engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("2hour_lecture.mp3")  # just works
print(f"{len(result.segments)} segments transcribed")
```

### Word-level timestamps

```python
result = engine.transcribe("audio.wav", word_timestamps=True)
for seg in result.segments:
    for word in seg.words:
        print(f"[{word.start:.2f}s → {word.end:.2f}s] {word.word}")
```

### Hotwords / prompt injection

Guide the model with domain-specific vocabulary (whisper and hf_whisper backends):

```python
result = engine.transcribe("meeting.wav", initial_prompt="SonicScribe, PyTorch, RTFx")
```

### Batch transcription

Process multiple files in one GPU pass for maximum throughput:

```python
files = ["clip1.wav", "clip2.wav", "clip3.wav", "clip4.wav"]
results = engine.transcribe_batch(files, language="en")
# Qwen: 4.4x faster than sequential
# HF Whisper: 2.3x faster than sequential
```

### Parallel file processing

Process many files across threads (different from GPU batching):

```python
results = engine.transcribe_files(["a.wav", "b.wav", "c.wav", "d.wav"], workers=4)
```

### Speaker diarization (who said what)

Add speaker labels to transcribed segments via pyannote.audio 4.0 as a post-process pipeline stage. Works on top of any backend; turns plain transcription into speaker-attributed dialogue:

```python
from sonic_scribe import Engine
from sonic_scribe.optimizations import DiarizationStage, DiarizationConfig, Pipeline

pipeline = Pipeline(stages=[
    DiarizationStage(DiarizationConfig(num_speakers=2, hf_token="hf_xxx")),
])
engine = Engine(backend="parakeet", device="cuda", pipeline=pipeline)
result = engine.transcribe("meeting.wav")

for seg in result.segments:
    print(f"[{seg.speaker_id}] {seg.start:.1f}s-{seg.end:.1f}s: {seg.text}")
# Output:
# [SPEAKER_00] 0.5s-2.1s: Welcome to the meeting.
# [SPEAKER_01] 2.4s-4.0s: Thanks for having me.
```

Install with `pip install "sonic-scribe[diarization]"`. The `pyannote/speaker-diarization-community-1` model is gated; accept the license at https://huggingface.co/pyannote/speaker-diarization-community-1 and pass an HF read token via `hf_token=` (or set `HF_TOKEN` in the environment).

Uses pyannote 4.0's *exclusive* diarization timeline (one speaker per frame) by default for clean ASR alignment. Set `use_exclusive=False` in `DiarizationConfig` to expose overlapping-speech turns instead.

### Punctuation restoration + truecasing

ASR output is typically lowercased and unpunctuated. Add a `PunctuationStage` to fix this automatically (47 languages via `punctuators` ONNX model):

```python
from sonic_scribe.optimizations import Pipeline, PunctuationStage

pipeline = Pipeline(stages=[PunctuationStage()])
engine = Engine(backend="moonshine", pipeline=pipeline)
result = engine.transcribe("audio.wav")
# Input:  "hello world how are you"
# Output: "Hello world, how are you?"
```

Install with `pip install "sonic-scribe[punctuation]"`.

### Export to SRT / VTT / JSON

```python
result = engine.transcribe("lecture.wav")
print(result.to_srt())   # SRT subtitle format
print(result.to_vtt())   # WebVTT subtitle format
print(result.to_json())  # structured JSON
```

### Async streaming (real-time)

Get partial transcripts as audio comes in — sub-500ms first-partial latency:

```python
import asyncio
from sonic_scribe.streaming import (
    StreamingConfig, transcribe_stream,
    PartialTranscript, FinalSegment,
)

async def live_transcribe(audio_frames):
    engine = Engine(backend="hf_whisper", device="cuda")
    backend = engine.backend
    config = StreamingConfig(chunk_seconds=4.0)

    async for event in transcribe_stream(audio_frames, backend, config):
        if isinstance(event, PartialTranscript):
            print(f"... {event.text}", end="\r")
        elif isinstance(event, FinalSegment):
            print(f"✓ {event.segment.text}")
```

### CLI

```bash
# Transcribe a single file
sonic-scribe transcribe interview.wav -b parakeet -d cuda

# Multiple files as JSON
sonic-scribe transcribe a.wav b.wav c.wav --json

# SRT / WebVTT subtitle output
sonic-scribe transcribe lecture.wav --srt > lecture.srt
sonic-scribe transcribe lecture.wav --vtt > lecture.vtt

# Word timestamps in JSON output
sonic-scribe transcribe audio.wav --word-timestamps --json

# List installed backends
sonic-scribe info
```

## Encoder optimizations

SonicScribe includes two novel encoder compression algorithms that speed up PyTorch-based backends without retraining:

### Token Merging (loss-less 1.4x speedup)

```python
from sonic_scribe.optimizations.tome import AudioToMeStack, ToMeConfig

stack = AudioToMeStack(ToMeConfig(sim_threshold=0.95, mode="shrink")).apply(model)
# Encoder runs 1.4x faster, WER unchanged
```

### LiteASR low-rank compression

```python
# Offline: compress encoder weights
python tools/liteasr_compress.py --model openai/whisper-small --out factors.npz

# Runtime: swap in compressed weights (19% fewer encoder params)
from sonic_scribe.optimizations.liteasr import apply_factors_to_encoder, load_factors
entries = load_factors("factors.npz")
apply_factors_to_encoder(model, entries)
```

### Composable optimization pipeline

Stack multiple optimizations together. The Pipeline validates conflicts at construction time:

```python
from sonic_scribe import Engine
from sonic_scribe.optimizations.pipeline import Pipeline, VADStage, EncoderToMeStage

pipeline = Pipeline(stages=[
    VADStage(),                                          # silence removal
    EncoderToMeStage(sim_threshold=0.95, mode="shrink"), # token merging
])
engine = Engine(backend="hf_whisper", device="cuda", pipeline=pipeline)
result = engine.transcribe("lecture.wav", language="en")
```

### VAD backend selection

Three voice activity detection backends, switchable at pipeline level:

| VAD | F1 (FLEURS-102) | Latency | License |
|---|---|---|---|
| Silero (default) | 95.95% | Good | MIT |
| FireRedVAD | **97.57%** | Good | Apache 2.0 |
| TEN-VAD | 95.19% | **<10ms** | Apache 2.0 + Agora restrictions |

```python
from sonic_scribe.optimizations.vad import load_vad

vad = load_vad("firered")  # or "silero", "ten"
pipeline = Pipeline(stages=[VADStage(vad=vad)])
```

Install with `pip install "sonic-scribe[firered-vad]"` or `pip install "sonic-scribe[ten-vad]"`.

## Available backends

| Backend | Best model | LibriSpeech WER | Speed | License |
|---|---|---|---|---|
| `parakeet` | Parakeet TDT v3 (0.6B) | **0.67%** | **53.4x** CUDA | CC-BY-4.0 |
| `moonshine` | moonshine/base (61M) | 1.34% | 20.9x CPU | MIT |
| `qwen` | Qwen3-ASR-0.6B | 1.57% | 14.1x CUDA | Apache-2.0 |
| `whisper` | large-v3 (CT2 int8) | 1.57% | 1.2x CPU | MIT |
| `hf_whisper` | whisper-large-v2 | 2.01% | 16.5x CUDA | MIT |

## Reproducing benchmarks

Every number in this README is reproducible:

```bash
python benchmarks/bench_multi_dataset.py --all     # 8 datasets × 5 backends, ~6h wall
python benchmarks/bench_multi_dataset.py --collect # aggregate to markdown
python benchmarks/bench_parakeet_cuda.py           # Parakeet CUDA vs CPU
python benchmarks/bench_moonshine_whisper_real.py  # Moonshine + faster-whisper
python benchmarks/bench_batch.py                   # Batch throughput comparison
python benchmarks/bench_tome_e2e.py                # ToMe speedup sweep
```

All benchmarks use real audio (auto-downloaded via HuggingFace datasets). The multi-dataset runner writes per-combo JSON to `benchmarks/results/` so interrupted runs resume cleanly.

## Examples

Runnable scripts in `examples/`:

```bash
python examples/quickstart.py                  # 3-line transcription
python examples/multi_backend_compare.py       # compare all installed backends
python examples/streaming_demo.py audio.wav    # async streaming from file
python examples/optimizations_demo.py audio.wav  # VAD + ToMe pipeline
```

## Development

```bash
git clone https://github.com/SeaL773/SonicScribe.git
cd SonicScribe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,moonshine,whisper,hf-whisper]"

pytest -q                                          # 330 tests
ruff check src/ tests/ benchmarks/ tools/          # zero warnings
```

CUDA-gated tests skip cleanly without a GPU. See [CONTRIBUTING.md](CONTRIBUTING.md) for project conventions.

## License

MIT. Individual backend models have their own licenses (see table above).
