Metadata-Version: 2.4
Name: phostt
Version: 0.4.3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
License-File: LICENSE
Summary: On-device Vietnamese speech recognition — Python bindings for phostt
Keywords: stt,speech-recognition,vietnamese,onnx,zipformer
Author-email: Eugene Khodzitsky <eugene.khodzitsky@gmail.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Changelog, https://github.com/ekhodzitsky/phostt/blob/master/CHANGELOG.md
Project-URL: Documentation, https://docs.rs/phostt
Project-URL: Homepage, https://github.com/ekhodzitsky/phostt
Project-URL: Repository, https://github.com/ekhodzitsky/phostt

<p align="center">
  <h1 align="center">phostt</h1>
  <p align="center"><strong>On-device Vietnamese speech recognition</strong></p>
  <p align="center">Local STT server powered by Zipformer-vi RNN-T — no cloud, no API keys, full privacy</p>
  <p align="center">
    <a href="https://crates.io/crates/phostt"><img src="https://img.shields.io/crates/v/phostt.svg" alt="Crates.io"></a>
    <a href="https://crates.io/crates/phostt"><img src="https://img.shields.io/crates/d/phostt.svg" alt="Downloads"></a>
    <a href="https://docs.rs/phostt"><img src="https://docs.rs/phostt/badge.svg" alt="Docs.rs"></a>
    <a href="https://github.com/ekhodzitsky/phostt/releases/latest"><img src="https://img.shields.io/github/v/release/ekhodzitsky/phostt?label=release" alt="Release"></a>
    <a href="https://pypi.org/project/phostt/"><img src="https://img.shields.io/pypi/v/phostt.svg" alt="PyPI"></a>
    <a href="https://github.com/ekhodzitsky/phostt/actions/workflows/ci.yml"><img src="https://img.shields.io/github/actions/workflow/status/ekhodzitsky/phostt/ci.yml?branch=master&label=ci" alt="CI"></a>
    <a href="https://github.com/ekhodzitsky/phostt/blob/master/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="MIT License"></a>
    <a href="https://github.com/ekhodzitsky/phostt/blob/master/CHANGELOG.md"><img src="https://img.shields.io/badge/changelog-Keep%20a%20Changelog-orange" alt="Changelog"></a>
  </p>
</p>

---

**phostt** turns any machine into a Vietnamese speech recognition server that
runs entirely on-device. Zipformer-vi RNN-T weights ship pre-quantized from
[sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx); the binary is one
command, the model is ~75 MB, everything runs locally.

```sh
cargo install phostt && phostt serve
# WebSocket: ws://127.0.0.1:9876/v1/ws
# REST API:  http://127.0.0.1:9876/v1/transcribe
```

Or use Python:

```sh
pip install phostt
```

```python
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
```

Or build from source:

```sh
git clone https://github.com/ekhodzitsky/phostt
cd phostt
cargo run --release -- serve
```

## Table of Contents

- [Why phostt?](#why-phostt)
- [Features](#features)
- [Platform Support](#platform-support)
- [Quick Start](#quick-start)
  - [Install](#install)
  - [Python](#python)
  - [Serve](#serve)
  - [Smoke test](#smoke-test)
  - [Usage Examples](#usage-examples)
  - [Docker](#docker)
- [Benchmarks](#benchmarks)
- [Quality / WER](#quality--wer)
- [Architecture](#architecture)
- [Mobile / FFI](#mobile--ffi)
- [Roadmap](#roadmap)
- [Known Limitations](#known-limitations)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
- [Security](#security)
- [Acknowledgements](#acknowledgements)
- [License](#license)

## Why phostt?

| | phostt | PhoWhisper-large | Cloud APIs |
|---|:---:|:---:|:---:|
| **Architecture** | Zipformer + RNN-T | Whisper enc-dec | varies |
| **Model size (INT8)** | **~75 MB** | ~1.5 GB | server-side |
| **WER (GigaSpeech2-vi)** | ~7.7% | n/a | varies |
| **Latency (3.7 s audio)** | **~61 ms** | ~300 ms | network + queue |
| **Throughput** | **61× RTF** | ~3× RTF | varies |
| **Privacy** | 100% local | 100% local | data leaves device |
| **Cost** | free forever | free | $0.006/min+ |
| **Setup** | `cargo install` | Python + deps | API key + billing |
| **Streaming** | real-time WebSocket | batch only | varies |

The Zipformer-vi-30M weights ship via [sherpa-onnx releases](https://github.com/k2-fsa/sherpa-onnx/releases) (Apache 2.0,
trained on ~70,000 hours of Vietnamese speech, ICASSP-published WER on the
VLSP and GigaSpeech2 benchmarks).

## Features

- **Real-time streaming** — partial transcription via WebSocket as you speak
- **REST API + SSE** — file transcription with instant or streaming response
- **Hardware acceleration** — CoreML + Neural Engine (macOS), CUDA 12+ (Linux), CPU everywhere
- **Pre-quantized INT8** — encoder ships at ~75 MB INT8 from upstream
- **Multi-format audio** — WAV, M4A/AAC, MP3, OGG/Vorbis, FLAC
- **Auto-download** — model fetched from sherpa-onnx GitHub releases on first run
- **Speaker diarization** — optional `diarization` feature for multi-speaker sessions
- **Docker ready** — CPU and CUDA images with multi-stage builds
- **Android FFI** — C-ABI + Kotlin bridge for mobile integration
- **Hardened** — connection limits, frame caps, idle timeouts, sanitized errors, rate limiting

## Platform Support

| Platform | Target | Backend | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | `aarch64-apple-darwin` | CoreML / CPU | Neural Engine + CPU fallback |
| macOS (Intel) | `x86_64-apple-darwin` | CPU | |
| Linux (x86_64) | `x86_64-unknown-linux-gnu` | CUDA 12+ / CPU | CUDA via `--features cuda` |
| Linux (ARM64) | `aarch64-unknown-linux-gnu` | CPU | Buildable, not CI-tested yet |
| Android | `aarch64-linux-android`, `armv7-linux-androideabi` | NNAPI / CPU | Via `cargo-ndk` + `ffi` feature |
| Windows | `x86_64-pc-windows-msvc` | CPU | Community-maintained |

> **iOS** is theoretically supported via CoreML (`--features coreml,ffi`), but not yet verified in CI.

## Quick Start

### Install

```sh
cargo install phostt
```

The first run downloads the ~75 MB Zipformer-vi ONNX bundle automatically into `~/.phostt/models/`.

### Python

```sh
pip install phostt
```

```python
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
```

`Engine` is thread-safe — multiple Python threads can call `transcribe_file`
or `transcribe_bytes` concurrently (limited by the ONNX session pool size).

### Serve

```sh
phostt serve
# Listening on ws://127.0.0.1:9876/v1/ws
# REST API at http://127.0.0.1:9876/v1/transcribe
```

### Smoke test

```sh
phostt transcribe ~/.phostt/models/test_wavs/0.wav
```

Expected output (from the bundled Vietnamese test fixture):

```
RỒI CŨNG HỖ TRỢ CHO LÂU LÂU CŨNG CHO GẠO CHO NÀY KIA
```

### Usage Examples

**REST API (single file):**

```sh
curl -X POST http://localhost:9876/v1/transcribe \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav
```

**REST API (streaming SSE):**

```sh
curl -X POST http://localhost:9876/v1/stream \
  -H "Content-Type: audio/wav" \
  --data-binary @sample.wav
```

**WebSocket (real-time):**

```sh
# Connect and stream PCM16 chunks as you speak
websocat ws://localhost:9876/v1/ws
```

**Python:**

```python
from phostt import Engine

engine = Engine("~/.phostt/models")
text = engine.transcribe_file("audio.wav")
print(text)
```

See [`examples/python_binding.py`](examples/python_binding.py) for a runnable demo.

**With hardware acceleration:**

```sh
# macOS Apple Silicon — CoreML Neural Engine
phostt serve --features coreml

# Linux + NVIDIA — CUDA 12
phostt serve --features cuda
```

### Docker

```sh
# CPU (any platform)
docker build -t phostt .
docker run -p 9876:9876 phostt

# CUDA (Linux + NVIDIA Container Toolkit)
docker build -f Dockerfile.cuda -t phostt-cuda .
docker run --gpus all -p 9876:9876 phostt-cuda
```

Or pull from GitHub Container Registry:

```sh
docker pull ghcr.io/ekhodzitsky/phostt:latest
docker run -p 9876:9876 ghcr.io/ekhodzitsky/phostt:latest
```

Or use Docker Compose:

```sh
docker compose up
```

## Benchmarks

Measured on Apple Silicon M2 Pro, release build, 3.74 s Vietnamese test audio:

| Backend | Mean Latency | Median | P95 | RTF | Peak RSS |
|---|---|---|---|---|---|
| **CPU** | **60 ms** | 60 ms | 61 ms | **62×** | **1.4 GB** |
| CoreML (Neural Engine) | 93 ms | 90 ms | 124 ms | 40× | 1.2 GB |

*RTF = real-time factor (audio seconds processed per wall-clock second).*
*For this 30M-param INT8 model, CPU is faster than CoreML on Apple Silicon.*

Auto-updated benchmark history: [`BENCHMARKS.md`](BENCHMARKS.md).

## Quality / WER

**GigaSpeech2-vi (clean): ~7.7%** — published upstream benchmark on clean Vietnamese speech.

For detailed benchmark history (latency, throughput, memory) and regression tracking datasets, see [`BENCHMARKS.md`](BENCHMARKS.md).

## Architecture

```
┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│   Client    │────▶│  axum HTTP  │────▶│   SessionPool       │
│  (WS/REST)  │     │   router    │     │  (async-channel)    │
└─────────────┘     └─────────────┘     └─────────────────────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
           ┌────────────────┐
           │ SessionTriplet │──▶ Zipformer Encoder (ONNX)
           │ (enc/dec/join) │──▶ RNN-T Decoder (greedy)
           └────────────────┘──▶ Joiner
                    │
                    ▼
           ┌────────────────┐
           │ StreamingState │──▶ overlap-buffer / VAD
           │ (per-connection)│    → partial + final segments
           └────────────────┘
```

## Mobile / FFI

phostt exposes a C-ABI for Android integration:

```c
PhosttEngine* engine = phostt_engine_new("/path/to/models");
PhosttStream* stream = phostt_stream_new(engine);
char* json = phostt_stream_process_chunk(engine, stream, pcm16, len, 16000);
// ... free with phostt_string_free(json) ...
```

See [`ANDROID.md`](ANDROID.md) for NDK setup, Kotlin bridge (`ffi/android/PhosttBridge.kt`), and model bundling strategies.

## Roadmap

- [x] v0.3.0 — Silero VAD streaming, configurable overlap-buffer, auto-benchmark CI
- [x] v0.4.0 — Polyvoice diarization, security/resource hardening, benchmark RSS
- [x] v0.4.1 — Dependency updates (rubato 2.0, sha2 0.11), docs polish, CI improvements
- [ ] iOS build verification (CoreML + `ffi` feature) — *theoretically supported, not yet CI-tested*
- [ ] Quantized embedding extractor for faster diarization
- [ ] Offline batch re-clustering pass for improved speaker accuracy

## Known Limitations

- **Out-of-domain audio** (English loanwords, numbers, proper names) may produce phonetic Vietnamese transcriptions rather than verbatim text. This is expected for a mono-lingual model trained on ~70,000 hours of Vietnamese speech.
- **Memory footprint** (~1.4 GB peak RSS) may be too heavy for <2 GB devices. Consider the CPU backend and a smaller batch size for embedded use.
- **iOS** is theoretically supported via CoreML + `ffi`, but has not been verified in CI.
- **Windows** builds are community-maintained and not CI-tested.

## Troubleshooting

| Symptom | Cause | Fix |
|---|---|---|
| `Model not found` on first run | Auto-download failed or proxy blocks GitHub | Set `PHOSTT_MODEL_DIR` to a local path with extracted weights |
| High latency (>200 ms) on CPU | Debug build or missing `release` profile | Always run `cargo run --release` or `cargo install` |
| CoreML slower than CPU | Neural Engine overhead on short audio | CPU is actually faster for this 30M-param INT8 model; CoreML wins on larger models |
| `SIGKILL` during model load | OOM on low-RAM system | Close other apps, use CPU backend, or run on a machine with ≥4 GB RAM |
| WebSocket closes immediately | Rate limit hit or origin mismatch | Check logs; disable rate limiting with `--rate-limit 0` for local testing |
| Diarization missing speakers | `diarization` feature not enabled | Rebuild with `--features diarization` |

See [`TODO.md`](TODO.md) for the full tracker.

## Contributing

See [`CONTRIBUTING.md`](CONTRIBUTING.md). Quick start for developers:

```sh
cargo build --release --features coreml   # or cuda
cargo test                                # 146 fast unit tests, no model needed
cargo clippy --all-targets -- -D warnings
cargo deny check
```

## Security

Please report security vulnerabilities privately — see [`SECURITY.md`](SECURITY.md) for contact details and supported versions.

## Acknowledgements

phostt is a Vietnamese fork of [`gigastt`](https://github.com/ekhodzitsky/gigastt),
which provides the production-grade server scaffolding (HTTP/WS/SSE,
rate-limit, metrics, graceful shutdown). The Vietnamese inference uses the
Zipformer-Transducer weights packaged by the
[sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) project.

## License

MIT — see [LICENSE](LICENSE).

