Metadata-Version: 2.4
Name: fscript
Version: 1.0.0
Summary: Fast local transcription for large lectures with NVIDIA Parakeet ONNX
Home-page: https://github.com/brenorb/fast-transcript
Author: Breno Brito
License: MIT
Project-URL: Homepage, https://github.com/brenorb/fast-transcript
Project-URL: Repository, https://github.com/brenorb/fast-transcript
Project-URL: Issues, https://github.com/brenorb/fast-transcript/issues
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary

# fast-transcript

**`fast-transcript` is a local lecture transcription CLI built to beat the usual Apple Silicon tradeoff: either fast but flaky, or accurate but painfully slow.**

On the development machine, this project handled **30 minutes in 2!*** while staying around **2.51 GB RSS** on the long run. In the same local test set, it beat **`mlx-whisper`**, **`insanely-fast-whisper`**, and **`parakeet-mlx`**.

<sub>* Benchmark run on a MacBook Pro M1. The exact long-run measurement was **29m47s** of Portuguese lecture audio transcribed in about **2m14s** (**13.38x real-time**).</sub>

The CLI binary is called **`fscript`**:

```bash
fscript lecture.mp3
fscript lecture.mp3 notes/
fscript lecture.mp3 --text
fscript lecture.mp3 --text=plain
fscript lecture.mp3 --raw
fscript lecture.mp3 --srt
fscript lecture.mp3 --vtt
fscript lecture.mp3 --json
fscript lecture.mp3 --backend=lseend-dihard3
fscript lecture.mp3 --backend=none --json --raw
fscript lecture.mp3 -n 2
```

That is the whole point of this project. One command. Large audio. No babysitting.

## Why this exists

I wanted a tool for **transcribing long classes and lectures quickly on a laptop while still using the computer for normal work**.

The existing options I tested had clear problems for this use case:

- **`insanely-fast-whisper`** was far too slow on this Mac once it fell back to CPU
- **`mlx-whisper`** was solid, but slower than I wanted for long lecture workflows
- **`parakeet-mlx`** had excellent memory numbers, but drifted into English on longer Portuguese segments unless heavily tuned

`fast-transcript` packages the ONNX Parakeet path that held up best in practice.

## What it does

- downloads the default **Parakeet TDT 0.6B v3 int8** model automatically if it is missing
- stores the extracted model in a persistent per-user application data directory
- keeps the downloaded tarball in the user cache directory
- accepts `mp3`, `wav`, and other audio formats supported by `ffmpeg`
- accepts remote `http(s)` video/audio URLs supported by `yt-dlp`
- prefers platform-provided manual subtitles for remote URLs when available
- falls back to downloading remote audio and transcribing locally when only auto-captions exist or no captions exist
- auto-converts unsupported audio to **16 kHz mono PCM16 WAV**
- uses **120s chunks** with **2s overlap** by default
- runs local speaker diarization by default via `fluidaudiocli process --mode offline`
- writes `<audio>.speakers.txt` next to the input unless you choose a different output path
- can alternatively write raw transcript text via `--text`, with timestamps on by default and `--text=plain` as the opt-out
- can alternatively write subtitle files via `--srt` or `--vtt`
- can alternatively write speaker-aware text via `--speakers`, defaulting to `HH:MM:SS - SPEAKER_01: ...`
- cleans pathological repeated-word runs such as `we we we we` into `we... we` by default, with `--raw` as the opt-out
- stays quiet by default: concise progress in the terminal, transcript JSON on disk
- shows a spinner and chunk progress bar on interactive terminals

## Install

### Requirements

- `ffmpeg`
- `ffprobe`
- `yt-dlp` for remote URLs, or `uvx yt-dlp`
- `fluidaudiocli` on `PATH` for the default speaker-aware mode
  - use `--backend=none` if you want to skip diarization

### Install with Homebrew

```bash
brew install brenorb/tap/fast-transcript
```

On Apple Silicon macOS, Homebrew now installs `fast-transcript` from a proper bottle.
On Linux x86_64, the formula still installs from the published release binary.

If you prefer the explicit two-step form:

```bash
brew tap brenorb/tap
brew install fast-transcript
```

### PyPI / uv

The PyPI package name for this project is **`fscript`** so the target UX is:

```bash
uvx fscript lecture.mp3
uv tool install fscript
```

The repo already includes platform wheel builds for:

- macOS arm64
- Linux x86_64

PyPI publishing is currently enabled for:

- macOS arm64

See [`docs/pypi-publishing.md`](./docs/pypi-publishing.md) for the release workflow details.

### Install a prebuilt binary directly

Download the archive for your platform from the [GitHub Releases page](https://github.com/brenorb/fast-transcript/releases), then put `fscript` on your `PATH`.

### Build from source

```bash
cargo install --git https://github.com/brenorb/fast-transcript
```

Or from a local clone:

```bash
cargo install --path .
```

On macOS, the build now auto-detects the active Xcode or Command Line Tools Clang runtime directories so `cargo test` keeps linking even if your Rust toolchain points at a stale `libclang_rt.osx` path.

## Quick start

```bash
fscript lecture.mp3
fscript https://www.youtube.com/watch?v=QSdh8Gj0mEg
```

This will:

1. ensure the default model exists
2. normalize the audio if needed
3. transcribe with the default chunking strategy
4. diarize with the default `coreml` backend
5. write `lecture.speakers.txt`
6. print the final absolute transcript path to `stdout`

For remote URLs, the default speaker-aware flow is:

1. inspect the URL with `yt-dlp`
2. download the remote audio
3. run the normal local transcription + diarization pipeline

If you switch to `--backend=none`, `fscript` can still use platform-provided manual subtitles directly when they are available unless you also force `--local`.

## Usage

```bash
fscript <audio-or-url> [output-path]
fscript <audio-or-url> -o output-path
fscript <audio-or-url> --stdout
fscript <audio-or-url> -
fscript <audio-or-url> --speakers
fscript <audio-or-url> --speakers=plain
fscript <audio-or-url> --text
fscript <audio-or-url> --text=plain
fscript <audio-or-url> --raw
fscript <audio-or-url> --json
fscript <audio-or-url> --srt
fscript <audio-or-url> --vtt
fscript <audio-or-url> --backend=lseend-dihard3
fscript <audio-or-url> --backend=none --json --raw
fscript <audio-or-url> -n 2
fscript --version
```

When `fscript` writes the transcript to a file, it keeps progress and human-readable status on `stderr` and prints only the final absolute transcript path on `stdout`.
That makes it easy to compose in shell scripts:

```bash
out=$(fscript lecture.mp3)
open "$out"
```

If the explicit `output-path` already exists as a directory, `fscript` writes the default filename for the chosen mode inside that directory.

Optional overrides:

```bash
fscript lecture.wav custom-output.txt
fscript lecture.wav exports/
fscript lecture.wav -o custom-output.txt
fscript lecture.wav --stdout
fscript lecture.wav --speakers
fscript lecture.wav --speakers=plain
fscript lecture.wav --text
fscript lecture.wav --text=plain
fscript lecture.wav --raw
fscript lecture.wav --json
fscript lecture.wav --srt
fscript lecture.wav --vtt
fscript lecture.wav --backend=lseend-dihard3
fscript lecture.wav --backend=none --json --raw
fscript lecture.wav -n 2
fscript lecture.wav --chunk 180 --overlap 3
fscript lecture.wav --chunk 0
fscript lecture.wav --model-dir ./models/parakeet/custom-copy
fscript lecture.wav --model-package ./models/parakeet-v3-int8.tar.gz
fscript lecture.wav --model-url https://example.com/parakeet-v3-int8.tar.gz
fscript https://www.youtube.com/watch?v=QSdh8Gj0mEg
fscript https://www.youtube.com/watch?v=QSdh8Gj0mEg --local
```

Raw text output modes:

- `--text`: transcript text with segment timestamps, one line per segment with `HH:MM:SS - ...`
- `--text=plain`: transcript text without timestamps or speaker labels
- when `--text` is active and you do not pass an explicit output path, the default file becomes `<audio>.transcript.txt`

Cleaning mode:

- cleaning is on by default and affects only the output being written for that invocation
- `--raw`: disables output cleaning for that invocation
- it applies to JSON, speakers, text, SRT, and VTT outputs
- it is intentionally conservative and leaves ordinary repetition alone

Subtitle output modes:

- `--srt`: SubRip subtitle file
- `--vtt`: WebVTT subtitle file
- if diarization is active, subtitle cues include normalized speaker labels such as `SPEAKER_01: ...`
- when `--srt` is active and you do not pass an explicit output path, the default file becomes `<audio>.srt`
- when `--vtt` is active and you do not pass an explicit output path, the default file becomes `<audio>.vtt`

Speaker-aware output modes:

- `--speakers`: speaker-aware output with timestamps, for example `00:12:34 - SPEAKER_01: ...`
- `--speakers=plain`: speaker-aware output without timestamps, for example `SPEAKER_01: ...`
- if diarization is disabled or a segment has no speaker label, the line falls back to plain segment text without an `UNKNOWN:` prefix
- when no output mode is passed, `--speakers` is the default
- when `--speakers` is active and you do not pass an explicit output path, the default file becomes `<audio>.speakers.txt`

Environment overrides:

- `FSCRIPT_MODEL_DIR`
- `FSCRIPT_MODEL_PACKAGE`
- `FSCRIPT_MODEL_URL`
- `FSCRIPT_DIARIZATION_BINARY`

## Optional diarization

`fscript` keeps the speaker-aware path as the default.

By default, it:

1. runs the normal Parakeet ASR flow first
2. releases the ASR model
3. runs a separate `fluidaudiocli` diarization subprocess
4. merges diarization windows into ASR segments by temporal overlap

Backends:

- `--backend=coreml`: default `FluidInference/speaker-diarization-coreml` path via `fluidaudiocli process --mode offline`
- `--backend=lseend-dihard3`: alternate `FluidInference/ls-eend-coreml` DIHARD III path via `fluidaudiocli lseend --variant dihard3`
  - defaults to `--threshold 0.3`
- `--backend=none`: skip diarization entirely

Controls:

- `-n N` / `--num-speakers N` is forwarded only to the default `coreml` backend
- `-t N` / `--threshold N` overrides the default diarization threshold for `lseend-dihard3`
- `lseend-dihard3` does not support `--num-speakers`; use the default threshold or override it with `-t` / `--threshold`

If `fluidaudiocli` is missing, `fscript` now returns a clear backend error instead of silently falling back.

## Defaults

- model dir:
  - macOS: `~/Library/Application Support/fast-transcript/models/parakeet-tdt-0.6b-v3-int8`
  - Linux: `~/.local/share/fast-transcript/models/parakeet-tdt-0.6b-v3-int8`
- model package cache:
  - macOS: `~/Library/Caches/fast-transcript/parakeet-v3-int8.tar.gz`
  - Linux: `~/.cache/fast-transcript/parakeet-v3-int8.tar.gz`
- model URL: `https://huggingface.co/brenorb/parakeet-tdt-0.6b-v3-int8-onnx-bundle/resolve/main/parakeet-v3-int8.tar.gz?download=1`
- chunk seconds: `120`
- chunk overlap seconds: `2`
- default backend: `coreml`
- cleaning: on
- default output path: `<audio>.speakers.txt`
- output path with `--json`: `<audio>.transcript.json`
- output path with `--text`: `<audio>.transcript.txt`
- output path with `--srt`: `<audio>.srt`
- output path with `--vtt`: `<audio>.vtt`
- output path with `--speakers`: `<audio>.speakers.txt`

## Benchmarks

These are **local development benchmarks**, not universal claims. They were run on the same Apple Silicon Mac used during development, using a Portuguese lecture clip and the same broader workflow comparison.

### 2-minute lecture clip

| Engine | Setup | Speed | Peak RSS | Notes |
| --- | --- | ---: | ---: | --- |
| **fast-transcript** | Parakeet ONNX | **13.06x** real-time | **2.25 GB** | Best balance of speed and reliability |
| `mlx-whisper` | `whisper-large-v3-turbo` | `5.25x` | `1.70 GB` | Good quality, slower |
| `parakeet-mlx` | tuned for quality | `4.92x` | `1.29 GB` | Needed substantial tuning |
| `parakeet-mlx` | raw greedy | `10.16x` | `0.57 GB` | Faster on short audio, drifted into English on longer PT-BR |
| `insanely-fast-whisper` | `whisper-large-v3` CPU | `0.30x` | `6.18 GB` | Accurate, but too slow here |
| `insanely-fast-whisper` | MPS + fallback | `0.31x` | `3.04 GB` | Small gain, same general problem |

### Long lecture run

| Engine | Audio | Speed | Peak RSS | Notes |
| --- | --- | ---: | ---: | --- |
| **fast-transcript** | `29m47s` lecture | **13.38x** real-time | **2.51 GB** | Stable long run with default chunking |

### Practical reading

- `fast-transcript` was not the absolute fastest thing we saw in every synthetic case
- it **was** the best result once long Portuguese lecture audio, transcript quality, and unattended runs all mattered at the same time
- that is the target workload for this repo

## Output format

Default output is speaker-aware text and includes:

- segment timestamps
- speaker labels when diarization returns them
- cleaned repeated-word runs unless you pass `--raw`

JSON output via `--json` includes:

- merged transcript text
- model path
- original input path
- prepared WAV path
- whether a remote URL used manual subtitles or the local model
- whether `ffmpeg` normalization was used
- load time
- transcribe time
- chunk configuration
- per-chunk timing
- transcript `segments`
- optional `speaker_diarization` metadata

When diarization is enabled, each transcript segment may include:

- `speaker`

Alternative output modes:

- `--speakers`: speaker-aware text with timestamps
- `--speakers=plain`: speaker-aware text without timestamps
- `--text`: transcript text with segment timestamps
- `--text=plain`: transcript text without timestamps
- `--json`: structured JSON benchmark/transcript payload
- `--srt`: subtitle file
- `--vtt`: subtitle file

## Motivation

This project is optimized for **large lectures and classes**, including files in the **30-minute to 2-hour** range, where:

- startup friction matters
- background CPU usage matters
- memory spikes matter
- brittle hand-tuned command lines become a tax

The design goal is not “highest benchmark on a cherry-picked GPU server”.
The goal is “transcribe big local lecture audio fast enough that you actually keep using it”.

## Inspiration

This project was heavily informed by:

- [Handy](https://github.com/cjpais/Handy)
- [GLaDOS](https://github.com/dnhkng/GLaDOS)
- [transcribe-rs](https://github.com/cjpais/transcribe-rs)

In particular, the ONNX Parakeet path here was shaped by the packaging and implementation ideas used in Handy and GLaDOS.

## Default model bundle

The default auto-download bundle is published in our own Hugging Face model repository:

- [brenorb/parakeet-tdt-0.6b-v3-int8-onnx-bundle](https://huggingface.co/brenorb/parakeet-tdt-0.6b-v3-int8-onnx-bundle)

This keeps the default install path tied to the exact validated tarball instead of an app-specific blob host.

## License

MIT
