Metadata-Version: 2.4
Name: demucs-onnx
Version: 0.2.0
Summary: Run and export HT-Demucs / Demucs music source separation as ONNX. Pure numpy + onnxruntime inference (no PyTorch). Karaoke / acapella CLI, auto-resampling, auto execution provider routing, fp16-weight downloads, MP3 output. Fixes the 4 blockers that prevent vanilla torch.onnx.export from working on htdemucs.
Project-URL: Homepage, https://stemsplit.io
Project-URL: Documentation, https://github.com/StemSplit/demucs-onnx#readme
Project-URL: Repository, https://github.com/StemSplit/demucs-onnx
Project-URL: Hugging Face, https://huggingface.co/StemSplitio
Project-URL: Bug Tracker, https://github.com/StemSplit/demucs-onnx/issues
Author-email: StemSplit <team@stemsplit.io>
License: MIT
License-File: LICENSE
Keywords: acapella,audio-to-audio,demucs,demucs-android,demucs-ios,demucs-mobile,demucs-onnx-export,htdemucs,htdemucs-onnx,karaoke,music-source-separation,onnx,onnxruntime,source-separation,stem-separation,vocal-isolation,vocal-remover
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Classifier: Topic :: Multimedia :: Sound/Audio :: Conversion
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: huggingface-hub>=0.24
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.17
Requires-Dist: soundfile>=0.12
Requires-Dist: soxr>=0.3
Requires-Dist: tqdm>=4.65
Provides-Extra: all
Requires-Dist: lameenc>=1.6; extra == 'all'
Provides-Extra: dev
Requires-Dist: lameenc>=1.6; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: export
Requires-Dist: demucs==4.0.1; extra == 'export'
Requires-Dist: onnx>=1.16; extra == 'export'
Requires-Dist: onnxscript>=0.1.0; extra == 'export'
Requires-Dist: torch<2.5,>=2.4; extra == 'export'
Requires-Dist: torchaudio<2.5,>=2.4; extra == 'export'
Provides-Extra: mp3
Requires-Dist: lameenc>=1.6; extra == 'mp3'
Description-Content-Type: text/markdown

# demucs-onnx

[![PyPI](https://img.shields.io/pypi/v/demucs-onnx.svg)](https://pypi.org/project/demucs-onnx/)
[![Python](https://img.shields.io/pypi/pyversions/demucs-onnx.svg)](https://pypi.org/project/demucs-onnx/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

> **The canonical way to run and export HT-Demucs / Demucs music source
> separation as ONNX.** Pure numpy + onnxruntime at inference (no PyTorch),
> and a one-liner export pipeline that fixes the four known blockers in
> `torch.onnx.export`. Powers the [StemSplit](https://stemsplit.io)
> production stack.

```bash
pip install 'demucs-onnx[mp3]'

# One command -> karaoke instrumental as a shareable MP3.
demucs-onnx separate song.mp3 out/ --karaoke --mp3
# writes out/karaoke.mp3  (drums + bass + other, vocals removed)

# Or every stem, automatically picking the best GPU on this host.
demucs-onnx separate song.mp3 out/
# writes out/drums.wav out/bass.wav out/other.wav out/vocals.wav
```

That's the whole thing. Models auto-download from the Hugging Face Hub on
first run and are cached forever. Inputs at any sample rate (48 kHz,
22 kHz, mono, anything) work transparently — we resample in for
inference and resample back out so the file you get matches the file
you put in.

---

## Why this package exists

For the entire history of the [`demucs`](https://github.com/facebookresearch/demucs)
repo (2021 – 2026) **nobody on PyPI has shipped working ONNX export
tooling** for HT-Demucs. Searching GitHub turns up half a dozen abandoned
forks, all stuck on one of four blockers, all without a working `.onnx`
file to show for it. The official demucs README has no mention of ONNX.

We solved it. This package ships:

1. A **pure-numpy + onnxruntime inference path** that runs the official
   HT-Demucs FT models with no PyTorch dependency. Install footprint
   drops from ~2 GB (PyTorch) to ~50 MB (onnxruntime).
2. **A one-call export pipeline** — `export_to_onnx("htdemucs_ft", ...)` —
   that applies all four patches, parity-checks the output against
   PyTorch fp32, and only writes the file if max abs diff < 1e-3.
3. The same patches as **independent, grep-able modules** (`stft.py`,
   `mha.py`, `pos_embed.py`, `segment.py`) so you can debug your own
   exports of related architectures.

Mirror published as five Hugging Face repos under
[`StemSplitio`](https://huggingface.co/StemSplitio) for direct download.

| Want to … | Use this |
|---|---|
| Run htdemucs_ft on CPU / mobile / web with no PyTorch | `from demucs_onnx import separate` |
| Convert your own demucs checkpoint to ONNX | `from demucs_onnx.export import export_to_onnx` |
| Skip the infrastructure entirely | The hosted [StemSplit API](https://stemsplit.io/developers) |

---

## What's new in v0.2.0 — the UX bundle

- 🎤 **`--karaoke` shortcut** — one flag, instant karaoke instrumental
  (sum of `drums`/`bass`/`other`, vocals removed).
- 🔀 **`--mix-stems vocals,drums`** — write a *single* file that's the
  sum of whichever stems you list. Great for "vocals + drums only"
  remix beds, acapella + drums tracks, etc.
- 🎧 **`--mp3` output** with `--bitrate 192k` (32-320 kbps). Powered by
  the tiny `lameenc` wheel — no ffmpeg required.
- ⚡ **Auto execution-provider routing**: `providers="auto"` (the new
  default) picks CoreML on macOS arm64, CUDA on Linux+NVIDIA, DML on
  Windows DX12, CPU otherwise. No more `--provider coreml` boilerplate.
- 🪶 **fp16-weight downloads** with `--small` / `precision="fp16weights"`:
  166 MB per model instead of 316 MB (1.91× smaller). Same runtime
  memory and latency, max abs diff vs fp32 is ~6e-5.
- 🎚️ **Auto-resampling**: any sample rate input (8 kHz to 192 kHz, mono
  or multi-channel) is transparently resampled to 44.1 kHz for
  inference and back to the input rate before writing.
- 📊 **Progress bar** via `tqdm` when stdout is a TTY (`--quiet` to
  silence everything, `--verbose` for the old chunk-by-chunk log).

See [`CHANGELOG.md`](CHANGELOG.md) for the full diff vs v0.1.0.

---

## Comparison vs alternatives

| Project | Working ONNX export? | Working ONNX inference? | PyPI? |
|---|---|---|---|
| **demucs-onnx** *(this)* | **Yes**, parity-verified to 1.6e-4 | **Yes**, no torch needed | **Yes** |
| `facebookresearch/demucs` | No (4 blockers, see below) | n/a | Yes (PyTorch only) |
| `lstm-mode/demucs-onnx` (GH fork) | Stuck on STFT complex blocker | n/a | No |
| Various Stack Overflow gists | Each stuck on one of the 4 blockers | n/a | No |
| `mvsep` / Audio Separator GUIs | Use bundled MDX/UVR ONNX, not htdemucs | Yes for MDX, not htdemucs | n/a |

If you find a comparable working solution after this package was
published — please [open an issue](https://github.com/StemSplit/demucs-onnx/issues)
so we can update this table.

---

## Quick start

### Install

```bash
pip install demucs-onnx                # inference only — onnxruntime + numpy + soundfile + soxr
pip install "demucs-onnx[mp3]"         # adds the lameenc encoder for --mp3 output
pip install "demucs-onnx[export]"      # adds torch + demucs for the export pipeline
```

### Separate (Python)

```python
from demucs_onnx import separate

# Full 4-stem bag (default). Auto-downloads from HF on first run, auto
# picks the best execution provider for this host (CoreML / CUDA / DML).
stems = separate("song.mp3")
# stems: {"drums": ndarray (2, S), "bass": ..., "other": ..., "vocals": ...}

# Just one stem — 4× faster, 75% less RAM, model size 316 MB instead of 1.26 GB.
from demucs_onnx import separate_stem
vocals = separate_stem("song.mp3", "vocals")

# Smaller download (166 MB per stem instead of 316 MB), no runtime cost.
stems = separate("song.mp3", precision="fp16weights")

# Write straight to MP3, including a karaoke instrumental mix.
separate(
    "song.mp3", "stems/",
    output_format="mp3", bitrate_kbps=192,
    mix_stems=("drums", "bass", "other"), mix_output_name="karaoke",
)
```

### Separate (CLI)

```bash
# Killer feature — one command -> karaoke.mp3 ready to share.
demucs-onnx separate song.mp3 stems/ --karaoke --mp3

# All 4 stems, auto provider (CoreML on macOS, CUDA on Linux, etc).
demucs-onnx separate song.mp3 stems/

# Single specialist mode — 4x faster than the bag.
demucs-onnx separate song.mp3 stems/ --stem vocals

# Smaller download (1.91x), same runtime cost.
demucs-onnx separate song.mp3 stems/ --small

# Custom mix-down: write one file that's vocals + drums only.
demucs-onnx separate song.mp3 stems/ --mix-stems vocals,drums --mp3

# Explicit provider override (auto is the default).
demucs-onnx separate song.mp3 stems/ --providers coreml
demucs-onnx separate song.mp3 stems/ --providers cuda
demucs-onnx separate song.mp3 stems/ --providers dml

demucs-onnx list-models
```

### Export (Python)

```python
from demucs_onnx.export import export_to_onnx

# Export every specialist of htdemucs_ft into out/ as 4 .onnx files.
paths = export_to_onnx("htdemucs_ft", "out/")
# paths == {"drums": Path("out/htdemucs_ft_drums.onnx"), "bass": ..., ...}

# Export just the vocals specialist to a single file.
export_to_onnx("htdemucs_ft", "vocals.onnx", stem="vocals")

# Export your own fine-tuned checkpoint.
from pathlib import Path
export_to_onnx(Path("my_finetune.th"), "my_finetune.onnx")
```

### Export (CLI)

```bash
demucs-onnx export htdemucs_ft out/                    # all 4 specialists
demucs-onnx export htdemucs_ft drums.onnx --stem drums # one stem -> single file
demucs-onnx export htdemucs_ft out/ --opset 17         # change opset
demucs-onnx export htdemucs_ft out/ --no-parity-check  # advanced (don't)
```

### Mobile / web (after exporting)

```swift
// iOS / Swift, ORT 1.17+
import onnxruntime_objc
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
                              modelPath: bundle.path(forResource: "htdemucs_ft_vocals",
                                                     ofType: "onnx")!,
                              sessionOptions: opts)
```

```js
// Browser / web, onnxruntime-web
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("htdemucs_ft_vocals.onnx", {
  executionProviders: ["wasm"],
  graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });
```

---

## The 4 blockers explained

These are the four things that break vanilla `torch.onnx.export` on
HT-Demucs (PyTorch 2.4 / opset 17). Each lives in its own grep-able
module so you can lift the fix into a different project.

### Blocker 1 — `torch.stft` returns complex tensors

```python
# demucs/htdemucs.py
z = torch.stft(x, n_fft, hop_length, return_complex=True)  # complex64 output
```

`torch.onnx.export` raises `Exporting STFT does not currently support
complex types`. The dynamo exporter sometimes lowers it, but the resulting
graph fails ORT shape inference.

**Fix** — [`demucs_onnx/export/stft.py`](src/demucs_onnx/export/stft.py).
Replace `torch.stft` with a `Conv1d` whose kernels are precomputed
sin/cos DFT bases for `n_fft = 4096`, `hop = 1024`, hann window,
`normalized=True`. The output is two real channels (real, imag) instead
of one complex channel. Inverse: a matching `ConvTranspose1d` plus an
`OLA(window²)` envelope normalisation. The class also overrides demucs's
own `_spec` / `_ispec` / `_magnitude` / `_mask` methods so the rest of
the network sees `(B, C, 2, F, T)` real tensors throughout.

Verified to 5×10⁻⁶ max abs diff against `torch.stft` on real audio.

### Blocker 2 — `model.segment` is a `fractions.Fraction`

```python
# demucs/htdemucs.py
self.segment = Fraction(39, 5)  # = 7.8 seconds
```

`torch._dynamo` allow-lists a small set of "user-defined classes" it can
trace through. `Fraction` is not on it (PyTorch 2.4) and graph capture
crashes. The legacy exporter is more permissive but still produces a
wrong graph because `Fraction` arithmetic is opaque to it.

**Fix** — [`demucs_onnx/export/segment.py`](src/demucs_onnx/export/segment.py).
Coerce to `float`. Mathematically identical at inference, side-steps both
exporter limitations.

### Blocker 3 — `random.randrange` in the transformer pos-embedding

```python
# demucs/transformer.py
shift = random.randrange(self.sin_random_shift + 1)  # = 0 at eval
```

Used during training for positional-embedding augmentation. At eval,
`sin_random_shift = 0` so the call always returns 0, but neither the
legacy exporter nor dynamo can trace through a call to `random` —
`UnsupportedOperatorError` and graph break, respectively.

**Fix** — [`demucs_onnx/export/pos_embed.py`](src/demucs_onnx/export/pos_embed.py).
Monkey-patch `CrossTransformerEncoder._get_pos_embedding` with a
deterministic version that hardcodes `shift = 0`. Mathematically
identical at inference time.

### Blocker 4 — `aten::_native_multi_head_attention` has no ONNX symbolic

```python
# torch/nn/functional.py — internally
return torch._native_multi_head_attention(...)  # fused C++ kernel
```

`nn.MultiheadAttention` dispatches to a fast fused C++ kernel when its
inputs satisfy a fast-path check. The fused kernel has no ONNX symbolic:
the exporter raises `UnsupportedOperatorError: Exporting the operator
'aten::_native_multi_head_attention' to ONNX opset version 17 is not
supported`.

**Fix** — [`demucs_onnx/export/mha.py`](src/demucs_onnx/export/mha.py).
Replace `nn.MultiheadAttention.forward` (per instance, via
`types.MethodType`) with a manual scaled-dot-product attention built
from `Linear` / `bmm` / `softmax`. The exporter handles those primitives
without complaint. Output is bit-identical to the fused kernel up to
fp32 round-off.

### Net result

After all four patches, end-to-end parity vs PyTorch fp32:

| Stem | max abs diff (1×2×343980 random input) |
|---|---:|
| drums | 1.63 × 10⁻⁴ |
| bass | 1.42 × 10⁻⁴ |
| other | 1.71 × 10⁻⁴ |
| vocals | 1.55 × 10⁻⁴ |

…and the ONNX graph runs in `onnxruntime` CPU at **1.31× the speed of
PyTorch CPU** on Apple M4 Pro (no GPU).

---

## Pre-trained ONNX models on Hugging Face

We host five companion model repos. The Python package downloads from
these automatically on first run; you can also fetch them by hand.

| Repo | Stems | Size | Use case |
|---|---|---:|---|
| [`StemSplitio/htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx) | all 4 | 1.26 GB | Full bag, single download |
| [`StemSplitio/htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | drums | 316 MB | Drum extraction, beat transcription |
| [`StemSplitio/htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | bass | 316 MB | Bassline isolation, mix rebalancing |
| [`StemSplitio/htdemucs-ft-other-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | other | 316 MB | Karaoke instrumental, sample-flipping |
| [`StemSplitio/htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | vocals | 316 MB | **#1 open-source vocal SDR** — vocal removal, acapella, karaoke |

All five are MIT-licensed and parity-verified to < 1e-3 vs PyTorch fp32.

---

## Performance

Real measurements on Apple M4 Pro (8-core CPU, no GPU):

| Mode | Per 7.8-s segment | Per 3-min song | RTF |
|---|---:|---:|---:|
| `demucs-onnx`, single specialist (CPU) | **1.59 s** | **~22 s** | 0.20 |
| `demucs-onnx`, full bag (CPU) | 6.4 s | ~88 s | 0.49 |
| PyTorch CPU (single specialist) | 2.09 s | ~29 s | 0.26 |
| PyTorch MPS (full bag) | 1.0 s | ~12 s | 0.07 |

CUDA / DirectML / CoreML ONNX EPs are all ≥ 5× faster than the CPU EP
on real GPUs — see the model card on each HF repo for hardware-specific
numbers.

---

## API

### `demucs_onnx.separate(input, output_dir=None, *, model="htdemucs_ft", stems=None, providers="auto", precision="fp32", cache_dir=None, token=None, verbose=False, progress=True, output_format="wav", bitrate_kbps=192, mix_stems=None, mix_output_name="mix") -> dict[str, np.ndarray]`

Run separation on an audio file. Returns
`{stem_name: (channels, samples)}` in float32 **at the input file's
native sample rate** (we auto-resample for inference and back). If
`output_dir` is given, also writes `<stem>.wav` (or `.mp3`) files into
it; pass `mix_stems=("drums","bass","other")` to additionally write a
single karaoke instrumental file.

`model` accepts:

- `"htdemucs_ft"` (default) — full 4-stem bag.
- `"htdemucs_ft_<stem>"` or just `"<stem>"` — single specialist
  (`drums` / `bass` / `other` / `vocals`).

`providers` accepts:

- `"auto"` (default, new in v0.2.0) — auto-detect the best EP for this
  host (CoreML / CUDA / DML / CPU).
- A short alias (`"cpu"`, `"coreml"`, `"cuda"`, `"dml"`), an explicit
  ORT provider name, or a list of either.

`precision` accepts `"fp32"` (default) or `"fp16weights"`. The latter
downloads a 166 MB variant per stem (1.91× smaller) with identical
runtime memory and latency; max abs diff vs fp32 is ~6e-5.

### `demucs_onnx.auto_select_providers() -> list[str]`

Return the EP list `separate()` would pick on this host. Useful for
debugging — print it from your code if `auto` selects something
surprising.

### `demucs_onnx.describe_runtime() -> dict[str, object]`

Returns `{system, machine, python, onnxruntime, available_providers,
in_browser}`. Print this if `auto` doesn't pick the EP you expect.

### `demucs_onnx.separate_stem(input, stem, output_dir=None, **kwargs) -> np.ndarray`

Shorthand: run only one specialist and return the single stem as a numpy
array. ~4× faster than running the full bag when you only need one stem.

### `demucs_onnx.separate_all(input, output_dir=None, **kwargs) -> dict[str, np.ndarray]`

Shorthand for `separate(..., model="htdemucs_ft")`.

### `demucs_onnx.export.export_to_onnx(checkpoint, output, *, stem=None, stems=None, opset=17, parity_check=True, parity_tolerance=1e-3, ...) -> dict[str, Path]`

Convert a demucs/htdemucs PyTorch checkpoint (by name or `.th` path) to
one or more ONNX files. Applies all four patches, runs a numerical
parity check before writing, and aborts if max abs diff > tolerance.

### `demucs_onnx.export.patch_htdemucs_for_onnx(model) -> nn.Module`

Apply all four patches in place, return the same model. Useful when you
want to keep the patched model around for alternative tracers.

### Individual patches

Each blocker is a single-purpose module so you can pull just one fix
into a different project:

- `demucs_onnx.export.coerce_segment_to_float` — Fraction → float
- `demucs_onnx.export.disable_random_pos_shift` — drop `random.randrange`
- `demucs_onnx.export.onnx_friendly_mha_forward` — manual MHA forward
- `demucs_onnx.export.RealSTFT` / `RealISTFT` — complex STFT replacement

---

## Skip the infrastructure — use the StemSplit API

Don't want to bundle a 316 MB model in your app, manage a GPU pool, or
write overlap-add chunking? Use the
**[StemSplit API](https://stemsplit.io/developers)** instead — same
models under the hood, hosted for you, with credits and a dashboard.

- 🌐 [stemsplit.io](https://stemsplit.io)
- 📘 [Developer docs](https://stemsplit.io/developers/docs)
- 🔌 [API reference](https://stemsplit.io/developers/reference)

Or use the no-code tools that ship the same model family:

- 🎤 [Vocal Remover](https://stemsplit.io/vocal-remover)
- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker)
- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker)
- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter)

---

## License & attribution

This package is **MIT-licensed**, matching the original HT-Demucs.

Please cite the original authors if you use the model in research:

```bibtex
@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
```

- Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs)
- ONNX export, parity verification, packaging, and host inference by [StemSplit](https://stemsplit.io)
- Search keywords: **demucs onnx**, **htdemucs onnx**, **demucs export python**,
  **demucs ios**, **demucs android**, **demucs mobile**, **htdemucs export onnx**,
  **demucs onnxruntime**, **demucs source separation onnx**, **vocal remover onnx**,
  **karaoke onnx**, **acapella extractor onnx**.
