Metadata-Version: 2.4
Name: wav2caption
Version: 0.1.4
Summary: Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts)
Project-URL: Homepage, https://github.com/hinanohart/wav2caption
Project-URL: Repository, https://github.com/hinanohart/wav2caption
Project-URL: Issues, https://github.com/hinanohart/wav2caption/issues
Project-URL: Changelog, https://github.com/hinanohart/wav2caption/blob/main/CHANGELOG.md
Author: wav2caption contributors
License: MIT License
        
        Copyright (c) 2026 wav2caption contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: ace-step,audio-analysis,audio-captioning,caption-generation,essentia,instrument-detection,mir,mtg-jamendo,music-captioning,music-generation,music-information-retrieval,prompt-generation,suno,text-to-music,udio
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: numpy>=1.23.5
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: essentia
Requires-Dist: essentia-tensorflow>=2.1b6.dev1110; extra == 'essentia'
Description-Content-Type: text/markdown

# wav2caption

[![PyPI](https://img.shields.io/pypi/v/wav2caption.svg)](https://pypi.org/project/wav2caption/)
[![Python](https://img.shields.io/pypi/pyversions/wav2caption.svg)](https://pypi.org/project/wav2caption/)
[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![CI](https://github.com/hinanohart/wav2caption/actions/workflows/test.yml/badge.svg)](https://github.com/hinanohart/wav2caption/actions/workflows/test.yml)
[![CodeQL](https://github.com/hinanohart/wav2caption/actions/workflows/codeql.yml/badge.svg)](https://github.com/hinanohart/wav2caption/actions/workflows/codeql.yml)
[![Downloads](https://static.pepy.tech/badge/wav2caption)](https://pepy.tech/project/wav2caption)

**Audio → instrument-aware caption for AI music generation.**
Suno / Udio / ACE-Step prompt generator that describes *what is playing*
**and how it is used* (rhythm / bass / harmony / lead / strings / brass /
synth / vocal).

Point it at a WAV/MP3/FLAC file and get back a structured analysis *and* a
ready-to-paste prompt for [ACE-Step](https://github.com/ace-step/ACE-Step),
[Suno](https://suno.com/), [Udio](https://www.udio.com/), or any other
prompt-conditioned music model.

> **Disclaimer:** This is an **independent third-party tool**. It is **not affiliated with, endorsed by, or sponsored by** Suno, Udio, ACE-Step, Essentia, MTG-Jamendo, or Discogs. Those names appear **nominatively** to identify the downstream prompt formats and upstream models / datasets this tool integrates with. Bundled model weights inherit their original CC-BY / MIT licenses; users are responsible for verifying that audio inputs they analyse are properly licensed.

```
live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section
```

Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class
instrument head + Discogs-EffNet embeddings) with classical MIR features
(BPM, key, loudness, spectral centroid, pitch range) and a small role
taxonomy, so the caption describes both **what is playing** and **how it is
used** (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

---

## Why this exists

Most "audio → tag" tools stop at a flat list of instruments. When you feed
that into a prompt-conditioned music model, the arrangement gets lost —
instruments are named but their *role* is missing, and dynamics are dropped
entirely. `wav2caption` was factored out of a production pipeline that
captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and
it keeps two things other tools don't:

- **Role grouping.** `drums` and `bass` are not just instruments; they are
  the *rhythm* and *bass* roles. A section that also has `strings` + `brass`
  gets tagged as "string section, brass section" rather than five
  indistinguishable labels.
- **Section features.** Per-window loudness, centroid, and pitch-range give
  you "quiet (breakdown/interlude)", "peak energy (chorus/climax)",
  "staccato stabs", "metallic percussion accents" — the kind of descriptors
  music LLMs actually condition on.

## Install

```bash
pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"
```

> Essentia is distributed under **AGPL-3.0** (or a commercial license
> from MTG-UPF). If you ship a network service built on `wav2caption`,
> you may need to release your source under AGPL-3.0 or buy a commercial
> license. The `wav2caption` code itself is MIT.

### Models

The pretrained weights are **not** bundled (they are CC-BY-NC-SA 4.0 and
non-commercial). Download them once, then verify the SHA-256 digests:

```bash
mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb

# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1  discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a  mtg_jamendo_instrument-discogs-effnet-1.pb
EOF
```

The same check is available programmatically:

```python
from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())
```

or automatically on every `analyze(...)` call by setting
`WAV2CAPTION_VERIFY_DIGESTS=1` in your environment.

> ⚠️ **Supply-chain note.** The `.pb` files are TensorFlow GraphDefs and
> a maliciously crafted graph can influence what runs inside Essentia.
> Always download over HTTPS from `essentia.upf.edu` and verify the
> digests before first load.

Or point `WAV2CAPTION_MODELS_DIR` (or `--models-dir`) at an existing folder.

## Quick start

### CLI

```bash
wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5
```

### Example output

On a 3:32 record-grand-prix reference instrumental, `wav2caption
song.wav` produces:

```
=== song.wav ===
duration: 3:32  tempo: 132.9 BPM  key: Eb major (conf 0.87)  danceability: 1.10

[ detected instruments ]
  drums                0.402  ################
  electricguitar       0.308  ############
  bass                 0.286  ###########
  guitar               0.274  ##########
  piano                0.222  ########
  acousticguitar       0.177  #######
  synthesizer          0.176  #######
  violin               0.126  #####
  ...

[ role scores ]
  rhythm             0.468
  acoustic_guitar    0.450
  harmony            0.377
  lead_guitar        0.308
  bass               0.286
  strings            0.219
  synth              0.176
  brass              0.118
  vocal              0.067
  woodwind           0.061

[ sections ]
  0:20-0:30  loud=1301  bright=1019Hz  Eb major
    roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
    features: metallic percussion accents, string harmonies, brass accents
  0:30-0:40  loud=1224  bright=1278Hz  Eb major
    roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
    features: metallic percussion accents, staccato stabs

[ caption ]
  live drums, electric guitar, piano, bass, string section, acoustic guitar,
  Eb major, 133 BPM, dynamic build-up, breakdown section
```

### Python

```python
from wav2caption import analyze, build_caption

result = analyze("song.wav")
print(build_caption(result))

for s in result.sections:
    roles = {r: name for r, (name, _score) in s.roles.items()}
    print(f"{s.start:>5.1f}s  {roles}  {s.features}")
```

`AnalysisResult` is a typed dataclass:

```python
@dataclass
class AnalysisResult:
    path: Path
    duration_sec: float
    bpm: float
    key: str
    scale: str  # "major" | "minor"
    key_confidence: float
    danceability: float
    detected_instruments: list[tuple[str, float]]   # (label, probability)
    role_scores: dict[str, float]                   # aggregated per role
    sections: list[Section]
```

## Role taxonomy

| role              | instruments                                                                 |
| ----------------- | --------------------------------------------------------------------------- |
| `rhythm`          | drums, drummachine, beat, percussion, bongo                                 |
| `bass`            | bass, acousticbassguitar, doublebass                                        |
| `harmony`         | piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion         |
| `lead_guitar`     | electricguitar                                                              |
| `acoustic_guitar` | acousticguitar, classicalguitar, guitar                                     |
| `strings`         | strings, violin, viola, cello, orchestra                                    |
| `brass`           | brass, trumpet, trombone, horn, saxophone                                   |
| `woodwind`        | flute, clarinet, oboe                                                       |
| `synth`           | synthesizer, pad, sampler, computer                                         |
| `bells`           | bell, harp, harmonica                                                       |
| `vocal`           | voice                                                                       |

The mapping is intentionally opinionated and biased toward *production
arrangement* labels rather than strict orchestration (e.g. `guitar` goes to
`acoustic_guitar` because the MTG-Jamendo label is ambiguous and the
acoustic interpretation is safer for caption conditioning). Override
`ROLE_MAP` if you disagree — it's just a `dict[str, tuple[str, ...]]`.

## Project layout

```
src/wav2caption/
    __init__.py       # public API
    analyzer.py       # analyze() + build_caption() + dataclasses
    constants.py      # INSTRUMENTS, ROLE_MAP, get_role()
    models.py         # model-path discovery
    cli.py            # wav2caption console script
tests/                # no-Essentia unit tests
```

## Development

```bash
git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src
```

The unit tests intentionally do **not** require Essentia, so CI stays fast
and free of TensorFlow. Real-audio smoke tests belong in `examples/`.

## Verification (sigstore)

Releases from **v_next_** (released after 2026-05-16) include a sigstore keyless signature bundle
(`.sigstore` per artifact) attached to the GitHub Release.

### Verify a PyPI install

```bash
pip download <pkg-name>==<version> --no-deps -d ./verify
python -m sigstore verify github \
    --cert-identity 'https://github.com/hinanohart/wav2caption/.github/workflows/release.yml@refs/tags/v<version>' \
    --cert-oidc-issuer 'https://token.actions.githubusercontent.com' \
    ./verify/*.whl ./verify/*.tar.gz
```

The corresponding `.sigstore` bundles can be downloaded from the GitHub Release page.

### Historic releases (pre-2026-05-16)

Earlier releases were published without sigstore bundles. Re-installing those versions
provides no cryptographic provenance — pin to a current release if assurance matters.

## License

- Source code: **MIT** (see [LICENSE](LICENSE)).
- Runtime dep Essentia: **AGPL-3.0** (opt-in via `pip install "wav2caption[essentia]"`).
- Pretrained models: **CC-BY-NC-SA 4.0** (user-downloaded, non-commercial).

Full third-party notices: [`NOTICE.md`](NOTICE.md).

If you need a commercial pipeline you will have to either license Essentia
from MTG-UPF or swap in a different backend. The MIT-licensed code
in this repo is backend-agnostic enough that a `torch` / `onnxruntime`
port is straightforward — PRs welcome.
