Metadata-Version: 2.4
Name: vocaboot
Version: 0.1.0
Summary: Terminal-first singing voice synthesis framework for the Claude Code era.
Project-URL: Homepage, https://github.com/hinanohart/vocaboot
Project-URL: Repository, https://github.com/hinanohart/vocaboot
Project-URL: Issues, https://github.com/hinanohart/vocaboot/issues
Project-URL: Changelog, https://github.com/hinanohart/vocaboot/blob/main/CHANGELOG.md
Author-email: vocaboot contributors <255530825+hinanohart@users.noreply.github.com>
License-Expression: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: claude-code,cli,diffsinger,mcp,singing-voice-synthesis,svs,vocaloid
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Sound Synthesis
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0
Requires-Dist: soundfile>=0.12
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: diffsinger
Requires-Dist: librosa<1,>=0.10; extra == 'diffsinger'
Requires-Dist: onnxruntime<2,>=1.18; extra == 'diffsinger'
Requires-Dist: scipy<2,>=1.11; extra == 'diffsinger'
Provides-Extra: mcp
Requires-Dist: mcp>=0.9; extra == 'mcp'
Provides-Extra: svc
Requires-Dist: torch<3,>=2.1; extra == 'svc'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc'
Provides-Extra: verify
Requires-Dist: pyworld<1,>=0.3; extra == 'verify'
Requires-Dist: scipy<2,>=1.11; extra == 'verify'
Provides-Extra: verify-similarity
Requires-Dist: huggingface-hub>=0.23; extra == 'verify-similarity'
Requires-Dist: librosa<1,>=0.10; extra == 'verify-similarity'
Requires-Dist: torch<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: torchaudio<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: transformers<5,>=4.40; extra == 'verify-similarity'
Description-Content-Type: text/markdown

# vocaboot

> Terminal-friendly singing voice synthesis framework for the Claude Code era.

**Status**: stage 1 smoke under development (2026-05-12). Not yet released.

## Why

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI.

`vocaboot` is the bridge: a single `pip install` and you can synthesize a singing voice from your shell, from a Python `import`, or from an MCP-aware agent. Stage 1 ships a CLI scaffold; whether **CLI / MCP / Skill / library only** is the *primary* surface for the Claude Code era is treated as an open question, to be re-evaluated at the end of Stage 1 against actual usage. (See "Distribution" below.)

## Design constraints

- **Cost: $0 (inference path)** — CPU-only numeric verify is the $0 invariant. The *self-train* path (see "Stage 0 result") is **not** $0 by definition: it requires user GPU compute and audio data.
- **License: Apache-2.0** (code) — model weights inherit their upstream license; no NC weights bundled.
- **Generality**: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
- **"Roughly close" is good enough**: chasing literal pitch-perfect voice cloning is a structural failure mode (see `_failures/`); the framework optimizes for *direction* of a target timbre, not literal match.
- **Auto-failure museum (R8)**: every failed run writes a structured post-mortem under `_failures/`, so the next attempt cannot repeat the same mistake. Paths are scrubbed to `{basename, sha256}` before persistence (R15); cyclic exception detail is replaced with `<cycle>` placeholders so museum itself never raises.

## Architecture overview (locked, 4 stages, strict superset)

| Stage | What works | Modules added |
|---|---|---|
| **0** | `ckpt` acquisition gate — closed 2026-05-12 with result **(B): 0 license-clean pairs**; project ships `--accept-nc` opt-in + self-train as dual smoke path. | (research, no new module) |
| 1 | One engine, one voice, 5-second wav, CPU, **agent-verify gate** | input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= **11 layer modules**; `cli.py` is the CLI entry point and is excluded from the module count) |
| 2 | Any voice, any song | + profile_match (optional target spec) |
| 3 | "Roughly like X" via voice conversion | + svc (DDSP-SVC default; RVC v2 / Seed-VC plugin) |
| 4 | Natural-language input (`vocaboot "sing 'Yume' like a soft anime voice"`) | + nl_parse |

Hard ceilings (anti-bloat invariant): layers ≤ 9, **layer modules ≤ 12** (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, `score_align` will be folded into `input` and `agent_advisory` into `agent_verify` to keep the ceiling intact.

### Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with **0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs** for the DiffSinger format:

- **vocoder**: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are **not drop-in compatible** with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
- **acoustic**: every public DiffSinger community voicebank (`qixuan`, `renri`, `nyaru`, `mitsudate` …) is CC-BY-NC-SA-4.0. The openvpi *code* is Apache-2.0; the *weights* are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the *substantive* difference must be stated: **vocaboot has no `pip install`-and-go commercial-OK default path**, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

| Path | Tier | $0? | Use |
|---|---|---|---|
| `--accept-nc` opt-in | CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) | Yes (inference only) | personal, non-commercial smoke |
| self-train | Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio | **No** — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) | commercial / OSS-clean redistribution |

The default tier in `docs/ckpt_registry.md` stays empty until upstream licensing changes. **Engine pivot** (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See `docs/ckpt_registry.md` for the deferral rationale.

### Output wav licensing (NC opt-in path)

When `--accept-nc` is set, the resulting `.wav` is the **user's copyright**, and CC's public position is that CC license conditions on the model do **not** automatically extend to model outputs ([Creative Commons FAQ](https://creativecommons.org/faq/#do-i-need-permission-to-use-a-cc-licensed-work-as-input-to-train-an-ai-model), [Using CC-licensed Works for AI Training, 2025](https://creativecommons.org/wp-content/uploads/2025/05/Using-CC-licensed-Works-for-AI-Training.pdf)). However: **the act of running NC weights is itself bound by the NC clause**, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless `--accept-nc` is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured `LICENSE_NOTICE.txt` emission into the failure museum and stdout when an NC ckpt is loaded; see `docs/ckpt_registry.md` for the literal text.

### "OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

> **OSS-1 (revised, 2026-05-12)**: 2 commands max to smoke wav. After `pip install vocaboot[diffsinger,verify]`, running `vocaboot demo --accept-nc` must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled `examples/short.ds`, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The `demo` subcommand is a Stage 1 step 2 deliverable. The `--accept-nc` requirement is *the* deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

## Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via `[verify]`). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is **not** a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a `--verify=agent` mode that invoked `claude -p` for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a `[verify-similarity]` extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as **museum-annotation only** — structurally excluded from the verdict by design.

`--no-agent-verify` exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

## Distribution

Stage 1 ships:

- `pip install vocaboot[diffsinger,verify]` → either
  - `vocaboot demo --accept-nc` (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), **or**
  - `vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc` (BYO ckpt path).
- `import vocaboot` → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

- **MCP server stdio** — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
- **Anthropic Skill / Subagent** — natural surface for Claude Code users; tradeoff is client coupling.
- **Headless library only (no CLI)** — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did **not** rule any of them out.

## Related work (delta)

`vocaboot` is not the first attempt at headless DiffSinger inference. Pre-existing projects:

| Project | What it does | What `vocaboot` adds |
|---|---|---|
| [openvpi DiffSingerMiniEngine](https://github.com/openvpi/DiffSingerMiniEngine) | Python HTTP server wrapping ONNX inference | License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface |
| [diffscope/dsinfer](https://github.com/diffscope/dsinfer) | C++ low-level inference SDK | Higher-level Python API + agent integration + verify gate |
| [HighCWu / Jobsecond ONNX-Infer](https://github.com/Jobsecond/diffsinger-onnx-infer) | Reference ONNX inference snippet | Production-grade gates + structured failure capture + R8 reproducibility |
| OpenUtau (headless mode) | GUI-first with scripted batch | $0 verify gate, no GUI bridge required |
| [nnsvs/nnsvs](https://github.com/nnsvs/nnsvs) | MIT SVS framework; weights per-voicebank | If Stage 0 reopens, nnsvs is the first re-target candidate (license-clean framework, weights TBD) |

If, during Stage 0 or 1, a clean integration into one of these projects becomes more valuable than a standalone framework, the project is willing to redirect toward a PR contribution.

## Stage 1 (current target)

```bash
# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.
```

`examples/short.ds` uses generic CV phoneme tokens (`d o r e m i f a s o`); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts `examples/short.ds` to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

### `--voice synthetic` is an oscillator, not a voice

The Stage 1 synthetic-acoustic path produces a **4-partial harmonic oscillator** driven by the score's note sequence. It is **not** vocal-tract-filtered voice synthesis — the `ph_seq` (phoneme) field is deliberately ignored, and the output has no vowel formants or consonants. Its purpose is to prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without depending on a license-clean acoustic checkpoint that does not exist (Stage 0 result B). Vowel-formant / phoneme-aware synthesis arrives with the BYO acoustic path in Stage 2.

A NUMERIC verify APPROVE on the synthetic path proves the pipeline **rings**, not that the output sounds like a voice. Stage 2 adds spectral/timbral metrics behind `[verify-similarity]` so APPROVE discriminates synthetic from real-voice output.

## Roadmap

- [x] Architecture lock-in (rounds 28–32; per-round audit trails are inlined into `docs/stage_2.md` ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in `CHANGELOG.md`)
- [x] **Stage 0**: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
- [x] **Stage 1 step 2 (5c)**: `vocaboot demo` subcommand wired (click.group, bundled `examples/short.ds`, `--accept-nc` mandatory, SHA-256 runtime verify of the pinned vocoder).
- [x] Stage 1 step 2 (5d): `LICENSE_NOTICE.txt` attribution emission for the NC vocoder (sidecar `<out>.LICENSE_NOTICE.txt` + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
- [x] Stage 1: smoke wav (CPU, 5 s) via `vocaboot demo --accept-nc` (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
- [x] Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
- [x] Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in `test_stage1_synthetic_smoke_wav_end_to_end` already serve as the CI canary.
- [x] **Stage 1.5** closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) `stage_1.py` → `pipeline.py` rename, (B) `privacy.refuse_user_audio_without_consent` + `museum` user-audio hard-exclude rule, (C) `docs/quality_gate.md` Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) `docs/ckpt_registry.md` acquisition Path B (NC-only public release) chosen, (E) `docs/stage_2.md` spec.
- [ ] **Stage 2** (`docs/stage_2.md`): `profile_match` module + `audio_ingest` module + Tier 2 verify metrics (`mcd`, `wavlm_cosine`).
- [ ] Stage 2.5: MCP server (`vocaboot mcp-serve`) — thin wrapper around `pipeline.run`.
- [ ] Stage 3: SVC fallback chain (DDSP-SVC primary) + self-train recipe (acquisition Path C).
- [ ] Stage 4: natural-language interface.
- [ ] Public OSS release on Path B (NC-only, local screencast demo; HF Space deferred pending Default-tier ckpt).

## Non-goals

- Pitch-perfect voice cloning (target literal match is a known failure mode)
- Bundled model weights (we point at upstream URLs and verify SHA-256)
- GUI

## License

Code: Apache-2.0.
Model weights: inherit from upstream — `vocaboot` itself bundles none.
