Metadata-Version: 2.4
Name: vocaboot
Version: 0.2.1
Summary: Terminal-first singing voice synthesis framework for the Claude Code era.
Project-URL: Homepage, https://github.com/hinanohart/vocaboot
Project-URL: Repository, https://github.com/hinanohart/vocaboot
Project-URL: Issues, https://github.com/hinanohart/vocaboot/issues
Project-URL: Changelog, https://github.com/hinanohart/vocaboot/blob/main/CHANGELOG.md
Author-email: vocaboot contributors <255530825+hinanohart@users.noreply.github.com>
License-Expression: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: agent-loop,claude-code,cli,ddsp-svc,diffsinger,mcp,rvc,singing-voice-synthesis,soulx-singer,svc,svs,vocaloid,voice-conversion
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: OS Independent
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Multimedia :: Sound/Audio :: Sound Synthesis
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0
Requires-Dist: soundfile>=0.12
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: diffsinger
Requires-Dist: librosa<1,>=0.10; extra == 'diffsinger'
Requires-Dist: onnxruntime<2,>=1.18; extra == 'diffsinger'
Requires-Dist: scipy<2,>=1.11; extra == 'diffsinger'
Provides-Extra: mcp
Requires-Dist: mcp>=0.9; extra == 'mcp'
Provides-Extra: svc
Requires-Dist: torch<3,>=2.1; extra == 'svc'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc'
Provides-Extra: svc-ddsp
Requires-Dist: numpy>=1.24; extra == 'svc-ddsp'
Requires-Dist: onnxruntime<2,>=1.18; extra == 'svc-ddsp'
Requires-Dist: pyworld<1,>=0.3; extra == 'svc-ddsp'
Requires-Dist: scipy<2,>=1.11; extra == 'svc-ddsp'
Requires-Dist: torch<3,>=2.1; extra == 'svc-ddsp'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-ddsp'
Provides-Extra: svc-rvc
Requires-Dist: fairseq<1,>=0.12; extra == 'svc-rvc'
Requires-Dist: librosa<1,>=0.10; extra == 'svc-rvc'
Requires-Dist: numpy>=1.24; extra == 'svc-rvc'
Requires-Dist: scipy<2,>=1.11; extra == 'svc-rvc'
Requires-Dist: torch<3,>=2.1; extra == 'svc-rvc'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-rvc'
Provides-Extra: svc-soulx
Requires-Dist: accelerate>=0.30; extra == 'svc-soulx'
Requires-Dist: huggingface-hub>=0.23; extra == 'svc-soulx'
Requires-Dist: librosa<1,>=0.10; extra == 'svc-soulx'
Requires-Dist: torch<3,>=2.1; extra == 'svc-soulx'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-soulx'
Requires-Dist: transformers<5,>=4.40; extra == 'svc-soulx'
Provides-Extra: verify
Requires-Dist: pyworld<1,>=0.3; extra == 'verify'
Requires-Dist: scipy<2,>=1.11; extra == 'verify'
Provides-Extra: verify-similarity
Requires-Dist: huggingface-hub>=0.23; extra == 'verify-similarity'
Requires-Dist: librosa<1,>=0.10; extra == 'verify-similarity'
Requires-Dist: torch<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: torchaudio<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: transformers<5,>=4.40; extra == 'verify-similarity'
Description-Content-Type: text/markdown

# vocaboot

> Terminal-friendly singing voice synthesis framework for the Claude Code era.

**Status**: 0.2.1 — Stage 2.5 MCP server (`vocaboot mcp-serve`) + Stage 2 closure (Tier 2 canary, discrimination, consent gate, VoiceProfile round-trip — all green at 204/204) shipped 2026-05-13 on PyPI (`pip install vocaboot==0.2.1`). The 0.2.0 line landed Stage 1 (smoke + privacy + Tier-2 verify) + Stage 1 DiffSinger MIT ONNX BYO wiring + Stage 3 SVC opening (DDSP-SVC inference dispatch via subprocess) + F1 security gate (sha256 → `safe_torch_load(weights_only=True)` → subprocess isolation, OWASP A08 / CWE-502). Next: v0.3.0 = R14 round 7 DiffSinger BYO + mel-format adapter; Stage 3 SVC plugins (RVC, SoulX-Singer); Stage 4 NL parse.

## Why

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI.

`vocaboot` is the bridge: a single `pip install` and you can synthesize a singing voice from your shell, from a Python `import`, or from an MCP-aware agent. Stage 1 ships a CLI scaffold; whether **CLI / MCP / Skill / library only** is the *primary* surface for the Claude Code era is treated as an open question, to be re-evaluated at the end of Stage 1 against actual usage. (See "Distribution" below.)

## Design constraints

- **Cost: $0 (inference path)** — CPU-only numeric verify is the $0 invariant. The *self-train* path (see "Stage 0 result") is **not** $0 by definition: it requires user GPU compute and audio data.
- **License: Apache-2.0** (code) — model weights inherit their upstream license; no NC weights bundled.
- **Generality**: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
- **"Roughly close" is good enough**: chasing literal pitch-perfect voice cloning is a structural failure mode (see `_failures/`); the framework optimizes for *direction* of a target timbre, not literal match.
- **Auto-failure museum (R8)**: every failed run writes a structured post-mortem under `_failures/`, so the next attempt cannot repeat the same mistake. Paths are scrubbed to `{basename, sha256}` before persistence (R15); cyclic exception detail is replaced with `<cycle>` placeholders so museum itself never raises.

## Architecture overview (locked, 4 stages, strict superset)

| Stage | What works | Modules added |
|---|---|---|
| **0** | `ckpt` acquisition gate — closed 2026-05-12 with result **(B): 0 license-clean pairs**; project ships `--accept-nc` opt-in + self-train as dual smoke path. | (research, no new module) |
| 1 | One engine, one voice, 5-second wav, CPU, **agent-verify gate** | input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= **11 layer modules**; `cli.py` is the CLI entry point and is excluded from the module count) |
| 2 | Any voice, any song | + profile_match (optional target spec) |
| 3 | "Roughly like X" via voice conversion | + svc (DDSP-SVC default; RVC v2 / SoulX-Singer plugin) |
| 4 | Natural-language input (`vocaboot "sing 'Yume' like a soft anime voice"`) | + nl_parse |

Hard ceilings (anti-bloat invariant): layers ≤ 9, **layer modules ≤ 12** (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, `score_align` will be folded into `input` and `agent_advisory` into `agent_verify` to keep the ceiling intact.

### Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with **0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs** for the DiffSinger format:

- **vocoder**: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are **not drop-in compatible** with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
- **acoustic**: every public DiffSinger community voicebank (`qixuan`, `renri`, `nyaru`, `mitsudate` …) is CC-BY-NC-SA-4.0. The openvpi *code* is Apache-2.0; the *weights* are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the *substantive* difference must be stated: **vocaboot has no `pip install`-and-go commercial-OK default path**, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

| Path | Tier | $0? | Use |
|---|---|---|---|
| `--accept-nc` opt-in | CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) | Yes (inference only) | personal, non-commercial smoke |
| self-train | Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio | **No** — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) | commercial / OSS-clean redistribution |

The default tier in `docs/ckpt_registry.md` stays empty until upstream licensing changes. **Engine pivot** (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See `docs/ckpt_registry.md` for the deferral rationale.

### Output wav licensing (NC opt-in path)

When `--accept-nc` is set, the resulting `.wav` is the **user's copyright**, and CC's public position is that CC license conditions on the model do **not** automatically extend to model outputs ([Creative Commons FAQ](https://creativecommons.org/faq/#do-i-need-permission-to-use-a-cc-licensed-work-as-input-to-train-an-ai-model), [Using CC-licensed Works for AI Training, 2025](https://creativecommons.org/wp-content/uploads/2025/05/Using-CC-licensed-Works-for-AI-Training.pdf)). However: **the act of running NC weights is itself bound by the NC clause**, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless `--accept-nc` is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured `LICENSE_NOTICE.txt` emission into the failure museum and stdout when an NC ckpt is loaded; see `docs/ckpt_registry.md` for the literal text.

### "OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

> **OSS-1 (revised, 2026-05-12)**: 2 commands max to smoke wav. After `pip install vocaboot[diffsinger,verify]`, running `vocaboot demo --accept-nc` must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled `examples/short.ds`, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The `demo` subcommand is a Stage 1 step 2 deliverable. The `--accept-nc` requirement is *the* deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

## Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via `[verify]`). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is **not** a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a `--verify=agent` mode that invoked `claude -p` for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a `[verify-similarity]` extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as **museum-annotation only** — structurally excluded from the verdict by design.

`--no-agent-verify` exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

## Distribution

Stage 1 ships:

- `pip install vocaboot[diffsinger,verify]` → either
  - `vocaboot demo --accept-nc` (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), **or**
  - `vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc` (BYO ckpt path).
- `import vocaboot` → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

- **MCP server stdio** — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
- **Anthropic Skill / Subagent** — natural surface for Claude Code users; tradeoff is client coupling.
- **Headless library only (no CLI)** — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did **not** rule any of them out.

## Related work (delta)

`vocaboot` is not the first OSS SVS/SVC project. The slot it occupies is **terminal- and agent-loop-friendly wrapping** around upstream engines, not engine R&D itself. Direct comparisons:

| Project | What it does | What `vocaboot` adds |
|---|---|---|
| **SVC engines (Stage 3 plugin set)** | | |
| [RVC-Project / Retrieval-based-Voice-Conversion-WebUI](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) (35.6k★, MIT, large community voicebank set) | GUI-first realtime SVC, voicebanks per-license-audited | Terminal & MCP surface, $0 numeric verify gate, R5 CI, failure museum, agent-callable; vocaboot wraps RVC v2 as a plugin, not competing on engine quality |
| [yxlllc / DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) (2.5k★, MIT, CPU-capable) | Lightweight DDSP-based SVC, mature CLI inference | Primary engine of record for Stage 3; vocaboot adds protocol bifurcation, registry-pinned weights, and verify-against-target post-conversion gate |
| [Soul-AILab / SoulX-Singer](https://github.com/Soul-AILab/SoulX-Singer) (608★, Apache-2.0 code + weights, 42 000 h vocal train) | Transcription-free zero-shot SVC; Apache-2.0 across the board (only one in the plugin set) | Default-tier candidate; vocaboot integrates as remote (HF Space) or subprocess depending on the 5.6 GB / GPU envelope (see `docs/stage_3.md` for the A/B/C distribution-form simulation) |
| **SVS engines (Stage 1 plugin set)** | | |
| [openvpi / DiffSingerMiniEngine](https://github.com/openvpi/DiffSingerMiniEngine) | Python HTTP server wrapping ONNX inference | License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface |
| [diffscope / dsinfer](https://github.com/diffscope/dsinfer) | C++ low-level inference SDK | Higher-level Python API + agent integration + verify gate |
| [Jobsecond / diffsinger-onnx-infer](https://github.com/Jobsecond/diffsinger-onnx-infer) | Reference ONNX inference snippet | Production-grade gates + structured failure capture + R8 reproducibility |
| OpenUtau (headless mode) | GUI-first with scripted batch | $0 verify gate, no GUI bridge required |
| [nnsvs / nnsvs](https://github.com/nnsvs/nnsvs) | MIT SVS framework; weights per-voicebank | Stage 0 re-target candidate if upstream weight licensing changes |

`vocaboot` does **not** try to out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: **the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate (R5 bootstrap CI), a privacy guard with structural user-audio exclusion, and an auto-failure museum**. If a clean integration into one of the above projects becomes more valuable than a standalone wrapper, vocaboot will redirect toward a PR contribution.

## Stage 1 (current target)

```bash
# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.
```

`examples/short.ds` uses generic CV phoneme tokens (`d o r e m i f a s o`); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts `examples/short.ds` to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

### `--voice synthetic` is an oscillator, not a voice

The Stage 1 synthetic-acoustic path produces a **4-partial harmonic oscillator** driven by the score's note sequence. It is **not** vocal-tract-filtered voice synthesis — the `ph_seq` (phoneme) field is deliberately ignored, and the output has no vowel formants or consonants. Its purpose is to prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without depending on a license-clean acoustic checkpoint that does not exist (Stage 0 result B). Vowel-formant / phoneme-aware synthesis arrives with the BYO acoustic path in Stage 2.

A NUMERIC verify APPROVE on the synthetic path proves the pipeline **rings**, not that the output sounds like a voice. Stage 2 adds spectral/timbral metrics behind `[verify-similarity]` so APPROVE discriminates synthetic from real-voice output.

## Stage 3 DDSP-SVC voice conversion (v0.2.0-rc1, BYO)

```bash
pip install vocaboot[svc-ddsp]
git clone https://github.com/yxlllc/DDSP-SVC.git ~/ddsp-svc
# ...follow upstream README to install its requirements, train or
# download a ckpt, and compute sha256sum model.pt

export VOCABOOT_DDSP_PATH=~/ddsp-svc/exp/combsub-test/model_300000.pt
export VOCABOOT_DDSP_SHA256=<digest>
export VOCABOOT_DDSP_SRC=~/ddsp-svc

vocaboot ddsp provision        # verify sha256 + weights-only probe
vocaboot ddsp convert --source singer.wav --out converted.wav
```

The `convert` subcommand walks a 4-stage refusal ladder (unconfirmed
sha256 → ckpt not found → hash mismatch → engine not wired) and
dispatches `model.forward` to the upstream subprocess. v0.3.0+ will add
an in-process vendored adapter so the `VOCABOOT_DDSP_SRC` step becomes
optional.

## Completion criteria (OSS 1k★ 水準、 R14 round 5 構想層 2026-05-13)

vocaboot が「completed」 = OSS スター 1k 級獲得水準と判定するための 5 product-level criteria。 v0.1.0 alpha の internal-quality gate (R5 verify / R8 museum / R15 privacy) の上に重ねる外向き条件で、 同類 SVS/SVC project (RVC 35k★ / MoonInTheRiver/DiffSinger 4.7k★ / OpenUtau 3.8k★ / openvpi/DiffSinger 3.1k★ / DDSP-SVC 2.5k★) の共通点を逆算して導出。

- **C1.** `pip install vocaboot[...]` 後 30 秒以内に音が出る — v0.2.0 で **DiffSinger MIT ONNX BYO path** 開設済 (`svs_engine.py:226` _is_synthetic 偽分岐拡張) で BYO ckpt 提供時 C1 達成。 OSS demo default の no-NC 完全達成は mel-format adapter (BigVGAN/Vocos MIT bridge、 R14 round 6 別議題) が landed する v0.3.0+。
- **C2.** 出力が「歌声に聞こえる」 — v0.2.0 BYO path で DiffSinger acoustic ckpt + voicebank 受け入れ済 → C2 達成 (BYO ckpt 提供時)。 OSS demo default は `--accept-nc` continued、 v0.3.0+ mel-format adapter で完全達成。 (Stage 1 OSS-1 の 4-partial harmonic oscillator は母音すら出ないことを自認、 `README.md` 下記 demo limitations セクション参照。)
- **C3.** README に動く demo wav (`<audio>` embed) + 30 秒 mp4 demo — v0.2.1+ で追加予定。
- **C4.** HF Space / Replicate に 1-click try ページ — Stage 3 SoulX 配布形態 C 経由、 v0.3.0 で land。
- **C5.** Twitter/X + HN + Reddit r/MachineLearning ローンチ投稿 — C1-C4 揃った後 v0.3.0 release で実施。

vocaboot 独自軸の「terminal-first + verify gate + failure museum」は **developer-tooling 価値** で OSS star audience の 20% (dev community) に効く一方、 残り 80% の非エンジニア audience には C1-C5 が必須。 **v0.2.0 = (a) C1+C2 BYO-scope 達成 (developer-tooling milestone、 BYO ckpt user 限定) + (b) Stage 3 DDSP-SVC 実 inference wiring** で developer audience に届く中間 release を 2026-05-13 完了 (0.2.0 line は 0.2.0.post2 まで、 0.2.1 で MCP server land)、 **OSS 1k★ 完成宣言は v0.3.0** (mel-format adapter で `--accept-nc` default 解消 + HF Space C4 land 時点) が判定基準 (R14 round 6 critic 構想層補正 2026-05-13: BYO-scope C1+C2 は OSS audience 5% 限定なので developer-tooling milestone 位置づけに narrow、 1k★ 達成条件は v0.3.0 に shift)。

## Roadmap

- [x] Architecture lock-in (rounds 28–32; per-round audit trails are inlined into `docs/stage_2.md` ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in `CHANGELOG.md`)
- [x] **Stage 0**: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
- [x] **Stage 1 step 2 (5c)**: `vocaboot demo` subcommand wired (click.group, bundled `examples/short.ds`, `--accept-nc` mandatory, SHA-256 runtime verify of the pinned vocoder).
- [x] Stage 1 step 2 (5d): `LICENSE_NOTICE.txt` attribution emission for the NC vocoder (sidecar `<out>.LICENSE_NOTICE.txt` + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
- [x] Stage 1: smoke wav (CPU, 5 s) via `vocaboot demo --accept-nc` (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
- [x] Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
- [x] Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in `test_stage1_synthetic_smoke_wav_end_to_end` already serve as the CI canary.
- [x] **Stage 1.5** closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) `stage_1.py` → `pipeline.py` rename, (B) `privacy.refuse_user_audio_without_consent` + `museum` user-audio hard-exclude rule, (C) `docs/quality_gate.md` Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) `docs/ckpt_registry.md` acquisition Path B (NC-only public release) chosen, (E) `docs/stage_2.md` spec.
- [x] **Public OSS release** (2026-05-13): Apache-2.0 LICENSE pristine, NOTICE chain audited, GitHub `license.spdx_id=apache-2.0` 復帰確認済。 4 commit (`c267afd` + `18733d1` + `9f0ae47` + `26a3589`) on `origin/main`.
- [WIP] **Stage 3 SVC base + ddsp landing** (R14 round 5 priority 🥈, services/svc/base.py + ddsp.py、 MIT/CPU primary engine): VoiceConverterProtocol 実装 + DDSP-SVC plugin land。
- [WIP] **Stage 1 DiffSinger MIT ONNX BYO wiring** (R14 round 5 priority 🥉、 Completion C1+C2 達成手段): `services/svs_engine.py:226` の `EngineNotWiredError` を DiffSinger ONNX BYO path に拡張、 `--accept-nc` 解消、 openvpi/DiffSinger 3.1k★ Apache-2.0 + MoonInTheRiver/DiffSinger 4.7k★ MIT を BYO ckpt として受け入れ。
- [x] **v0.2.0 release tag** (2026-05-13、 現 0.2.1): C1+C2 BYO-scope 達成 + Stage 3 DDSP-SVC 実 inference wiring 完了 (R14 round 5 構想層判定)。 PyPI live: `pip install vocaboot==0.2.1`。
- [x] **Stage 2.5 MCP server** (`vocaboot mcp-serve`、 v0.2.1 同梱、 2026-05-13): `pipeline.run` の thin wrapper (`services/mcp_server.py`)。 `[mcp]` extra に `mcp>=0.9` を wire-in、 stdio JSON-RPC で `synth` / `version` tool を提供。 `audio_ingest` / consent-gated 系は MCP 境界不変条件 (`docs/stage_2.md`) により非公開、 `accept_nc` は `--accept-nc` で server-start 時に human signal として固定。
- [x] **Stage 2 closure** (2026-05-13、 v0.2.1 でレッテル付け): `docs/stage_2.md` Acceptance criteria 9 件 (Tier 2 canary / discrimination / consent gate / VoiceProfile round-trip / no Stage 1 regression / layer count / WavLM pin / `SpeakerSimilarityProtocol`) すべて 0.2.0.post2 時点で test gate green、 0.2.1 で changelog に明示。
- [ ] **Stage 3 SVC remaining**: services/svc/rvc.py (RVC v2、 per-voicebank license gate) + services/svc/soulx.py (SoulX-Singer Apache-2.0、 subprocess + HF Space remote)。 v0.3.0 で land。
- [ ] **Completion C3-C5** (v0.3.0+): README demo wav embed + 30s mp4 + HF Space 1-click + HN/Reddit ローンチ。
- [ ] **Stage 4** (v0.4.0+、 別 R14 round): natural-language interface (NL prompt → DiffSinger `.ds` JSON、 `[nl]` extra)。 v0.2.0 完成宣言後の追加 layer であり release 必須条件ではない。

## Non-goals

- Pitch-perfect voice cloning (target literal match is a known failure mode)
- Bundled model weights (we point at upstream URLs and verify SHA-256)
- GUI

## Security: malicious ckpt threat model

vocaboot loads PyTorch checkpoint files (`.pt`) supplied by the user via
`VOCABOOT_DDSP_PATH` / `VOCABOOT_VOCODER_PATH` / `VOCABOOT_WAVLM_PATH`. A
malicious `.pt` file can embed pickle payloads that execute arbitrary
code at deserialization time (OWASP A08 / CWE-502 — *insecure
deserialization*). Defenses:

1. **sha256 verify before load** — `core.ckpt_registry.verify()` refuses
   any ckpt whose bytes don't match the user-committed digest
   (`VOCABOOT_*_SHA256`). A man-in-the-middle download swap is caught
   at the byte boundary, before PyTorch's parser is reached.
2. **`torch.load(weights_only=True)`** — the default in
   `core.ckpt_registry.safe_torch_load()`. PyTorch's weights-only
   deserializer refuses arbitrary classes; legitimate ckpts bundling
   `argparse.Namespace` (e.g. DDSP-SVC v5.0) trigger `UnsafeCkptError`
   so the user explicitly opts in to `allow_unsafe=True` rather than
   silently widening the trust boundary.
3. **Subprocess isolation for upstream inference** — `vocaboot ddsp
   convert` invokes `yxlllc/DDSP-SVC`'s `main.py` in a child process
   (`VOCABOOT_DDSP_SRC`). The weights are loaded in upstream's address
   space; a successful exploit would still need to break out of the
   subprocess to affect vocaboot's process.

**Social-engineering caveat (cannot be defended automatically)**: if a
user is socially engineered into exporting an adversary-supplied
digest into `VOCABOOT_*_SHA256` AND passing `allow_unsafe=True`, no
in-tree defense catches it. Pin sha256 values from sources you trust
— cross-check against `docs/ckpt_registry.md` or the upstream model
card before exporting.

## License

Code: Apache-2.0.
Model weights: inherit from upstream — `vocaboot` itself bundles none.
