Metadata-Version: 2.4
Name: vocaboot
Version: 0.3.1
Summary: Terminal-first singing voice synthesis framework for the Claude Code era.
Project-URL: Homepage, https://github.com/hinanohart/vocaboot
Project-URL: Repository, https://github.com/hinanohart/vocaboot
Project-URL: Issues, https://github.com/hinanohart/vocaboot/issues
Project-URL: Changelog, https://github.com/hinanohart/vocaboot/blob/main/CHANGELOG.md
Author-email: vocaboot contributors <255530825+hinanohart@users.noreply.github.com>
License-Expression: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: agent-loop,claude-code,cli,ddsp-svc,diffsinger,mcp,rvc,singing-voice-synthesis,soulx-singer,svc,svs,vocaloid,voice-conversion
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: OS Independent
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Multimedia :: Sound/Audio :: Sound Synthesis
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0
Requires-Dist: soundfile>=0.12
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: diffsinger
Requires-Dist: librosa<1,>=0.10; extra == 'diffsinger'
Requires-Dist: onnxruntime<2,>=1.18; extra == 'diffsinger'
Requires-Dist: scipy<2,>=1.11; extra == 'diffsinger'
Provides-Extra: mcp
Requires-Dist: mcp>=0.9; extra == 'mcp'
Provides-Extra: svc
Requires-Dist: torch<3,>=2.1; extra == 'svc'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc'
Provides-Extra: svc-ddsp
Requires-Dist: numpy>=1.24; extra == 'svc-ddsp'
Requires-Dist: onnxruntime<2,>=1.18; extra == 'svc-ddsp'
Requires-Dist: pyworld<1,>=0.3; extra == 'svc-ddsp'
Requires-Dist: scipy<2,>=1.11; extra == 'svc-ddsp'
Requires-Dist: torch<3,>=2.1; extra == 'svc-ddsp'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-ddsp'
Provides-Extra: svc-rvc
Requires-Dist: fairseq<1,>=0.12; extra == 'svc-rvc'
Requires-Dist: librosa<1,>=0.10; extra == 'svc-rvc'
Requires-Dist: numpy>=1.24; extra == 'svc-rvc'
Requires-Dist: scipy<2,>=1.11; extra == 'svc-rvc'
Requires-Dist: torch<3,>=2.1; extra == 'svc-rvc'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-rvc'
Provides-Extra: svc-soulx
Requires-Dist: accelerate>=0.30; extra == 'svc-soulx'
Requires-Dist: huggingface-hub>=0.23; extra == 'svc-soulx'
Requires-Dist: librosa<1,>=0.10; extra == 'svc-soulx'
Requires-Dist: torch<3,>=2.1; extra == 'svc-soulx'
Requires-Dist: torchaudio<3,>=2.1; extra == 'svc-soulx'
Requires-Dist: transformers<5,>=4.40; extra == 'svc-soulx'
Provides-Extra: verify
Requires-Dist: pyworld<1,>=0.3; extra == 'verify'
Requires-Dist: scipy<2,>=1.11; extra == 'verify'
Provides-Extra: verify-similarity
Requires-Dist: huggingface-hub>=0.23; extra == 'verify-similarity'
Requires-Dist: librosa<1,>=0.10; extra == 'verify-similarity'
Requires-Dist: torch<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: torchaudio<3,>=2.1; extra == 'verify-similarity'
Requires-Dist: transformers<5,>=4.40; extra == 'verify-similarity'
Description-Content-Type: text/markdown

# vocaboot

> Singing voice synthesis from your terminal, callable from Claude Code via MCP.

**Status**: `0.3.1` Beta (`pip install vocaboot==0.3.1`, PyPI live 2026-05-14). DiffSinger BYO acoustic provider wired. 227 passed / 4 skipped / 0 failed (231 collected, `pytest -q`). <!-- r17-ignore -->  <!-- per-version pin: update count + keep marker on each release bump -->


## 30-second demo

```bash
pip install 'vocaboot[diffsinger,verify]'
vocaboot demo --accept-nc            # 5s wav, CPU; first run fetches ~50 MB NSF-HiFiGAN vocoder
# --accept-nc opts you into the CC-BY-NC-SA-4.0 vocoder weights — personal / non-commercial only.
```

🎧 [**Listen to a pre-generated demo**](examples/demo_output.mp3) (5s, 60 KB MP3).

**Honest disclosure**: Stage 1's `--voice synthetic` path is a 4-partial harmonic oscillator passed through the NSF-HiFiGAN vocoder — it proves the pipeline rings end-to-end and is the zero-dependency smoke target. Voice synthesis itself runs via the **DiffSinger BYO acoustic provider** (`--voice <path>/acoustic.onnx`, `services/svs_engine.py:229` 3-provider dispatch; landed in v0.3.0). Acquisition guide: [`docs/diffsinger_byo_spec.md`](docs/diffsinger_byo_spec.md). The NC posture (`--accept-nc`) still applies because the only license-clean vocoder (NSF-HiFiGAN) is CC-BY-NC-SA-4.0; the MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) is a future-cycle roadmap item, not a promise. **Commercial-OK usage is NOT shipped in 0.x** — the path is self-train (Stage 0 result B).

## Why use vocaboot

- **Terminal-first.** Every operation works headlessly. No GUI, no web UI, no Electron.
- **Claude-Code-MCP ready.** `vocaboot mcp-serve` exposes `synth` + `version` tools over JSON-RPC stdio so an agent loop can drive synthesis directly. Setup: [`docs/mcp_setup.md`](docs/mcp_setup.md).
- **$0 verify gate.** Every synth runs pyworld F0 / HNR with R5 bootstrap confidence intervals. APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI can distinguish "undetermined" from "rejected".
- **Privacy by default.** `audio_ingest` refuses paths matching `DEFAULT_PROTECTED_SUBSTRINGS` or your `~/.config/vocaboot/protected_patterns.txt`, and requires explicit consent flags before persisting anything (see `src/vocaboot/privacy.py`).
- **Failure museum.** Every failed run writes a scrubbed `_failures/<id>/metadata.json` (paths reduced to `{basename, sha256}`) so the next attempt cannot repeat the same mistake. Raw audio is never committed.

## Quickstart with Claude Desktop

Add to `~/.config/Claude/claude_desktop_config.json` (Linux) or `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS):

```json
{
  "mcpServers": {
    "vocaboot": {
      "command": "vocaboot",
      "args": ["mcp-serve", "--accept-nc"]
    }
  }
}
```

Restart Claude Desktop. Ask Claude: *"Synthesize a singing voice from `examples/short.ds` and save to `~/out.wav`."* Claude routes the request to the `synth` tool via MCP. The `ingest` family (user-voice consumption, CLI subcommand `vocaboot ingest`) is deliberately not exposed — an LLM cannot emit a human-signal consent flag, so that boundary stays CLI-only (`docs/stage_2.md` "MCP boundary invariant").

Full setup, smithery.ai registry posting, and troubleshooting: [`docs/mcp_setup.md`](docs/mcp_setup.md).

## How vocaboot fits in

Existing singing voice synthesis (SVS) tools — SynthV, VOCALOID, NEUTRINO, CeVIO AI — are GUI-only. You cannot script them from a terminal, cannot pipe them through an agent like Claude Code, cannot automate batch covers, cannot integrate them into CI. `vocaboot` is the bridge: a single `pip install` and you can synthesize from your shell, from a Python `import`, or from an MCP-aware agent.

vocaboot does **not** out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: *the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate, a privacy guard with structural user-audio exclusion, and an auto-failure museum.* See "Related work (delta)" below.

## Design constraints

- **Cost: $0 (inference path)** — CPU-only numeric verify is the $0 invariant. The *self-train* path (see "Stage 0 result") is **not** $0 by definition: it requires user GPU compute and audio data.
- **License: Apache-2.0** (code) — model weights inherit their upstream license; no NC weights bundled.
- **Generality**: the framework must work with any voice, any song, any host environment (WSL2 / Linux / macOS first-class, Windows second-class).
- **"Roughly close" is good enough**: chasing literal pitch-perfect voice cloning is a structural failure mode (see `_failures/`); the framework optimizes for *direction* of a target timbre, not literal match.
- **Auto-failure museum (R8)**: every failed run writes a structured post-mortem under `_failures/`, so the next attempt cannot repeat the same mistake. Paths are scrubbed to `{basename, sha256}` before persistence (R15); cyclic exception detail is replaced with `<cycle>` placeholders so museum itself never raises.

## Architecture overview (locked, 4 stages, strict superset)

| Stage | What works | Modules added |
|---|---|---|
| **0** | `ckpt` acquisition gate — closed 2026-05-12 with result **(B): 0 license-clean pairs**; project ships `--accept-nc` opt-in + self-train as dual smoke path. | (research, no new module) |
| 1 | One engine, one voice, 5-second wav, CPU, **agent-verify gate** | input + score_align + svs + audio_out + agent_verify + metrics + agent_advisory + museum + license_check + privacy + stage_1 (= **11 layer modules**; `cli.py` is the CLI entry point and is excluded from the module count) |
| 2 | Any voice, any song | + profile_match (optional target spec) |
| 3 | "Roughly like X" via voice conversion | + svc (DDSP-SVC default; RVC v2 / SoulX-Singer plugin) |
| 4 | Natural-language input (`vocaboot "sing 'Yume' like a soft anime voice"`) | + nl_parse |

Hard ceilings (anti-bloat invariant): layers ≤ 9, **layer modules ≤ 12** (CLI entry, MCP entry etc. are surface-level and counted separately), hooks ≤ 2. Stage 4 lands at 14 layer modules → at the latest by Stage 3, `score_align` will be folded into `input` and `agent_advisory` into `agent_verify` to keep the ceiling intact.

### Stage 0 result (2026-05-12, closed: B)

Stage 0 ckpt research closed with **0 Apache-2.0 / MIT / CC-BY-4.0 / CC0 acoustic+vocoder pairs** for the DiffSinger format:

- **vocoder**: BigVGAN v2 44kHz (MIT / NVIDIA) and Vocos mel-24kHz (MIT / charactr) are license-clean, but their mel formats (general 80-bin mel) are **not drop-in compatible** with DiffSinger's F0-conditioned NSF 44.1 kHz / 128-bin mel.
- **acoustic**: every public DiffSinger community voicebank (`qixuan`, `renri`, `nyaru`, `mitsudate` …) is CC-BY-NC-SA-4.0. The openvpi *code* is Apache-2.0; the *weights* are not.

vocaboot's surface resemblance to Stable Diffusion WebUI / kohya_ss is intentional ("Apache-2.0 code, license-aware opt-in for weights") but the *substantive* difference must be stated: **vocaboot has no `pip install`-and-go commercial-OK default path**, because no DiffSinger-format Apache/MIT/CC0 acoustic exists. The two real smoke paths are:

| Path | Tier | $0? | Use |
|---|---|---|---|
| `--accept-nc` opt-in | CC-BY-NC-SA-4.0 (openvpi NSF-HiFiGAN + community acoustic) | Yes (inference only) | personal, non-commercial smoke |
| self-train | Apache-2.0 openvpi/DiffSinger code + MIT Vocos/BigVGAN vocoder bridge + user audio | **No** — user supplies GPU (8 GB+ VRAM), 1–3 h labeled audio, ~12–48 GPU-hours of training, plus a self-implemented mel-format adapter (Stage 3+ contribution path) | commercial / OSS-clean redistribution |

The default tier in `docs/ckpt_registry.md` stays empty until upstream licensing changes. **Engine pivot** (to nnsvs / FastDiff / etc.) is a deferred option, not a rejected one: if Stage 0 is re-opened against another SVS framework with a license-clean default tier, the project may pivot. See `docs/ckpt_registry.md` for the deferral rationale.

### Output wav licensing (NC opt-in path)

When `--accept-nc` is set, the resulting `.wav` is the **user's copyright**, and CC's public position is that CC license conditions on the model do **not** automatically extend to model outputs ([Creative Commons FAQ](https://creativecommons.org/faq/#do-i-need-permission-to-use-a-cc-licensed-work-as-input-to-train-an-ai-model), [Using CC-licensed Works for AI Training, 2025](https://creativecommons.org/wp-content/uploads/2025/05/Using-CC-licensed-Works-for-AI-Training.pdf)). However: **the act of running NC weights is itself bound by the NC clause**, so any commercial use case requires the self-train path. vocaboot refuses to load NC weights unless `--accept-nc` is explicitly passed.

The openvpi vocoder distribution carries CC-BY-NC-SA-4.0 attribution requirements (a copy of the license + "OpenVPI Community / DiffSinger Community" notice + project page link). Stage 1 step 2 wires a structured `LICENSE_NOTICE.txt` emission into the failure museum and stdout when an NC ckpt is loaded; see `docs/ckpt_registry.md` for the literal text.

### "OSS-1 accessible" invariant — Stage 0 result reframe

Round 35 analyst pinned OSS-1 to "1 command quick start → smoke wav". With Stage 0 result (B), the literal invariant becomes:

> **OSS-1 (revised, 2026-05-12)**: 2 commands max to smoke wav. After `pip install vocaboot[diffsinger,verify]`, running `vocaboot demo --accept-nc` must fetch a registry-pinned NC vocoder, synth a 5-second wav from the bundled `examples/short.ds`, and exit 0 (APPROVE) or 5 (REVIEW) within 60 seconds on CPU.

The `demo` subcommand is a Stage 1 step 2 deliverable. The `--accept-nc` requirement is *the* deviation from the original 1-command target, called out here explicitly so reviewers do not mistake the dual-path docs for an OSS-1-compliant default.

## Verify gate ($0, no Claude dependency)

The agent-verify gate runs pyworld F0 / HNR with R5 bootstrap CIs on every synth run. No subprocess, no Claude dependency, $0 install footprint (numpy + pyworld + scipy via `[verify]`). APPROVE / REVIEW / REJECT map to exit codes 0 / 5 / 6 so CI distinguishes "undetermined" from "rejected".

Code review is **not** a per-synth concern — it is a per-commit concern, run separately via CI (kluster + GitHub Actions). The original Round-35 design included a `--verify=agent` mode that invoked `claude -p` for static codereview, but the same LLM-as-judge fragility that excluded Agent C from the verdict (Round 26 number-fabrication regression) applies identically to Agent A; the mode was removed before wiring (step 5e, 2026-05-12).

Speaker-similarity metrics (WavLM cosine, ECAPA-TDNN) join this gate in Stage 2 once profile-match lands, behind a `[verify-similarity]` extra so the default Stage 1 install stays footprint-minimal.

Agent C (qualitative LLM-as-judge advisory) still runs as **museum-annotation only** — structurally excluded from the verdict by design.

`--no-agent-verify` exists for unit-testing the synth path in isolation; CI is the only caller that should ever use it.

## Distribution

Stage 1 ships:

- `pip install vocaboot[diffsinger,verify]` → either
  - `vocaboot demo --accept-nc` (OSS-1: bundled score, synthetic acoustic, registry-pinned NC vocoder), **or**
  - `vocaboot synth --voice <ckpt> --score examples/short.ds --out smoke.wav --accept-nc` (BYO ckpt path).
- `import vocaboot` → Python library (same package)

Treated as Stage 2 retrospective decisions (no irrevocable rejection now):

- **MCP server stdio** — natural surface for agent loops; cheap to add as a thin wrapper over the library API.
- **Anthropic Skill / Subagent** — natural surface for Claude Code users; tradeoff is client coupling.
- **Headless library only (no CLI)** — would simplify install footprint if the CLI proves redundant against MCP/Skill.

Choosing among these is a retrospective decision after we know which surface actual users reach for. The Round 35 architect lock did **not** rule any of them out.

## Related work (delta)

`vocaboot` is not the first OSS SVS/SVC project. The slot it occupies is **terminal- and agent-loop-friendly wrapping** around upstream engines, not engine R&D itself. Direct comparisons:

| Project | What it does | What `vocaboot` adds |
|---|---|---|
| **SVC engines (Stage 3 plugin set)** | | |
| [RVC-Project / Retrieval-based-Voice-Conversion-WebUI](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) (35.6k★, MIT, large community voicebank set) | GUI-first realtime SVC, voicebanks per-license-audited | Terminal & MCP surface, $0 numeric verify gate, R5 CI, failure museum, agent-callable; vocaboot wraps RVC v2 as a plugin, not competing on engine quality |
| [yxlllc / DDSP-SVC](https://github.com/yxlllc/DDSP-SVC) (2.5k★, MIT, CPU-capable) | Lightweight DDSP-based SVC, mature CLI inference | Primary engine of record for Stage 3; vocaboot adds protocol bifurcation, registry-pinned weights, and verify-against-target post-conversion gate |
| [Soul-AILab / SoulX-Singer](https://github.com/Soul-AILab/SoulX-Singer) (608★, Apache-2.0 code + weights, 42 000 h vocal train) | Transcription-free zero-shot SVC; Apache-2.0 across the board (only one in the plugin set) | Default-tier candidate; vocaboot integrates as remote (HF Space) or subprocess depending on the 5.6 GB / GPU envelope (see `docs/stage_3.md` for the A/B/C distribution-form simulation) |
| **SVS engines (Stage 1 plugin set)** | | |
| [openvpi / DiffSingerMiniEngine](https://github.com/openvpi/DiffSingerMiniEngine) | Python HTTP server wrapping ONNX inference | License gate + privacy guard + R5-CI verify gate + failure museum + agent-callable surface |
| [diffscope / dsinfer](https://github.com/diffscope/dsinfer) | C++ low-level inference SDK | Higher-level Python API + agent integration + verify gate |
| [Jobsecond / diffsinger-onnx-infer](https://github.com/Jobsecond/diffsinger-onnx-infer) | Reference ONNX inference snippet | Production-grade gates + structured failure capture + R8 reproducibility |
| OpenUtau (headless mode) | GUI-first with scripted batch | $0 verify gate, no GUI bridge required |
| [nnsvs / nnsvs](https://github.com/nnsvs/nnsvs) | MIT SVS framework; weights per-voicebank | Stage 0 re-target candidate if upstream weight licensing changes |

`vocaboot` does **not** try to out-quality RVC or DDSP-SVC on the engine axis — those communities already have hundreds of person-years invested. The differentiation slot is: **the only terminal-first SVS/SVC wrapper with an MCP-shaped surface, a $0 numeric verify gate (R5 bootstrap CI), a privacy guard with structural user-audio exclusion, and an auto-failure museum**. If a clean integration into one of the above projects becomes more valuable than a standalone wrapper, vocaboot will redirect toward a PR contribution.

## Stage 1 (current target)

```bash
# OSS-1 quick start (bundled score, synthetic + NC vocoder, 5s):
vocaboot demo --accept-nc

# BYO ckpt path (your DiffSinger acoustic + the registry vocoder):
vocaboot synth --engine diffsinger \
    --voice <ckpt_path> \
    --score examples/short.ds \
    --out smoke.wav \
    --accept-nc \
    --ckpt-license CC-BY-NC-SA-4.0
# agent verify gate (pyworld F0/HNR + R5 bootstrap CI) runs automatically.
# REJECT/REVIEW persists to ./_failures/.
```

`examples/short.ds` uses generic CV phoneme tokens (`d o r e m i f a s o`); whether those map cleanly to the chosen ckpt's phoneme dictionary is ckpt-dependent. Stage 1 step 2 pins one reference ckpt and adjusts `examples/short.ds` to match its dictionary; for other ckpts, users must align phoneme tokens themselves.

### Two acoustic paths: synthetic (smoke) and DiffSinger BYO (voice)

`--voice synthetic` produces a **4-partial harmonic oscillator** driven by the score's note sequence — `ph_seq` is deliberately ignored, output has no vowel formants or consonants. Purpose: prove the end-to-end pipeline (score → mel + f0 → NSF-HiFiGAN vocoder → wav → verify gate) wires correctly without a checkpoint. A NUMERIC verify APPROVE on the synthetic path proves the pipeline **rings**, not that the output sounds like a voice.

`--voice <path>/acoustic.onnx` activates the **DiffSinger BYO acoustic provider** (v0.3.0+, `services/svs_engine.py:229`): `acoustic.onnx` + sibling `phonemes.json` are loaded by onnxruntime CPU, `ph_seq` / `ph_dur` / `note_seq` are converted to `(tokens: int64, durations: int64, f0: float32)` per the openvpi v2.3.0 contract, and the mel `[1, n_frames, 128]` is handed to NSF-HiFiGAN. This is real vocal-tract-filtered synthesis. Acquisition guide + on-disk layout + failure-mode table: [`docs/diffsinger_byo_spec.md`](docs/diffsinger_byo_spec.md).

**G1 wiring-sanity gate** (NOT a voice-quality gate): `tests/integration/test_byo_acoustic.py` asserts `wav.size ≥ 0.75 × expected` + `np.isfinite(wav).all()`. This proves the BYO ckpt is wire-loadable and produces a finite waveform; it does **not** assert that the output is voice-like or matches the input voicebank. Voice-quality objective evaluation (`wavlm_cosine ≥ 0.60` + `MCD CI upper bound ≤ 500 dB` librosa-MFCC scale, see `docs/quality_gate.md:14,:45`, on N≥3 voicebanks) is post-0.3.0 work tracked in `CHANGELOG.md` under `## [Unreleased]`. <!-- r17-ignore -->

Stage 2 spectral/timbral metrics behind `[verify-similarity]` discriminate synthetic from real-voice output post-synth (`wavlm_cosine` + MCD with 2-sided threshold band).

## Stage 3 DDSP-SVC voice conversion (v0.2.0-rc1, BYO)

```bash
pip install vocaboot[svc-ddsp]
git clone https://github.com/yxlllc/DDSP-SVC.git ~/ddsp-svc
# ...follow upstream README to install its requirements, train or
# download a ckpt, and compute sha256sum model.pt

export VOCABOOT_DDSP_PATH=~/ddsp-svc/exp/combsub-test/model_300000.pt
export VOCABOOT_DDSP_SHA256=<digest>
export VOCABOOT_DDSP_SRC=~/ddsp-svc

vocaboot ddsp provision        # verify sha256 + weights-only probe
vocaboot ddsp convert --source singer.wav --out converted.wav
```

The `convert` subcommand walks a 4-stage refusal ladder (unconfirmed
sha256 → ckpt not found → hash mismatch → engine not wired) and
dispatches `model.forward` to the upstream subprocess. An in-process
vendored adapter making the `VOCABOOT_DDSP_SRC` step optional is on
the post-1.0 roadmap (no version promise; tracked under v1.1.0+
deferred items below).

## Completion criteria (OSS 1k★ 水準、 R14 round 5 構想層、 2026-05-14 snapshot)

vocaboot が「completed」 = OSS スター 1k 級獲得水準と判定するための 5 product-level criteria。 v0.1.0 alpha の internal-quality gate (R5 verify / R8 museum / R15 privacy) の上に重ねる外向き条件で、 同類 SVS/SVC project (RVC 35k★ / MoonInTheRiver/DiffSinger 4.7k★ / OpenUtau 3.8k★ / openvpi/DiffSinger 3.1k★ / DDSP-SVC 2.5k★) の共通点を逆算して導出。

- **C1.** `pip install vocaboot[...]` 後 30 秒以内に音が出る — v0.2.0 で **DiffSinger ONNX BYO path** 開設、 v0.3.0 で BYO acoustic provider 完成。 OSS demo default の no-NC 完全達成は MIT vocoder bridge (BigVGAN / Vocos with mel-format adapter) が landed する future cycle (round 8 critic 残留指摘、 v1.0.0 ship までに land 必須)。
- **C2.** 出力が「歌声に聞こえる」 — v0.3.0 で **DiffSinger BYO acoustic provider** landed (`services/svs_engine.py:229` 3-provider dispatch、 acoustic.onnx + phonemes.json companion、 openvpi v2.3.0 baseline)。 v0.3.0 ship gate = **G1 wiring sanity gate** (`tests/integration/test_byo_acoustic.py`: wav.size ≥ 0.75 × expected + `np.isfinite(wav).all()`、 voice quality は主張しない)。 v1.0.0 ship gate には **G2 objective gate** (wavlm_cosine ≥ 0.60 + MCD CI upper bound ≤ 500 dB librosa-MFCC scale, see `docs/quality_gate.md:14,:45`, on N≥3 voicebank) + **maintainer-attested N=1 listening pass** が必要 — 0.x lifetime 内で community feedback + real ckpt 検証を累積する。
- **C3.** README に動く demo wav (`<audio>` embed) + 30 秒 mp4 demo — `examples/demo_output.mp3` (5s, 60 KB) を README 上部 link 提供で部分達成。 mp4 demo は v1.0.0 candidate cycle。
- **C4.** HF Space / Replicate に 1-click try ページ — Stage 3 SoulX 配布形態 C 経由、 future cycle。
- **C5.** Twitter/X + HN + Reddit r/MachineLearning ローンチ投稿 — v1.0.0 release で実施、 C3-C4 はそれまでに land 必須。

vocaboot 独自軸の「terminal-first + verify gate + failure museum」は **developer-tooling 価値** で OSS star audience の 20% (dev community) に効く一方、 残り 80% の非エンジニア audience には C1-C5 が必須。 **v1.0.0 完成宣言判定基準** (構想層、 future cycle): A (DiffSinger BYO acoustic provider、 v0.3.0 で landed) + G2 objective gate pass (real ckpt N≥3) + MIT vocoder bridge land (NC posture 解消) + maintainer-attested voice listening + R14 round 9 session-separation audit pass + packaging final。 0.3.0 → 1.0.0 jump は最低 (G2 pass + MIT bridge land) を ship gate にする。

## Roadmap

- [x] Architecture lock-in (rounds 28–32; per-round audit trails are inlined into `docs/stage_2.md` ("R3 / R4 / R5 / R6 / R7 / R8 / R14 round-N" subsections), with closure decisions tracked in `CHANGELOG.md`)
- [x] **Stage 0**: ckpt acquisition gate closed with result (B), dual-path adopted (2026-05-12)
- [x] **Stage 1 step 2 (5c)**: `vocaboot demo` subcommand wired (click.group, bundled `examples/short.ds`, `--accept-nc` mandatory, SHA-256 runtime verify of the pinned vocoder).
- [x] Stage 1 step 2 (5d): `LICENSE_NOTICE.txt` attribution emission for the NC vocoder (sidecar `<out>.LICENSE_NOTICE.txt` + stdout summary; CC-BY-NC-SA-4.0 §3(a) clauses spelled out).
- [x] Stage 1: smoke wav (CPU, 5 s) via `vocaboot demo --accept-nc` (synthetic 4-partial oscillator + NSF-HiFiGAN; rings the pipeline end-to-end).
- [x] Stage 1 agent-verify gate (numeric pyworld F0/HNR + R5 bootstrap CI; AGENT mode removed in step 5e — LLM-as-judge fragility, code review delegated to CI).
- [x] Stage 1 closed (2026-05-12): a bit-exact wav regression test (originally step 5f) was rejected after review — OSS-mainline SVS/TTS projects (HF / Coqui / Vocos / NeMo / ESPnet) all use approximate/perceptual checks; the numeric verify gate plus the existing duration/RMS assertions in `test_stage1_synthetic_smoke_wav_end_to_end` already serve as the CI canary.
- [x] **Stage 1.5** closed (2026-05-12): R14 3-agent debate (analyst/architect/critic, all REVISE) surfaced 6 gaps before Stage 2 could begin. Five closures: (A) `stage_1.py` → `pipeline.py` rename, (B) `privacy.refuse_user_audio_without_consent` + `museum` user-audio hard-exclude rule, (C) `docs/quality_gate.md` Tier ladder (Tier 1 rings → Tier 2 voice-like → Tier 3 target-quality; Tier 4 subjective MOS out of scope), (D) `docs/ckpt_registry.md` acquisition Path B (NC-only public release) chosen, (E) `docs/stage_2.md` spec.
- [x] **Public OSS release** (2026-05-13): Apache-2.0 LICENSE pristine, NOTICE chain audited, GitHub `license.spdx_id=apache-2.0` 復帰確認済。 4 commit (`c267afd` + `18733d1` + `9f0ae47` + `26a3589`) on `origin/main`.
- [x] **v0.2.0 release tag** (2026-05-13、 現 0.2.1): C1+C2 BYO-scope 達成 + Stage 3 DDSP-SVC 実 inference wiring 完了 (R14 round 5 構想層判定)。 PyPI live: `pip install vocaboot==0.2.1`。
- [x] **Stage 2.5 MCP server** (`vocaboot mcp-serve`、 v0.2.1 同梱、 2026-05-13): `pipeline.run` の thin wrapper (`services/mcp_server.py`)。 `[mcp]` extra に `mcp>=0.9` を wire-in、 stdio JSON-RPC で `synth` / `version` tool を提供。 `audio_ingest` / consent-gated 系は MCP 境界不変条件 (`docs/stage_2.md`) により非公開、 `accept_nc` は `--accept-nc` で server-start 時に human signal として固定。
- [x] **Stage 2 closure** (2026-05-13、 v0.2.1 でレッテル付け): `docs/stage_2.md` Acceptance criteria 9 件 (Tier 2 canary / discrimination / consent gate / VoiceProfile round-trip / no Stage 1 regression / layer count / WavLM pin / `SpeakerSimilarityProtocol`) すべて 0.2.0.post2 時点で test gate green、 0.2.1 で changelog に明示。

### v0.3.0 (released 2026-05-14, BYO acoustic provider landing)

- [x] **PR0 prerequisite**: mypy strict 設定 + 真 bug 8 件 fix (`pipeline.py:66/165` 等 latent bug を mypy strict 実走で検出済)、 設計文書全面書き換え (`docs/round8_w1_a_design.md`)、 G2 wording を objective gate 化。
- [x] **W1 = A**: DiffSinger BYO acoustic provider at `services/svs_engine.py:229` (openvpi/DiffSinger v2.3.0 baseline、 acoustic.onnx + phonemes.json 経由、 ~140 LoC、 21 unit + 1 integration test)。
- [x] **packaging**: NC posture explicit、 PyPI 0.3.0 ship。 G1 wiring sanity gate (`tests/integration/test_byo_acoustic.py`) を 0.3.0 ship gate に採用。

### Toward v1.0.0 (構想層、 future cycle, NOT 0.3.0 ship blockers)

両 agent (critic + analyst) audit による 1.0.0 ship blocker = G2 objective gate pass + MIT vocoder bridge land + maintainer-attested voice listening + R14 round 9 session-separation audit + API stability commitment 整備。 これらを v1.0.0 candidate cycle で累積し、 1.0.0 = production stable claim と実体が乖離しない state で ship する。 0.3.0 → 1.0.0 中間に 0.4.0 / 0.5.0 / ... を入れる選択肢は open。

### Deferred (future cycles, no version promises)

- **B**: Stage 3 SVC remaining (`services/svc/rvc.py` RVC v2 per-voicebank license gate + `services/svc/soulx.py` SoulX-Singer Apache-2.0 subprocess + HF Space remote)。
- **C**: Stage 4 natural-language interface (NL prompt → DiffSinger `.ds` JSON、 `[nl]` extra)。
- **MIT vocoder bridge** (BigVGAN / Vocos + mel-format adapter): NC default 解消の真の道筋。 v1.0.0 ship gate (round 8 critic 残留指摘 解消)。
- **G2 objective gate verification**: real openvpi DiffSinger ckpt + N≥3 voicebank で wavlm_cosine / MCD 計測。 v1.0.0 ship gate。
- **Round 9 R14 session-separation audit**: round 7/8/W1 が全て同一 originSessionId 850adffc 内で進行した integrity 残留 (critic C1.1 残留 2.5/10)。 別 session で実施、 v1.0.0 ship gate。
- **DiffSinger fork variants** (mel_bin=80 等、 upstream schema 進化): mel-format adapter 経由で対応。
- **Completion C3-C5**: 30s mp4 demo + HF Space 1-click + HN/Reddit ローンチ。
- **DDSP vendored adapter**: `VOCABOOT_DDSP_SRC` 経由 subprocess を in-process adapter に置換。
- **MCP-server-start telemetry instrumentation**: observe-only PR で実装。

## Non-goals

- Pitch-perfect voice cloning (target literal match is a known failure mode)
- Bundled model weights (we point at upstream URLs and verify SHA-256)
- GUI

## Security: malicious ckpt threat model

vocaboot loads PyTorch checkpoint files (`.pt`) supplied by the user via
`VOCABOOT_DDSP_PATH` / `VOCABOOT_VOCODER_PATH` / `VOCABOOT_WAVLM_PATH`. A
malicious `.pt` file can embed pickle payloads that execute arbitrary
code at deserialization time (OWASP A08 / CWE-502 — *insecure
deserialization*). Defenses:

1. **sha256 verify before load** — `core.ckpt_registry.verify()` refuses
   any ckpt whose bytes don't match the user-committed digest
   (`VOCABOOT_*_SHA256`). A man-in-the-middle download swap is caught
   at the byte boundary, before PyTorch's parser is reached.
2. **`torch.load(weights_only=True)`** — the default in
   `core.ckpt_registry.safe_torch_load()`. PyTorch's weights-only
   deserializer refuses arbitrary classes; legitimate ckpts bundling
   `argparse.Namespace` (e.g. DDSP-SVC v5.0) trigger `UnsafeCkptError`
   so the user explicitly opts in to `allow_unsafe=True` rather than
   silently widening the trust boundary.
3. **Subprocess isolation for upstream inference** — `vocaboot ddsp
   convert` invokes `yxlllc/DDSP-SVC`'s `main.py` in a child process
   (`VOCABOOT_DDSP_SRC`). The weights are loaded in upstream's address
   space; a successful exploit would still need to break out of the
   subprocess to affect vocaboot's process.

**Social-engineering caveat (cannot be defended automatically)**: if a
user is socially engineered into exporting an adversary-supplied
digest into `VOCABOOT_*_SHA256` AND passing `allow_unsafe=True`, no
in-tree defense catches it. Pin sha256 values from sources you trust
— cross-check against `docs/ckpt_registry.md` or the upstream model
card before exporting.

## License

Code: Apache-2.0.
Model weights: inherit from upstream — `vocaboot` itself bundles none.
