Metadata-Version: 2.4
Name: spine-codec
Version: 0.1.0
Summary: A neural audio codec for expressive speech
Project-URL: Homepage, https://github.com/twangodev/spine-codec
Project-URL: Repository, https://github.com/twangodev/spine-codec
Project-URL: Issues, https://github.com/twangodev/spine-codec/issues
Project-URL: Model Card, https://huggingface.co/twangodev/spine-codec
Author: James Ding
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ddsp,fsq,neural-audio-codec,speech,tts
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Multimedia :: Sound/Audio
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: einops>=0.7
Requires-Dist: huggingface-hub>=1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: safetensors>=0.8.0
Requires-Dist: torch>=2.4
Requires-Dist: torchaudio>=2.4
Requires-Dist: torchcodec>=0.14.0
Requires-Dist: vector-quantize-pytorch>=1.14
Provides-Extra: train
Requires-Dist: wandb; extra == 'train'
Description-Content-Type: text/markdown

# spine

[![PyTorch](https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/)
[![PyPI](https://img.shields.io/pypi/v/spine-codec)](https://pypi.org/project/spine-codec/)
[![Hugging Face](https://img.shields.io/badge/Hugging_Face-model-FFD21E?logo=huggingface&logoColor=FFD21E)](https://huggingface.co/twangodev/spine-codec)
[![License](https://img.shields.io/github/license/twangodev/spine-codec)](https://github.com/twangodev/spine-codec/blob/main/LICENSE)

A neural audio codec for expressive speech.

![Architecture](https://raw.githubusercontent.com/twangodev/spine-codec/main/.github/assets/architecture.png)

Spine encodes 24 kHz mono audio into multi-scale [FSQ](https://arxiv.org/abs/2309.15505) tokens at 1.57 kbps (~88 tokens/s) across four temporal scales (~6 / 12 / 23 / 47 Hz), keeping sequences short for downstream language models. The convolutional decoder is hard-bandlimited at 6 kHz by a fixed crossover; a filtered-noise branch and a complex-STFT head synthesize the high band under purely adversarial supervision, eliminating the high-frequency static typical of GAN codecs.

- 115M-parameter generator: conv encoder/decoder with a 512-d transformer bottleneck (8 + 12 layers)
- Multi-scale FSQ (`pool → quantize → repeat`) on a shared latent, with no codebook collapse
- Reconstruction losses bandlimited below the crossover; the high band is owned by the DDSP split

## Installation

```bash
pip install spine-codec
```

Training pulls in extra dependencies (wandb):

```bash
pip install "spine-codec[train]"
```

For development, clone this repo and run `uv sync`.

## Usage

The pretrained model is downloaded from [twangodev/spine-codec](https://huggingface.co/twangodev/spine-codec) on first use; pass `--checkpoint` to use a local training checkpoint instead.

```bash
spine encode --input speech.wav --output codes.pt
spine decode --input codes.pt --output speech.wav
spine recon  --input speech.wav --output roundtrip.wav
```

```python
import torchaudio
from spine import Spine

model = Spine.from_pretrained("twangodev/spine-codec")
audio, sr = torchaudio.load("speech.wav")  # 24 kHz mono
codes = model.encode(audio.unsqueeze(0))
reconstruction = model.decode(codes)
```

## Training

```bash
spine train --config configs/train.yaml
```

Training configs live in the repo (not the wheel), so train from a git checkout
with the `train` extra installed.

YAML configs are sparse overrides on top of the defaults in `spine/config.py`.

## Acknowledgements

The architecture builds on [Mimi](https://arxiv.org/abs/2410.00037), [SNAC](https://arxiv.org/abs/2410.14411), [DAC](https://arxiv.org/abs/2306.06546), [FSQ](https://arxiv.org/abs/2309.15505), and [DDSP](https://arxiv.org/abs/2001.04643).

