Metadata-Version: 2.4
Name: kani-tts
Version: 0.0.1
Summary: Text-to-speech using neural audio codec and causal language models
Author-email: simonlob <simonlobgromov@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/nineninesix-ai/kani-tts
Project-URL: Repository, https://github.com/nineninesix-ai/kani-tts
Project-URL: Documentation, https://github.com/nineninesix-ai/kani-tts/blob/main/README.md
Keywords: tts,text-to-speech,audio,nemo,transformers
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: nemo-toolkit[all]>=1.18.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: librosa>=0.10.0
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: datasets>=2.12.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Provides-Extra: audio
Requires-Dist: soundfile>=0.12.0; extra == "audio"
Dynamic: license-file

# Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

## Features

- Simple, intuitive API
- Built on Hugging Face Transformers and NVIDIA NeMo
- High-quality audio generation using neural codecs
- GPU acceleration support

## Installation

### From PyPI (once published)

```bash
pip install kani-tts
```

### From source

```bash
git clone https://github.com/yourusername/kani-tts.git
cd kani-tts
pip install -e .
```

### Optional dependencies

For saving audio files:
```bash
pip install kani-tts[audio]
```

For development:
```bash
pip install kani-tts[dev]
```

## Quick Start

```python
from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")
```

## Advanced Usage

### Custom Configuration

```python
from kani_tts import KaniTTS

model = KaniTTS(
    'your-model-name',
    temperature=0.7,           # Control randomness (default: 0.6)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1800)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
)

audio, text = model("Your text here")
```

### Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

```python
import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)
```

### Batch Processing

```python
texts = [
    "First sentence to synthesize.",
    "Second sentence to synthesize.",
    "Third sentence to synthesize."
]

for i, text in enumerate(texts):
    audio, _ = model(text)
    model.save_audio(audio, f"output_{i}.wav")
```

## Architecture

Kani-TTS uses a two-stage architecture:

1. **Text → Audio Tokens**: A causal language model generates audio token sequences from text
2. **Audio Tokens → Waveform**: NVIDIA NeMo's nano codec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:
- Text boundaries (start/end of text)
- Speech boundaries (start/end of speech)
- Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

## Requirements

- Python 3.10 or higher
- CUDA-capable GPU (recommended) or CPU
- PyTorch 2.0 or higher
- Transformers library
- NeMo Toolkit

## Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:
- Extended vocabulary including audio tokens
- Special tokens for speech/text boundaries
- Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Citation

If you use Kani-TTS in your research, please cite:

```bibtex
@software{kani_tts,
  title = {Kani-TTS: Text-to-Speech using Neural Audio Codec},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/kani-tts}
}
```

## Acknowledgments

- Built on [Hugging Face Transformers](https://github.com/huggingface/transformers)
- Uses [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) audio codec
- Powered by PyTorch
