Metadata-Version: 2.4
Name: pipecat-xtts-vllm
Version: 0.1.1
Summary: Pipecat community TTS integration for the XTTSv2-vLLM streaming server
Author: wuxuedaifu
License: MIT
Project-URL: Homepage, https://github.com/wuxuedaifu/pipecat-xtts-vllm
Project-URL: Server, https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pipecat-ai<2,>=1.4
Requires-Dist: aiohttp>=3.11.12
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: aioresponses>=0.7; extra == "dev"
Requires-Dist: aiohttp<3.12,>=3.11.12; extra == "dev"
Dynamic: license-file

# pipecat-xtts-vllm

A [Pipecat](https://github.com/pipecat-ai/pipecat) community TTS integration that streams
synthesized speech from an [XTTSv2-vLLM streaming server](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server).

`XTTSVLLMTTSService` is a drop-in Pipecat `TTSService` that:

- Clones a voice from a reference audio clip.
- Computes XTTSv2 conditioning once (via `POST /v1/tts/conditioning`) and caches it for the
  service lifetime — no per-request conditioning overhead.
- Streams raw PCM audio chunks (via `POST /v1/audio/speech`) directly into the Pipecat pipeline.

---

## Installation

```bash
pip install pipecat-xtts-vllm
```

To work on the package from source instead:

```bash
pip install -e .
```

### Start the XTTSv2-vLLM server

The client talks to the heavy Docker server at
[wuxuedaifu/xttsv2-vllm-streaming-server](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server).
Follow its README to pull and run the image, for example:

```bash
docker run --gpus all -p 8000:8000 ghcr.io/wuxuedaifu/xttsv2-vllm-streaming-server:latest
```

---

## Usage with a Pipeline

The snippet below shows the essential setup. See
[`examples/foundational/xtts_vllm_say_one_thing.py`](examples/foundational/xtts_vllm_say_one_thing.py)
for the full, runnable version.

```python
import asyncio
from pathlib import Path

from pipecat.frames.frames import EndFrame, TTSSpeakFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import PipelineParams, PipelineWorker
from pipecat.workers.runner import WorkerRunner

from pipecat_xtts_vllm import XTTSVLLMTTSService

async def main():
    reference_audio = Path("reference.wav").read_bytes()

    tts = XTTSVLLMTTSService(
        base_url="http://localhost:8000",
        reference_audio=reference_audio,
        language="en",
        sample_rate=24000,
    )

    # Add tts (and any downstream processors) to a Pipeline.
    pipeline = Pipeline([tts, ...])

    worker = PipelineWorker(
        pipeline,
        params=PipelineParams(audio_out_sample_rate=24000),
        idle_timeout_secs=None,
    )

    runner = WorkerRunner()
    await runner.add_workers(worker)

    async def say():
        await worker.queue_frames([
            TTSSpeakFrame("Hello from the XTTSv2 vLLM streaming server."),
            EndFrame(),
        ])

    await asyncio.gather(runner.run(), say())

asyncio.run(main())
```

Alternatively, pass a precomputed `XTTSVLLMConditioning` object instead of `reference_audio` if
you have cached the conditioning data externally:

```python
from pipecat_xtts_vllm import XTTSVLLMConditioning, XTTSVLLMTTSService

conditioning = XTTSVLLMConditioning(
    gpt_cond_latent_b64="<base64-encoded latent>",
    speaker_embeddings_b64="<base64-encoded embeddings>",
)

tts = XTTSVLLMTTSService(
    base_url="http://localhost:8000",
    conditioning=conditioning,
)
```

---

## Running the Example

Set the required environment variables, then run the script:

```bash
export XTTS_VLLM_BASE_URL=http://localhost:8000
export XTTS_VLLM_REFERENCE_AUDIO=/path/to/reference.wav

python examples/foundational/xtts_vllm_say_one_thing.py
```

The script synthesizes one sentence and writes the output to `output.wav` in the current
directory.

---

## Configuration

`XTTSVLLMTTSService` accepts keyword-only arguments. At least one of `reference_audio` or
`conditioning` must be given; if both are provided, `conditioning` takes precedence.

| Parameter | Default | Description |
|---|---|---|
| `base_url` | _(required)_ | Base URL of the XTTSv2-vLLM streaming server (e.g. `http://localhost:8000`). |
| `reference_audio` | `None` | Raw bytes of a reference WAV clip (~6 s) used for voice cloning. Required unless `conditioning` is given. |
| `conditioning` | `None` | Optional precomputed `XTTSVLLMConditioning` (skips the `/v1/tts/conditioning` call); if set, it takes precedence over `reference_audio`. |
| `language` | `"en"` | Language code passed to the server (see [Supported languages](#supported-languages)). |
| `chunk_size` | `20` | Number of tokens per streaming chunk. |
| `speed` | `1.0` | Speech rate multiplier. |
| `sample_rate` | `24000` | PCM sample rate in Hz (should match the server output). |
| `aiohttp_session` | `None` | External `aiohttp.ClientSession` to reuse. If `None`, a session is created and closed by the service. |

---

## Supported languages

XTTSv2 supports 17 languages. Pass the matching code as the `language` argument:

| Code | Language | Code | Language |
|---|---|---|---|
| `en` | English | `nl` | Dutch |
| `es` | Spanish | `cs` | Czech |
| `fr` | French | `ar` | Arabic |
| `de` | German | `zh-cn` | Chinese (Simplified) |
| `it` | Italian | `hu` | Hungarian |
| `pt` | Portuguese | `ko` | Korean |
| `pl` | Polish | `ja` | Japanese |
| `tr` | Turkish | `hi` | Hindi |
| `ru` | Russian | | |

Pass `auto` to let the server auto-detect the language.

---

## Compatibility

Supports Python 3.11+; last tested with **pipecat-ai v1.4.0** on Python 3.12.

---

## License

**Integration code:** MIT — see [LICENSE](LICENSE).

**XTTSv2 model weights:** distributed under the
[Coqui Public Model License](https://coqui.ai/cpml) (non-commercial use only). See the
[server repository](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server) for details.

---

## Attribution

Developed by [wuxuedaifu](https://github.com/wuxuedaifu).
