Metadata-Version: 2.4
Name: pipecat-xtts-vllm
Version: 0.1.0
Summary: Pipecat community TTS integration for the XTTSv2-vLLM streaming server
Author: wuxuedaifu
License: MIT
Project-URL: Homepage, https://github.com/wuxuedaifu/pipecat-xtts-vllm
Project-URL: Server, https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pipecat-ai<2,>=1.4
Requires-Dist: aiohttp>=3.11.12
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: aioresponses>=0.7; extra == "dev"
Requires-Dist: aiohttp<3.12,>=3.11.12; extra == "dev"
Dynamic: license-file

# pipecat-xtts-vllm

A [Pipecat](https://github.com/pipecat-ai/pipecat) community TTS integration that streams
synthesized speech from an [XTTSv2-vLLM streaming server](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server).

`XTTSVLLMTTSService` is a drop-in Pipecat `TTSService` that:

- Clones a voice from a reference audio clip.
- Computes XTTSv2 conditioning once (via `POST /v1/tts/conditioning`) and caches it for the
  service lifetime — no per-request conditioning overhead.
- Streams raw PCM audio chunks (via `POST /v1/audio/speech`) directly into the Pipecat pipeline.

---

## Installation

> **Note:** `pipecat-xtts-vllm` will be published to PyPI in a later step. Until then, install
> from source:
>
> ```bash
> pip install -e .
> ```
>
> Once published:
>
> ```bash
> pip install pipecat-xtts-vllm
> ```

### Start the XTTSv2-vLLM server

The client talks to the heavy Docker server at
[wuxuedaifu/xttsv2-vllm-streaming-server](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server).
Follow its README to pull and run the image, for example:

```bash
docker run --gpus all -p 8000:8000 ghcr.io/wuxuedaifu/xttsv2-vllm-streaming-server:latest
```

---

## Usage with a Pipeline

The snippet below shows the essential setup. See
[`examples/foundational/xtts_vllm_say_one_thing.py`](examples/foundational/xtts_vllm_say_one_thing.py)
for the full, runnable version.

```python
import asyncio
from pathlib import Path

from pipecat.frames.frames import EndFrame, TTSSpeakFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.worker import PipelineParams, PipelineWorker
from pipecat.workers.runner import WorkerRunner

from pipecat_xtts_vllm import XTTSVLLMTTSService

async def main():
    reference_audio = Path("reference.wav").read_bytes()

    tts = XTTSVLLMTTSService(
        base_url="http://localhost:8000",
        reference_audio=reference_audio,
        language="en",
        sample_rate=24000,
    )

    # Add tts (and any downstream processors) to a Pipeline.
    pipeline = Pipeline([tts, ...])

    worker = PipelineWorker(
        pipeline,
        params=PipelineParams(audio_out_sample_rate=24000),
        idle_timeout_secs=None,
    )

    runner = WorkerRunner()
    await runner.add_workers(worker)

    async def say():
        await worker.queue_frames([
            TTSSpeakFrame("Hello from the XTTSv2 vLLM streaming server."),
            EndFrame(),
        ])

    await asyncio.gather(runner.run(), say())

asyncio.run(main())
```

Alternatively, pass a precomputed `XTTSVLLMConditioning` object instead of `reference_audio` if
you have cached the conditioning data externally:

```python
from pipecat_xtts_vllm import XTTSVLLMConditioning, XTTSVLLMTTSService

conditioning = XTTSVLLMConditioning(
    gpt_cond_latent_b64="<base64-encoded latent>",
    speaker_embeddings_b64="<base64-encoded embeddings>",
)

tts = XTTSVLLMTTSService(
    base_url="http://localhost:8000",
    conditioning=conditioning,
)
```

---

## Running the Example

Set the required environment variables, then run the script:

```bash
export XTTS_VLLM_BASE_URL=http://localhost:8000
export XTTS_VLLM_REFERENCE_AUDIO=/path/to/reference.wav

python examples/foundational/xtts_vllm_say_one_thing.py
```

The script synthesizes one sentence and writes the output to `output.wav` in the current
directory.

---

## Configuration

`XTTSVLLMTTSService` accepts keyword-only arguments. At least one of `reference_audio` or
`conditioning` must be given; if both are provided, `conditioning` takes precedence.

| Parameter | Default | Description |
|---|---|---|
| `base_url` | _(required)_ | Base URL of the XTTSv2-vLLM streaming server (e.g. `http://localhost:8000`). |
| `reference_audio` | `None` | Raw bytes of a reference WAV clip (~6 s) used for voice cloning. Required unless `conditioning` is given. |
| `conditioning` | `None` | Optional precomputed `XTTSVLLMConditioning` (skips the `/v1/tts/conditioning` call); if set, it takes precedence over `reference_audio`. |
| `language` | `"en"` | BCP-47 language code passed to the server. |
| `chunk_size` | `20` | Number of tokens per streaming chunk. |
| `speed` | `1.0` | Speech rate multiplier. |
| `sample_rate` | `24000` | PCM sample rate in Hz (should match the server output). |
| `aiohttp_session` | `None` | External `aiohttp.ClientSession` to reuse. If `None`, a session is created and closed by the service. |

---

## Compatibility

Supports Python 3.11+; last tested with **pipecat-ai v1.4.0** on Python 3.12.

---

## License

**Integration code:** MIT — see [LICENSE](LICENSE).

**XTTSv2 model weights:** distributed under the
[Coqui Public Model License](https://coqui.ai/cpml) (non-commercial use only). See the
[server repository](https://github.com/wuxuedaifu/xttsv2-vllm-streaming-server) for details.

---

## Attribution

Developed by [wuxuedaifu](https://github.com/wuxuedaifu).
