Metadata-Version: 2.4
Name: voxarena
Version: 0.1.6
Summary: An evaluation arena for realtime voice agents.
Author: VoxArena contributors
License: MIT License
        
        Copyright (c) 2026 Simkeyur
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/simkeyur/vox-arena
Project-URL: Issues, https://github.com/simkeyur/vox-arena/issues
Keywords: voice-agents,realtime-llm,evaluation,benchmarking,gemini-live,openai-realtime,pipecat
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.110.0
Requires-Dist: uvicorn>=0.28.0
Requires-Dist: pydantic>=2.6.0
Requires-Dist: pydantic-settings>=2.2.0
Requires-Dist: pipecat-ai[google,openai]>=0.5.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: loguru>=0.7.2
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/simkeyur/vox-arena/main/ui/src/assets/logo-dark.png" />
    <img src="https://raw.githubusercontent.com/simkeyur/vox-arena/main/ui/src/assets/logo.png" alt="VoxArena" width="220" />
  </picture>
</p>

<p align="center"><em>An evaluation arena for realtime voice agents.</em></p>

<p align="center">

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![Built with Pipecat](https://img.shields.io/badge/built%20with-pipecat-9cf.svg)](https://github.com/pipecat-ai/pipecat)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](#contributing)

</p>

VoxArena is a reproducible benchmarking harness for realtime voice agents. Run the same scripted conversation across Gemini Live, OpenAI Realtime, and other [Pipecat](https://github.com/pipecat-ai/pipecat)-supported providers — and compare them apples-to-apples on latency, tool-call accuracy, and hallucinations.

Drop it into your CI pipeline, your dev loop, or the bundled control panel.

---

## 🚀 CI & Pipeline Integration

VoxArena ships a `voxarena` CLI designed for headless use in your build pipeline. It returns a non-zero exit code when metrics fall below thresholds you define, and emits JUnit XML for native CI reporting.

```bash
pip install voxarena

voxarena run \
  --provider gemini \
  --script ./script/utterances.yaml \
  --min-tool-accuracy 0.9 \
  --max-hallucinations 0 \
  --max-avg-ttfa-ms 1500 \
  --output result.json \
  --junit voxarena.xml
# exit 0 if every threshold passes, 1 otherwise
```

### Compare two providers in one shot

```bash
voxarena compare \
  --gemini-model gemini-3.1-flash-live-preview \
  --openai-model gpt-realtime-2 \
  --num-turns 5 \
  --min-tool-accuracy 0.9 \
  --output compare.json
```

### GitHub Actions

```yaml
- name: Voice agent regression check
  env:
    GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
  run: |
    pip install voxarena
    voxarena run --provider gemini \
      --min-tool-accuracy 0.92 --max-hallucinations 0 \
      --junit voxarena.xml --quiet

- uses: mikepenz/action-junit-report@v4
  if: always()
  with:
    report_paths: voxarena.xml
```

### Subcommands

| Command | What it does |
| --- | --- |
| `voxarena run` | Single-provider scripted run; exits 0/1 against thresholds. |
| `voxarena compare` | Runs Gemini and OpenAI in parallel against the same script. |
| `voxarena report` | Generates a markdown comparison report from past runs. |

Run `voxarena <command> --help` for the full flag set.

---

## 🖥️ Web Control Panel UI (Zero Setup)

You can configure credentials, build test scripts, and run the benchmark suite entirely from your web browser:

```bash
pip install voxarena
voxarena ui
```

This starts a local server and automatically opens the dashboard in your default browser at `http://127.0.0.1:8000`.

From the UI, you can:
- **Set Up API Keys:** Add and save Google Gemini and OpenAI API keys securely in the local database.
- **Select Models:** Pick from preloaded Gemini and OpenAI realtime models, or write in your own custom model identifiers.
- **Edit Test Utterances:** Create, edit, and delete turns in your test scripts using the interactive visual list editor (no raw YAML/JSON formatting needed).
- **Run & Inspect:** Start live comparison runs and watch real-time transcripts, metrics, audio playbacks, and tool-call correctness side-by-side.

*Note: If you run `voxarena ui` in a clean, empty directory, it will automatically bootstrap default script files and pre-recorded audio so you can run benchmarks immediately.*

---

## Features

- 🎙️ **Provider-agnostic agent** — one Pipecat pipeline drives every provider; swap models without re-implementing your agent
- 🔁 **Scripted conversations** — multi-turn YAML scripts with pre-recorded WAV inputs and expected tool calls / response content
- 📊 **Automated scoring** — tool-call correctness, response matching, hallucination counts, time-to-first-audio, interruption-stop latency
- 🆚 **Side-by-side comparisons** — run multiple providers in parallel against the same script
- 🗄️ **Persistent run history** — JSON manifests on disk, indexed in SQLite
- 🖥️ **Web control panel** — React UI to launch runs, watch live status, browse results, and edit scripts
- 🧩 **Extensible** — add a new provider by implementing one adapter class

## Architecture

<p align="center">
  <img src="https://raw.githubusercontent.com/simkeyur/vox-arena/main/ui/src/assets/architecture.png" alt="VoxArena Architecture" width="800" />
</p>

## Local Dev (with UI)

```bash
git clone https://github.com/simkeyur/vox-arena.git
cd vox-arena
cp .env.example .env  # add GOOGLE_API_KEY / OPENAI_API_KEY

python3 -m venv .venv && source .venv/bin/activate
pip install -e .

uvicorn voxarena.main:app --reload --port 8000
```

Then in another terminal:

```bash
cd ui && npm install && npm run dev
```

Open the control panel at `http://localhost:5173`.

## Bring Your Own Agent

The demo ships with the "Saffron Leaf" restaurant agent so you can run end-to-end on day one. To evaluate your own:

1. Replace the system prompt and tool schemas in `voxarena/agent.py`
2. Implement (or stub) your tools in `voxarena/tools.py`
3. Re-record `script/audio/*.wav` and update `script/utterances.yaml` to reflect your real workload
4. Run the arena as normal — every provider gets scored against your scripts

## Scripted Conversations

Conversations live in [`script/utterances.yaml`](script/utterances.yaml). Each turn pairs an utterance id with an `expect` block describing the correct tool call and/or response content:

```yaml
- id: u04
  text: "Are you open on Sundays?"
  expect:
    tool: get_hours
    args:
      day: sunday
    response_contains:
      - "closed"
```

The harness plays `script/audio/{id}.wav` into the pipeline and scores the agent's actual tool calls and transcript against `expect`.

## Configuration

| Variable | Description |
| --- | --- |
| `GOOGLE_API_KEY` / `OPENAI_API_KEY` | Provider credentials |
| `GEMINI_MODEL` / `OPENAI_MODEL` | Realtime model under test |
| `GEMINI_EVAL_MODEL` / `OPENAI_EVAL_MODEL` | Cheaper text models for grading |
| `PORT` | FastAPI server port |
| `BASE_DIR` | Override workdir (CLI: `--workdir`) |

## Contributing

To add a new provider: implement an adapter in `voxarena/providers/` following the pattern in `gemini.py` / `openai.py`, wire it into `voxarena/harness.py` and `voxarena/config.py`, and open a PR.

For bugs and feature requests, please open an issue.

## License

[MIT](LICENSE).
