Metadata-Version: 2.4
Name: bellwether
Version: 0.1.0
Summary: The cost-and-failure-mode benchmark for LLM agents.
Author-email: Stephen Hedrick <Stephen@wavebound.io>
License: MIT
Project-URL: Homepage, https://github.com/cartesianxr7/bellwether
Project-URL: Repository, https://github.com/cartesianxr7/bellwether
Project-URL: Documentation, https://cartesianxr7.github.io/bellwether
Project-URL: Issues, https://github.com/cartesianxr7/bellwether/issues
Keywords: llm,benchmark,evaluation,agents,anthropic,openai,gemini,tcot
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.40
Requires-Dist: openai>=1.50
Requires-Dist: google-genai>=0.3
Requires-Dist: python-dotenv>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: jinja2>=3.1; extra == "dev"
Requires-Dist: markdown>=3.5; extra == "dev"
Dynamic: license-file

# bellwether

[![tests](https://github.com/cartesianxr7/bellwether/actions/workflows/test.yml/badge.svg)](https://github.com/cartesianxr7/bellwether/actions/workflows/test.yml)
[![python](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://github.com/cartesianxr7/bellwether)
[![methodology](https://img.shields.io/badge/methodology-v0.1-blueviolet.svg)](https://cartesianxr7.github.io/bellwether/methodology.html)
[![license](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.

**[Live leaderboard](https://cartesianxr7.github.io/bellwether/)** &middot; **[Methodology](https://cartesianxr7.github.io/bellwether/methodology.html)**

## Why

Cross-provider LLM benchmarks today rank capability ("which model is smarter on average"). HELM and Chatbot Arena own that ground.

Practitioners building production systems need a different answer: **which provider for THIS task, at THIS cost when retries and failures are accounted for, with THESE failure modes that map to my product's tolerance.**

bellwether answers the procurement question and ships the toolkit anyone can run on their own prompts.

## What it measures

- **`effective_TCoT`**: total cost per successfully completed task, including the cost of failed retries. The procurement-question metric, not the average-quality one.
- **Failure-mode taxonomy**: classify *how* models fail, not just whether (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). Maps to product-tolerance decisions.
- **Machine-checkable ground truth only.** No LLM-as-judge. Sidesteps the well-documented judge-bias issue.
- **Prompt portability.** Headline numbers use one canonical prompt across providers; portability cost (tuned vs canonical) is a v1 promise with a real contract.

See [METHODOLOGY.md](METHODOLOGY.md) for formulas, retry policy, validator contract, and reproducibility caveats.

## Install

From source (current; PyPI publish pending):

```bash
git clone https://github.com/cartesianxr7/bellwether
cd bellwether
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env       # add ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
pre-commit install         # optional, gates secret leaks
pytest                     # 120+ tests; all should pass
```

After v0.1.0 publish to PyPI:

```bash
pip install bellwether
```

## Run

```bash
bellwether list providers           # show registered provider adapters
bellwether list tasks               # show registered tasks

# Smoke test: 2 instances, 1 run each, $1 cap, takes ~10 seconds and ~$0.01:
bellwether run --instances 2 --n 1 --max-cost 1

# Standard bench: 5 instances, 3 runs per instance, all 3 providers, $5 cap:
bellwether run --instances 5 --n 3 --max-cost 5

# Re-render leaderboard from existing results without re-running:
bellwether report results
```

The cost guardrail (`--max-cost USD`) is a hard cap on total spend per invocation. Strongly recommended.

## Status

**v0.1**: methodology, package, CLI, structured-output extraction task across Claude Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash Lite. 1-task leaderboard, 3-pass reproducibility data.

**v0.2 through v0.5**: function calling (BFCL), RAG (FinanceBench/NQ-open/HotpotQA), multi-step reasoning (GAIA validation set), long-context summarization (GovReport). One task per release.

**v1**: code-generation task with sandboxing, OpenRouter open-weights, tuned-prompt-track formalization, plugin loader.

## Repository

- Code: [github.com/cartesianxr7/bellwether](https://github.com/cartesianxr7/bellwether)
- Leaderboard: [cartesianxr7.github.io/bellwether](https://cartesianxr7.github.io/bellwether)
- Methodology: [cartesianxr7.github.io/bellwether/methodology.html](https://cartesianxr7.github.io/bellwether/methodology.html)
- Raw results JSON: [results/](https://github.com/cartesianxr7/bellwether/tree/main/results)

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Adding a task or a provider adapter is a single PR; the contract is documented and small.

## License

MIT. See [LICENSE](LICENSE).

## Author

Stephen Hedrick.
