Metadata-Version: 2.4
Name: voting-mcp
Version: 0.1.0
Summary: MCP server exposing principled social-choice aggregation rules (Borda, Copeland, Condorcet, approval, STV, opinion pool), with a reproducible benchmark measuring their accuracy vs majority vote over an LLM ensemble.
Author-email: Hrishi Kabra <kabrahrishi@gmail.com>
License: MIT
Keywords: aggregation,ensemble,mcp,social-choice,voting
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: mcp[cli]>=1.2.0
Requires-Dist: pydantic>=2.6
Provides-Extra: bench
Requires-Dist: matplotlib>=3.8; extra == 'bench'
Requires-Dist: numpy>=1.26; extra == 'bench'
Requires-Dist: openai>=1.30; extra == 'bench'
Requires-Dist: python-dotenv>=1.0; extra == 'bench'
Requires-Dist: pyyaml>=6.0; extra == 'bench'
Description-Content-Type: text/markdown

# voting-mcp

**Principled social-choice aggregation as MCP tools — with a benchmark that measures the
accuracy lift over naive majority vote.**

Almost every multi-agent system aggregates votes with `Counter(votes).most_common(1)`, throwing
away preference order and confidence. `voting-mcp` ships the real rules (Borda, Copeland,
Condorcet, approval, STV, linear opinion pool) as callable MCP tools — each with its known
axiomatic behavior and explicit, documented tie-breaking — plus a reproducible benchmark that
aggregates a diverse ensemble of LLMs on a reasoning set and reports accuracy with bootstrap
confidence intervals.

The server is **pure compute**: stdio transport, no network, no file writes, no secrets — clean
against the OWASP MCP Top 10 by construction.

## Install

```sh
# run the server directly (once published)
uvx voting-mcp

# or from source
git clone <repo> && cd voting-mcp
uv sync
uv run python -m voting_mcp.server
```

Add it to an MCP client (e.g. Claude Desktop `claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "voting": { "command": "uvx", "args": ["voting-mcp"] }
  }
}
```

## Tools

Every tool takes a `profile` (`{candidates, ballots}`) and returns a `Result` with the full
co-winner set (`winners`, so ties are never hidden), the single tie-broken `winner` (or `null`
when none exists), a `ranking`, per-candidate `scores`, and a `note`.

| Tool | Ballots | Notes |
|------|---------|-------|
| `borda` | rankings | positional; Condorcet-inconsistent, clone-sensitive |
| `copeland` | rankings | Condorcet-consistent pairwise (+1 win, +0.5 tie) |
| `condorcet` | rankings | returns the pairwise winner **or an explicit no-winner on a cycle** |
| `approval` | approval sets | most-approved wins |
| `stv` | rankings | single-winner instant-runoff; clone-resistant |
| `opinion_pool` | distributions | linear pool — **preserves confidence, not an argmax vote** |
| `plurality` | rankings | baseline (most first choices) |
| `majority` | rankings | strict >50% or **no winner** |
| `aggregate_rule` | any | dispatch by a `rule` enum |

Tie-breaking is an explicit parameter (`lexicographic` default, `none`, or seeded `random`).

## Benchmark

Aggregate an ensemble of 5 models (one OpenAI-compatible client via OpenRouter) on
ARC-Challenge and compare each rule to the naive majority vote:

```sh
uv sync --extra bench
uv run python -m bench.fetch_arc --limit 200
# prints a cost estimate and STOPS; add --yes to actually call the API, --mock for a free dry run
uv run python -m bench.run_ensemble --dataset bench/datasets/arc_challenge.jsonl --limit 200 --yes
uv run python -m bench.compare --dataset bench/datasets/arc_challenge.jsonl --limit 200
```

Every raw response is cached under `bench/results/raw/`; re-runs never re-call the API, so
aggregation tweaks are free.

### Results

5-model ensemble (gpt-4o-mini · gemini-2.5-flash-lite · deepseek-v3 · claude-haiku-4.5 ·
glm-4.7), n = 200, bootstrap 95% CI. Two datasets of different difficulty; full write-up and
both plots in [`RESULTS.md`](RESULTS.md).

**MMLU-Pro (hard, baseline 73.5%) — the informative case:**

| Rule | Accuracy | 95% CI | Δ vs majority |
|------|---------:|:------:|--------------:|
| **opinion_pool** | **0.755** | [0.695, 0.815] | **+0.020** |
| **majority_vote (baseline)** | 0.735 | [0.679, 0.788] | — |
| approval | 0.701 | [0.640, 0.757] | −0.035 |
| stv | 0.693 | [0.630, 0.750] | −0.043 |
| copeland | 0.647 | [0.580, 0.710] | −0.088 |
| condorcet | 0.620 | [0.550, 0.685] | −0.115 |
| majority (strict) | 0.590 | [0.520, 0.655] | −0.145 |
| borda | 0.472 | [0.405, 0.540] | −0.263 |

![MMLU-Pro](docs/accuracy_mmlu_pro.png)

**The finding (honest):** the value isn't "fancy voting beats majority." It's that **the
confidence-preserving rule (`opinion_pool`) wins** when the crowd is uncertain (+2.0pp, the only
rule above baseline — though its CI still overlaps, so *suggestive, not conclusive*), while
**forcing the distributions into full rankings actively hurts** — `borda` collapses to 0.472,
far below majority, because with 10 options the tail of the ranking is mostly noise. Aggregate
the confidence; don't throw it away. On **ARC-Challenge** (baseline 96.8%, near-ceiling) nothing
separates — every rule lands within overlapping CIs. See [`RESULTS.md`](RESULTS.md).

## Develop

```sh
uv run pytest -q
uv run ruff check .
uv run mypy --strict src
# exercise the tools in the MCP Inspector:
npx @modelcontextprotocol/inspector uv run python -m voting_mcp.server
```

> Note: if you keep this repo under an iCloud-synced folder (e.g. `~/Desktop`), iCloud can spawn
> duplicate `.pth` files that intermittently break the editable install. Tests use
> `pythonpath=src`; run the server with `PYTHONPATH=src` if an import fails, or move the repo
> off the synced folder.

## License

MIT
