Metadata-Version: 2.4
Name: agent-colosseum
Version: 0.1.0
Summary: Multi-agent colosseum framework. Prove that agents debating, researching, and red-teaming produce better answers than a single agent.
Project-URL: Homepage, https://github.com/jinsoo96/agent-colosseum
Project-URL: Repository, https://github.com/jinsoo96/agent-colosseum
Project-URL: Issues, https://github.com/jinsoo96/agent-colosseum/issues
Project-URL: Documentation, https://github.com/jinsoo96/agent-colosseum#readme
Author-email: Jinsoo Kim <wlstn010203@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: agent,ai,benchmark,debate,llm,multi-agent,peer-review,reasoning,red-team,research
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: AsyncIO
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.0
Provides-Extra: all
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Description-Content-Type: text/markdown

# agent-colosseum

**Make AI agents fight, research, and review each other — then prove the result is better than a single agent.**

agent-colosseum is a Python framework for running multi-agent interactions (debate, collaborative research, red-teaming, peer review) and benchmarking them against single-agent baselines. It is provider-agnostic (Claude, GPT, Gemini, local models) and ships with built-in datasets and evaluation metrics.

## Why?

A single LLM has blind spots. It commits to an answer too early, hallucinates confidently, and can't challenge its own reasoning. Multiple agents attacking, questioning, and building on each other's work produce measurably better results.

**This isn't theory. The papers prove it:**

| Paper | Key Result |
|-------|-----------|
| [Du et al. (2023)](https://arxiv.org/abs/2305.14325) | GSM8K: 77% → **85%**, MMLU: 63.9% → **71.1%** |
| [Liang et al. (EMNLP 2024)](https://arxiv.org/abs/2305.19118) | **GPT-3.5 + debate > GPT-4 alone**. Thought diversity **2.6x** increase |
| [Chan et al. (2023)](https://arxiv.org/abs/2308.07201) | Same-role agents = **zero** multi-agent benefit. Diverse roles required |
| [CAMEL (NeurIPS 2023)](https://arxiv.org/abs/2303.17760) | Multi-agent wins **76.3%** vs single-agent 10.4% |
| [Park et al. (2023)](https://arxiv.org/abs/2304.03442) | Full architecture (memory+reflection) **+41%** believability |

agent-colosseum lets you reproduce these results, run your own experiments, and use multi-agent patterns in production.

## Installation

```bash
pip install agent-colosseum              # core only
pip install agent-colosseum[anthropic]   # + Claude support
pip install agent-colosseum[openai]      # + GPT support
pip install agent-colosseum[all]         # everything
```

## Quick Start

```python
import asyncio
from agent_colosseum import DebateExecutor, DebateConfig
from agent_colosseum.providers.anthropic import AnthropicProvider

async def main():
    provider = AnthropicProvider(model="claude-sonnet-4-6")
    executor = DebateExecutor(provider)

    # Run a debate
    result = await executor.debate("Should AI models be open-sourced without restrictions?")
    print(result.final_answer)
    print(f"Confidence: {result.confidence}")
    print(f"Key arguments: {result.key_arguments}")

    # Compare: single agent vs debate
    single, debate = await executor.compare("Is Rust better than C++?")
    print(f"Single: {single.final_answer[:200]}")
    print(f"Debate: {debate.final_answer[:200]}")
    print(f"Token cost: {debate.total_tokens / single.total_tokens:.1f}x")

asyncio.run(main())
```

## Four Arena Modes

### 1. Debate — Agents Fight, Then Reach Consensus

Agents take opposing sides, attack each other's arguments, and are forced to synthesize a conclusion. Based on MAD (Liang et al.) and Du et al.

```python
result = await executor.debate("Is remote work more productive than office work?")
```

**Three debate protocols:**

| Protocol | How it works | Best for |
|----------|-------------|----------|
| `mad` (default) | Pro vs Con + Judge, tit-for-tat forced opposition | Factual questions, reasoning |
| `du` | N agents answer independently → share → update → converge | Math, well-defined problems |
| `freestyle` | Free-form argument, no forced sides | Open-ended topics |

```python
# Du-style consensus (no judge, natural convergence)
result = await executor.debate(topic, protocol="du")

# Freestyle with custom agents
result = await executor.debate(topic, protocol="freestyle", debaters=my_agents)
```

**How it works:**
```
Round 1-3: Fight (argument → rebuttal → concession → challenge)
    ↓
Synthesis Round: Each debater states what they concede and what they won't
    ↓  
Final Integration: Merge synthesis statements into unified conclusion
    ↓
Result: final_answer + agreed_points + unresolved issues + insight
```

### 2. Research — Independent Investigation, Cross-Examination, Synthesis

Each agent investigates from their specialty (empiricist, theorist, methodologist), then they cross-examine each other's findings before synthesizing.

```python
result = await executor.research("Are LLM agents production-ready in 2026?")
```

**How it works:**
```
Phase 1 (Independent): Each researcher analyzes from their specialty
Phase 2 (Cross-exam):  Read each other's work, challenge methodology
Phase 3 (Synthesis):   Confirmed findings + probable hypotheses + unresolved
```

**Output includes:**
- `confirmed_findings` — all researchers agree
- `probable_hypotheses` — most agree, needs verification
- `unresolved` — researchers disagree
- `further_research` — suggested next steps

### 3. Red Team — One Defends, Others Attack

One agent presents a claim. The others attack it from different angles (logic, evidence, edge cases, practicality, assumptions). The defender absorbs valid attacks and strengthens the claim.

```python
result = await executor.redteam("Microservices are better than monoliths")
```

**How it works:**
```
Phase 1: Defender presents initial claim
Phase 2: Attackers hit from different angles → Defender revises (repeat)
Phase 3: Compare original vs hardened claim
```

**Output includes:**
- `original_weaknesses` — weaknesses found in the initial claim
- `improvements` — how the claim got stronger
- `surviving_weaknesses` — weaknesses that couldn't be fixed

### 4. Peer Review — Present, Review, Revise

One agent presents analysis. Others review it (strengths, weaknesses, questions, suggestions). The author revises based on feedback.

```python
result = await executor.peer_review("What caused the 2008 financial crisis?")
```

**How it works:**
```
Phase 1: Author presents structured analysis
Phase 2: Reviewers critique from different angles → Author revises (repeat)
Phase 3: Final reviewed output
```

## Custom Agents

Every mode accepts custom agents. Control the number, roles, personalities, and even which LLM model each agent uses.

```python
from agent_colosseum import Debater, DebaterRole

debaters = [
    Debater(
        name="DataScientist",
        role=DebaterRole.PROPONENT,
        persona="Relies on statistical evidence and empirical data.",
        model="claude-sonnet-4-6",  # optional per-agent model
    ),
    Debater(
        name="Philosopher",
        role=DebaterRole.OPPONENT,
        persona="Questions fundamental assumptions and ethical implications.",
    ),
    Debater(
        name="Engineer",
        role=DebaterRole.SKEPTIC,
        persona="Demands practical feasibility. 'Will it actually work?'",
    ),
    Debater(
        name="Contrarian",
        role=DebaterRole.DEVILS_ADVOCATE,
        persona="Opposes whatever the majority thinks. Finds hidden problems.",
    ),
]

# Use in any mode
result = await executor.debate(topic, debaters=debaters)
result = await executor.research(topic, debaters=debaters)
result = await executor.redteam(topic, debaters=debaters)   # first = defender
result = await executor.peer_review(topic, debaters=debaters)  # first = author
```

**Available roles:**

| Role | Behavior |
|------|----------|
| `PROPONENT` | Argues in favor |
| `OPPONENT` | Argues against |
| `SKEPTIC` | Demands evidence for everything |
| `DEVILS_ADVOCATE` | Opposes majority, reveals hidden problems |
| `WILDCARD` | Unpredictable, attacks from unexpected angles |

## Cross-Model Debate

Du et al. showed that ChatGPT (14/20) + Bard (11/20) debating together scored **17/20** — better than either alone. agent-colosseum supports this natively:

```python
from agent_colosseum.providers.anthropic import AnthropicProvider
from agent_colosseum.providers.openai import OpenAIProvider

debaters = [
    Debater(name="Claude", role=DebaterRole.PROPONENT, model="claude-sonnet-4-6"),
    Debater(name="GPT", role=DebaterRole.OPPONENT, model="gpt-4o"),
]

# Use MultiProvider to route by model name
# (see examples/cross_model_debate.py for full implementation)
```

## Configuration

```python
from agent_colosseum import DebateConfig

config = DebateConfig(
    max_rounds=3,              # Du: 2-3 optimal, plateau at 4+
    disagreement_level=2,      # 1=mild, 2=strong (optimal), 3=extreme
    adaptive_break=True,       # Liang: judge can end early on consensus
    scoring_criteria=["logic", "evidence", "persuasion"],
    temperature=0.7,           # agent generation temperature
    judge_temperature=0.0,     # SOTOPIA: deterministic judge correlates r=0.71 with humans
    max_tokens_per_statement=800,
    max_tokens_final=2000,
)

executor = DebateExecutor(provider, config=config)
```

**Why these defaults?** Every default is backed by paper evidence:

| Setting | Default | Evidence |
|---------|---------|----------|
| `max_rounds=3` | 3 | Du et al.: 2-3 rounds optimal, plateau at 4+ |
| `disagreement_level=2` | 2 | Liang: Level 2 optimal. Level 3 devolves into "trying to win" |
| `adaptive_break=True` | True | Liang: judge-based early termination is effective |
| `judge_temperature=0.0` | 0.0 | SOTOPIA: deterministic scoring correlates r=0.71 with human judgment |

## Benchmarking

Built-in benchmarking to prove multi-agent > single agent with your own data:

```python
from agent_colosseum.benchmark import BenchmarkRunner

runner = BenchmarkRunner(executor)

# Run on built-in datasets
result = await runner.run("reasoning", protocol="mad", max_items=5)
print(result.summary())

# Output:
# ═══ Benchmark: reasoning (protocol: mad) ═══
#
# ## Accuracy
#   Single Agent:  3/5 (60.0%)
#   Debate:        4/5 (80.0%)
#   Difference:    +20.0%
#
# ## Quality (0-10)
#   Single Agent:  6.2
#   Debate:        7.8
#   Difference:    +1.6
#
# ## Cost
#   Single tokens: 2,100
#   Debate tokens: 8,400
#   Ratio:         4.0x
```

**Built-in datasets:**

| Dataset | Items | Type | Based on |
|---------|-------|------|----------|
| `arithmetic` | 10 | Exact-answer math | Du et al. |
| `reasoning` | 8 | Counter-intuitive logic | Liang CIAR |
| `factual` | 8 | Fact verification | Du MMLU/TruthfulQA |
| `opinion` | 5 | Open-ended (no ground truth) | Quality eval only |

**Quality metrics (SOTOPIA-inspired 7 dimensions):**

| Dimension | Measures | Weight |
|-----------|----------|--------|
| depth | Argument depth (surface vs root cause) | 1.0x |
| diversity | Perspective diversity (Liang: debate = 2.6x) | 1.5x |
| evidence | Evidence specificity | 1.0x |
| logic | Logical consistency | 1.0x |
| **novelty** | Insights impossible for single agent | **2.0x** |
| concession | Quality of concessions (intellectual honesty) | 1.0x |
| **synthesis** | Final conclusion better than individual positions | **2.0x** |

## Real-Time Streaming

All modes support a callback for real-time statement streaming:

```python
async def on_statement(stmt, round_num):
    label = "Synthesis" if round_num == 0 else f"R{round_num}"
    print(f"[{label}] {stmt.debater_name}: {stmt.content[:200]}")

result = await executor.debate(topic, on_statement=on_statement)
```

## Custom LLM Providers

Implement the `LLMProvider` interface to use any LLM:

```python
from agent_colosseum.providers.base import LLMProvider, LLMResponse

class OllamaProvider(LLMProvider):
    async def generate(
        self,
        system: str,
        messages: list[dict[str, str]],
        *,
        max_tokens: int = 800,
        temperature: float = 0.7,
        model: str | None = None,
    ) -> LLMResponse:
        # Your implementation here
        return LLMResponse(content="...", input_tokens=0, output_tokens=0)
```

Built-in providers: `AnthropicProvider`, `OpenAIProvider` (also works with any OpenAI-compatible API via `base_url`).

## Architecture

```
agent_colosseum/
├── models.py              # Debater, Statement, Round, DebateResult
├── prompts.py             # All prompts (English, provider-agnostic)
├── protocols.py           # Debate protocols (MAD, Du consensus, freestyle)
├── executor.py            # Main engine — debate/research/redteam/peer_review/compare
├── modes/
│   ├── research.py        # Independent investigation → cross-exam → synthesis
│   ├── redteam.py         # 1 defender + N attackers → hardened claim
│   └── peer_review.py     # 1 author → N reviewers → improved output
├── providers/
│   ├── base.py            # LLMProvider ABC
│   ├── anthropic.py       # Claude
│   └── openai.py          # GPT / OpenAI-compatible
└── benchmark/
    ├── datasets.py        # Built-in datasets (arithmetic, reasoning, factual, opinion)
    ├── metrics.py         # Accuracy + 7-dimension quality evaluation
    └── runner.py          # Single vs multi-agent comparison runner
```

## Design Decisions

Every design choice in agent-colosseum is grounded in published research:

| Decision | Reasoning | Source |
|----------|-----------|--------|
| Default 2 agents | 2 debaters optimal; 3+ degrades due to context length | Liang et al. |
| Default 3 rounds | 2-3 optimal; 4+ shows plateau or decline | Du et al. |
| Diverse roles mandatory | Same role = zero multi-agent benefit | Chan et al. |
| Disagreement level 2 | Level 3 devolves into "arguing to win" | Liang et al. |
| One-by-one communication | Sequential > simultaneous (60% vs 55% accuracy) | Chan et al. |
| Adaptive early termination | Judge-based consensus detection saves cost | Liang et al. |
| Deterministic judge (temp=0) | Correlates r=0.71 with human judgment | SOTOPIA |
| Synthesis round | Debaters themselves reach conclusion, not external judge | Novel |

<details>
<summary><b>References</b></summary>

#### Core Multi-Agent Debate
- [Du et al. (2023)](https://arxiv.org/abs/2305.14325) — Improving Factuality and Reasoning in Language Models through Multiagent Debate
- [Liang et al. (EMNLP 2024)](https://arxiv.org/abs/2305.19118) — Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- [Chan et al. (2023)](https://arxiv.org/abs/2308.07201) — ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- [CAMEL (NeurIPS 2023)](https://arxiv.org/abs/2303.17760) — Communicative Agents for "Mind" Exploration of Large Language Model Society

#### Agent Architecture & Evaluation
- [Park et al. (UIST 2023)](https://arxiv.org/abs/2304.03442) — Generative Agents: Interactive Simulacra of Human Behavior
- [SOTOPIA (ICLR 2024)](https://arxiv.org/abs/2310.11667) — Interactive Evaluation for Social Intelligence in Language Agents
- [MetaGPT (2023)](https://arxiv.org/abs/2308.00352) — Meta Programming for A Multi-Agent Collaborative Framework
- [Tree of Thoughts (2023)](https://arxiv.org/abs/2305.10601) — Deliberate Problem Solving with Large Language Models

#### Personality & Behavior
- [BIG5-CHAT (ACL 2025)](https://arxiv.org/abs/2410.16491) — Shaping LLM Personalities via Training on Human-Grounded Data
- [Agentic LLMs Survey (2025)](https://arxiv.org/abs/2503.23037) — Theory of Mind, multi-agent cooperation, social norms
- [Dynamic Personality (ACL 2025)](https://aclanthology.org/2025.findings-acl.1185.pdf) — Personality evolution during interaction

</details>

## License

MIT
