Metadata-Version: 2.4
Name: mixture-llm
Version: 0.1.3
Summary: Lightweight Mixture of Agents pipeline framework.
Author-email: Leonard O'Sullivan <leonard@machine.dev>
License: MIT License
        
        Copyright (c) 2025
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/leonardosul/mixture-llm
Project-URL: Documentation, https://leonardosul.github.io/mixture-llm/
Project-URL: Issues, https://github.com/leonardosul/mixture-llm/issues
Keywords: llm,moa,mixture-of-agents,prompting,pipelines
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: litellm
Requires-Dist: litellm>=1.0.0; extra == "litellm"
Provides-Extra: examples
Requires-Dist: openai>=1.0.0; extra == "examples"
Requires-Dist: litellm>=1.0.0; extra == "examples"
Provides-Extra: dev
Requires-Dist: mixture-llm[examples]; extra == "dev"
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: pre-commit>=3.8; extra == "dev"
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5.1; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6; extra == "docs"
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25; extra == "docs"
Dynamic: license-file

# mixture-llm

[![Test](https://github.com/leonardosul/mixture-llm/actions/workflows/test.yaml/badge.svg)](https://github.com/leonardosul/mixture-llm/actions/workflows/test.yaml)
[![Release](https://img.shields.io/github/v/release/leonardosul/mixture-llm)](https://github.com/leonardosul/mixture-llm/releases/latest)
[![PyPI](https://img.shields.io/pypi/v/mixture-llm)](https://pypi.org/project/mixture-llm/)
[![License: MIT](https://img.shields.io/github/license/leonardosul/mixture-llm)](LICENSE)
[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://leonardosul.github.io/mixture-llm)

Combine LLMs to beat the best single LLM.

The Mixture-of-Agents architecture achieved **65.1% on AlpacaEval 2.0** using only open-source models—surpassing GPT-4o's 57.5%. This library gives you the building blocks to construct these pipelines.

## Install

```bash
pip install mixture-llm
```

## Quick start

```python
from mixture_llm import Propose, Aggregate, run

pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b"]),
    Aggregate("gpt-5-nano-2025-08-07"),
]

result, history = await run(pipeline, "What is quantum computing?", my_client)
```

## Paper-accurate pipelines

### Together MoA (65.1% AlpacaEval)

The benchmark-winning configuration from [Wang et al. (2024)](https://arxiv.org/abs/2406.04692): 3 layers, 6 diverse proposers, Qwen aggregator.

```python
PROPOSERS = [
    "wizardlm-2-8x22b",
    "qwen1.5-110b-chat",
    "qwen1.5-72b-chat",
    "llama-3-70b-instruct",
    "mixtral-8x22b-instruct",
    "dbrx-instruct",
]

together_moa = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-110b-chat"),
]
```

### MoA-Lite (59.3% AlpacaEval)

Cost-optimized 2-layer variant—still beats GPT-4o.

```python
moa_lite = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen1.5-72b-chat"),
]
```

### Self-MoA (+6.6% over standard MoA)

[Li et al. (2025)](https://arxiv.org/abs/2502.00674) showed that sampling one top model multiple times can outperform diverse model mixtures.

```python
# Same model, multiple samples via temperature
self_moa = [
    Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),
    Aggregate("gpt-5-nano-2025-08-07"),
]
```

### With robustness (shuffle + dropout)

Prevents positional bias and improves diversity.

```python
robust_moa = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-70b", "gemini-2.5-flash"]),
    Shuffle(),
    Dropout(0.2),
    Aggregate("gpt-5-nano-2025-08-07"),
]
```

## Steps

**LLM steps** — call models:
- `Propose(agents)` — generate initial responses in parallel
- `Synthesize(agents)` — each agent synthesizes all previous outputs
- `Aggregate(agent)` — single model combines everything into final output
- `Refine(agents)` — improve each response individually
- `Rank(agent, n)` — select top n responses by quality
- `Vote(agent)` — pick consensus answer

**Transform steps** — manipulate responses:
- `Shuffle()` — randomize order (prevents position bias)
- `Dropout(rate)` — randomly drop responses (improves robustness)
- `Sample(n)` — random subset
- `Take(n)` — first n responses
- `Filter(fn)` — keep responses matching predicate
- `Map(fn)` — transform each response

## Configuration

Every LLM step accepts `temp` and `max_tokens`:

```python
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5"], temp=0.9, max_tokens=4096)
```

Override the synthesis prompt:

```python
Aggregate("gpt-5-nano-2025-08-07", prompt="Pick the single best response and return it verbatim.")
```

## Client examples

Your client is an async function with this signature:

```python
async def client(model, messages, temp, max_tokens) -> tuple[str, int, int]:
    # Returns (response_text, input_tokens, output_tokens)
```

### OpenAI SDK (OpenAI + Anthropic models)

```python
from openai import AsyncOpenAI

openai_client = AsyncOpenAI()
anthropic_client = AsyncOpenAI(
    base_url="https://api.anthropic.com/v1/",
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

async def multi_provider_client(model, messages, temp, max_tokens):
    client = anthropic_client if model.startswith("claude") else openai_client
    # GPT-5: max_completion_tokens, no temperature, minimal reasoning
    is_gpt5 = model.startswith("gpt-5")
    params = {"model": model, "messages": messages}
    params.update({"max_completion_tokens": max_tokens, "reasoning_effort": "minimal"} if is_gpt5 else {"max_tokens": max_tokens, "temperature": temp})
    resp = await client.chat.completions.create(**params)
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Mix providers in one pipeline
pipeline = [
    Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "gpt-5-nano-2025-08-07"]),
    Aggregate("claude-sonnet-4-5"),
]
```

### OpenRouter (access all models via one API)

```python
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

async def openrouter_client(model, messages, temp, max_tokens):
    resp = await client.chat.completions.create(
        model=model, messages=messages, temperature=temp, max_tokens=max_tokens
    )
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Together MoA models via OpenRouter
PROPOSERS = [
    "qwen/qwen-2.5-72b-instruct",
    "meta-llama/llama-3.3-70b-instruct",
    "mistralai/mixtral-8x22b-instruct",
]

together_moa_openrouter = [
    Propose(PROPOSERS, temp=0.7, max_tokens=512),
    Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
    Aggregate("qwen/qwen-2.5-72b-instruct"),
]
```

### Groq via LiteLLM (free tier)

Groq offers free access to several models. Great for experimentation.

```python
from litellm import acompletion

async def groq_client(model, messages, temp, max_tokens):
    resp = await acompletion(
        model=f"groq/{model}", messages=messages, temperature=temp, max_tokens=max_tokens
    )
    return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens

# Free Groq models (check console.groq.com/docs/rate-limits for current list)
GROQ_FREE = [
    "llama-3.3-70b-versatile",
    "llama-3.1-8b-instant",
    "qwen/qwen3-32b",
    "meta-llama/llama-4-scout-17b-16e-instruct",
]

free_moa = [
    Propose(GROQ_FREE, temp=0.7, max_tokens=512),
    Aggregate("llama-3.3-70b-versatile"),
]

# Self-MoA with Groq (single model, multiple samples)
free_self_moa = [
    Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7),
    Aggregate("llama-3.3-70b-versatile"),
]
```

## Examples

The [`examples/`](examples/) directory contains tested, runnable scripts for different providers. See [`examples/EXAMPLES.md`](examples/EXAMPLES.md) for detailed documentation.

| Example | Provider | What You'll Learn |
|---------|----------|-------------------|
| [`openai_basic.py`](examples/openai_basic.py) | OpenAI | Basic MoA pattern (Propose → Aggregate), client setup, token tracking |
| [`openai_self_moa.py`](examples/openai_self_moa.py) | OpenAI | Self-MoA technique—one model sampled 6 times beats diverse mixtures |
| [`multi_provider.py`](examples/multi_provider.py) | OpenAI + Anthropic | Provider routing, Shuffle step to prevent position bias |
| [`openrouter_moa.py`](examples/openrouter_moa.py) | OpenRouter | 3-layer MoA (Propose → Synthesize → Aggregate), paper configuration |
| [`groq_free.py`](examples/groq_free.py) | Groq | Free experimentation, LiteLLM integration, Dropout for robustness |
| [`with_history.py`](examples/with_history.py) | Groq | Pipeline debugging, Rank step, execution history inspection |

```bash
# Install and run
pip install -e ".[examples]"
export OPENAI_API_KEY=sk-...
python examples/openai_basic.py

# Or try free with Groq
export GROQ_API_KEY=gsk_...
python examples/groq_free.py
```

## Key findings from the research

- **Aggregator quality matters 2x more than proposer quality** — invest in your final model
- **3 layers is the sweet spot** — diminishing returns beyond this
- **Diversity vs quality tradeoff** — Self-MoA shows a single great model can beat diverse mediocre ones
- **6 proposers optimal** — gains diminish after this point

## References

- Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" (2024) — [arXiv:2406.04692](https://arxiv.org/abs/2406.04692)
- Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" (2025) — [arXiv:2502.00674](https://arxiv.org/abs/2502.00674)

## License

MIT
