Metadata-Version: 2.4
Name: mixpilot
Version: 0.1.2
Summary: An agentic harness for marketing measurement: adstock, saturation, attribution, and budget allocation as typed agent tools with an eval suite.
Author: Mohit Luthra
License: MIT
Project-URL: Homepage, https://github.com/mohit-luthra/mixpilot
Keywords: marketing-mix-modeling,mmm,attribution,agents,llm,adstock,saturation
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: pydantic>=2.0
Provides-Extra: agent
Requires-Dist: anthropic>=0.39; extra == "agent"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"

# MixPilot

**An agentic harness for marketing measurement.**

Everyone is building general-purpose agents. MixPilot is the opposite bet: a small,
real agent runtime whose entire action space is the marketing-science toolkit —
adstock, saturation, multi-touch attribution, budget optimisation, and a
marketing-aware data-quality audit — wired as typed tools with structured results,
an approval gate, and an eval suite that scores the agent's *method-selection
judgment*.

The thesis: in a domain agent, the tools are the product. A model can sound
confident about marketing data; the hard part is knowing which method the data can
actually support — and refusing the ones it can't. That judgment is what MixPilot
encodes and tests.

```bash
pip install mixpilot            # tools only
pip install "mixpilot[groq]"    # + the agent loop on Groq (open models)
pip install "mixpilot[agent]"   # + the agent loop on Anthropic
```

The agent loop is provider-agnostic — the LLM client is injected, so the same loop
runs on Groq Llama or Claude by swapping one line.

## What's inside

| Component | What it does | The lesson |
|---|---|---|
| `audit_data_quality` | nulls, negative spend, zero-variance, **multicollinearity** | The audit decides whether any later number is trustworthy at all |
| `adstock_transform` | geometric carryover with half-life | Carryover before saturation, always |
| `fit_saturation_curve` | Hill curve (β, α, γ) + R² | A low-R² curve is a warning, not a result |
| `run_attribution_model` | last-touch / linear / **Markov removal-effect** | Match the model to the data shape, never overreach |
| `allocate_budget` | constrained optimisation over saturation curves | Move money toward unsaturated marginal return |
| `ToolResult` contract | summary + next_actions + recovery_hint on every call | A tool result is the next observation, not a log line |
| Approval gate | mutating actions clear a policy gate outside the prompt | Safety lives in code, not prose |
| Agent loop | turn budget + loop detection + stop conditions | The loop is a control system, not a while-tool-calls toy |
| Eval suite | data shape → required/forbidden method | Test the judgment, not the prose |

## The judgment, made concrete

Same data, two attribution models:

```python
from mixpilot.tools.attribution import run_attribution_model

paths  = [{"path": ["Social", "Search"], "converted": True}] * 40
paths += [{"path": ["Social"],           "converted": False}] * 20

run_attribution_model(paths, "last_touch").artifacts["share"]
# {'Search': 1.0, 'Social': 0.0}   <- the assist is thrown away
run_attribution_model(paths, "markov").artifacts["share"]
# {'Search': 0.5, 'Social': 0.5}   <- both channels are necessary
```

Last-touch hands Social nothing. Markov's removal effect sees that every conversion
needed Social first, and splits the credit. That gap — assist value — is the entire
reason multi-touch attribution exists.

## Running the agent

```python
from mixpilot import Agent
from mixpilot.agent.groq_client import GroqClient

agent = Agent(GroqClient(model="llama-3.3-70b-versatile"))  # needs GROQ_API_KEY
result = agent.run(
    "Here are 12 weeks of spend and sales for one channel ... "
    "what's the saturation point and should we spend more?"
)
print(result.final_text)
```

Swap `GroqClient` for `AnthropicClient` (needs `ANTHROPIC_API_KEY`) to run the same
loop on Claude. The client is injected, so production uses a real model and the evals
use a scripted client — no network needed to test the harness.

## Evals: scoring method selection

```bash
python evals/run_evals.py          # offline, deterministic
python evals/run_evals.py --live   # against a real model
```

Each case pairs a data shape with the methods it does and does not support. A case
passes only if the agent uses every required tool and avoids every forbidden one —
e.g. it must **not** run Markov attribution on single-touch data, and must audit
before claiming per-channel effects on collinear spend.

## Tests

```bash
python -m pytest -q     # 14 passing — domain math, harness contracts, provider translation
```

The tests never call a model. They prove the adstock conserves mass, the Hill fit
recovers known parameters, the Markov removal effect credits assists, the optimiser
moves budget toward unsaturated channels, the audit catches multicollinearity, and
the approval gate blocks mutating actions. Harness reliability that has nothing to do
with model intelligence.

## Scope

This is deliberately small. No context compaction, no MCP, no subagents — those are
solved problems in general harnesses. The point here is the opposite: how far you get
when the action space is a real domain and the tools enforce the judgment.

MIT licensed.
