Metadata-Version: 2.4
Name: codeprobe
Version: 0.1.0a1
Summary: Benchmark AI coding agents against your own codebase. Mine real tasks from repo history, run agents, interpret results.
Author: codeprobe contributors
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/sjarmak/codeprobe
Project-URL: Repository, https://github.com/sjarmak/codeprobe
Project-URL: Issues, https://github.com/sjarmak/codeprobe/issues
Keywords: ai,benchmark,eval,coding-agent,mcp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click<9,>=8.0
Provides-Extra: yaml
Requires-Dist: pyyaml<7,>=6.0; extra == "yaml"
Provides-Extra: codex
Requires-Dist: openai>=1.66; extra == "codex"
Provides-Extra: tokens
Requires-Dist: tiktoken<1,>=0.7; extra == "tokens"
Provides-Extra: stats
Requires-Dist: scipy<2,>=1.11; extra == "stats"
Provides-Extra: all
Requires-Dist: pyyaml<7,>=6.0; extra == "all"
Requires-Dist: openai>=1.66; extra == "all"
Requires-Dist: tiktoken<1,>=0.7; extra == "all"
Requires-Dist: scipy<2,>=1.11; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest<9,>=8.0; extra == "dev"
Requires-Dist: pytest-cov<6,>=5.0; extra == "dev"
Requires-Dist: ruff<1,>=0.4; extra == "dev"
Requires-Dist: mypy<2,>=1.10; extra == "dev"
Requires-Dist: types-PyYAML<7,>=6.0; extra == "dev"
Requires-Dist: scipy<2,>=1.11; extra == "dev"
Dynamic: license-file

# codeprobe

Benchmark AI coding agents against **your own codebase**.

Mine real tasks from your repo history, run agents against them, and find out which setup actually works best for YOUR code — not someone else's benchmark suite.

## Why codeprobe?

Existing benchmarks (SWE-bench, HumanEval) use fixed task sets that AI models may have memorized from training data. codeprobe mines tasks from **your private repo history**, producing benchmarks that are impossible to contaminate.

## Quick Start

```bash
pip install codeprobe            # Core (mine + run + interpret)
pip install codeprobe[stats]     # + statistical tests (scipy)
pip install codeprobe[tokens]    # + exact Copilot token counting (tiktoken)
pip install codeprobe[all]       # Everything

cd /path/to/your/repo

codeprobe init          # What do you want to learn?
codeprobe mine .        # Extract tasks from repo history
codeprobe run .         # Run agents against tasks
codeprobe interpret .   # Get recommendations
```

## Commands

| Command               | Purpose                                     |
| --------------------- | ------------------------------------------- |
| `codeprobe init`      | Interactive wizard — choose what to compare |
| `codeprobe mine`      | Mine eval tasks from merged PRs/MRs         |
| `codeprobe run`       | Execute tasks against AI agents             |
| `codeprobe interpret` | Analyze results, rank configurations        |
| `codeprobe assess`    | Score a codebase's benchmarking potential   |

## Supported Agents

- **Claude Code** (`--agent claude`)
- **GitHub Copilot** (`--agent copilot`)
- Custom agents via the `AgentAdapter` protocol

## Supported Git Hosts

GitHub, GitLab, Bitbucket, Azure DevOps, Gitea/Forgejo, and local repos.

## Configuration

Create a `.evalrc.yaml` in your repo root:

```yaml
name: my-experiment
agents: [claude, copilot]
models: [claude-sonnet-4-6, claude-opus-4-6]
tasks_dir: .codeprobe/tasks
```

## License

Apache-2.0
