Metadata-Version: 2.4
Name: issuebenchkit
Version: 0.1.0
Summary: Turn real GitHub issues into small, reproducible coding-agent benchmark tasks.
Project-URL: Homepage, https://github.com/he-yufeng/IssueBenchKit
Project-URL: Issues, https://github.com/he-yufeng/IssueBenchKit/issues
Author: Yufeng He
License: MIT
License-File: LICENSE
Keywords: agent,benchmark,coding-agent,github-issues,swe-bench
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Description-Content-Type: text/markdown

<p align="right"><a href="README_CN.md">中文文档</a></p>

# IssueBenchKit

Turn a real GitHub issue, pull request, or local bug into a small coding-agent benchmark task.

SWE-bench is great when you want a public leaderboard. Most teams need something smaller: a
repeatable task built from the bugs they actually care about, with a clear test command and a
report that says whether a candidate patch really fixed it.

IssueBenchKit is that local builder. It does not try to invent tests for you. It packages the
issue context, base commit, reproduction command, and scoring result so you can evaluate coding
agents on your own repositories.

## Quick Start

```bash
pip install issuebenchkit
```

Create a benchmark task:

```bash
issuebench init tasks/qwen-copy \
  --repo ./qwen-code \
  --issue https://github.com/QwenLM/qwen-code/issues/4716 \
  --base 8b4f3b2 \
  --test "npm test -- copyCommand.test.ts"
```

Run the task against a candidate checkout:

```bash
issuebench run tasks/qwen-copy --repo ./candidate-qwen-code --out after.json
```

Compare before and after:

```bash
issuebench score tasks/qwen-copy --before before.json --after after.json
```

Export a report:

```bash
issuebench export tasks/qwen-copy --format html --out report.html
```

## What It Stores

Each task directory contains one `issuebench.json` manifest:

- source repo path and optional GitHub issue URL
- base commit or version marker
- reproduction / validation command
- expected signal, notes, and tags

Run results are plain JSON files with exit code, duration, command, stdout tail, stderr tail, and
the pass/fail verdict. They are easy to archive, diff, or attach to a PR.

## Why Not Just Use SWE-bench?

Use SWE-bench for public comparison. Use IssueBenchKit when you need:

- a benchmark task for a private or small repo
- a tiny task that can run in CI
- a before/after report for one real bug
- a dataset of issues that reflects your own engineering workflow

## Current Scope

The first version is intentionally small:

- generic shell test commands
- JSON manifest files
- before/after scoring
- JSONL and single-file HTML export

It does not generate tests automatically, mutate repositories, or claim that one command can
evaluate every language ecosystem.

## License

MIT
