Metadata-Version: 2.4
Name: skilleval
Version: 0.1.0
Summary: Find the cheapest LLM that gets your task 100% right
Project-URL: Homepage, https://github.com/chan-kinghin/skills-eval
Project-URL: Repository, https://github.com/chan-kinghin/skills-eval
Project-URL: Documentation, https://github.com/chan-kinghin/skills-eval/blob/main/docs/USER_MANUAL.md
Project-URL: Issues, https://github.com/chan-kinghin/skills-eval/issues
Author: King Hin Chan
License-Expression: MIT
License-File: LICENSE
Keywords: ai,benchmark,cost-optimization,evaluation,llm,nlp
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: aiohttp>=3.9
Requires-Dist: click>=8.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: aioresponses>=0.7; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.3; extra == 'dev'
Provides-Extra: docs
Requires-Dist: openpyxl>=3.1; extra == 'docs'
Requires-Dist: pdfplumber>=0.10; extra == 'docs'
Requires-Dist: python-docx>=1.0; extra == 'docs'
Description-Content-Type: text/markdown

# SkillEval

[English](README.md) | [中文](README_ZH.md)

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)

**Find the cheapest LLM that gets your task 100% right.**

SkillEval is a CLI tool that automates LLM evaluation for deterministic tasks. It runs your task across multiple models in parallel, compares outputs against expected results, and recommends the most cost-effective option.

## Quick Start

```bash
pip install -e .

# Set at least one provider API key
export DASHSCOPE_API_KEY="sk-..."   # Qwen (Alibaba DashScope)

# Create a task folder
skilleval init my-task

# Add your input files, expected output, and skill prompt
# Then run the evaluation
skilleval run my-task/

# Machine-readable output
skilleval run my-task/ --json | jq '.recommendation'
```

## Supported Providers

| Provider   | Platform                  | Env Variable         |
|------------|---------------------------|----------------------|
| **Qwen**   | Alibaba Cloud / DashScope | `DASHSCOPE_API_KEY`  |
| **GLM**    | Zhipu AI / BigModel       | `ZHIPU_API_KEY`      |
| **MiniMax** | MiniMax                  | `MINIMAX_API_KEY`    |

## Evaluation Modes

- **Mode 1 (`run`)** — You write the prompt (skill), SkillEval tests it across models.
- **Mode 2 (`matrix`)** — One model writes the prompt, another executes it. Tests all creator x executor combinations.
- **Mode 3 (`chain`)** — A meta-skill guides prompt creation, then another model executes it. Full pipeline evaluation.

## Additional Features

- **Ad-hoc endpoints** — Use any OpenAI-compatible API without editing the catalog: `--endpoint`, `--api-key`, `--model-name`.
- **Skill linting (`lint`)** — Validate Claude Code skill structure (frontmatter, phases, references, code blocks).
- **Skill testing (`skill-test`)** — Test a skill's core prompt logic against expected outputs.
- **Run comparison (`compare`)** — Diff two runs to detect improvements or regressions.
- **HTML reports (`report --html`)** — Generate self-contained HTML reports for sharing.
- **JSON output (`--json`)** — Machine-readable JSON on `run`, `matrix`, `chain`, `catalog`, and `report` commands for piping into other tools.
- **Verbose logging (`-v` / `-vv`)** — `-v` for INFO, `-vv` for DEBUG. Logs go to stderr so they don't interfere with `--json` output.
- **Auto-confirm (`--yes` / `-y`)** — Skip the confirmation prompt on `chain` (replaces the old `--confirm` flag).
- **Config validation** — Warns on unknown keys in `config.yaml` and validates comparator names at load time.
- **Circuit breaker** — Automatically skips a provider after 5 consecutive failures, avoiding wasted time and cost.
- **Ctrl+C handling** — Saves partial results on interrupt so you never lose a half-finished run.
- **Friendly errors** — No raw tracebacks by default; use `-vv` to see full stack traces when debugging.
- **Progress bar** — Now shows elapsed time and ETA alongside the completion percentage.

## Documentation

See the [User Manual](docs/USER_MANUAL.md) ([中文](docs/USER_MANUAL_ZH.md)) for detailed setup instructions, configuration options, comparator reference, and walkthroughs.

## Development

```bash
pip install -e ".[dev,docs]"
pytest
ruff check src/ tests/
```

See [CONTRIBUTING.md](CONTRIBUTING.md) ([中文](CONTRIBUTING_ZH.md)) for full contributor guidelines.

## License

[MIT](LICENSE)
