Metadata-Version: 2.4
Name: mawlaia-evalforge
Version: 0.1.0
Summary: Eval runner for LLM applications
License: MIT
Author: Mawlaia
Author-email: dev@mawlaia.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: all
Provides-Extra: anthropic
Provides-Extra: openai
Provides-Extra: semantic
Requires-Dist: anthropic (>=0.25,<0.26) ; extra == "anthropic" or extra == "all"
Requires-Dist: click (>=8.0,<9.0)
Requires-Dist: openai (>=1.0,<2.0) ; extra == "openai" or extra == "all"
Requires-Dist: pydantic (>=2.0,<3.0)
Requires-Dist: sentence-transformers (>=3.0,<4.0) ; extra == "semantic" or extra == "all"
Project-URL: Homepage, https://mawlaia.com
Project-URL: Repository, https://github.com/mawlaia/evalforge
Description-Content-Type: text/markdown

# evalforge

> Eval runner for LLM applications. CI-native AI quality.

Your code has tests. Your prompts don't. **evalforge** brings the same quality gates you have for code to your LLM features — run on every PR, catch regressions before they reach production.

```python
from evalforge import EvalDataset, run_eval
from evalforge.scorers import llm_judge, exact_match

dataset = EvalDataset.from_json("evals/rag_quality.json")

report = run_eval(
    dataset=dataset,
    model="claude-3-5-sonnet",
    scorers=[llm_judge(rubric="Is the answer accurate and grounded?"), exact_match()],
)

report.assert_pass(threshold=0.85)  # fails CI if score drops below 85%
```

## Status

🚧 **Early development.** Star to follow progress.

## What it does

- **Eval runner** — LLM-as-judge, exact match, ROUGE, semantic similarity, custom Python scorers
- **CI integration** — GitHub Actions native, fails the build if quality drops
- **Dataset versioning** — store, version, and sample eval sets
- **Comparison mode** — A/B test prompts and models against a baseline
- **Shadow traffic** — route production traffic to a new model and compare live
- **Vertical eval packs** — pre-built datasets and scorers for RAG, customer support, code generation, legal, medical

## Roadmap

- [ ] Python SDK
- [ ] CLI
- [ ] GitHub Actions action
- [ ] Hosted control plane ([mawlaia.com](https://mawlaia.com))
- [ ] Vertical eval packs

## License

MIT

