Metadata-Version: 2.4
Name: evalguide
Version: 0.1.0
Summary: CLI tool that evaluates LLM outputs from production logs against a dual-dimension rubric.
Project-URL: Homepage, https://github.com/onicarps/eval-harness
Project-URL: Repository, https://github.com/onicarps/eval-harness
Project-URL: Issues, https://github.com/onicarps/eval-harness/issues
Author: evalguide
License: MIT
License-File: LICENSE
Keywords: cli,evaluation,faithfulness,llm,rubric
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.11
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: tiktoken>=0.7
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# eval-harness

A Python CLI that evaluates LLM outputs from production logs against a
dual-dimension rubric (faithfulness + task completion).

## Install

```bash
pip install -e ".[dev]"
```

## Quickstart

```bash
export OPENRIXER_API_KEY=sk-or-...
eval-harness run path/to/logs.jsonl --judge meta-llama/llama-3.1-8b-instruct:free
```

Input JSONL schema:

```json
{"input": "user prompt", "output": "model response", "reference": "optional ground truth"}
```

## Commands

- `eval-harness run <file>` — ingest, evaluate, and report
- `eval-harness judges` — list free judge models (cached in `~/.eval-harness/judges.json`)
- `eval-harness report --run-id UUID` — show a stored run
- `eval-harness export --run-id UUID --format json|csv --output-file PATH`
- `eval-harness cache [--stats] [--clear]`

Exit codes: `0` all pass, `1` any failures, `2` evaluator error.

## CI/CD example

```yaml
- run: pip install eval-harness
- run: OPENRIXER_API_KEY=${{ secrets.OPENRIXER_API_KEY }} eval-harness run eval/cases.jsonl --pass-threshold 0.7
```

## Development

```bash
pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
pytest tests/ -v --cov=src
```
