Metadata-Version: 2.4
Name: juryeval
Version: 0.4.0
Summary: Lightweight NLP/LLM evaluation toolkit — metrics, judges, significance testing
Home-page: https://github.com/liodon-ai/juryeval
Author: Liodon AI
Author-email: info@liodon.ai
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: nltk
Provides-Extra: full
Requires-Dist: scikit-learn; extra == "full"
Requires-Dist: sacrebleu; extra == "full"
Requires-Dist: rouge-score; extra == "full"
Requires-Dist: transformers; extra == "full"
Requires-Dist: torch; extra == "full"
Requires-Dist: sentence-transformers; extra == "full"
Provides-Extra: judge
Requires-Dist: openai>=1.0.0; extra == "judge"
Provides-Extra: semantic
Requires-Dist: sentence-transformers; extra == "semantic"
Provides-Extra: lmeval
Requires-Dist: lm-eval>=0.4.0; extra == "lmeval"
Provides-Extra: all
Requires-Dist: scikit-learn; extra == "all"
Requires-Dist: sacrebleu; extra == "all"
Requires-Dist: rouge-score; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: torch; extra == "all"
Requires-Dist: sentence-transformers; extra == "all"
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: lm-eval>=0.4.0; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# juryeval

Lightweight NLP/LLM evaluation toolkit — metrics, LLM-as-Judge infrastructure, statistical significance testing, and prompt robustness analysis.

Designed for fast smoke-tests, demos, and as a shared dependency for evaluation frameworks like LM Eval Harness, OpenCompass, and Lighteval.

## Install

```bash
pip install juryeval

# Optional feature sets:
pip install juryeval[full]    # all metrics (sklearn, sacrebleu, transformers, etc.)
pip install juryeval[judge]   # LLM-as-Judge (openai)
pip install juryeval[semantic]  # embedding similarity (sentence-transformers)
pip install juryeval[lmeval]  # lm-eval-harness integration
pip install juryeval[all]     # everything
```

## Usage

### Metrics

```python
from juryeval import (
    eval_classification, eval_translation, eval_summarization,
    perplexity, flesch_kincaid, bert_score,
)

acc_f1 = eval_classification(preds=["pos", "neg"], refs=["pos", "pos"])
bleu   = eval_translation(preds=["hello world"], refs=["bonjour le monde"])
rouge  = eval_summarization(preds=["summary here"], refs=["reference here"])
ppl    = perplexity("This is a sentence.")
fk     = flesch_kincaid("This is easy to read.")
bs     = bert_score(preds=["answer"], refs=["reference"])
```

### LLM-as-Judge

```python
from juryeval import PairwiseJudge, PointwiseJudge, MultiJudgeEnsemble, JudgeCalibration

judge = PairwiseJudge("gpt-4")
result = judge.compare(
    answer_a="Paris is the capital of France.",
    answer_b="It's Paris.",
    question="What is the capital of France?",
)
# {"winner": "A", "score": 1.0, "reason": "..."}

# Pointwise scoring
scorer = PointwiseJudge("gpt-4")
result = scorer.score("Paris is the capital.", question="What is the capital of France?")
# {"score": 0.9, "reason": "..."}

# Multi-judge ensemble
ensemble = MultiJudgeEnsemble([
    PairwiseJudge("gpt-4"),
    PairwiseJudge("claude-3-opus"),
    PairwiseJudge("gemini-pro"),
])
result = ensemble.compare(answer_a, answer_b, question)
# {"majority_winner": "A", "agreement": 0.67, "vote_distribution": {...}, ...}

# Judge calibration
cal = JudgeCalibration()
report = cal.evaluate(judge)
# {"position_bias": 0.05, "consistency": 0.95, "length_bias": 0.1, ...}
```

### Statistical Significance

```python
from juryeval import bootstrap_ci, compare_models

ci = bootstrap_ci(scores, num_resamples=2000)
# {"estimate": 0.72, "lower": 0.68, "upper": 0.76, "std_err": 0.02}

result = compare_models(model_a_scores, model_b_scores)
# {"win_rate": 0.65, "p_value": 0.003, "mean_a": 0.72, "mean_b": 0.68, ...}
```

### Prompt Robustness

```python
from juryeval import PromptVariance

pv = PromptVariance(model_fn=lambda prompt: "output")
report = pv.analyze("What is 2+2?")
# {"num_variants": 7, "output_length_mean": 5.0, "outputs": [...], ...}
```

### LM Eval Harness Integration

```bash
pip install juryeval[lmeval]
python -c "from juryeval.lmeval import register_all; register_all()"

# Then register pairwise_judge / pointwise_judge metrics in your task YAML:
# metric_list:
#   - metric: pairwise_judge
#     aggregation: mean
#     higher_is_better: true
```

### Running Tests

```bash
pip install pytest
pytest tests/ -v
```

## License

MIT
