Metadata-Version: 2.4
Name: GamELY
Version: 0.1.0
Summary: A Python package for evaluating LLM-generated responses against human references using state-of-the-art LLMs as judges.
Author-email: Nishant Singh <nishant1695@gmail.com>, Ragahv Awasthi <raghavaw@iiitd.ac.in>
License-Expression: MIT
Project-URL: Homepage, https://github.com/nishant1695/GamELY
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# GamELY - LLM Response Evaluation Framework

A Python package for evaluating LLM-generated responses against human references using state-of-the-art LLMs as judges.

## Installation

```bash
pip install GamELY
```

## Quick Start

```python
import pandas as pd
from GamELY import evaluate_responses

# Prepare your data
df = pd.DataFrame({
    'reference': [
        'The capital of France is Paris',
        'Water boils at 100°C at sea level'
    ],
    'generated': [
        'Paris is the capital city of France',
        'Water boils at 90°C in high altitudes'
    ]
})

# Run evaluation
results = evaluate_responses(
    dataframe=df,
    model_name='gpt-4-turbo',  # or 'claude-3-opus', 'deepseek-chat'
    api_key='your_api_key_here'
)

print(results[['reference', 'generated', 'Is the LLM generated response accurate?']])
```

## Key Features

- **Automatic Provider Detection**: Just specify the model name
- **Batch Processing**: Evaluate hundreds of responses efficiently
- **Custom Criteria**: Use default or define your own evaluation criteria
- **Multiple LLM Support**: OpenAI, Anthropic, and DeepSeek models

## Required Parameters

### `dataframe`
- **Type**: `pandas.DataFrame`
- **Columns**:
  - `reference`: Human-written reference answers (str)
  - `generated`: LLM-generated responses to evaluate (str)
- **Example**:
  ```python
  pd.DataFrame({
      'reference': ['Reference answer 1', 'Reference answer 2'],
      'generated': ['Generated response 1', 'Generated response 2']
  })
  ```

### `model_name`
Supported models:
- **OpenAI**: `gpt-3.5-turbo`, `gpt-4`, `gpt-4-turbo`, `gpt-4o-mini`, `gpt-4o`, `o1-mini`, `o1`
- **Anthropic**: `claude-2`, `claude-3-haiku-20240307`, `claude-3-sonnet-20240229`, `claude-3-opus-latest`, `claude-3-5-haiku-latest`, `claude-3-5-sonnet-latest`
- **DeepSeek**: `deepseek-chat`, `deepseek-reasoner`

### `api_key`
- Obtain from your LLM provider's console
- **Recommended**: Store in environment variables
  ```python
  import os
  os.environ['OPENAI_API_KEY'] = 'your-key-here'  # For OpenAI/DeepSeek
  os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'
  ```

## Advanced Usage

### Custom Evaluation Criteria
```python
custom_criteria = [
    'Does the response use formal language?',
    'Is the response under 100 characters?'
]

results = evaluate_responses(
    dataframe=df,
    model_name='claude-3-5-sonnet-latest',
    api_key='your_key',
    criteria=custom_criteria
)
```
### Default Evaluation Criteria
If you do not provide a custom list of criteria when calling evaluate_responses, GamELY will use the following default set of 17 criteria that aim to provide a holistic evaluation of the LLM's output:
```python
DEFAULT_CRITERIA = [
        'Is the LLM generated response accurate?',
        'Is the response correct in comprehension?',
        'Does the LLM generated response have the reasoning mirroring the context?',
        'Is the LLM generated response helpful to the user?',
        'Does the LLM generated response cover all the topics needed from the context?',
        'Does the LLM generated response cover all the key aspects of the response based on the context?',
        'Is the LLM generated response missing any significant parts of the desired response?',
        'Is the LLM generated response fluent?',
        'Is the LLM generated response grammatically correct?',
        'Is the LLM generated response organized well?',
        'Does the LLM generated response have any amount of biasness?',
        'Does the LLM generated response have any amount of toxicity?',
        'Does the LLM generated response violate any privacy?',
        'Does the LLM generated response have any amount of hallucinations?',
        'Is the generated response distinguishable from human response?',
        'How does the generated response compare with human response?',
        'How does the generated response compare to other LLM responses?'
    ]
```

### Error Handling
```python
from GamELY import AuthenticationError, APIRequestError

try:
    results = evaluate_responses(df, 'gpt-4', 'invalid-key')
except AuthenticationError as e:
    print(f"Invalid API key for {e.provider}: Please check your credentials")
except APIRequestError as e:
    print(f"API Error: {str(e)}")
```

## FAQ

### Q: How are scores calculated?
A: Each criterion is scored 1-5 by the LLM judge:
- 1 = Strongly disagree
- 2 = Disagree
- 3 = Neutral
- 4 = Agree
- 5 = Strongly agree
- NaN = Irrelevant criterion

### Q: What's the cost?
A: Evaluation uses your LLM provider's API - costs depend on model and dataset size

### Q: Can I add custom models?
A: Currently supports OpenAI, Anthropic, and DeepSeek. Contact us for new provider requests

### Q: How long does evaluation take?
A: Depends on model speed and dataset size. 100 rows take ~2-5 minutes with GPT-4

## Troubleshooting

### Common Errors
- `AuthenticationError`: Check your API key and provider billing
- `ValueError`: Verify model name spelling and support status
- `APIRequestError`: Check network connection and API rate limits

### Best Practices
1. Start with small batches (5-10 rows) for testing
2. Use lowest-cost adequate model (e.g., `gpt-3.5-turbo` for simple evaluations)
3. Cache results for repeated evaluations
