Metadata-Version: 2.4
Name: evaluator-service
Version: 0.1.0
Summary: Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation
Author-email: PepsiCo <tech@pepsico.com>
Maintainer-email: PepsiCo <tech@pepsico.com>
License: MIT
Project-URL: Homepage, https://github.com/pepsico/evaluator-service
Project-URL: Documentation, https://github.com/pepsico/evaluator-service#readme
Project-URL: Repository, https://github.com/pepsico/evaluator-service.git
Project-URL: Issues, https://github.com/pepsico/evaluator-service/issues
Keywords: llm,rag,evaluation,multi-llm,response-selection,observability
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.10
Classifier: Framework :: FastAPI
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115.0
Requires-Dist: uvicorn[standard]>=0.32.0
Requires-Dist: pydantic>=2.9.2
Requires-Dist: httpx>=0.27.2
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pymongo>=4.10.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: license-file

# Evaluator Service

Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation.

## Features

- **Single-call Custom Evaluator**: Evaluates RAG metrics (faithfulness, context precision, context recall, relevance, hallucination risk) in a single LLM call
- **Score Aggregation**: Weighted scoring formula to combine multiple metrics into a final score
- **LLM-as-a-Judge**: Tie-breaking mechanism using LLM comparison when scores are close
- **Parallel Processing**: Evaluates multiple candidate responses concurrently
- **Observability**: MongoDB integration for storing evaluation traces
- **FastAPI**: RESTful API for easy integration
- **Extensible**: Pluggable architecture for different storage backends (MongoDB, Azure Blob, etc.)

## Installation

```bash
pip install evaluator-service
```

## Configuration

Set the following environment variables:

```bash
# Pepgnix LLM Service Configuration
PEPGNIX_SERVICE_URL=https://pepgnix-service.example.com/api/v1/llm
PEPGNIX_TEAM_ID=your-team-id
PEPGNIX_PROJECT_ID=your-project-id
PEPGNIX_API_KEY=your-pepgnix-api-key

# MongoDB Configuration (for observability)
MONGODB_CONNECTION_STRING=mongodb://localhost:27017
MONGODB_DATABASE_NAME=evaluator_service
MONGODB_COLLECTION_NAME=evaluation_traces
```

## Usage

### As a Library

```python
from evaluator_service import EvaluationOrchestrator, EvaluatorService, WinnerSelector
from evaluator_service.clients import PepgnixClient, MongoObservabilityClient
from evaluator_service.models import EvalRequest, Candidate, ContextChunk

# Initialize clients
llm_client = PepgnixClient()
observability_client = MongoObservabilityClient()

# Initialize services
evaluator_service = EvaluatorService(llm_client)
llm_judge = LLMJudge(llm_client)
winner_selector = WinnerSelector(llm_judge)
orchestrator = EvaluationOrchestrator(evaluator_service, winner_selector, observability_client)

# Create evaluation request
request = EvalRequest(
    request_id="req-123",
    user_query="What was PepsiCo revenue in 2024?",
    context_chunks=[
        ContextChunk(
            chunk_id="doc-001-chunk-04",
            text="PepsiCo reported revenue of 91.8 billion USD in FY2024.",
            retrieval_score=0.94
        )
    ],
    candidates=[
        Candidate(model="gpt", response="PepsiCo reported revenue of 91.8 billion USD in FY2024."),
        Candidate(model="claude", response="According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.")
    ]
)

# Run evaluation
result = orchestrator.evaluate(request)
print(f"Winner: {result.winner.model}, Score: {result.score}")
```

### As a REST API

```bash
# Start the server
evaluator-service

# Or using python
python -m evaluator_service.main
```

The API will be available at `http://localhost:8080`

#### API Endpoint

**POST /api/v1/evaluate**

Request body:
```json
{
  "request_id": "req-123",
  "user_query": "What was PepsiCo revenue in 2024?",
  "context_chunks": [
    {
      "chunk_id": "doc-001-chunk-04",
      "text": "PepsiCo reported revenue of 91.8 billion USD in FY2024.",
      "retrieval_score": 0.94
    }
  ],
  "candidates": [
    {
      "model": "gpt",
      "response": "PepsiCo reported revenue of 91.8 billion USD in FY2024."
    },
    {
      "model": "claude",
      "response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024."
    }
  ]
}
```

Response:
```json
{
  "request_id": "req-123",
  "winner": {
    "model": "claude",
    "response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.",
    "score": 0.85,
    "selection_method": "score_winner"
  },
  "all_scores": {
    "gpt": {
      "final": 0.82,
      "faithfulness": 0.9,
      "context_precision": 0.85,
      "context_recall": 0.8,
      "relevance": 0.95,
      "hallucination_risk": 0.1
    },
    "claude": {
      "final": 0.85,
      "faithfulness": 0.95,
      "context_precision": 0.9,
      "context_recall": 0.85,
      "relevance": 0.9,
      "hallucination_risk": 0.05
    }
  },
  "trace_id": "abc-123-def-456",
  "evaluated_at": "2024-01-15T10:30:00Z",
  "latency_ms": 2340
}
```

## Scoring Formula

The final score is calculated using the following weighted formula:

```
Final Score =
  (0.35 × faithfulness)
+ (0.25 × context_recall)
+ (0.20 × relevance)
+ (0.20 × context_precision)
- (0.30 × hallucination_risk)
```

## Tie-Breaking

When the difference between the top two scores is less than 0.05, the LLM Judge is invoked to compare the two answers based on:
- Accuracy
- Completeness
- Grounding
- Clarity

## Development

```bash
# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .

# Lint
ruff check .
```

## License

MIT License - see LICENSE file for details.

## Contributing

Contributions are welcome! Please open an issue or submit a pull request.
