Metadata-Version: 2.4
Name: eic-model-evaluation
Version: 1.0.0
Summary: End-to-end LLM evaluation framework with synthetic data generation and multi-cloud support
Author-email: Enterprise Innovation Consulting LLC <seroukhov@entinco.com>
License: Commercial
Project-URL: Homepage, https://bitbucket.org/entinco/eic-aimodelknowledge-utils/src/master/lib-modelevaluation-python
Project-URL: Repository, https://bitbucket.org/entinco/eic-aimodelknowledge-utils
Keywords: llm,evaluation,langchain,openai,vertexai
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: langchain>=0.3.0
Requires-Dist: langchain-openai>=0.2.0
Requires-Dist: langchain-google-genai>=2.0.0
Requires-Dist: langchain-anthropic>=0.3.0
Requires-Dist: langchain-community>=0.3.0
Requires-Dist: langchain-text-splitters>=0.3.0
Requires-Dist: litellm>=1.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: codebleu>=0.7.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: faiss-cpu>=1.7.0
Requires-Dist: gradio>=4.44.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24.0; extra == "dev"
Dynamic: license-file

# Model Evaluation Library

A production-grade Python framework for end-to-end evaluation of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.

This library provides a modular architecture for:

* Generating synthetic evaluation datasets from private documentation.
* Evaluating models using "LLM-as-a-Judge" and deterministic metrics.
* Benchmarking RAG retrieval accuracy and generation quality.
* Comparing models with statistical and narrative reporting.

## Installation

```bash
pip install model-evaluation
```

## Quick Start

Run a complete evaluation pipeline: generate a dataset, define metrics, and evaluate a target model.

```python
import asyncio
from model_evaluation import ModelFactory, EvaluationEngine, DatasetGenerator

async def main():
    # 1. Setup secure connectors (auto-infers provider from model name)
    # Ensure OPENAI_API_KEY environment variable is set
    judge = ModelFactory.create_auto("gpt-4o")
    target = ModelFactory.create_auto("gpt-4o-mini")

    # 2. Generate synthetic test data from your documentation
    generator = DatasetGenerator(
        model_connector=judge,
        prompt_template="Create technical interview questions based on: {chunk}"
    )
    # Supports PDF, Markdown, and text files
    dataset = await generator.agenerate_from_file("docs/architecture.md", num_questions=10)

    # 3. Configure the Evaluation Engine
    engine = EvaluationEngine(
        judge_connector=judge,
        max_concurrency=5,  # Control parallelism
        metrics_config={
            "faithfulness": {"weight": 2.0},  # Assign higher importance
            "answer_relevance": {"weight": 1.0}
        }
    )

    # 4. Execute Evaluation
    results = await engine.aevaluate(
        dataset=dataset,
        target_model=target,
        metrics=["faithfulness", "answer_relevance", "compliance"]
    )

    # 5. Review Results
    print(f"Average Quality Score: {results['average_score']:.2f}")
    
    for metric, stats in results['metrics'].items():
        print(f"{metric}: {stats['score']:.2f} (min: {stats['min']:.2f})")

if __name__ == "__main__":
    asyncio.run(main())
```

## Advanced Usage & Extensibility

The framework is designed to be easily extended for custom enterprise needs.

### 1. Custom Model Connectors

Integrate with internal or unsupported model providers by subclassing `ModelConnector`.

```python
from model_evaluation.connectors.base import ModelConnector, PredictResult
from model_evaluation.connectors.factory import ModelFactory

class LocalLlamaConnector(ModelConnector):
    async def predict(self, input_text: str, system_prompt: str | None = None) -> PredictResult:
        # Your custom inference logic (e.g., requests to local vLLM)
        response_text = call_my_local_model(input_text)
        return PredictResult(output=response_text)

# Register globally
ModelFactory.register("local_llama", LocalLlamaConnector)

# Use in pipeline
model = ModelFactory.create("local_llama", "llama-3-70b-instruct")
```

### 2. Custom Metrics

Define domain-specific evaluation logic.

```python
from model_evaluation.metrics import BaseLLMJudgeMetric, MetricResult, MetricRegistry

class SecurityComplianceMetric(BaseLLMJudgeMetric):
    @property
    def name(self) -> str:
        return "security_compliance"

    @property
    def prompt_template(self) -> str:
        return """
        Analyze if the answer reveals internal IP addresses or secrets.
        Context: {context}
        Answer: {answer}
        """

    def _parse_response(self, response: str) -> MetricResult:
        # Custom parsing logic
        return MetricResult(score=1.0, reasoning="No secrets found.")

# Register and use
MetricRegistry.register(SecurityComplianceMetric)
results = await engine.aevaluate(..., metrics=["security_compliance"])
```

### 3. RAG Evaluation (Retrieval + Generation)

Assess the retrieval component separately from the generation quality.

```python
from model_evaluation.metrics import RecallAtK, PrecisionAtK

# Measure if relevant chunks were retrieved
recall_metric = RecallAtK(k=5)
score = await recall_metric.evaluate(
    context=ground_truth_document,
    retrieved_chunks=retrieved_snippets
)
```

### 4. Model Comparison & Narrative Reporting

Compare two models side-by-side with an AI-generated summary.

```python
from model_evaluation.reporting import NarrativeReporter

# Get results from two different models
results_a = await engine.aevaluate(dataset, model_a)
results_b = await engine.aevaluate(dataset, model_b)

# Generate textual comparison
reporter = NarrativeReporter(judge_llm)
report = await reporter.generate_comparison(
    results_a, results_b, 
    model_a_name="GPT-4", 
    model_b_name="Llama-3"
)
print(report)
```

## Core Features

* **Multi-Provider Support**: Unified interface for OpenAI, Anthropic, Google Vertex AI, Groq, and custom endpoints.
* **Intelligent Filtering**: Automatically applies relevant metrics based on item type (e.g., `CodeBLEU` runs only on `type="code"` items).
* **Cost Estimation**: Calculate expected token usage and costs before running large evaluations.
* **Concurrency Control**: Reliable async execution with semaphores to respect API rate limits.
* **Schema Validation**: Strict typing for inputs and outputs ensures pipeline reliability.

## API Reference

### Evaluation Item Schema

The input dataset should follow this structure:

| Field | Type | Description |
| :--- | :--- | :--- |
| `question` | `str` | The input prompt or question. |
| `ground_truth` | `str, optional` | Reference answer for comparison. |
| `context` | `str, optional` | Source text for grounding/faithfulness checks. |
| `type` | `str, optional` | `text` or `code`. Controls which metrics run. |

### Available Metrics

| Type | Metric Names |
| :--- | :--- |
| **Judge** | `faithfulness`, `compliance`, `answer_relevance`, `conciseness`, `toxicity`, `logical_consistency` |
| **Retrieval** | `recall`, `precision`, `ndcg` |
| **Similarity** | `semantic_similarity`, `codebleu`, `determinism` |

## Examples

Check the `examples/` directory for ready-to-run scripts:

* `quickstart_eval.py`: Basic evaluation loop.
* `vertex_rag_eval.py`: RAG evaluation with Google Vertex AI.
* `model_comparison.py`: Comparing two models.
* `cost_estimation.py`: Estimating API costs.
* `custom_gen_prompt.py`: Customizing dataset generation.
* `template_customization.py`: Customizing report outputs.
