Metadata-Version: 2.2
Name: gllm-evals-binary
Version: 0.1.12
Summary: A library for evaluating LLM-based applications.
Author-email: Surya Mahadi <made.r.s.mahadi@gdplabs.id>
Requires-Python: <3.14,>=3.11
Description-Content-Type: text/markdown
Requires-Dist: aioboto3<16.0.0,>=15.2.0
Requires-Dist: aiohttp<4.0.0,>=3.13.3
Requires-Dist: datasets<4.0.0,>=3.6.0
Requires-Dist: deepmerge<3.0,>=2.0
Requires-Dist: filelock<4.0.0,>=3.20.3
Requires-Dist: filetype<2.0.0,>=1.2.0
Requires-Dist: gllm-core-binary<0.5.0,>=0.4.3
Requires-Dist: gllm-inference-binary[anthropic,google,openai]<0.7.0,>=0.6.0
Requires-Dist: google-api-python-client<3.0.0,>=2.0.0
Requires-Dist: google-auth<3.0.0,>=2.0.0
Requires-Dist: json-repair<0.47.0,>=0.46.0
Requires-Dist: jsonschema<5.0.0,>=4.0.0
Requires-Dist: langfuse<4.0.0,>=3.2.1
Requires-Dist: orjson<4.0.0,>=3.11.6
Requires-Dist: pyasn1<0.7.0,>=0.6.2
Requires-Dist: pydantic<3.0.0,>=2.11.5
Requires-Dist: python-box[all]<8.0.0,>=7.3.2; sys_platform != "win32"
Requires-Dist: python-box<8.0.0,>=7.3.2; sys_platform == "win32"
Requires-Dist: python-magic<0.5.0,>=0.4.27; sys_platform != "win32"
Requires-Dist: python-magic-bin<0.5.0,>=0.4.14; sys_platform == "win32"
Requires-Dist: pytrec-eval-terrier<0.6.0,>=0.5.7
Requires-Dist: sutoppu<2.0.0,>=1.2.0
Requires-Dist: urllib3<3.0.0,>=2.7.0
Requires-Dist: virtualenv<21.0.0,>=20.36.1
Requires-Dist: cryptography<47.0.0,>=46.0.7
Provides-Extra: dev
Requires-Dist: coverage==7.4.4; extra == "dev"
Requires-Dist: mypy==1.15.0; extra == "dev"
Requires-Dist: pre-commit==3.7.0; extra == "dev"
Requires-Dist: pytest<9.1.0,>=9.0.3; extra == "dev"
Requires-Dist: pytest-asyncio<2.0.0,>=1.0.0; extra == "dev"
Requires-Dist: pytest-cov<6.0.0,>=5.0.0; extra == "dev"
Requires-Dist: pytest-mock<3.15.0,>=3.14.1; extra == "dev"
Requires-Dist: ruff<0.7.0,>=0.6.7; extra == "dev"
Provides-Extra: deepeval
Requires-Dist: deepeval<4.0.0,>=3.7.0; extra == "deepeval"
Provides-Extra: ragas
Requires-Dist: marshmallow<3.27.0,>=3.26.2; extra == "ragas"
Requires-Dist: pillow<13.0.0,>=12.1.1; extra == "ragas"
Requires-Dist: ragas<0.4.0,>=0.3.0; extra == "ragas"
Provides-Extra: langchain
Requires-Dist: agentevals<0.0.9,>=0.0.8; extra == "langchain"
Requires-Dist: openevals<0.2.0,>=0.1.0; extra == "langchain"
Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "langchain"
Requires-Dist: langchain-core>=1.3.3; extra == "langchain"
Requires-Dist: langsmith<0.9.0,>=0.8.0; extra == "langchain"

# GLLM Evaluator SDK

A comprehensive evaluation framework for Generative AI applications including LLM outputs, AI Agent responses, and RAG (Retrieval-Augmented Generation) systems.

## Overview

The GLLM Evaluator SDK provides a robust, extensible framework designed to make AI evaluation as simple and seamless as possible across the GDP Labs ecosystem. Built with integration-first philosophy, it enables teams to easily assess the quality of generated content from any AI system while seamlessly connecting with experiment tracking and observability platforms.

### Philosophy

**Easy Evaluation Everywhere**: Standardize evaluation practices across all GDP Labs AI applications with minimal setup and maximum flexibility.

**Integration-First Design**: Built to work seamlessly with your existing experiment tracking, observability, and MLOps infrastructure.

**Extensible by Design**: Add new evaluators, metrics, and integrations without breaking existing workflows.

### Key Features

- 🌐 **GDP Labs Ecosystem Ready**: Standardized evaluation framework across all internal AI applications
- 🔌 **Seamless Integration**: Easy integration with experiment tracking and observability platforms
- 🚀 **Async-First Design**: High-performance async evaluation with parallel processing
- 🔧 **Extensible Architecture**: Easy to add new evaluators and metrics for any use case
- 🤖 **LLM as a Judge**: Advanced language models for nuanced, contextual evaluation
- 📐 **Traditional Metrics**: Support for classical evaluation metrics and custom scoring functions
- 🔗 **Popular Evaluator Integration**: Integration with popular evaluators such as RAGAS, DeepEval, and LangChain
- ⚡ **Zero-Config Start**: Get started with sensible defaults, customize as needed

## Installation

### Prerequisites

Mandatory:
1. Python 3.11+ — [Install here](https://www.python.org/downloads/)
2. pip — [Install here](https://pip.pypa.io/en/stable/installation/)
3. uv — [Install here](https://docs.astral.sh/uv/getting-started/installation/)
4. gcloud CLI (for authentication) — [Install here](https://cloud.google.com/sdk/docs/install), then log in using:
   ```bash
   gcloud auth login
   ```

---


### Install from Artifact

Because `gllm-evals` is a private library hosted in a secure Google Cloud repository, you must provide an access token to install it. The command below handles this authorization inline by using an access token from the `gcloud` CLI.

```bash
uv pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ gllm-evals
```

---

## Local Development Setup

### Prerequisites

1. Python 3.11+ — [Install here](https://www.python.org/downloads/)
2. pip — [Install here](https://pip.pypa.io/en/stable/installation/)
3. uv — [Install here](https://docs.astral.sh/uv/getting-started/installation/)
4. gcloud CLI — [Install here](https://cloud.google.com/sdk/docs/install), then log in using:

   ```bash
   gcloud auth login
   ```
5. Git — [Install here](https://git-scm.com/downloads)
6. Access to the [GDP Labs SDK GitHub repository](https://github.com/GDP-ADMIN/gl-sdk)

---

### 1. Clone Repository

```bash
git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-evals
```

---

### 2. Setup Authentication

Because `gllm-evals` is a private library, you first need to configure `uv` to authenticate with our secure Google Cloud repositories.
Set the following environment variables to authenticate with internal package indexes:

```bash
export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"
```

---

### 3. Quick Setup

Run:

```bash
make setup
```

---

### 4. Activate Virtual Environment

```bash
source .venv/bin/activate
```

---

## Local Development Utilities

The following Makefile commands are available for quick operations:

### Install uv

```bash
make install-uv
```

### Install Pre-Commit

```bash
make install-pre-commit
```

### Install Dependencies

```bash
make install
```

### Update Dependencies

```bash
make update
```

### Run Tests

```bash
make test
```

---

## Adding the Package

Once authorization is configured, you can add `gllm-evals` to your project:

```bash
uv add gllm-evals
```

### Dependencies

The SDK requires:
- `gllm-core` and `gllm-inference` for LLM interactions
- `pydantic` for data validation

## Quick Start

### Basic Usage

```python
import asyncio
import os
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    # Initialize the evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Prepare evaluation data
    data = {
        "query": "What is the capital of France?",
        "expected_response": "Paris is the capital of France.",
        "generated_response": "The capital of France is Paris.",
        "retrieved_context": "Paris is the capital and largest city of France."
    }

    # Evaluate
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
```

### Batch Evaluation

```python
import asyncio
import os
from gllm_evals.dataset.dict_dataset import DictDataset
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.runner import Runner
from gllm_evals.experiment_tracker.csv_experiment_tracker import CSVExperimentTracker

async def batch_evaluation():
    # Initialize evaluator
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY"),
        run_parallel=True  # Enable parallel processing
    )

    # Create dataset
    dataset = DictDataset([
        {
            "query": "What is the capital of France?",
            "expected_response": "Paris",
            "generated_response": "Paris is the capital of France.",
            "retrieved_context": "Paris is the capital of France."
        },
        {
            "query": "What is 1 + 1?",
            "expected_response": "2",
            "generated_response": "The answer is 2.",
            "retrieved_context": "1 + 1 equals 2."
        }
    ])

    # Run evaluation
    runner = Runner(evaluator, batch_size=10)
    results = await runner.evaluate(dataset)

    # Track results
    tracker = CSVExperimentTracker(score_key="generation/score")
    tracker.log_batch(results)

    print(f"Evaluation Results: {tracker.get_results()}")

if __name__ == "__main__":
    asyncio.run(batch_evaluation())
```

### Custom Metrics

Create domain-specific metrics easily:

```python
from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput

class DomainSpecificMetric(BaseMetric):
    """Custom metric for domain-specific evaluation."""

    name = "domain_accuracy"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        # Your domain-specific evaluation logic
        score = self.calculate_domain_score(data)
        return {"score": score, "explanation": "Domain-specific reasoning"}
```

## Architecture

### Core Components

#### 1. Evaluators
- **`BaseEvaluator`**: Abstract base class for all evaluators - extend for any evaluation scenario
- **`GEvalGenerationEvaluator`**: Production-ready GEval-backed evaluator for text generation quality with rule-based scoring

#### 2. Metrics
- **`BaseMetric`**: Abstract base class for metrics - create custom metrics for any domain
- **`LMBasedMetric`**: Generic LM-powered metric evaluation with customizable prompts

#### 3. Datasets
- **`BaseDataset`**: Abstract base class for datasets - support any data format
- **`DictDataset`**: Simple dictionary-based dataset implementation

### 4. Runner
- **`Runner`**: Runner class for batch evaluation

# Metrics
Below is a list of metrics that are currently supported by the SDK.

| Metric | Description | Type | Score Range |
|--------|-------------|------|-------------|
| LMBasedMetric | An all purpose metric that can be used to evaluate any metric that can be expressed as a LM prompt | LM-based | - |
| DeepEvalGEvalMetric | A versatile evaluation metric framework that can be used to create custom evaluation metrics with configurable criteria, evaluation steps, and rubrics | LM-based | - |
| GEvalCompletenessMetric | A metric that can be used to evaluate the completeness of the generated output | DeepEval GEval | 1-3 |
| GEvalRedundancyMetric | A metric that can be used to evaluate the redundancy of the generated output | DeepEval GEval | 1-3 |
| GEvalGroundednessMetric | A metric that can be used to evaluate the groundedness of the generated output | DeepEval GEval | 1-3 |
| GEvalLanguageConsistencyMetric | A metric that can be used to evaluate language consistency between query and generated response | DeepEval GEval | 0-1 |
| GEvalRefusalMetric | A metric that can be used to evaluate refusal behavior from query and expected response | DeepEval GEval | 0-1 |
| GEvalRefusalAlignmentMetric | A metric that can be used to evaluate refusal alignment between expected and generated responses | DeepEval GEval | 0-1 |

# Evaluators
Below is a list of evaluators that are currently supported by the SDK.

| Evaluator | Description | Type |
|--------|-------------|------|
| GEvalGenerationEvaluator | An evaluator that can be used to evaluate the quality of the generated output | LLM-based |

# Datasets
Below is a list of datasets that are currently supported by the SDK.

| Dataset | Description |
|--------|-------------|
| DictDataset | A dataset that loads data from a dictionary |
| HuggingFaceDataset |  A dataset that loads data from a HuggingFace dataset |
