Metadata-Version: 2.4
Name: judge-llm
Version: 1.0.11
Summary: A lightweight LLM evaluation framework for comparing and testing AI providers
Author-email: Ilayanambi <hihelloai@yahoo.com>
License: CC-BY-NC-SA-4.0
Project-URL: Homepage, https://github.com/HiHelloAI/judge-llm
Project-URL: Documentation, https://github.com/HiHelloAI/judge-llm#readme
Project-URL: Repository, https://github.com/HiHelloAI/judge-llm
Project-URL: Issues, https://github.com/HiHelloAI/judge-llm/issues
Keywords: llm,evaluation,testing,ai,gemini,openai,claude
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: rich>=13.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: dotenv>=0.9.9
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: google-adk
Requires-Dist: google-adk==1.16.0; extra == "google-adk"
Requires-Dist: google-adk[eval]==1.16.0; extra == "google-adk"
Requires-Dist: rouge-score; extra == "google-adk"
Provides-Extra: gemini
Requires-Dist: google-genai>=0.1.0; extra == "gemini"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.7.0; extra == "anthropic"
Provides-Extra: telemetry
Requires-Dist: opentelemetry-api>=1.20.0; extra == "telemetry"
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == "telemetry"
Requires-Dist: opentelemetry-exporter-otlp>=1.20.0; extra == "telemetry"
Provides-Extra: phoenix
Requires-Dist: opentelemetry-api>=1.20.0; extra == "phoenix"
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == "phoenix"
Requires-Dist: arize-phoenix-otel>=0.6.0; extra == "phoenix"
Requires-Dist: openinference-semantic-conventions>=0.1.0; extra == "phoenix"
Provides-Extra: all
Requires-Dist: google-genai>=0.1.0; extra == "all"
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.7.0; extra == "all"
Requires-Dist: opentelemetry-api>=1.20.0; extra == "all"
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == "all"
Requires-Dist: opentelemetry-exporter-otlp>=1.20.0; extra == "all"
Dynamic: license-file

<div align="center">
  <img src="assets/icon.png" alt="Judge LLM" width="200"/>

  # JUDGE LLM

  A lightweight, extensible Python framework for **evaluating and comparing LLM providers**. Test your AI agents systematically with multi-turn conversations, cost tracking, and comprehensive reporting.

  [![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
  [![Version](https://img.shields.io/badge/version-1.0.9-green.svg)](https://github.com/HiHelloAI/judge-llm)
  [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
  [![PyPI](https://img.shields.io/pypi/v/judge-llm.svg)](https://pypi.org/project/judge-llm/)
  [![OpenTelemetry](https://img.shields.io/badge/OpenTelemetry-optional-blueviolet.svg)](https://opentelemetry.io/)

  [Documentation](https://hihelloai.github.io/judge-llm/docs/intro) • [Quick Start](#quick-start) • [Features](#features) • [Examples](#testing-examples) • [Reports](#reports--dashboard)
</div>

<div align="center">
  <img src="assets/judge-llm.gif" alt="Judge LLM Demo" width="100%"/>
</div>

## Purpose

JUDGE LLM helps you **evaluate AI agents and LLM providers** by running test cases against your models and measuring:
- **Response quality** (exact matching, semantic similarity, ROUGE scores)
- **Cost & latency** (token usage, execution time, budget compliance)
- **Conversation flow** (tool uses, multi-turn interactions)
- **Safety & custom metrics** (extensible evaluation logic)

Perfect for regression testing, A/B testing providers, and ensuring production-grade quality.

## Features

- **Multiple Providers**: Gemini, Google ADK, ADK HTTP, Mock, and custom providers with registry-based extensibility
- **Built-in Evaluators**: Response similarity, trajectory validation, cost/latency checks, embedding similarity, LLM-as-judge, sub-agent chain validation
- **Custom Components**: Create and register custom providers, evaluators, and reporters
- **Registry System**: Register once in defaults, use everywhere by name
- **Rich Reports**: Console tables, interactive HTML dashboard, JSON exports, SQLite database, plus custom reporters
- **Parallel Execution**: Run evaluations concurrently with configurable workers
- **Quality Gates**: Fail CI/CD builds when thresholds are violated (configurable)
- **Config-Driven**: YAML configs with smart defaults or programmatic Python API
- **Default Config**: Reusable configurations with component registration
- **Per-Test Overrides**: Fine-tune evaluator thresholds per test case
- **Environment Variables**: Auto-loads `.env` for secure API key management
- **Telemetry**: Optional OpenTelemetry tracing with Phoenix, Jaeger, and OTLP support

## Installation

### From Source

```bash
git clone https://github.com/HiHelloAI/judge-llm.git
cd judge-llm
pip install -e .
```

### From PyPI (when published)

```bash
pip install judge-llm
```

### With Optional Dependencies

```bash
# Install with Gemini provider support
pip install judge-llm[gemini]

# Install with Google ADK provider support
pip install judge-llm[google_adk]

# Install with dev dependencies
pip install judge-llm[dev]

# Install with OpenTelemetry support
pip install judge-llm[telemetry]

# Install with Arize Phoenix observability
pip install judge-llm[phoenix]
```

### Setup Environment Variables

JUDGE LLM automatically loads environment variables from a `.env` file:

```bash
# Copy the example file
cp .env.example .env

# Edit .env and add your API keys
nano .env
```

**`.env` file:**
```bash
# Google Gemini API Key
GOOGLE_API_KEY=your-google-api-key-here
```

The `.env` file is automatically loaded when you import the library or run the CLI. **Never commit `.env` to version control** - it's already in `.gitignore`.

## Quick Start

### CLI Usage

```bash
# Run evaluation from config file
judge-llm run --config config.yaml

# Run with inline arguments (supports .json, .yaml, or .yml)
judge-llm run --dataset ./data/eval.yaml --provider mock --agent-id my_agent --report html --output report.html

# Validate configuration
judge-llm validate --config config.yaml

# List available components
judge-llm list providers
judge-llm list evaluators
judge-llm list reporters

# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html
```

### Python API

```python
from judge_llm import evaluate

# From config file
report = evaluate(config="config.yaml")

# Programmatic API (supports .json, .yaml, or .yml datasets)
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./data/eval.yaml"]},
    providers=[{"type": "mock", "agent_id": "my_agent"}],
    evaluators=[{"type": "response_evaluator", "config": {"similarity_threshold": 0.8}}],
    reporters=[{"type": "console"}, {"type": "html", "output_path": "./report.html"}]
)

print(f"Success: {report.success_rate:.1%} | Cost: ${report.total_cost:.4f}")
```

## Configuration

**Minimal config.yaml:**
```yaml
dataset:
  loader: local_file
  paths: [./data/eval.json]  # Supports .json, .yaml, or .yml files

providers:
  - type: gemini
    agent_id: my_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config: {similarity_threshold: 0.8}

reporters:
  - type: console
  - type: html
    output_path: ./report.html
```

**Advanced config with quality gates:**
```yaml
agent:
  fail_on_threshold_violation: true  # Exit with error if evaluations fail (default: true)
  parallel_execution: true            # Run tests in parallel
  max_workers: 4                      # Number of parallel workers
  num_runs: 3                         # Run each test 3 times

dataset:
  loader: local_file
  paths: [./data/eval.yaml]

providers:
  - type: gemini
    agent_id: production_agent
    model: gemini-2.0-flash-exp

evaluators:
  - type: response_evaluator
    config:
      similarity_threshold: 0.85  # Minimum 85% similarity required
  - type: cost_evaluator
    config:
      max_cost_per_case: 0.05      # Maximum $0.05 per test

reporters:
  - type: database
    db_path: ./results.db  # Track results over time
```

**Use in CI/CD:**
```bash
# Fails with exit code 1 if any evaluator thresholds are violated
judge-llm run --config ci-config.yaml

# Or disable failures for monitoring
# Set fail_on_threshold_violation: false in config
```

**Dataset File Formats:**

JUDGE LLM supports both JSON and YAML formats for evaluation datasets. Use whichever format you prefer:

```yaml
# Using JSON dataset
dataset:
  loader: local_file
  paths: [./data/eval.json]

# Using YAML dataset
dataset:
  loader: local_file
  paths: [./data/eval.yaml]

# Using multiple datasets (mixed formats)
dataset:
  loader: local_file
  paths:
    - ./data/eval1.json
    - ./data/eval2.yaml

# Using directory loader with pattern
dataset:
  loader: directory
  paths: [./data]
  pattern: "*.yaml"  # or "*.json" or "*.yml"
```

**Google ADK Provider Configuration:**

For agents built with Google's Agent Development Kit (ADK):

```yaml
providers:
  - type: google_adk
    agent_id: my_adk_agent
    agent_metadata:
      module_path: "tool_agent.agent"  # Path to your agent module
      agent_name: "root_agent"         # Agent variable name (default: "root_agent")
      root_path: "."                   # Root path for imports (optional)
```

The ADK provider automatically:
- Loads your ADK agent from the specified module
- Converts between Judge LLM and ADK formats
- Validates tool usage and conversation flow
- Reports results in standard Judge LLM format

See [examples/09-google-adk-agent/](examples/09-google-adk-agent/) for a complete working example.

**ADK HTTP Provider Configuration:**

For evaluating agents deployed as remote HTTP services:

```yaml
providers:
  - type: adk_http
    agent_id: my_remote_agent
    endpoint_url: "http://localhost:8000/run_sse"
    auth_type: bearer          # bearer, api_key, basic, none
    timeout: 60
    app_name: "my_app"
    model: gemini-2.0-flash

evaluators:
  - type: response_evaluator
  - type: trajectory_evaluator
  - type: subagent_evaluator   # Validate agent transfer chains
  - type: llm_judge_evaluator  # LLM-as-judge quality assessment
    config:
      evaluation_type: comprehensive
```

**Lifecycle Callbacks (Extensibility):**

The ADK HTTP provider exposes lifecycle callbacks that you can override by subclassing:

| Callback | When | Context keys |
|----------|------|-------------|
| `on_before_session_create` | Before session creation request | `payload, headers, url, app_name, user_id, session_id` |
| `on_after_session_create` | After session created | `session_id, response, app_name, user_id` |
| `on_before_run` | Before sending message | `payload, headers, message, session_id, system_instruction` |
| `on_after_run` | After collecting events | `events, session_id, response` |

```python
# my_provider.py
from judge_llm.providers.adk_http_provider import ADKHTTPProvider

class MyADKHTTPProvider(ADKHTTPProvider):
    def on_before_run(self, context):
        context["headers"]["X-Trace-Id"] = "my-trace"
        return context
```

```yaml
# Register and use via config
providers:
  - type: custom
    module_path: ./my_provider.py
    class_name: MyADKHTTPProvider
    register_as: my_adk_http
    agent_id: my_agent
    endpoint_url: http://localhost:8000/run_sse
```

See [examples/10-adk-http-agent/](examples/10-adk-http-agent/) for a complete working example.

See the [examples/](examples/) directory for complete configuration examples including default configs, custom evaluators, and advanced features.

## Custom Component Registration

JUDGE LLM supports registering custom providers, evaluators, and reporters for reuse across projects.

### Method 1: Register in Default Config

Create `.judge_llm.defaults.yaml` in your project root:

```yaml
# Register custom components once
providers:
  - type: custom
    module_path: ./my_providers/anthropic.py
    class_name: AnthropicProvider
    register_as: anthropic  # ← Use this name in test configs

evaluators:
  - type: custom
    module_path: ./my_evaluators/safety.py
    class_name: SafetyEvaluator
    register_as: safety

reporters:
  - type: custom
    module_path: ./my_reporters/slack.py
    class_name: SlackReporter
    register_as: slack
```

Then use them by name in any test config:

```yaml
# test.yaml - clean and simple!
providers:
  - type: anthropic  # ← Uses registered custom provider
    agent_id: claude

evaluators:
  - type: safety  # ← Uses registered custom evaluator

reporters:
  - type: slack  # ← Uses registered custom reporter
    config: {webhook_url: ${SLACK_WEBHOOK}}
```

### Method 2: Programmatic Registration

```python
from judge_llm import evaluate, register_provider, register_evaluator, register_reporter
from my_components import CustomProvider, SafetyEvaluator, SlackReporter

# Register components
register_provider("my_provider", CustomProvider)
register_evaluator("safety", SafetyEvaluator)
register_reporter("slack", SlackReporter)

# Use by name
report = evaluate(
    dataset={"loader": "local_file", "paths": ["./tests.json"]},
    providers=[{"type": "my_provider", "agent_id": "test"}],
    evaluators=[{"type": "safety"}],
    reporters=[{"type": "slack", "config": {"webhook_url": "..."}}]
)
```

**Benefits:**
- ✅ **DRY** - Register once, use everywhere
- ✅ **Team Standardization** - Share defaults across team
- ✅ **Clean Configs** - Test configs reference components by name
- ✅ **Easy Updates** - Change implementation in one place

See [examples/default_config_reporters/](examples/default_config_reporters/) for complete examples.

## Testing Examples

Explore **10 complete examples** in the `examples/` directory, from basic setup to advanced features:

| Example | Description | Key Features |
|---------|-------------|--------------|
| **[01-gemini-agent](examples/01-gemini-agent/)** | Basic Gemini agent evaluation | Simple setup, response evaluation, CLI & Python API usage |
| **[02-default-config](examples/02-default-config/)** | Using default configuration files | Config merging, `.judge_llm.defaults.yaml`, reducing duplication |
| **[03-custom-evaluator](examples/03-custom-evaluator/)** | Creating custom evaluators | Extending `BaseEvaluator`, custom validation logic, registration |
| **[04-safety-long-conversation](examples/04-safety-long-conversation/)** | Multi-turn safety evaluation | Long conversations, PII detection, toxicity analysis, LLM-as-judge |
| **[05-evaluator-config-override](examples/05-evaluator-config-override/)** | Per-test-case evaluator overrides | Fine-grained threshold control, two-level configuration |
| **[06-database-reporter](examples/06-database-reporter/)** | SQLite database reporter | Historical tracking, trend analysis, SQL queries, cost monitoring |
| **[07-custom-reporter](examples/07-custom-reporter/)** | Creating custom reporters | CSV reporter example, config-based & programmatic registration |
| **[08-default-config-reporters](examples/08-default-config-reporters/)** | Registering custom components | Register providers, evaluators, and reporters in defaults |
| **[09-google-adk-agent](examples/09-google-adk-agent/)** | Google ADK agent evaluation | ADK integration, tool usage validation, agent module loading |
| **[10-adk-http-agent](examples/10-adk-http-agent/)** | Remote ADK HTTP agent evaluation | HTTP/SSE streaming, multi-agent chains, sub-agent evaluation |

### Running Examples

Each example includes configuration files, datasets, and detailed instructions:

```bash
# Navigate to any example
cd examples/01-gemini-agent

# Run via CLI
judge-llm run --config config.yaml

# Or run via Python script
python run_evaluation.py

# Or use the convenience shell script
./run.sh
```

### Example Categories

**Getting Started:**
- Start with **01-gemini-agent** for basic usage
- Use **02-default-config** to learn configuration best practices

**Customization:**
- **03-custom-evaluator** - Build domain-specific evaluation logic
- **07-custom-reporter** - Create custom output formats
- **08-default-config-reporters** - Organize custom components

**Advanced Features:**
- **04-safety-long-conversation** - Production-ready safety evaluation with LLM-as-judge
- **05-evaluator-config-override** - Fine-tune evaluations per test case
- **06-database-reporter** - Track metrics over time with SQL queries
- **09-google-adk-agent** - Evaluate Google ADK agents seamlessly
- **10-adk-http-agent** - Evaluate remote agents via HTTP with multi-agent chain validation

## Built-in Components

### Providers
- **Gemini** - Google's Gemini models (requires `GOOGLE_API_KEY` in `.env`)
- **Google ADK** - Google's Agent Development Kit for local agentic workflows (requires `google-adk` package)
- **ADK HTTP** - Remote ADK HTTP endpoints with SSE streaming, multi-auth, and agent chain tracking (requires `httpx`)
- **Mock** - Built-in test provider, no setup required
- **Custom** - Extend `BaseProvider` for your own LLM providers (OpenAI, Anthropic, etc.)

### Evaluators
- **ResponseEvaluator** - Compare responses (exact, semantic similarity, ROUGE)
- **TrajectoryEvaluator** - Validate tool uses, conversation flow, and argument matching
- **CostEvaluator** - Enforce cost thresholds
- **LatencyEvaluator** - Enforce latency thresholds
- **EmbeddingSimilarityEvaluator** - Semantic similarity using embeddings (Gemini, OpenAI, sentence-transformers)
- **LLMJudgeEvaluator** - LLM-as-judge for relevance, hallucination, quality, and factuality assessment
- **SubAgentEvaluator** - Validate agent transfer chains in multi-agent orchestration systems
- **Custom** - Extend `BaseEvaluator` for custom logic (safety, compliance, etc.)

### Reporters
- **ConsoleReporter** - Rich terminal output with colored tables
- **HTMLReporter** - Interactive HTML dashboard
- **JSONReporter** - Machine-readable JSON export
- **DatabaseReporter** - SQLite database for historical tracking
- **Custom** - Extend `BaseReporter` for custom formats (CSV, Slack, Datadog, etc.)

## Reports & Dashboard

### HTML Dashboard
Interactive web interface with:
- **Sidebar**: Summary metrics + execution list with color-coded status
- **Main Panel**: Execution details, evaluator scores, conversation history
- **Features**: Dark mode, responsive, self-contained (works offline)

### Console Output
Rich formatted tables with live execution progress

### JSON Export
Machine-readable results for programmatic analysis

### SQLite Database
Persistent storage for:
- Historical trend tracking
- Regression detection
- Cost analysis over time
- SQL-based queries

```bash
# Generate dashboard from database
judge-llm dashboard --db results.db --output dashboard.html
```

## Telemetry & Observability (OpenTelemetry)

Judge LLM includes optional OpenTelemetry instrumentation for deep observability into evaluation runs. **Disabled by default** with zero overhead when not enabled.

### Quick Start

```bash
# Install telemetry dependencies
pip install judge-llm[telemetry]

# Run with console tracing
judge-llm run --config config.yaml --telemetry

# Run with OTLP exporter (Jaeger, Grafana Tempo, etc.)
judge-llm run --config config.yaml --telemetry --telemetry-exporter otlp
```

### Arize Phoenix Integration

```bash
# Install Phoenix support
pip install judge-llm[phoenix]

# Start Phoenix server
pip install arize-phoenix && phoenix serve

# Run with Phoenix tracing
judge-llm run --config config.yaml --telemetry --telemetry-exporter phoenix
# Open http://localhost:6006 to view traces
```

### Enable via YAML Config

```yaml
agent:
  telemetry:
    enabled: true
    exporter: phoenix    # "console", "otlp", or "phoenix"
    endpoint: http://localhost:6006
```

### Enable via Environment Variables

```bash
export JUDGE_LLM_TELEMETRY=true
export OTEL_EXPORTER_TYPE=phoenix
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
```

### What Gets Traced

Every evaluation creates a span tree with detailed attributes:

```
judge_llm.evaluate                                [CHAIN]
├── judge_llm.execute_task                        [CHAIN, session.id]
│   ├── judge_llm.provider.execute                [LLM, input/output, tokens]
│   │   ├── adk_http.create_session               [TOOL, HTTP req/res]
│   │   └── adk_http.send_and_collect             [LLM, HTTP req/res, tokens]
│   └── judge_llm.evaluator.evaluate              [EVALUATOR, score]
└── judge_llm.reporter.generate                   [per reporter]
```

**Phoenix-specific features:**
- **Sessions** — spans are grouped by `session.id` for multi-turn conversation tracking
- **Input/Output** — full request payloads and agent response text visible on each span
- **HTTP details** — request/response bodies, headers, and status codes on HTTP spans
- **LLM metadata** — model name, token counts (prompt/completion/total) via OpenInference conventions

See the [Telemetry Guide](docs/docs/guides/telemetry.md) for complete documentation.

## Development

```bash
# Setup
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black judge_llm && ruff check judge_llm
```

Contributions welcome! Fork, create a feature branch, add tests, and submit a PR.

## License

Licensed under **CC BY-NC-SA 4.0** - Free for non-commercial use with attribution. See [LICENSE](LICENSE) for details.

For commercial licensing, contact the maintainers.
