Metadata-Version: 2.4
Name: evalmeter
Version: 0.1.0
Summary: Comprehensive evaluation library for Gen AI applications using AWS Bedrock
Author-email: Ramprasath S <ramprasath.s@example.com>
License: MIT
Project-URL: Homepage, https://github.com/RamprasathS/evalmeter
Project-URL: Repository, https://github.com/RamprasathS/evalmeter
Project-URL: Issues, https://github.com/RamprasathS/evalmeter/issues
Keywords: ai,evaluation,bedrock,llm,claude,genai,evalmeter
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3>=1.34.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: click>=8.1.0
Requires-Dist: fastapi>=0.109.0
Requires-Dist: uvicorn[standard]>=0.27.0
Requires-Dist: plotly>=5.18.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: nltk>=3.8.0
Requires-Dist: rouge>=1.0.1
Requires-Dist: python-Levenshtein>=0.23.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: rich>=13.7.0
Requires-Dist: tqdm>=4.66.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.12.0; extra == "dev"
Requires-Dist: ruff>=0.1.9; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Dynamic: license-file

# 📊 EvalMeter

**Measure AI Quality with Precision using AWS Bedrock**

A comprehensive evaluation framework for Gen AI applications, powered by AWS Bedrock. EvalMeter provides 11 evaluation metrics across heuristic, statistical, and LLM-as-judge methods to help you measure and improve your AI systems.

[![PyPI version](https://badge.fury.io/py/evalmeter.svg)](https://badge.fury.io/py/evalmeter)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## 🎬 Demo

![EvalMeter Demo](demo/evalmeter-demo.gif)

*Quick demo showing project tracking, experiment comparison, and metrics visualization*

### Key Features Shown:
- 📁 **Projects** - Group related experiments and track progress
- 📊 **Dashboard** - Overview with key statistics
- 📈 **Progress Charts** - Visualize improvements over time
- ⚖️ **Compare** - Side-by-side experiment comparison
- 💬 **Comments** - Document changes and insights

---

## ✨ Key Features

- 🎯 **11 Evaluation Metrics** - Heuristic, Statistical, and LLM-as-Judge evaluators
- 🤖 **AWS Bedrock Powered** - Claude Sonnet 4.5 and Titan Embeddings V2
- 📊 **Multiple Data Formats** - CSV, JSONL, JSON, Parquet support
- 💾 **Local SQLite Storage** - Track experiments without external dependencies
- 🎨 **Modern Web UI** - React dashboard with real-time visualization
- 📁 **Project Tracking** - Group experiments and monitor progress over time
- ⚡ **Simple CLI** - One-line commands to run evaluations
- 🔌 **REST API** - FastAPI backend for programmatic access
- 📈 **Progress Charts** - Visualize improvement trends
- 🔍 **Detailed Metrics** - Comprehensive scoring and metadata

---

## 📦 Installation

```bash
pip install evalmeter
```

### Prerequisites

- Python 3.9 or higher
- AWS account with Bedrock access
- AWS credentials configured

### AWS Setup

```bash
# Configure AWS credentials
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
```

**Required IAM Permissions:**
```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-*"
      ]
    }
  ]
}
```

---

## 🚀 Quick Start

### 1. Prepare Your Data

Create a CSV file with your test cases:

```csv
input,output,expected
"What is 2+2?","4","4"
"Capital of France?","Paris","Paris"
"Explain photosynthesis","Plants use sunlight to make food","Photosynthesis is how plants convert light energy into chemical energy"
```

### 2. Run Evaluation

```bash
# Basic evaluation
evalmeter run --data test.csv --evals "exact_match,bleu,rouge"

# With project tracking
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline test" \
  --evals "factuality,relevance,coherence"

# Comprehensive evaluation (all 11 metrics)
evalmeter run --data test.csv \
  --experiment "comprehensive" \
  --evals "exact_match,fuzzy_match,contains,bleu,rouge,levenshtein,cosine_similarity,factuality,relevance,coherence,completeness"
```

### 3. View Results in Web UI

```bash
# Launch the web UI
./start-ui.sh

# This starts:
# - API server on http://localhost:8000
# - React UI on http://localhost:5173 (opens automatically)
```

The web UI provides:
- 📊 **Dashboard** - Overview of all experiments
- 📁 **Projects** - Group related experiments and track progress
- 📈 **Progress Charts** - Visualize improvements over time
- 🔍 **Detailed Results** - View scores, metrics, and sample-level data
- ⚖️ **Compare** - Side-by-side experiment comparison
- 💬 **Comments** - Document changes and insights for each experiment

**CLI Alternative:**
```bash
# List experiments
evalmeter list

# Show details
evalmeter show <experiment-id>
```

---

## 📊 Available Evaluators (11 Total)

### 🎯 Heuristic Evaluators (4)

| Evaluator | Description | Use Case |
|-----------|-------------|----------|
| `exact_match` | Binary exact string match | Classification, short answers |
| `fuzzy_match` | Similarity ratio (0.0-1.0) | Typo tolerance, spelling variations |
| `contains` | Substring matching | Long answers, key phrase detection |
| `regex_match` | Pattern matching | Format validation (emails, dates) |

### 📈 Statistical Evaluators (4)

| Evaluator | Description | Use Case |
|-----------|-------------|----------|
| `bleu` | N-gram precision | Translation, text generation |
| `rouge` | Recall-oriented matching | Summarization |
| `levenshtein` | Edit distance similarity | Text similarity |
| `cosine_similarity` | Semantic similarity via embeddings | Meaning comparison |

### 🤖 LLM-as-Judge Evaluators (4)

| Evaluator | Description | Use Case |
|-----------|-------------|----------|
| `factuality` | Factual correctness | Accuracy verification |
| `relevance` | Answer relevance | Relevance checking |
| `coherence` | Response structure | Quality assessment |
| `completeness` | Answer coverage | Thoroughness verification |

---

## 💻 Python API

```python
from evalmeter import Evaluator

# Initialize
evaluator = Evaluator(
    model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
    aws_region="us-east-1"
)

# Run evaluation
results = evaluator.run(
    data_path="test.csv",
    experiment_name="my-eval",
    project_id="chatbot-v2",
    comments="Testing new prompts",
    evaluators=["factuality", "relevance", "cosine_similarity"]
)

# Print summary
print(results.summary())

# Access metrics
print(f"Factuality: {results.metrics['factuality_mean']:.2f}")
print(f"Relevance: {results.metrics['relevance_mean']:.2f}")

# Iterate results
for result in results:
    print(f"Input: {result['input']}")
    print(f"Scores: {result['scores']}")
```

---

## 🎨 Web UI - Visualize Your Results

Launch the interactive dashboard to view and analyze your evaluation results:

```bash
./start-ui.sh
```

This opens the React UI at **http://localhost:5173**

### Dashboard Pages

#### 📊 Dashboard
Overview of all experiments with key statistics, metrics, and recent activity.

#### 📁 Projects - Track Progress Over Time
**Group related experiments to visualize improvements**

Create projects to organize experiments and track progress across iterations:

```bash
# Baseline
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "baseline" \
  --comments "Initial baseline with default prompts" \
  --evals "factuality,relevance,coherence,completeness"

# After improvements
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "improved-prompts" \
  --comments "Updated system prompts for better accuracy" \
  --evals "factuality,relevance,coherence,completeness"

# With RAG
evalmeter run --data test.csv \
  --project "chatbot-v2" \
  --experiment "with-rag" \
  --comments "Added RAG with vector database" \
  --evals "factuality,relevance,coherence,completeness"
```

**In the UI:**
1. Navigate to **Projects** → **chatbot-v2**
2. See all experiments in chronological order
3. View progress chart showing metric improvements over time
4. Read comments to understand what changed between versions

#### 🔬 Experiments
Browse all evaluation runs, filter by project/status/date, and view summary metrics.

#### 📈 Experiment Details
Click any experiment to see:
- **Metrics Summary** - Mean, min, max for all evaluators
- **Sample Results** - Individual input/output/expected with scores
- **Comments** - Your notes about this experiment
- **Metadata** - Model used, dataset, timestamps
- **Configuration** - Evaluators used and parameters

#### ⚖️ Compare
Select two experiments to compare side-by-side, view metric differences, and identify improvements or regressions.

#### 💬 Comments & Documentation
**Document your experiments for better tracking**

Add comments to every experiment explaining what changed, why, and observations:

```bash
evalmeter run --data test.csv \
  --project "qa-bot" \
  --experiment "test-5" \
  --comments "Increased temperature to 0.7 for more creative responses. Added context window of 3 previous messages. Results show better coherence but slightly lower factuality."
```

View these comments in the UI to understand your experimentation history and make informed decisions!

**See `docs/PROJECT_TRACKING.md` for complete guide.**

### Screenshots

<table>
  <tr>
    <td width="50%">
      <img src="demo/dashboard.png" alt="Dashboard">
      <p align="center"><em>Dashboard with experiment overview and statistics</em></p>
    </td>
    <td width="50%">
      <img src="demo/projects.png" alt="Projects">
      <p align="center"><em>Project tracking with progress charts</em></p>
    </td>
  </tr>
  <tr>
    <td width="50%">
      <img src="demo/experiments.png" alt="Experiments">
      <p align="center"><em>Experiment list with filtering and metrics</em></p>
    </td>
    <td width="50%">
      <img src="demo/metricgraph.png" alt="Metric Graphs">
      <p align="center"><em>Detailed metric visualization and trends</em></p>
    </td>
  </tr>
  <tr>
    <td colspan="2">
      <img src="demo/experimentvisualcomparison.png" alt="Visual Comparison">
      <p align="center"><em>Side-by-side experiment comparison with detailed metrics</em></p>
    </td>
  </tr>
</table>

---

## 📖 CLI Reference

### Run Evaluation

```bash
evalmeter run [OPTIONS]

Options:
  -d, --data PATH       Path to data file (required)
  -e, --experiment TEXT Experiment name
  -p, --project TEXT    Project ID for grouping
  -c, --comments TEXT   Experiment notes
  --evals TEXT          Comma-separated evaluators
  --model TEXT          Bedrock model ID
  --region TEXT         AWS region (default: us-east-1)
```

### List Experiments

```bash
evalmeter list [OPTIONS]

Options:
  -n, --limit INTEGER   Number to show (default: 10)
```

### Show Details

```bash
evalmeter show EXPERIMENT_ID
```

### List Evaluators

```bash
evalmeter evaluators
```

### Start API Server

```bash
evalmeter-api
```

---

## 🎯 Use Cases

### Question Answering
```bash
evalmeter run --data qa.csv \
  --evals "cosine_similarity,factuality,relevance,completeness"
```

### Text Generation
```bash
evalmeter run --data generation.csv \
  --evals "bleu,rouge,cosine_similarity,coherence"
```

### Summarization
```bash
evalmeter run --data summaries.csv \
  --evals "rouge,cosine_similarity,coherence"
```

---

## 💰 Cost Considerations

| Evaluator Type | Cost | Speed |
|----------------|------|-------|
| Heuristic | Free | ⚡⚡⚡ Instant |
| Statistical | Free | ⚡⚡⚡ Instant |
| Cosine Similarity | AWS Bedrock (Titan Embeddings) | ⚡⚡ Fast |
| LLM-as-Judge | AWS Bedrock (Claude) | ⚡ Moderate |

**Pricing**: See [AWS Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/) for current rates.

**Recommendation**: Start with free metrics, add cosine similarity for semantic understanding, use LLM judges for final validation.

---

## 📂 Project Structure

```
evalmeter/
├── evalmeter/           # Main package
│   ├── core/           # Core evaluation logic
│   │   ├── evaluators/ # All evaluator implementations
│   │   ├── data_loader.py
│   │   └── evaluator.py
│   ├── storage/        # Database and models
│   ├── api/            # FastAPI server
│   ├── utils/          # Configuration and utilities
│   └── cli.py          # CLI interface
├── ui/                 # React web interface
├── examples/           # Example data and scripts
├── docs/               # Documentation
└── tests/              # Test suite
```

---

## 🗄️ Data Storage

EvalMeter uses SQLite for local storage:

- **Location**: `~/.evalmeter/evalmeter.db`
- **Tables**: experiments, results, metrics
- **Capacity**: Millions of records
- **No external dependencies**

---

## 📚 Documentation

- **Quick Start**: This README
- **Evaluators Guide**: See `docs/EVALUATORS.md`
- **Project Tracking**: See `docs/PROJECT_TRACKING.md`
- **Examples**: See `examples/` directory

---

## 🤝 Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

See `CONTRIBUTING.md` for detailed guidelines.

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- **AWS Bedrock** - For providing Claude and Titan models
- **Anthropic** - For Claude Sonnet 4.5
- **Amazon** - For Titan Embeddings V2
- **NLTK, Rouge, Levenshtein** - For statistical metrics

---

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/RamprasathS/evalmeter/issues)
- **Discussions**: [GitHub Discussions](https://github.com/RamprasathS/evalmeter/discussions)

---

## 🌟 Star History

If you find EvalMeter useful, please consider giving it a star on GitHub!

---

**Made with ❤️ for the AI community**
