Metadata-Version: 2.4
Name: praisonaibench
Version: 0.0.13
Summary: Simple LLM Benchmarking Tool using PraisonAI Agents
Project-URL: Homepage, https://github.com/MervinPraison/praisonaibench
Project-URL: Repository, https://github.com/MervinPraison/praisonaibench
Project-URL: Documentation, https://github.com/MervinPraison/praisonaibench#readme
Project-URL: Issues, https://github.com/MervinPraison/praisonaibench/issues
Author: MervinPraison
Maintainer: MervinPraison
License: MIT
Keywords: agents,ai,benchmark,llm,testing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Requires-Dist: praisonaiagents[llm]>=0.0.1
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: black>=22.0; extra == 'all'
Requires-Dist: duckduckgo-search>=3.0; extra == 'all'
Requires-Dist: flake8>=4.0; extra == 'all'
Requires-Dist: mypy>=1.0; extra == 'all'
Requires-Dist: pre-commit>=3.0; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=22.0; extra == 'dev'
Requires-Dist: flake8>=4.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: tools
Requires-Dist: duckduckgo-search>=3.0; extra == 'tools'
Description-Content-Type: text/markdown

# PraisonAI Bench

🚀 **A simple, powerful LLM benchmarking tool built with PraisonAI Agents**

Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.

## 🎯 Testing Modes

| Feature | Single Test | Test Suite (YAML) |
|---------|-------------|-------------------|
| **📝 Description** | Run one prompt | Run multiple tests from YAML file |
| **🔧 Command** | `praisonaibench --test "prompt"` | `praisonaibench --suite tests.yaml` |
| **📊 Evaluation** | ✅ Enabled (Browser + LLM Judge) | ✅ Enabled (Browser + LLM Judge) |
| **🎨 HTML Extraction** | ✅ Auto-extracted | ✅ Auto-extracted |
| **📁 Output** | Single JSON result | Batch JSON results |
| **🖼️ Screenshots** | ✅ Generated | ✅ Generated |
| **⚡ Console Errors** | ✅ Detected | ✅ Detected |
| **🤖 LLM Judge** | ✅ gpt-5.1 quality scoring | ✅ gpt-5.1 quality scoring |
| **🔄 Retry Logic** | ✅ 3 attempts | ✅ 3 attempts |
| **⚡ Parallel Execution** | N/A | ✅ `--concurrent N` |
| **💰 Cost Tracking** | ✅ Token & cost per test | ✅ Cumulative cost summary |
| **📊 Export Formats** | JSON | JSON, CSV |
| **📈 HTML Reports** | N/A | ✅ `--report` |
| **🎯 Use Case** | Quick testing | Comprehensive benchmarking |

### 🔍 What's Included in Evaluation?

Our **research-backed hybrid evaluation system** provides comprehensive quality assessment:

| Component | What It Does | Score Weight |
|-----------|--------------|--------------|
| **📝 HTML Validation** | Static structure validation, DOCTYPE, required tags | 15% |
| **🌐 Functional** | Browser rendering, console errors, render time | 40% |
| **🎯 Expected Result** | Objective comparison (optional, for factual tasks) | 20%* |
| **🎨 Quality (LLM)** | Code quality, completeness, best practices | 25% |
| **📊 Overall** | Combined score (0-100) with pass/fail (≥70) | 100% |

*When `expected` field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)

**Example Output (with expected)**:
```
HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 ✅ PASSED
```

**Example Output (without expected)**:
```
HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 ✅ PASSED
```

## ✨ Key Features

- 🎯 **Any LLM Model** - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
- 🔄 **Single Agent Design** - Your prompt becomes the instruction (no complex configs)
- 💾 **Auto HTML Extraction** - Automatically saves HTML code from responses
- 📁 **Smart Organization** - Model-specific output folders (`output/gpt-4o/`, `output/xai/grok-code-fast-1/`)
- 🎛️ **Flexible Testing** - Run single tests, full suites, or filter specific tests
- ⚡ **Parallel Execution** - Run tests concurrently with `--concurrent N` for faster benchmarking
- 💰 **Cost & Token Tracking** - Automatic token usage and cost calculation for all supported models
- 📊 **Multiple Export Formats** - Export results as JSON or CSV for easy analysis
- 📈 **HTML Dashboard Reports** - Beautiful visual reports with interactive charts using `--report`
- 🛠️ **Modern Tooling** - Built with `pyproject.toml` and `uv` package manager
- 📋 **Comprehensive Results** - Complete metrics with timing, success rates, costs, and metadata

## 🚀 Quick Start

### 1. Install from PyPI (Recommended)

```bash
pip install praisonaibench
```

[![PyPI](https://img.shields.io/pypi/v/praisonaibench)](https://pypi.org/project/praisonaibench/)

### 2. Install with uv

```bash
# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench

# Install with uv
uv sync

# Or install in development mode
uv pip install -e .
```

### 3. Alternative: Install with pip (from source)

```bash
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .
```

### 4. Set Your API Keys

```bash
# OpenAI
export OPENAI_API_KEY=your_openai_key

# XAI (Grok)
export XAI_API_KEY=your_xai_key

# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key

# Google
export GOOGLE_API_KEY=your_google_key
```

### 4. Run Your First Benchmark

```python
from praisonaibench import Bench

# Create benchmark suite
bench = Bench()

# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])

# Run with specific model
result = bench.run_single_test(
    "Create a rotating cube HTML file", 
    model="xai/grok-code-fast-1"
)

# Get summary
summary = bench.get_summary()
print(summary)
```

## 📁 Project Structure

```
praisonaibench/
├── pyproject.toml           # Modern Python packaging
├── src/praisonaibench/      # Source code
│   ├── __init__.py          # Main imports
│   ├── bench.py             # Core benchmarking engine
│   ├── agent.py             # LLM agent wrapper
│   ├── cli.py               # Command line interface
│   └── version.py           # Version info
├── examples/                # Example configurations
│   ├── threejs_simulation_suite.yaml
│   └── config_example.yaml
└── output/                  # Generated results
    ├── gpt-4o/             # Model-specific HTML files
    ├── xai/grok-code-fast-1/
    └── benchmark_results_*.json
```

## 💻 CLI Usage

### Basic Commands

```bash
# Single test with default model
praisonaibench --test "Explain quantum computing"

# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o

# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229
```

### Test Suites

```bash
# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml

# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"

# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1

# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3
```

### Cross-Model Comparison

```bash
# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1
```

### Extract HTML from Results

```bash
# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# → Processes JSON file and saves any HTML content to .html files

# Works with any benchmark results JSON file
praisonaibench --extract my_results.json
```

### HTML Generation Examples

```bash
# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# → Saves to: output/gpt-4o/test_cube.html

# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# → Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html
```

### Cost & Token Tracking

Automatically track token usage and costs for all LLM API calls:

```bash
# Run tests with automatic cost tracking
praisonaibench --suite tests.yaml --model gpt-4o

# Output includes per-test costs:
💰 Cost: $0.002400 (1250 tokens)

# Summary shows total costs:
📊 Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 8.42s

💰 Cost Summary:
   Total tokens: 5,420
   Total cost: $0.0124

   By model:
     gpt-4o: $0.0124 (5,420 tokens)
```

**Supported models** include accurate pricing for:
- OpenAI (GPT-4o, GPT-4, GPT-3.5, O1)
- Anthropic (Claude 3 family)
- Google (Gemini 1.5 family)
- XAI (Grok models)
- Groq (optimised models)

Token usage is extracted from API responses when available, or estimated from text length. Costs are calculated using official provider pricing (updated December 2024).

### CSV Export

Export benchmark results to CSV for spreadsheet analysis:

```bash
# Export to CSV format
praisonaibench --suite tests.yaml --format csv

# Results saved to: output/csv/benchmark_results_20241211_123456.csv
```

**CSV includes:**
- Test names and status
- Model information
- Execution times
- Token usage (input/output/total)
- Costs per test
- Evaluation scores
- Prompts and response lengths
- Error messages (if any)

Perfect for:
- Spreadsheet analysis in Excel/Google Sheets
- Data visualization tools
- Statistical analysis
- Sharing results with non-technical stakeholders

### HTML Dashboard Reports

Generate beautiful interactive reports with comprehensive visualizations inspired by the React UI:

```bash
# Generate enhanced HTML report after running tests
praisonaibench --suite tests.yaml --report

# Generate report from existing results (without re-running tests)
praisonaibench --report-from output/json/benchmark_results_20241211_123456.json

# Compare multiple test results
praisonaibench --compare result1.json result2.json result3.json

# Reports saved to: output/reports/
```

**Enhanced Report Includes:**

**📊 Dashboard Tab:**
- Summary cards with key metrics (tests, models, success rate, avg time, cost, tokens)
- Interactive charts:
  - Status distribution (success/failure)
  - Execution time by model
  - Evaluation scores (radar chart)
  - Errors & warnings

**🏆 Leaderboard Tab:**
- Model rankings with multiple criteria:
  - Overall Score (default)
  - Functional Score
  - Quality Score
  - Pass Rate
  - Speed (fastest first)
- Top 3 models highlighted with medals
- Detailed metrics per model (functional, quality, pass rate, time)
- Click criteria to re-rank dynamically

**⚖️ Comparison Tab:**
- Detailed side-by-side model comparison
- Comprehensive metrics table:
  - Overall score, functional score, quality score
  - Pass rate with color coding
  - Average execution time
  - Total errors and warnings count
- Full model names and stats

**📋 Results Tab:**
- Complete test results table
- Individual test status, scores, time, tokens, cost
- Sortable columns
- Status indicators

**Features:**
- 🎨 Modern UI with gradient headers and smooth transitions
- 📱 Fully responsive design
- ⚡ Fast and lightweight (no external dependencies)
- 🔄 Tab navigation for organized viewing
- 📊 Chart.js powered visualizations
- 🎯 Based on praisonaibench-ui React application
- 💾 Standalone HTML - works offline
- 📧 Easy to share via email or host on web

**Comparison Reports:**
Multi-run comparison shows:
- Side-by-side success rates
- Performance trends
- Cost and token usage evolution
- Model improvements over time

Open any generated HTML file in any modern browser!

## 📋 Test Suite Format

### Basic Test Suite (`tests.yaml`)

```yaml
tests:
  - name: "math_test"
    prompt: "What is 15 * 23?"
    expected: "345"  # Optional: for objective comparison
  
  - name: "creative_test"
    prompt: "Write a short story about a robot"
    # No expected field - subjective task
  
  - name: "model_specific_test"
    prompt: "Explain quantum physics"
    model: "gpt-4o"
```

**Using the `expected` field**:
- ✅ **Use for**: Factual questions, math problems, code output, deterministic tasks
- ❌ **Skip for**: Creative tasks, open-ended questions, visual/interactive content
- When provided: Adds 20% objective scoring based on similarity
- When omitted: Weights automatically normalize (no penalty)

### Advanced Test Suite with Full Config Support

```yaml
# Global LLM configuration (applies to all tests)
config:
  max_tokens: 4000
  temperature: 0.7
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0
  # Any LiteLLM-compatible parameter is supported!

tests:
  - name: "creative_writing"
    prompt: "Write a detailed sci-fi story"
    model: "gpt-4o"
  
  - name: "code_generation"
    prompt: "Create a Python web scraper"
    model: "xai/grok-code-fast-1"
```

### Three.js HTML Generation Suite

```yaml
# examples/threejs_simulation_suite.yaml
tests:
  - name: "rotating_cube_simulation"
    prompt: |
      Create a complete HTML file with Three.js that displays a rotating 3D cube.
      The cube should have different colored faces, rotate continuously, and include proper lighting.
      The HTML file should be self-contained with Three.js loaded from CDN.
      Include camera controls for user interaction.
      Save the output as 'rotating_cube.html'.
    
  - name: "particle_system"
    prompt: |
      Create an HTML file with Three.js showing an animated particle system.
      Include 1000+ particles with random colors, movement, and physics.
      Add mouse interaction to influence particle behavior.
      
  - name: "terrain_simulation"
    prompt: |
      Create a Three.js HTML file with a procedurally generated terrain landscape.
      Include realistic textures, lighting, and a first-person camera.
      Add fog effects and animated elements.
      
  - name: "solar_system"
    prompt: |
      Create a Three.js solar system simulation in HTML.
      Include the sun, planets with realistic orbits, textures, and lighting.
      Add controls to speed up/slow down time.
```

## 🔧 Configuration

### Basic Configuration (`config.yaml`)

```yaml
# Default model (can be overridden per test)
default_model: "gpt-4o"

# Output settings
output_format: "json"
save_results: true
output_dir: "output"

# Performance settings
max_retries: 3
timeout: 60
```

### Supported Models

PraisonAI Bench supports **any LiteLLM-compatible model**:

```yaml
# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo

# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307

# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b

# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1

# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo  # with OPENAI_API_BASE set
```

## 📊 Results & Output

### Automatic HTML Extraction

When LLM responses contain HTML code blocks, they're automatically extracted and saved:

```
output/
├── gpt-4o/
│   ├── rotating_cube_simulation.html
│   └── particle_system.html
├── xai/
│   └── grok-code-fast-1/
│       ├── terrain_simulation.html
│       └── solar_system.html
└── benchmark_results_20250829_160426.json
```

### JSON Results Format

```json
[
  {
    "test_name": "rotating_cube_simulation",
    "prompt": "Create a complete HTML file with Three.js...",
    "response": "<!DOCTYPE html>\n<html>\n...",
    "model": "xai/grok-code-fast-1",
    "agent_name": "BenchAgent",
    "execution_time": 8.24,
    "status": "success",
    "timestamp": "2025-08-29 16:04:26"
  }
]
```

### Summary Statistics

```bash
📊 Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json
```

## 🎯 Advanced Features

### 🔄 **Universal Model Support**
- Works with **any LiteLLM-compatible model**
- No hardcoded model restrictions
- Automatic API key detection

### 💾 **Smart HTML Handling**
- Auto-detects HTML in multiple formats:
  - Markdown-wrapped HTML (```html...```)
  - Truncated HTML blocks (incomplete responses)
  - Raw HTML content (direct HTML responses)
- Extracts and saves as `.html` files automatically
- Organizes by model in separate folders
- Extract HTML from existing benchmark results with `--extract`
- Perfect for Three.js, React, or any web development benchmarks

### 🎛️ **Flexible Test Management**
- Run entire suites or filter specific tests
- Override models per test or globally
- Cross-model comparisons with detailed metrics

### ⚡ **Modern Development**
- Built with `pyproject.toml` (no legacy `setup.py`)
- Optimized for `uv` package manager
- Fast dependency resolution and installation

### 🏗️ **Simple Architecture**
- **Single Agent Design** - Your prompt becomes the instruction
- **No Complex Configs** - Just write your test prompts
- **Minimal Dependencies** - Only what you need

## 🚀 Use Cases

### Web Development Benchmarking
```bash
# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o
```

### Code Generation Comparison
```bash
# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1
```

### Creative Content Testing
```bash
# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro
```

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Install dependencies: `uv sync`
4. Make your changes
5. Run tests: `uv run pytest`
6. Submit a pull request

## 📄 License

MIT License - see LICENSE file for details.

---

**Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity!** 🚀
