Metadata-Version: 2.4
Name: praisonaibench-python
Version: 0.1.0
Summary: Python code evaluator plugin for PraisonAI Bench
Project-URL: Homepage, https://github.com/MervinPraison/praisonaibench
Project-URL: Repository, https://github.com/MervinPraison/praisonaibench
Project-URL: Documentation, https://github.com/MervinPraison/praisonaibench#readme
Project-URL: Issues, https://github.com/MervinPraison/praisonaibench/issues
Author: PraisonAI
Maintainer: PraisonAI
License: MIT
License-File: LICENSE
Keywords: benchmark,evaluator,llm,plugin,praisonaibench,python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.8
Requires-Dist: praisonaibench>=0.1.0
Provides-Extra: dev
Requires-Dist: black>=22.0; extra == 'dev'
Requires-Dist: flake8>=4.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# PraisonAI Bench Python Evaluator Plugin

🐍 A comprehensive Python code evaluation plugin for [PraisonAI Bench](https://github.com/MervinPraison/praisonaibench)

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## 🎯 Overview

The Python Evaluator Plugin enables PraisonAI Bench to evaluate Python code through comprehensive multi-stage assessment:

- ✅ **Syntax Validation** (30 points) - AST-based Python syntax checking
- ✅ **Code Execution** (40 points) - Safe subprocess execution with timeout protection
- ✅ **Output Comparison** (30 points) - Fuzzy matching with expected results

**Total Score: 0-100** | **Pass Threshold: ≥70**

## 🚀 Quick Start

### Installation

#### Using uv (Recommended)

```bash
# Clone or download the plugin
cd praisonaibench-python

# Install with uv
uv pip install -e .
```

#### Using pip

```bash
# Install from directory
cd praisonaibench-python
pip install -e .
```

#### Verify Installation

```bash
# Check that the plugin is registered
python -c "from praisonaibench_python import PythonEvaluator; print('Plugin loaded successfully!')"
```

### Configuration

Create a `.env` file (or copy from `.env.example`):

```bash
# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here

# Default model
DEFAULT_MODEL=gpt-4o-mini

# Execution timeout (seconds)
PYTHON_EXECUTION_TIMEOUT=5
```

### Basic Usage

Create a test suite file `tests.yaml`:

```yaml
tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"
  
  - name: "calculate_factorial"
    language: "python"
    prompt: "Write a Python function that calculates factorial of 5"
    expected: "120"
```

Run the benchmarks:

```bash
praisonaibench --suite tests.yaml --model gpt-4o-mini
```

## 📊 Evaluation System

### Scoring Breakdown

The evaluator uses a three-stage assessment system:

| Stage | Points | Description |
|-------|--------|-------------|
| **Syntax Validation** | 30 | AST parsing, import detection |
| **Code Execution** | 40 | Safe subprocess execution, error capture |
| **Output Comparison** | 30 | Fuzzy matching with expected output |
| **Total** | **100** | Combined score |

**Pass Threshold**: 70/100 points

### Scoring Examples

#### Example 1: Perfect Score (100/100)

```python
# Code: print("Hello World")
# Expected: "Hello World"

✅ Syntax: 30 points (valid Python)
✅ Execution: 40 points (runs successfully)
✅ Output: 30 points (exact match)
─────────────────────────
Total: 100/100 ✅ PASSED
```

#### Example 2: Partial Score (70/100)

```python
# Code: print("Hello")
# Expected: "Hello World"

✅ Syntax: 30 points (valid Python)
✅ Execution: 40 points (runs successfully)
⚠️  Output: 0 points (different output)
─────────────────────────
Total: 70/100 ✅ PASSED
```

#### Example 3: Failure (30/100)

```python
# Code: print(undefined_variable)
# Expected: "Hello World"

✅ Syntax: 30 points (valid syntax)
❌ Execution: 0 points (NameError)
❌ Output: 0 points (didn't execute)
─────────────────────────
Total: 30/100 ❌ FAILED
```

## 📖 Usage Guide

### Python API

```python
from praisonaibench_python import PythonEvaluator

# Create evaluator
evaluator = PythonEvaluator(timeout=5)

# Evaluate code
result = evaluator.evaluate(
    code='print("Hello World")',
    test_name="hello_test",
    prompt="Write Python code that prints Hello World",
    expected="Hello World"
)

# Check results
print(f"Score: {result['score']}/100")
print(f"Passed: {result['passed']}")

# View feedback
for item in result['feedback']:
    print(f"{item['level']}: {item['message']}")

# Access details
print(f"Output: {result['details']['output']}")
print(f"Score breakdown: {result['details']['score_breakdown']}")
```

### Test Suite Format

#### Simple Test

```yaml
tests:
  - name: "basic_math"
    language: "python"
    prompt: "Calculate 15 * 23 and print the result"
    expected: "345"
```

#### Advanced Test

```yaml
tests:
  - name: "fibonacci"
    language: "python"
    prompt: |
      Write a Python function that calculates the nth Fibonacci number.
      Calculate and print the 10th Fibonacci number.
    expected: "55"
```

#### Test Without Expected Output

```yaml
tests:
  - name: "creative_code"
    language: "python"
    prompt: "Write a Python class for a simple calculator"
    # No expected field - evaluation based on syntax and execution only
```

### Command Line Interface

```bash
# Run single test suite
praisonaibench --suite examples/simple_tests.yaml --model gpt-4o-mini

# Run with specific model
praisonaibench --suite examples/advanced_tests.yaml --model gpt-4o

# Run with custom configuration
praisonaibench --suite tests.yaml --config custom_config.yaml
```

## 🎨 Features

### Security Features

- ✅ **Subprocess Isolation** - Code runs in separate process
- ✅ **Timeout Protection** - Configurable execution timeout (default: 5s)
- ✅ **Resource Limits** - Prevents infinite loops and resource exhaustion
- ✅ **Error Handling** - Graceful handling of all error types

### Code Extraction

Automatically extracts code from various formats:

```python
# Supports markdown code blocks
"""
```python
print("Hello")
```
"""

# Supports generic code blocks
"""
```
print("Hello")
```
"""

# Supports raw code
"print('Hello')"
```

### Output Comparison

Smart fuzzy matching algorithm:

- **Exact match**: 30/30 points
- **High similarity** (>80%): 25-29 points
- **Medium similarity** (50-80%): 15-24 points
- **Low similarity** (<50%): 0-14 points

Features:
- Case-insensitive comparison
- Whitespace normalisation
- Substring matching (e.g., "345" in "The answer is 345")

### Detailed Feedback

```python
{
  "score": 85,
  "passed": True,
  "feedback": [
    {"level": "success", "message": "✅ Valid Python syntax"},
    {"level": "info", "message": "📦 Imports: math, sys"},
    {"level": "success", "message": "✅ Code executed successfully"},
    {"level": "info", "message": "📤 Output: Hello World"},
    {"level": "warning", "message": "⚠️  Output partially matches expected"}
  ],
  "details": {
    "extracted_code": "print('Hello World')",
    "executed": True,
    "output": "Hello World",
    "similarity": 0.95,
    "score_breakdown": {
      "syntax": 30,
      "execution": 40,
      "output_match": 28
    }
  }
}
```

## 📚 Examples

### Example 1: Hello World

```yaml
tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"
```

### Example 2: Factorial Function

```yaml
tests:
  - name: "factorial"
    language: "python"
    prompt: |
      Write a Python function that calculates the factorial of a number.
      Calculate factorial(5) and print the result.
    expected: "120"
```

### Example 3: List Operations

```yaml
tests:
  - name: "list_sum"
    language: "python"
    prompt: |
      Create a list [1, 2, 3, 4, 5], calculate the sum, and print it.
    expected: "15"
```

More examples available in:
- `examples/simple_tests.yaml` - Basic Python tests
- `examples/advanced_tests.yaml` - Complex Python challenges
- `examples/algorithm_tests.yaml` - Algorithm implementations

## 🧪 Testing

### Run Unit Tests

```bash
# Install development dependencies
uv pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_evaluator.py -v

# Run with coverage
pytest tests/ --cov=praisonaibench_python --cov-report=html
```

### Test Coverage

The plugin includes comprehensive tests:

- ✅ **Unit Tests** (`tests/test_evaluator.py`)
  - Code extraction
  - Syntax validation
  - Code execution
  - Output comparison
  - Error handling
  - Timeout protection

- ✅ **Integration Tests** (`tests/test_integration.py`)
  - Plugin interface compatibility
  - Multiple test scenarios
  - Concurrent evaluations
  - Large output handling
  - Import support

## 🔧 Configuration

### Environment Variables

```bash
# Required
OPENAI_API_KEY=your_api_key_here

# Optional
DEFAULT_MODEL=gpt-4o-mini
PYTHON_EXECUTION_TIMEOUT=5
PYTHON_EXECUTABLE=/path/to/python  # Leave empty for system default
```

### Programmatic Configuration

```python
from praisonaibench_python import PythonEvaluator

# Custom timeout
evaluator = PythonEvaluator(timeout=10)

# Custom Python executable
evaluator = PythonEvaluator(
    timeout=5,
    python_executable="/usr/bin/python3.11"
)
```

## 🏗️ Architecture

### Plugin Structure

```
praisonaibench-python/
├── src/praisonaibench_python/
│   ├── __init__.py          # Plugin exports
│   ├── evaluator.py         # Main evaluator class
│   └── version.py           # Version info
├── tests/
│   ├── test_evaluator.py    # Unit tests
│   └── test_integration.py  # Integration tests
├── examples/
│   ├── simple_tests.yaml
│   ├── advanced_tests.yaml
│   └── algorithm_tests.yaml
├── pyproject.toml           # Project configuration
├── .env                     # Configuration
└── README.md               # This file
```

### Class Hierarchy

```
BaseEvaluator (from praisonaibench)
    └── PythonEvaluator
        ├── get_language() → 'python'
        ├── get_file_extension() → 'py'
        └── evaluate(code, test_name, prompt, expected) → dict
```

## 🤝 Contributing

Contributions are welcome! Here's how:

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes
4. Run tests: `pytest tests/ -v`
5. Format code: `black src/ tests/`
6. Submit a pull request

### Development Setup

```bash
# Clone repository
git clone https://github.com/YourUsername/praisonaibench-python
cd praisonaibench-python

# Install in development mode
uv pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/
```

## 📄 License

MIT License - see LICENSE file for details.

## 🔗 Links

- [PraisonAI Bench](https://github.com/MervinPraison/praisonaibench) - Main project
- [Plugin System Documentation](https://github.com/MervinPraison/praisonaibench/blob/main/PLUGIN_SYSTEM.md)
- [Issue Tracker](https://github.com/MervinPraison/praisonaibench/issues)

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/MervinPraison/praisonaibench/issues)
- **Documentation**: [PraisonAI Bench Docs](https://github.com/MervinPraison/praisonaibench#readme)
- **Community**: Join the discussion on GitHub

## 🎉 Acknowledgements

Built with ❤️ for the PraisonAI Bench community.

Special thanks to:
- [PraisonAI](https://github.com/MervinPraison) - For the amazing benchmarking framework
- Contributors and testers
- The Python community

---

**Ready to benchmark Python code generation? Install the plugin and start testing!** 🚀
