Metadata-Version: 2.4
Name: llmthinkbench
Version: 0.1.2
Summary: A framework for evaluating overthinking and basic reasoning capabilities of Large Language Models
Home-page: https://github.com/ctrl-gaurav/LLMThinkBench
Author: Gaurav Srivastava
Author-email: gauravhhh30@gmail.com
Project-URL: Bug Tracker, https://github.com/ctrl-gaurav/LLMThinkBench/issues
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: vllm>=0.2.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: tqdm>=4.64.0
Requires-Dist: tabulate>=0.9.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 🧠 LLMThinkBench: An Advanced Reasoning and Overthinking Evaluation Framework for Language Models 

![Python](https://img.shields.io/badge/Python-3.8%2B-blue)
![License](https://img.shields.io/badge/License-MIT-green)
![vLLM](https://img.shields.io/badge/Powered%20by-vLLM-orange)
![HuggingFace](https://img.shields.io/badge/HuggingFace-Compatible-yellow)

**LLMThinkBench** is a robust, extensible framework for rigorously evaluating the reasoning capabilities and "overthinking" tendencies of Large Language Models. Through standardized, reproducible benchmarks, it provides crucial insights into model performance on core reasoning tasks.

<div align="center">
  <img src="/api/placeholder/800/300" alt="LLMThinkBench Overview" />
</div>

## 🌟 Key Features

- **Modular Architecture**: Easily extend with custom evaluation tasks
- **Efficient Inference**: Built on vLLM for high-throughput batched evaluation
- **Detailed Metrics**: Comprehensive reports on accuracy, instruction following, and more
- **Multi-GPU Support**: Scale evaluations across multiple GPUs
- **Reproducible Results**: Consistent methodology across model comparison

## 📊 Supported Tasks

| Task | Description | Metrics |
|------|-------------|---------|
| **Sorting** | Evaluates ability to correctly sort numerical lists of varying sizes | Accuracy, Instruction Following |
| **Comparison** | Tests number comparison abilities across different relationships | Accuracy across comparison types |
| **Custom Tasks** | Easily add your own evaluation tasks | Customizable metrics |

## 🚀 Installation

```bash
# From PyPI
pip install llmthinkbench

# From source
git clone https://github.com/yourusername/llmthinkbench.git
cd llmthinkbench
pip install -e .
```

## 📈 Quick Start

### Command Line Interface

```bash
# Basic usage with default parameters
llmthinkbench --model_id "Qwen/Qwen2.5-1.5B-Instruct" --tasks sorting comparison

# Comprehensive evaluation
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" \
  --tensor_parallel_size 2 \
  --tasks sorting comparison \
  --datapoints 1000 \
  --list_sizes 8 16 32 64 \
  --folds 3 \
  --range -1000 1000 \
  --store_details \
  --output_dir "./my_evaluation_results"
```

### Python API

```python
from llmthinkbench.models.model_handler import ModelHandler
from llmthinkbench.tasks.sorting_task import SortingTask
from llmthinkbench.tasks.comparison_task import ComparisonTask
from llmthinkbench.utils.reporting import generate_final_report

# Initialize model
model_handler = ModelHandler(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9
)

# Configure output directory
output_dir = "llama2_eval_results"

# Run sorting task
sorting = SortingTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Evaluate multiple list sizes
list_sizes = [8, 16, 32]
sorting_metrics = sorting.run_evaluation(list_sizes)

# Run comparison task
comparison = ComparisonTask(
    model_handler=model_handler,
    output_dir=output_dir,
    min_val=-100,
    max_val=100,
    num_folds=3,
    num_samples=500,
    store_details=True,
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Run evaluation
comparison_metrics = comparison.run_evaluation()

# Generate comprehensive report
all_metrics = sorting_metrics + comparison_metrics
report = generate_final_report(all_metrics, list_sizes, output_dir)
```

## 📝 Example Results

Below is an example report generated by LLMThinkBench:

```
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| Test Case      | Accuracy (Mean)  | Accuracy (Std)| Instruction Followed | Avg Chars | Avg Words | Avg Tokens  |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
| sorting_8      | 95.20%           | 3.60%         | 98.80%               | 612.57    | 93.45     | 186.23      |
| sorting_16     | 87.40%           | 4.80%         | 96.70%               | 982.32    | 167.85    | 312.45      |
| sorting_32     | 68.60%           | 7.20%         | 92.40%               | 1872.15   | 348.76    | 645.65      |
| comparison     | 99.20%           | 1.20%         | 99.60%               | 324.83    | 48.27     | 93.75       |
+----------------+------------------+---------------+----------------------+-----------+-----------+-------------+
```

## ⚙️ Advanced Configuration

### Command Line Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--model_id` | Hugging Face model ID | *Required* |
| `--tasks` | Tasks to evaluate | `["sorting"]` |
| `--datapoints` | Number of samples per test case | `1000` |
| `--folds` | Number of evaluation folds | `1` |
| `--range` | Number range for evaluation | `[-100, 100]` |
| `--list_sizes` | List sizes for sorting task | `[8]` |
| `--store_details` | Store detailed per-example results | `False` |
| `--output_dir` | Directory to save results | Auto-generated |
| `--tensor_parallel_size` | Number of GPUs to use | `1` |
| `--gpu_memory_utilization` | GPU memory utilization threshold | `0.9` |
| `--temperature` | Sampling temperature | `0.7` |
| `--top_p` | Sampling top_p value | `0.9` |
| `--max_tokens` | Maximum tokens for sampling | `512` |

## 🧩 Extending with Custom Tasks

LLMThinkBench is designed to be easily extensible. Here's how to create a custom evaluation task:

1. Create a new task module:

```python
# llmthinkbench/tasks/addition_task.py
import random
from ..utils.parsing import parse_boxed_answer
from .base_task import BaseTask

class AdditionTask(BaseTask):
    """Implementation of the addition task"""
    
    @property
    def task_name(self):
        return "addition"
    
    def generate_data(self):
        """Generate random number pairs for addition"""
        data = []
        for _ in range(self.num_samples):
            a = random.randint(self.min_val, self.max_val)
            b = random.randint(self.min_val, self.max_val)
            data.append({"a": a, "b": b, "sum": a + b})
        return data
    
    def create_prompt(self, data_point):
        """Create prompt for addition task"""
        return (f"Calculate the sum of these two numbers:\n\n"
                f"First number: {data_point['a']}\n"
                f"Second number: {data_point['b']}\n\n"
                f"Provide the result. Your final answer must be in the format "
                f"\\boxed{{result}} at the end.")
    
    def evaluate_response(self, response, data_point):
        """Evaluate model response for addition task"""
        boxed_answer = parse_boxed_answer(response)
        instruction_followed = boxed_answer is not None
        accuracy = 0
        
        if instruction_followed and len(boxed_answer) == 1:
            accuracy = 1 if boxed_answer[0] == data_point['sum'] else 0
        
        return {
            "num1": data_point['a'],
            "num2": data_point['b'],
            "expected_sum": data_point['sum'],
            "parsed_answer": boxed_answer[0] if boxed_answer and len(boxed_answer) > 0 else None,
            "accuracy": accuracy,
            "instruction_followed": instruction_followed
        }
    
    def run_evaluation(self):
        """Run evaluation for addition task"""
        all_metrics = []
        
        # Generate evaluation data
        data = self.generate_data()
        
        # Run each fold
        for fold in range(1, self.num_folds + 1):
            metrics = self.run_fold(data, "addition", fold)
            all_metrics.append(metrics)
        
        return all_metrics
```

2. Use your custom task:

```bash
llmthinkbench --model_id "meta-llama/Llama-2-7b-chat-hf" --tasks addition
```

## 📊 Visualization

LLMThinkBench results can be visualized using any plotting library. Here's a simple example using matplotlib:

```python
import json
import matplotlib.pyplot as plt
import pandas as pd

# Load results
with open("final_report.json") as f:
    results = json.load(f)

# Create dataframe for plotting
data = []
for task, metrics in results.items():
    data.append({
        "Task": task,
        "Accuracy": metrics["accuracy"]["mean"] * 100,
        "Instruction Following": metrics["instruction_followed"]["mean"] * 100
    })

df = pd.DataFrame(data)

# Plot results
plt.figure(figsize=(12, 6))
df.plot(x="Task", y=["Accuracy", "Instruction Following"], kind="bar")
plt.title("LLMThinkBench Results")
plt.ylabel("Percentage")
plt.ylim(0, 100)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("results_comparison.png")
```

## 🔍 Contributing

Contributions to LLMThinkBench are welcome! Please check out our [contributing guidelines](CONTRIBUTING.md) for more information.

## 📜 License

LLMThinkBench is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 📚 Citation

If you use LLMThinkBench in your research, please cite:

```
@software{llmthinkbench2025,
  author = {Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Aninditaa Chauhan},
  title = {LLMThinkBench: Advanced Reasoning and Overthinking Evaluation Framework for LLMs},
  year = {2025},
  url = {https://github.com/ctrl-gaurav/LLMThinkBench/}
}
```

## 📧 Contact

For questions, issues, or feedback, please [open an issue](https://github.com/ctrl-gaurav/llmthinkbench/issues) on GitHub.
