Metadata-Version: 2.4
Name: ragworkbench
Version: 0.1.3
Summary: A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems
Project-URL: Homepage, https://github.com/IBM/RagWorkbench
Project-URL: Documentation, https://github.com/IBM/RagWorkbench#readme
Project-URL: Repository, https://github.com/IBM/RagWorkbench
Project-URL: Issues, https://github.com/IBM/RagWorkbench/issues
Project-URL: Changelog, https://github.com/IBM/RagWorkbench/blob/main/CHANGELOG.md
Author-email: Matan Orbach <matano@il.ibm.com>, Assaf Toledo <assaf.toledo@ibm.com>, Benjamin Sznajder <benjams@il.ibm.com>
Maintainer-email: Matan Orbach <matano@il.ibm.com>, Assaf Toledo <assaf.toledo@ibm.com>, Benjamin Sznajder <benjams@il.ibm.com>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,benchmark,generation,llm,nlp,rag,retrieval
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: bert-score
Requires-Dist: datasets
Requires-Dist: docling-core
Requires-Dist: gdown
Requires-Dist: huggingface-hub
Requires-Dist: litellm
Requires-Dist: nicegui
Requires-Dist: pandas
Requires-Dist: pandas-stubs
Requires-Dist: pydantic
Requires-Dist: python-dotenv
Requires-Dist: requests
Requires-Dist: scipy
Requires-Dist: tenacity
Requires-Dist: types-pyyaml
Requires-Dist: types-requests
Requires-Dist: typing-extensions
Requires-Dist: unitxt
Provides-Extra: all
Requires-Dist: black==26.3.1; extra == 'all'
Requires-Dist: docling; extra == 'all'
Requires-Dist: isort==8.0.1; extra == 'all'
Requires-Dist: mkdocs-material>=9.0; extra == 'all'
Requires-Dist: mkdocs>=1.5; extra == 'all'
Requires-Dist: mypy>=1.0; extra == 'all'
Requires-Dist: pre-commit==4.5.1; extra == 'all'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'all'
Requires-Dist: pytest-cov>=4.0; extra == 'all'
Requires-Dist: pytest>=7.0; extra == 'all'
Requires-Dist: ruff==0.15.6; extra == 'all'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'all'
Requires-Dist: sphinx>=7.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: black==26.3.1; extra == 'dev'
Requires-Dist: isort==8.0.1; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pre-commit==4.5.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff==0.15.6; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'docs'
Requires-Dist: sphinx>=7.0; extra == 'docs'
Provides-Extra: examples
Requires-Dist: docling; extra == 'examples'
Description-Content-Type: text/markdown

# RAGWorkbench

A comprehensive benchmarking framework for Retrieval-Augmented Generation (RAG) systems.

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

## Overview

RAGWorkbench is a powerful Python framework designed to evaluate and benchmark RAG systems across multiple datasets and metrics. It provides a unified interface for loading diverse RAG benchmarks, running inference pipelines, and computing comprehensive evaluation metrics.

### Key Features

- 🎯 **Multiple Benchmark Datasets**: Support for 18+ RAG benchmark datasets including AIT-QA, BioASQ, HotpotQA, NarrativeQA, QASPER, and more
- 📊 **Comprehensive Metrics**: Built-in evaluation metrics for context correctness (Recall@K, MRR, MAP) and answer correctness (BERT Score, Sentence-BERT, LLM-as-a-Judge)
- 🔄 **Flexible Pipeline**: Modular architecture supporting custom ingest and inference pipelines
- 💾 **Smart Caching**: File-system based caching for data loading, generation, and evaluation results
- 💰 **Cost Tracking**: Automatic API usage and cost tracking via LiteLLM proxy with detailed reporting in results and boards - [see details](#viewing-cost-tracking-results)
- 🌐 **Interactive Explorer**: Web-based dataset exploration tool with advanced filtering capabilities
- 🧪 **Experiment Management**: End-to-end experiment orchestration from data loading to evaluation

## Installation

### Requirements

- Python 3.11 or higher

### Basic Installation

```bash
pip install .
```

### Development Installation

```bash
git clone https://github.com/IBM/RagWorkbench.git
cd RagWorkbench
pip install -e ".[dev]"
```

### Environment Configuration

Some evaluation metrics require environment variables to be configured. See [ENVIRONMENT_SETUP.md](ENVIRONMENT_SETUP.md) for detailed instructions on setting up credentials for:
- watsonx.ai LLM-as-a-Judge metrics
- Azure OpenAI metrics

### Optional Dependencies

```bash
# For documentation
pip install .[docs]

# For examples
pip install .[examples]

# Install all optional dependencies
pip install .[all]
```

## Quick Start

### Basic Usage

```python
from ragbench import DataLoaderFactory, DatasetName, Experiment
from ragbench.api.inference import InferencePipeline, InferenceParams
from ragbench.api.ingest import IngestPipeline
from ragbench.eval import MetricDefinition

# Load a dataset
data_loader = DataLoaderFactory.create_data_loader(DatasetName.HOTPOT_QA)

# Define your custom pipelines
class MyIngestPipeline(IngestPipeline):
    def process(self, data_loader):
        # Your ingestion logic
        pass

class MyInferencePipeline(InferencePipeline):
    def __init__(self, params: InferenceParams, cache_dir=None):
        super().__init__(params, cache_dir)

    def set_ingest_artifacts(self, ingest_artifacts):
        # Set up your retrieval system
        pass

    def process_no_cache(self, benchmark_entry):
        # Your inference logic
        pass

# Define evaluation metrics
metrics = [
    MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k"),
    MetricDefinition.from_yaml_key("unitxt.answer_correctness.bert_score_recall"),
]

# Create and run experiment
experiment = Experiment(
    name="my_rag_experiment",
    data_loader=data_loader,
    ingest_pipeline=MyIngestPipeline(),
    inference_pipeline=MyInferencePipeline(InferenceParams()),
    eval_metrics=metrics,
    cache_dir="./cache"
)

results, evaluation = experiment.run()
```

## Supported Datasets

RAGWorkbench supports 18+ benchmark datasets across various domains:

| Dataset | Domain | Retrieval Hops | Modalities |
|---------|--------|----------------|------------|
| AIT-QA | Financial | Single | TEXT, TABLE |
| BioASQ | Biomedical | Single | TEXT |
| CLAP-NQ | Wikipedia | Single | TEXT |
| DA-Code | Code | Single | TEXT |
| DABStep | Code | Multi | TEXT |
| HotpotQA | Wikipedia | Multi | TEXT |
| KramaBench | Wikipedia | Single | TEXT |
| Mini-Wiki | Wikipedia | Single | TEXT |
| MLDR | Multilingual | Single | TEXT |
| NarrativeQA | Literature | Single | TEXT |
| OfficeQA | Technical Docs | Single | TEXT |
| QASPER | Scientific Papers | Single | TEXT |
| SecQue | Policies | Single | TEXT |
| WatsonX DocsQA | Technical Docs | Single | TEXT |
| RealMM (4 variants) | Financial/Technical | Single | TEXT, TABLE, IMAGE |

### Loading Datasets

```python
from ragbench import DataLoaderFactory, DatasetName

# Load a specific dataset
loader = DataLoaderFactory.create_data_loader(DatasetName.BIOASQ)

# Get the benchmark and corpus
benchmark = loader.get_benchmark()
corpus = loader.get_corpus()

# Access benchmark entries
for entry in benchmark.get_benchmark_entries():
    print(f"Question: {entry.question}")
    print(f"Ground truth answers: {entry.ground_truth_answers}")
```

## Evaluation Metrics

RAGWorkbench provides comprehensive evaluation metrics through integration with Unitxt:

### Context Correctness Metrics

- **Retrieval@K**: Measures retrieval accuracy at different cutoffs (K=1, 3, 5, 10, 20, 40)
- **MRR (Mean Reciprocal Rank)**: Evaluates the rank of the first relevant document
- **MAP (Mean Average Precision)**: Measures precision across all relevant documents

### Answer Correctness Metrics

- **BERT Score Recall**: Semantic similarity using BERT embeddings
- **Sentence-BERT**: Sentence-level semantic similarity
- **LLM-as-a-Judge**: Uses LLMs (Llama, GPT-4) to evaluate answer quality

### Using Metrics

```python
from ragbench.eval import MetricDefinition

# Load metrics from YAML definitions
metric = MetricDefinition.from_yaml_key("unitxt.context_correctness.retrieval_at_k")

# Or create custom metrics
custom_metric = MetricDefinition(
    metric_id="custom.metric",
    metric_params={"param": "value"},
    metric_fields=["field1", "field2"],
    vendor="unitxt"
)
```

## Caching System

RagWorkbench includes a sophisticated caching system to speed up experiments:

```python
from pathlib import Path

# Enable caching for all components
cache_dir = Path("./cache")

experiment = Experiment(
    name="cached_experiment",
    data_loader=data_loader,
    ingest_pipeline=ingest_pipeline,
    inference_pipeline=MyInferencePipeline(params, cache_dir=cache_dir),
    eval_metrics=metrics,
    cache_dir=cache_dir  # Enables evaluator caching
)
```

The caching system supports:
- **Data Loader Cache**: Caches loaded datasets
- **Generation Cache**: Caches inference results
- **Evaluator Cache**: Caches evaluation results

## Cost Tracking

RAGWorkbench supports optional cost tracking for experiments using LiteLLM proxy. This feature allows you to monitor API usage and costs during experiment runs by generating unique tracking keys and querying usage statistics.

### Prerequisites

Before enabling cost tracking, ensure you have:

1. **LiteLLM Proxy Running**: A LiteLLM proxy server must be running (default: `http://localhost:4000`), with the inference and ingestion calls going through that proxy. The proxy should be configured to track usage by API key (See [LiteLLM documentation](https://docs.litellm.ai/))

2. **Master Key**: Set the `LITELLM_MASTER_KEY` environment variable to your LiteLLM proxy master key

```bash
# .env file
LITELLM_MASTER_KEY=sk-your-master-key-here
```

### Enabling Cost Tracking

Cost tracking is configured at the experiment level in your `board.yaml` file:

```yaml
# Experiment-level configuration
experiment:
  usage_tracking: true  # Enable cost tracking
```

### Viewing Cost Tracking Results

When cost tracking is enabled, usage and cost information is available in:

#### 1. Board Results (CSV)
Cost data is included in the main `results.csv` file in the `output/` directory with columns:
- `total_cost` - Total cost in USD
- `total_tokens` - Total tokens used (prompt + completion)
- `prompt_tokens` - Number of prompt tokens
- `completion_tokens` - Number of completion tokens
- `requests` - Number of API requests made
- `models_used` - List of models used

#### 2. Board Markdown Report
Cost metrics can be displayed in the board's markdown report `board.md` by adding them to your report configuration:

```yaml
report:
  screens:
    - title: "Performance & Cost"
      columns:
        accuracy_mean: "Accuracy"
        total_cost: "Cost ($)"
        total_tokens: "Tokens"
```

#### 3. Experiment Results JSON
Detailed cost data is exported to `experiment_results_<id>.json` files:
```json
{
  "cost_data": {
    "api_key": "sk-...",
    "total_cost": 0.1234,
    "total_tokens": 5000,
    "prompt_tokens": 3000,
    "completion_tokens": 2000,
    "requests": 10,
    "models_used": ["gpt-4", "gpt-3.5-turbo"]
  }
}
```


## Dataset Explorer

RagWorkbench includes an interactive web-based dataset explorer:

```bash
python -m ragbench.dataset_exploration.dataset_explorer
```

Then open your browser to `http://localhost:8080`

### Explorer Features

- 📋 Browse all available datasets in a sortable table
- 🔍 Search datasets by name or description
- 🎨 Filter by domain, retrieval hops, modalities, and more
- 📊 View detailed dataset statistics and metadata
- 📋 Copy dataset names with one click


### Core Components

- **DataLoader**: Loads and manages benchmark datasets
- **IngestPipeline**: Processes and indexes documents
- **InferencePipeline**: Runs retrieval and generation
- **Evaluator**: Computes evaluation metrics
- **Experiment**: Orchestrates the complete workflow

## Development

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run only unit tests
pytest tests/datasets_loader/unit

# Run only integration tests
pytest -m integration
```

### Code Quality

```bash
# Format code
black src tests

# Lint code
ruff check src tests

# Type checking
mypy src
```

### Pre-commit Hooks

```bash
pre-commit install
pre-commit run --all-files
```

## Contributing

We welcome contributions! Please see our contributing guidelines for more details.

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Releases

RAGWorkbench follows [Semantic Versioning](https://semver.org/) and uses automated releases via GitHub Actions.

### Installation

Install the latest stable release from PyPI:

```bash
pip install ragworkbench
```

Install a specific version:

```bash
pip install ragworkbench==0.1.0
```

### Release Process

For maintainers preparing a new release:

1. **Prepare the release**:
   ```bash
   ./scripts/prepare_release.sh 0.2.0
   ```

2. **Create and push the tag**:
   ```bash
   git tag -a v0.2.0 -m "Release version 0.2.0"
   git push origin v0.2.0
   ```

3. **Monitor the automated workflow** at [GitHub Actions](https://github.com/IBM/RagWorkbench/actions)

The release workflow will automatically:
- Build the package
- Publish to PyPI
- Create a GitHub release with release notes

For detailed release instructions, see [RELEASE.md](RELEASE.md).

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## Authors

- **Matan Orbach** - [matano@il.ibm.com](mailto:matano@il.ibm.com)
- **Assaf Toledo** - [assaf.toledo@ibm.com](mailto:assaf.toledo@ibm.com)
- **Benjamin Sznajder** - [benjams@il.ibm.com](mailto:benjams@il.ibm.com)
- **Odellia Boni** - [odelliab@il.ibm.com](mailto:odelliab@il.ibm.com)

## Acknowledgments

- Built with [Unitxt](https://github.com/IBM/unitxt) for evaluation metrics
- Uses [NiceGUI](https://nicegui.io/) for the dataset explorer
- Integrates with [Hugging Face Datasets](https://huggingface.co/docs/datasets/)

## Support

For questions, issues, or feature requests, please:
- Open an issue on [GitHub Issues](https://github.com/IBM/RagWorkbench/issues)

---
