Metadata-Version: 2.1
Name: benchbench
Version: 1.0.0
Summary: Tools for measuring sensitivity and diversity of multi-task benchmarks.
Author: Guanhua Zhang
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: torch
Requires-Dist: pandas
Requires-Dist: joblib
Requires-Dist: scikit-learn
Requires-Dist: zarth-utils ==1.0

<p align="center">
<img src="https://raw.githubusercontent.com/socialfoundations/benchbench/main/assets/logo.jpg" height="400" width="600">
</p>

**BenchBench** is a Python package that provides a suite of tools to evaluate multi-task benchmarks focusing on
diversity and sensitivity against irrelevant variations, such as label noise injection and the addition of irrelevant
candidate models. This package facilitates comprehensive analysis of multi-task benchmarks through a social choice lens,
exposing the fundamental trade-off between diversity and stability in both cardinal and ordinal benchmarks.

For more information, including the motivations behind the measures and our empirical findings, please
see [our paper](https://github.com/socialfoundations/benchbench).

## Quick Start

To install the package, simply run:

```bash
pip install benchbench
```

## Example Usage

To evaluate a cardinal benchmark, you can use the following code:

```python
from benchbench.data import load_cardinal_benchmark
from benchbench.measures.cardinal import get_diversity, get_sensitivity

data, cols = load_cardinal_benchmark('GLUE')
diversity = get_diversity(data, cols)
sensitivity = get_sensitivity(data, cols)
```

To evaluate an ordinal benchmark, you can use the following code:

```python
from benchbench.data import load_ordinal_benchmark
from benchbench.measures.ordinal import get_diversity, get_sensitivity

data, cols = load_ordinal_benchmark('HELM-accuracy')
diversity = get_diversity(data, cols)
sensitivity = get_sensitivity(data, cols)
```

To use your own benchmark, you just need to provide a pandas DataFrame and a list of columns indicating the tasks.
Check the [documentation](https://socialfoundations.github.io/benchbench) for more details.

## Reproduce the Paper

<p align="center">
<img src="https://raw.githubusercontent.com/socialfoundations/benchbench/main/assets/banner.png" height="400" width="600">
</p>

One could check out [cardinal.ipynb](https://githubtocolab.com/socialfoundations/benchbench/blob/main/examples/cardinal.ipynb), [ordinal.ipynb](https://githubtocolab.com/socialfoundations/benchbench/blob/main/examples/ordinal.ipynb) and [banner.ipynb](https://githubtocolab.com/socialfoundations/benchbench/blob/main/examples/banner.ipynb) to reproduce our results using Google Colab with one click.
