Metadata-Version: 2.4
Name: jetts
Version: 0.0.1
Summary: JETTS Benchmark
Author-email: Yilun Zhou <yilun.zhou@salesforce.com>, Austin Xu <austin.xu@salesforce.com>
License-Expression: LicenseRef-Creative-Commons-Attribution-NonCommercial-4.0-International
Project-URL: Homepage, https://github.com/SalesforceAIResearch/jetts-benchmark
Project-URL: Issues, https://github.com/SalesforceAIResearch/jetts-benchmark/issues
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: vllm
Requires-Dist: datasets
Requires-Dist: openai
Requires-Dist: msgspec
Requires-Dist: transformers
Requires-Dist: tqdm
Requires-Dist: numpy
Dynamic: license-file

# JETTS: Judge Evaluation for Test-Time-Scaling

*Authors: Yilun Zhou&ast;, Austin Xu&ast;, Peifeng Wang, Caiming Xiong, Shafiq Joty*

This repository contains the source code for the JETTS benchmark, introduced in the paper *Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators*.

## Setup

### Install JETTS

We recommend installing this package in a fresh conda environment, as follows:
```sh
conda create -n jetts python=3.12 -y
conda activate jetts
pip install uv
git clone github.com/SalesforceAIResearch/jetts-benchmark
cd jetts-benchmark
uv pip install -e .
```

Alternatively, if you do not need to modify any source code, you can directly run `pip install uv; uv pip install jetts` in the command line after creating the conda environment.

### Download Model Response Data

We generate and evaluate the responses generated for a set of generator models on our benchmark tasks so that, except for critique-based refinement, you do not need to run any generator models. These data are stored on Google Cloud and freely accessible to anyone, but you need to use the `gcloud` command line tool to download. You can follow [this link](https://cloud.google.com/sdk/docs/install) to download and install it. 


If you are working in this folder, we recommend creating a subfolder for these data files.

```sh
mkdir jetts_data
cd jetts_data

# data for reranking and refinement (143MB zipped, 650MB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/reranking_and_refinement.tar.gz .
tar xzf reranking_and_refinement.tar.gz
rm reranking_and_refinement.tar.gz

# data for beam search (6.7GB zipped, 51GB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/beam_search.tar.gz .
# this can take a while; to see a progress bar, you can use "pv beam_search.tar.gz | tar xz" after install "pv"
tar xzf beam_search.tar.gz
rm beam_search.tar.gz
```
If everything works correctly, you should see the following folders being created and populated: 
```
jetts_data
├─ beam_search
│  └─ (more subfolders)
└─ reranking_and_refinement
   └─ (jsonl files)
```

*Note: we are working on uploading the data files to Huggingface so that they can be downloaded automatically on the fly. Please stay tuned!*

In `reranking_and_refinement`, each file represents the responses generated by a model for a particular dataset, contains up to 10 responses, named `{dataset}_{generator_model}.jsonl`.

In `beam_search`, each subfolder contains the fully expanded beam search trees generated by a model for a particular dataset, named `{dataset}_{N}_{M}_{d}_{generator_model}`, with `0.jsonl` to `{L-1}.jsonl` corresponding to the `L` queries in the dataset. `N`, `M` and `d` correspond to the number of initial step samples, beam width and max depth of the search tree, as detailed in Sec. 3.3 of the paper.

### Launch a Judge Model

To benchmark a specific judge, JETTS supports two methods of defining a judge model instance: 

1. An OpenAI-compatible server (e.g., OpenAI models, Together AI models, and model servers launched by `vllm serve`). 
2. A `vllm.LLM` object, which is wrapped in `jetts.judge.vllm_judge.VllmJudge`.

In the demos below, we use `vllm serve` to launch a model server in the first method. We provide a helper script to make this process easy: 

```
python scripts/launch_judge.py --judge-model [JUDGE]
```

where `[JUDGE]` is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

| Short Name | Full Name                                   |
|------------|---------------------------------------------|
| prom7b     | prometheus-eval/prometheus-7b-v2.0          |
| sc8b       | Skywork/Skywork-Critic-Llama-3.1-8B         |
| ob8b       | NCSOFT/Llama-3-OffsetBias-8B                |
| thm8b      | PKU-ONELab/Themis                           |
| prom8x7b   | prometheus-eval/prometheus-8x7b-v2.0        |
| sc70b      | Skywork/Skywork-Critic-Llama-3.1-70B        |
| ste70b     | facebook/Self-taught-evaluator-llama3.1-70B |
| llama8b    | meta-llama/Llama-3.1-8B-Instruct            |

The judge will be served on `localhost:8000`, as expected by the script for each task.

Note that due to company policy, we are not able to release weights for the SFR-Judge family of models or provide API access. We hope to do so in the future and will update this instruction accordingly.

## Running JETTS Tasks

### Response Reranking
With `reranking_and_refinement` data downloaded and judge launched, reranking can be run with

```sh
python scripts/reranking.py --data-file [DATA_FILE]
```

where `[DATA_FILE]` is path of one of the `.jsonl` files in the `reranking_and_refinement` data folder. The script automatically computes the performance at the end and writes a file containing ranked responses for the dataset. The file name is named `{judge_model}_{data_file_name}_{reranking_method}.jsonl` inside the folder specified by `--output-dir`, which defaults to the `outputs/reranking` folder. Please consult the script or run it with the `-h` flag for optional arguments to customize the run. 

### Step-Level Beam Search
With `beam_search` data downloaded and judge launched, beam search can be run with

```sh
python scripts/beam_search.py --input-dir [INPUT_DIR]
```

where `[INPUT_DIR]` is the path of one of the subfolders inside the `beam_search` data folder. The script automatically computes the performance at the end and creates a folder and files `0.jsonl` to `{L-1}.jsonl` containing the beam search decisions for each tree. The result folder is named `{judge_model}_{input_dir_name}_{beam_selection_reranking_method}_{final_selection_reranking_method}.jsonl` inside the folder specified by `--output-dir`, which defaults to the `outputs/beam_search` folder. Please consult the script or run it with the `-h` flag for optional arguments to customize the run. 

### Critique-Based Refinement

In addition to downloading the `reranking_and_refinement` data and launching the judge, you also need to launch the generator as we need to perform live response generation following judge critiques. The generator can be launched in a similar manner as the judge, with 

```sh
python scripts/launch_generator.py --generator-name [GENERATOR]
```

where `[GENERATOR]` is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

| Short Name | Full Name                         |
|------------|-----------------------------------|
| llama8b    | meta-llama/Llama-3.1-8B-Instruct  |
| llama70b   | meta-llama/Llama-3.1-70B-Instruct |
| qwen32b    | Qwen/Qwen2.5-32B-Instruct         |
| qwen72b    | Qwen/Qwen2.5-72B-Instruct         |

The generator will be served on `localhost:8001`, as expected by the refinement script, which can be executed as 

```sh
python scripts/refinement.py --data-file [DATA_FILE]
```

where `[DATA_FILE]` is path of one of the `.jsonl` files in the `reranking_and_refinement` data folder. The script *does not automatically compute the performance at the end* but does write a file containing all refined responses in the final reranking order for the dataset. The file name is named `{judge_model}_{refiner_model}_{data_file_name}_{final_reranking_method}.jsonl` inside the folder specified by `--output-dir`, which defaults to the `outputs/refinement` folder. Please consult the script or run it with the `-h` flag for optional arguments to customize the run. 

### Evaluating Refinement Result

*Note: you do not need to run manual evaluation for reranking and beam search as we have pre-computed the score for all model responses (including every leaf node in the search tree) and saved them with the data files. You only need to run this evaluation for refinement as the responses are freshly generated by the generator model.*

In order to not intervene with the packages needed to run the actual benchmarking, we strongly recommend you to run the evalaution in a dedicated conda environment. Furthermore, since BigCodeBench requires many packages in their relatively old versions (e.g., `numpy==1.21.2` released in August of 2021), its evaluation interferes with that of other datasets. Thus, we recommend following the steps below to create two environments.

To run evaluations for CHAMP and AlpacaEval, you need to have `OPENAI_API_KEY` saved as an environment variable (e.g., `export OPENAI_API_KEY=[YOUR_API_KEY]`). For CHAMP, you also need to keep the generator server (i.e., `scripts/launch_generator.py`) running at port 8001 during evaluation; the judge server can be terminated.

*Note: all evaluations need to be run inside the `jetts_eval` directory.*

#### Everything except for BigCodeBench

```sh
cd jetts_eval
conda create -n jetts-eval python=3.10 -y
conda activate jetts-eval
pip install uv

# You only need to run the line(s) below for the dataset(s) that you wish to evaluate
uv pip install math-verify[antlr4_13_2]  # for GSM8k and MATH
uv pip install champ_dataset datasets openai  # for CHAMP
uv pip install alpaca-eval  # for AlpacaEval
uv pip install numpy absl-py langdetect nltk immutabledict  # for IFEval

# run all lines below for HumanEval+ and MBPP+
cd local_code_eval
tar xzf evalplus.tar.gz
cd evalplus
uv pip install -e .
cd ..
```

#### BigCodeBench

```sh
conda create -n jetts-eval-bcb python=3.10 -y
conda activate jetts-eval-bcb
pip install uv
cd local_code_eval
tar xzf bigcodebench.tar.gz
cd bigcodebench
uv pip install -e .
uv pip install -r requirements-eval.txt
```

#### Running the evaluation

After you have installed the necessary packages and activated the correct environment, you can run

```sh
python evaluate_refinement.py --refinement-output-file [REFINEMENT_OUTPUT_FILE]
```

where `[REFINEMENT_OUTPUT_FILE]` is the path to the output file generated by `scripts/refinement.py`. By default, the file name contains the dataset name. However, if you provide a custom output file without dataset name in it, you need to additionally specify `--dataset [DATASET]` from the list of `[gsm8k, math, champ, humaneval, mbpp, bigcodebench, alpacaeval, ifeval]`. The program will print out the score at the end.

## Questions?

If you have any questions, feel free to open an issue or contact the authors at `yilun.zhou@salesforce.com` and `austin.xu@salesforce.com`.
