Running the benchmark¶
Requirements¶
Strictly python3.8, on Linux
Installation¶
First install Progress-Gym using pip. Notice that, in order to perform accelerated inference using vllm and resolve resulting conflicts,
we strictly require python version of 3.8 and vllm version of 0.4.0.post1. Therefore, make sure you have python3.8 installed, and run:
$ python3.8 -m venv .venv
(.venv) $ pip install progressgym
Or install by cloning the github repository:
$ git clone https://github.com/ProgressAlign/ProgressAlign.git
$ python3.8 -m venv .venv
(.venv) $ pip install -r src/config/requirements-pip.txt
Basic Usage¶
Building Dataset
Notice that you don't have to run this part yourself unless you want to re-create our findings in the paper, as the HisText dataset has already been utilized to train the Human Proxy Model sequence.
Running
python ./build_dataset.py
- will automatically download the relevant files and then compile them into the desired dataset. Specifically, it will...
Download and decompress raw data files from multiple sources, placing them in
./dataset/raw_downloadsOrganize downloaded data into an unified format, and the resulting historical text dataset is saved as a series of files
./dataset/dataset_textseq/[century_num]/[year_num].jsonTrain a series of historical language models
./dataset/dataset_modelseq/[century_num]using the historical text dataset.
Running The Benchmarks
Running
$ python run_benchmark.py \\
--algorithms=ALGO1[,ALGO2[,ALGO3[,...]]] \\
--challenges=SUBTASK1[,SUBTASK2[,SUBTASK3[,...]]] \\
--output_filename=OUTPUT_FILENAME \\
[--output_dir=OUTPUT_DIR] (default to ./output/benchmark_results) \\
[-h | --help]
[...] (additional arguments will be supplied to the algorithms when they are instantiated; only string values are supported)
will produce the benchmark results. Omitting
--challenge [challenge_name]
will default to testing all challenges. Results will be saved to [output_directory]/[test_name].json, with
[output_directory] set to default as ./output/benchmark_results. For each test, results are saved as a dictionary. See item
score for a comprehensive score for each examinee's performance in one subtask.
See comments at the beginning of 'run_benchmark.py' for a detailed usage of the command line arguments.