Unverified Commit 5ab295c8 authored by Uanu's avatar Uanu Committed by GitHub
Browse files

Add a new task GPQA (the part without CoT) (#1434)



* add new task GPQA_n_shot

* add new task GPQA_zeroshot

* correct GPQA_zeroshot filename

* Add randomly shuffle choices

* Correct missing parentheses

* delete wrong tasks

* Add README

* Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/n_shot/utils.py

* Update lm_eval/tasks/gpqa/README.md

* placate linter

* linter

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 45941c67
# GPQA
### Paper
Title: GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Abstract: https://arxiv.org/abs/2311.12022
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4–based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions—for example, when developing new scientific knowledge—we need to develop *scalable oversight* methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
Homepage: `https://github.com/idavidrein/gpqa/tree/main`
### Citation
```
@misc{rein2023gpqa,
title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
year={2023},
eprint={2311.12022},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
```
This dataset is gated, so you will have to accept the terms of use at https://huggingface.co/datasets/Idavidrein/gpqa and login via `huggingface-cli login` using your HF Hub token before running this task.
### Groups and Tasks
#### Groups
* `gpqa`
#### Tasks
* `gpqa_{main, diamond, extended}_zeroshot`
* `gpqa_{main, diamond, extended}_n_shot`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
import yaml
from tqdm import tqdm
def main() -> None:
subset = ["extended", "diamond", "experts", "main"]
for task in tqdm(subset):
file_name = f"gpqa_{task}_n_shot.yaml"
try:
with open(f"{file_name}", "w") as f:
f.write("# Generated by _generate_configs.py\n")
yaml.dump(
{
"include": "_gpqa_n_shot_yaml",
"task": f"gpqa_{task}_n_shot",
"dataset_name": f"gpqa_{task}",
},
f,
)
except FileExistsError:
pass
if __name__ == "__main__":
main()
dataset_path: Idavidrein/gpqa
group: gpqa
output_type: multiple_choice
process_docs: !function utils.process_docs
training_split: train
# Because huggingface dataset only has train split
validation_split: train
test_split: null
description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"
doc_to_text: "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:"
doc_to_target: answer
doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
# Generated by _generate_configs.py
dataset_name: gpqa_diamond
include: _gpqa_n_shot_yaml
task: gpqa_diamond_n_shot
# Generated by _generate_configs.py
dataset_name: gpqa_extended
include: _gpqa_n_shot_yaml
task: gpqa_extended_n_shot
# Generated by _generate_configs.py
dataset_name: gpqa_main
include: _gpqa_n_shot_yaml
task: gpqa_main_n_shot
import datasets
import re
import random
def preprocess(text):
if text is None:
return " "
text = text.strip()
text = text.replace(" [title]", ". ")
text = re.sub("\\[.*?\\]", "", text)
text = text.replace(" ", " ")
return text
rng = random.Random(42)
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
choices = [
preprocess(doc["Incorrect Answer 1"]),
preprocess(doc["Incorrect Answer 2"]),
preprocess(doc["Incorrect Answer 3"]),
preprocess(doc["Correct Answer"]),
]
rng.shuffle(choices)
correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
out_doc = {
"choice1": choices[0],
"choice2": choices[1],
"choice3": choices[2],
"choice4": choices[3],
"answer": f"({chr(65 + correct_answer_index)})"
}
return out_doc
return dataset.map(_process_doc)
import yaml
from tqdm import tqdm
def main() -> None:
subset = ["extended", "diamond", "experts", "main"]
setting = "zeroshot"
for task in tqdm(subset):
file_name = f"gpqa_{task}_{setting}.yaml"
try:
with open(f"{file_name}", "w") as f:
f.write("# Generated by _generate_configs.py\n")
yaml.dump(
{
"include": f"_gpqa_{setting}_yaml",
"task": f"gpqa_{task}_{setting}",
"dataset_name": f"gpqa_{task}",
},
f,
)
except FileExistsError:
pass
if __name__ == "__main__":
main()
dataset_path: Idavidrein/gpqa
group: gpqa
output_type: multiple_choice
process_docs: !function utils.process_docs
training_split: train
# Because huggingface dataset only has train split
validation_split: train
test_split: null
doc_to_text: "What is the correct answer to this question:{{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:"
doc_to_target: answer
doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
num_fewshot: 0
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
# Generated by _generate_configs.py
dataset_name: gpqa_diamond
include: _gpqa_zeroshot_yaml
task: gpqa_diamond_zeroshot
# Generated by _generate_configs.py
dataset_name: gpqa_extended
include: _gpqa_zeroshot_yaml
task: gpqa_extended_zeroshot
# Generated by _generate_configs.py
dataset_name: gpqa_main
include: _gpqa_zeroshot_yaml
task: gpqa_main_zeroshot
import datasets
import re
import random
def preprocess(text):
if text is None:
return " "
text = text.strip()
text = text.replace(" [title]", ". ")
text = re.sub("\\[.*?\\]", "", text)
text = text.replace(" ", " ")
return text
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
choices = [
preprocess(doc["Incorrect Answer 1"]),
preprocess(doc["Incorrect Answer 2"]),
preprocess(doc["Incorrect Answer 3"]),
preprocess(doc["Correct Answer"]),
]
random.shuffle(choices)
correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
out_doc = {
"choice1": choices[0],
"choice2": choices[1],
"choice3": choices[2],
"choice4": choices[3],
"answer": f"({chr(65 + correct_answer_index)})"
}
return out_doc
return dataset.map(_process_doc)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment