Add a new task GPQA (the part without CoT) (#1434)

* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Add a new task GPQA (the part without CoT) (#1434)
* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
5ab295c8 · Uanu · GitHub · 45941c67 · 5ab295c8 · 5ab295c8
Unverified Commit 5ab295c8 authored Feb 21, 2024 by Uanu Committed by GitHub Feb 20, 2024
13 changed files
--- a/lm_eval/tasks/gpqa/README.md
+++ b/lm_eval/tasks/gpqa/README.md
+# GPQA
+### Paper
+Title: GPQA: A Graduate-Level Google-Proof Q&A Benchmark
+Abstract: https://arxiv.org/abs/2311.12022
+We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4–based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions—for example, when developing new scientific knowledge—we need to develop *scalable oversight* methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
+Homepage: `https://github.com/idavidrein/gpqa/tree/main`
+### Citation
+```
+@misc{rein2023gpqa,
+      title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
+      author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman},
+      year={2023},
+      eprint={2311.12022},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+```
+This dataset is gated, so you will have to accept the terms of use at https://huggingface.co/datasets/Idavidrein/gpqa and login via `huggingface-cli login` using your HF Hub token before running this task.
+### Groups and Tasks
+#### Groups
+* `gpqa`
+#### Tasks
+* `gpqa_{main, diamond, extended}_zeroshot`
+* `gpqa_{main, diamond, extended}_n_shot`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+    * [x] Have you referenced the original paper that introduced the task?
+    * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/gpqa/n_shot/_generate_configs.py
+++ b/lm_eval/tasks/gpqa/n_shot/_generate_configs.py
+import yaml
+from tqdm import tqdm
+def main() -> None:
+    subset = ["extended", "diamond", "experts", "main"]
+    for task in tqdm(subset):
+        file_name = f"gpqa_{task}_n_shot.yaml"
+        try:
+            with open(f"{file_name}", "w") as f:
+                f.write("# Generated by _generate_configs.py\n")
+                yaml.dump(
+                    {
+                        "include": "_gpqa_n_shot_yaml",
+                        "task": f"gpqa_{task}_n_shot",
+                        "dataset_name": f"gpqa_{task}",
+                    },
+                    f,
+                )
+        except FileExistsError:
+            pass
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml
+++ b/lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml
+dataset_path: Idavidrein/gpqa
+group: gpqa
+output_type: multiple_choice
+process_docs: !function utils.process_docs
+training_split: train
+# Because huggingface dataset only has train split
+validation_split: train
+test_split: null
+description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"
+doc_to_text: "Question: {{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:"
+doc_to_target: answer
+doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/gpqa/n_shot/gpqa_diamond_n_shot.yaml
+++ b/lm_eval/tasks/gpqa/n_shot/gpqa_diamond_n_shot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_diamond
+include: _gpqa_n_shot_yaml
+task: gpqa_diamond_n_shot
--- a/lm_eval/tasks/gpqa/n_shot/gpqa_extended_n_shot.yaml
+++ b/lm_eval/tasks/gpqa/n_shot/gpqa_extended_n_shot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_extended
+include: _gpqa_n_shot_yaml
+task: gpqa_extended_n_shot
--- a/lm_eval/tasks/gpqa/n_shot/gpqa_main_n_shot.yaml
+++ b/lm_eval/tasks/gpqa/n_shot/gpqa_main_n_shot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_main
+include: _gpqa_n_shot_yaml
+task: gpqa_main_n_shot
--- a/lm_eval/tasks/gpqa/n_shot/utils.py
+++ b/lm_eval/tasks/gpqa/n_shot/utils.py
+import datasets
+import re
+import random
+def preprocess(text):
+    if text is None:
+        return " "
+    text = text.strip()
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+rng = random.Random(42)
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        choices = [
+            preprocess(doc["Incorrect Answer 1"]),
+            preprocess(doc["Incorrect Answer 2"]),
+            preprocess(doc["Incorrect Answer 3"]),
+            preprocess(doc["Correct Answer"]),
+        ]
+        rng.shuffle(choices)
+        correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
+        out_doc = {
+            "choice1": choices[0],
+            "choice2": choices[1],
+            "choice3": choices[2],
+            "choice4": choices[3],
+            "answer": f"({chr(65 + correct_answer_index)})"
+        }
+        return out_doc
+    return dataset.map(_process_doc)
--- a/lm_eval/tasks/gpqa/zeroshot/_generate_configs.py
+++ b/lm_eval/tasks/gpqa/zeroshot/_generate_configs.py
+import yaml
+from tqdm import tqdm
+def main() -> None:
+    subset = ["extended", "diamond", "experts", "main"]
+    setting = "zeroshot"
+    for task in tqdm(subset):
+        file_name = f"gpqa_{task}_{setting}.yaml"
+        try:
+            with open(f"{file_name}", "w") as f:
+                f.write("# Generated by _generate_configs.py\n")
+                yaml.dump(
+                    {
+                        "include": f"_gpqa_{setting}_yaml",
+                        "task": f"gpqa_{task}_{setting}",
+                        "dataset_name": f"gpqa_{task}",
+                    },
+                    f,
+                )
+        except FileExistsError:
+            pass
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml
+++ b/lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml
+dataset_path: Idavidrein/gpqa
+group: gpqa
+output_type: multiple_choice
+process_docs: !function utils.process_docs
+training_split: train
+# Because huggingface dataset only has train split
+validation_split: train
+test_split: null
+doc_to_text: "What is the correct answer to this question:{{Question}}\nChoices:\n(A) {{choice1}}\n(B) {{choice2}}\n(C) {{choice3}}\n(D) {{choice4}}\nAnswer:"
+doc_to_target: answer
+doc_to_choice: ["(A)", "(B)", "(C)", "(D)"]
+num_fewshot: 0
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/gpqa/zeroshot/gpqa_diamond_zeroshot.yaml
+++ b/lm_eval/tasks/gpqa/zeroshot/gpqa_diamond_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_diamond
+include: _gpqa_zeroshot_yaml
+task: gpqa_diamond_zeroshot
--- a/lm_eval/tasks/gpqa/zeroshot/gpqa_extended_zeroshot.yaml
+++ b/lm_eval/tasks/gpqa/zeroshot/gpqa_extended_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_extended
+include: _gpqa_zeroshot_yaml
+task: gpqa_extended_zeroshot
--- a/lm_eval/tasks/gpqa/zeroshot/gpqa_main_zeroshot.yaml
+++ b/lm_eval/tasks/gpqa/zeroshot/gpqa_main_zeroshot.yaml
+# Generated by _generate_configs.py
+dataset_name: gpqa_main
+include: _gpqa_zeroshot_yaml
+task: gpqa_main_zeroshot
--- a/lm_eval/tasks/gpqa/zeroshot/utils.py
+++ b/lm_eval/tasks/gpqa/zeroshot/utils.py
+import datasets
+import re
+import random
+def preprocess(text):
+    if text is None:
+        return " "
+    text = text.strip()
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        choices = [
+            preprocess(doc["Incorrect Answer 1"]),
+            preprocess(doc["Incorrect Answer 2"]),
+            preprocess(doc["Incorrect Answer 3"]),
+            preprocess(doc["Correct Answer"]),
+        ]
+        random.shuffle(choices)
+        correct_answer_index = choices.index(preprocess(doc["Correct Answer"]))
+        out_doc = {
+            "choice1": choices[0],
+            "choice2": choices[1],
+            "choice3": choices[2],
+            "choice4": choices[3],
+            "answer": f"({chr(65 + correct_answer_index)})"
+        }
+        return out_doc
+    return dataset.map(_process_doc)