init

3041681f · silencealiang · 291fc518 · 3041681f · 3041681f · 3041681f
Commit 3041681f authored Mar 19, 2025 by silencealiang
20 changed files
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/README.md
+# ARC
+
+### Paper
+
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+
+Abstract: https://arxiv.org/abs/1803.05457
+
+The ARC dataset consists of 7,787 science exam questions drawn from a variety
+of sources, including science questions provided under license by a research
+partner affiliated with AI2. These are text-only, English language exam questions
+that span several grade levels as indicated in the files. Each question has a
+multiple choice structure (typically 4 answer options). The questions are sorted
+into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
+a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.
+
+Homepage: https://allenai.org/data/arc
+
+
+### Citation
+
+```
+@article{Clark2018ThinkYH,
+  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
+  author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
+  journal={ArXiv},
+  year={2018},
+  volume={abs/1803.05457}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
+
+#### Tasks
+
+* `arc_easy`
+* `arc_challenge`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_challenge.yaml
+include: arc_easy.yaml
+task: arc_challenge
+dataset_name: ARC-Challenge
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_easy.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_easy.yaml
+group:
+  - ai2_arc
+task: arc_easy
+dataset_path: allenai/ai2_arc
+dataset_name: ARC-Easy
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{choices.label.index(answerKey)}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/README.md
+# Arithmetic
+
+### Paper
+
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+
+
+### Citation
+
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `arithmetic`: Evaluates `1dc` to `5ds`
+
+#### Tasks
+
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+group:
+  - arithmetic
+task: arithmetic_1dc
+dataset_path: EleutherAI/arithmetic
+dataset_name: arithmetic_1dc
+output_type: loglikelihood
+validation_split: validation
+test_split: null
+doc_to_text: "{{context}}"
+doc_to_target: "{{completion}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2da.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2da.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_2da
+dataset_name: arithmetic_2da
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_2dm
+dataset_name: arithmetic_2dm
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_2ds
+dataset_name: arithmetic_2ds
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_3da.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_3da.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_3da
+dataset_name: arithmetic_3da
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_3ds
+dataset_name: arithmetic_3ds
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_4da.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_4da.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_4da
+dataset_name: arithmetic_4da
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_4ds
+dataset_name: arithmetic_4ds
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_5da.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_5da.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_5da
+dataset_name: arithmetic_5da
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_5ds
+dataset_name: arithmetic_5ds
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/asdiv/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/asdiv/README.md
+# ASDiv
+
+### Paper
+
+Title: `ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers`
+
+Abstract: https://arxiv.org/abs/2106.15772
+
+ASDiv (Academia Sinica Diverse MWP Dataset) is a diverse (in terms of both language
+patterns and problem types) English math word problem (MWP) corpus for evaluating
+the capability of various MWP solvers. Existing MWP corpora for studying AI progress
+remain limited either in language usage patterns or in problem types. We thus present
+a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem
+types taught in elementary school. Each MWP is annotated with its problem type and grade
+level (for indicating the level of difficulty).
+
+NOTE: We currently ignore formulas for answer generation.
+
+Homepage: https://github.com/chaochun/nlu-asdiv-dataset
+
+
+### Citation
+
+```
+@misc{miao2021diverse,
+    title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
+    author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
+    year={2021},
+    eprint={2106.15772},
+    archivePrefix={arXiv},
+    primaryClass={cs.AI}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet.
+
+#### Tasks
+
+* `asdiv`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/asdiv/default.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/asdiv/default.yaml
+task: asdiv
+dataset_path: EleutherAI/asdiv
+output_type: loglikelihood
+validation_split: validation
+doc_to_text: "{{body}}\nQuestion:{{question}}\nAnswer:"
+doc_to_target: "{{answer.split(' (')[0]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{body}} {{question}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/babi/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/babi/README.md
+# bAbI
+
+### Paper
+
+Title: Towards ai-complete question answering: A set of prerequisite toy tasks
+Abstract: https://arxiv.org/abs/1502.05698
+
+One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
+
+Homepage: https://github.com/facebookarchive/bAbI-tasks
+
+
+### Citation
+
+```
+@article{weston2015towards,
+  title={Towards ai-complete question answering: A set of prerequisite toy tasks},
+  author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
+  journal={arXiv preprint arXiv:1502.05698},
+  year={2015}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* Not part of a group yet
+
+#### Tasks
+
+* `babi`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/babi/babi.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/babi/babi.yaml
+task: babi
+dataset_path: Muennighoff/babi
+dataset_name: null
+output_type: generate_until
+training_split: train
+validation_split: valid
+test_split: test
+doc_to_text: "Passage: {{passage}}Question: {{question}}\nAnswer:"
+doc_to_target: " {{answer}}"
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "\n"
+    - "Passage:"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/README.md
+# BigBenchHard
+
+## Paper
+Title: `Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them`
+Abstract: https://arxiv.org/abs/2210.09261
+
+A suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH).
+These are the task for which prior language model evaluations did not outperform
+the average human-rater.
+
+Homepage: https://github.com/suzgunmirac/BIG-Bench-Hard
+
+
+## Citation
+```
+@article{suzgun2022challenging,
+  title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them},
+  author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and and Wei, Jason},
+  journal={arXiv preprint arXiv:2210.09261},
+  year={2022}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+- `bbh_zeroshot`
+- `bbh_fewshot`
+- `bbh_cot_fewshot`
+- `bbh_cot_zeroshot`
+
+
+#### Tasks
+
+- ...
+
+### Checklist
+
+- [x] Is in Eval-harness v1.0 ?
+- [ ] Has been checked for regression from v1.0?
+- [ ] Has been checked for equivalence with original paper methodology?
+- [ ] "Main" checked variant clearly denoted?
+
+### Variant Wishlist
+
+- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
+- [ ] Using Verifiers
+- [ ] Majority voting "without CoT"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/_generate_configs.py
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/bbh/_generate_configs.py
+"""
+Take in a YAML, and output all other splits with this YAML
+"""
+import argparse
+import os
+import re
+
+import datasets
+import requests
+import yaml
+from tqdm import tqdm
+
+from lm_eval import utils
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_yaml_path", required=True)
+    parser.add_argument("--save_prefix_path", default="zeroshot")
+    parser.add_argument("--cot", default=False)
+    parser.add_argument("--fewshot", default=False)
+    parser.add_argument("--task_prefix", default="")
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
+    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
+    with open(args.base_yaml_path, encoding="utf-8") as f:
+        base_yaml = yaml.full_load(f)
+
+    base_doc_to_text = "Q: {{input}}\nA:"
+    answer_regex = re.compile("(?<=answer is )(.*)(?=.)")
+
+    dataset_path = "lukaemon/bbh"
+    for task in tqdm(datasets.get_dataset_infos(dataset_path).keys()):
+        resp = requests.get(
+            f"https://raw.githubusercontent.com/suzgunmirac/BIG-Bench-Hard/main/cot-prompts/{task}.txt"
+        ).content.decode("utf-8")
+        prompt = resp.split("\n-----\n")[-1]
+        description, *few_shot = prompt.split("\n\n")
+
+        prefix_doc_to_text = ""
+        if args.fewshot:
+            if args.cot:
+                prefix_doc_to_text = "\n\n".join(few_shot) + "\n\n"
+            else:
+                for shot in few_shot:
+                    try:
+                        answer = answer_regex.search(shot)[0]
+                    except Exception:
+                        print("task", task)
+                        print(shot)
+                    example = shot.split("Let's think step by step.")[0]
+                    prefix_doc_to_text += f"{example}{answer}\n\n"
+
+        doc_to_text = prefix_doc_to_text + base_doc_to_text
+        if args.cot:
+            doc_to_text = doc_to_text + " Let's think step by step.\n"
+
+        yaml_dict = {
+            "include": base_yaml_name,
+            "task": f"bbh_{args.task_prefix}_{task}",
+            "dataset_name": task,
+            "description": description + "\n\n",
+            "doc_to_text": doc_to_text,
+        }
+
+        file_save_path = args.save_prefix_path + f"/{task}.yaml"
+        utils.eval_logger.info(f"Saving yaml for subset {task} to {file_save_path}")
+        with open(file_save_path, "w", encoding="utf-8") as yaml_file:
+            yaml.dump(
+                yaml_dict,
+                yaml_file,
+                width=float("inf"),
+                allow_unicode=True,
+                default_style='"',
+            )