Merge branch 'EleutherAI:main' into main

a2af2101 · Yen-Ting Lin · GitHub · 82cb25c1 · d5f39bf8 · a2af2101
Unverified Commit a2af2101 authored Jul 12, 2024 by Yen-Ting Lin Committed by GitHub Jul 12, 2024
20 changed files
--- a/lm_eval/tasks/agieval/lsat-lr.yaml
+++ b/lm_eval/tasks/agieval/lsat-lr.yaml
 include: aqua-rat.yaml
-group:
-  - agieval
-  - agieval_nous
-  - agieval_en
 task: agieval_lsat_lr
 dataset_path: hails/agieval-lsat-lr
--- a/lm_eval/tasks/agieval/lsat-rc.yaml
+++ b/lm_eval/tasks/agieval/lsat-rc.yaml
 include: aqua-rat.yaml
-group:
-  - agieval
-  - agieval_nous
-  - agieval_en
 task: agieval_lsat_rc
 dataset_path: hails/agieval-lsat-rc
--- a/lm_eval/tasks/agieval/math.yaml
+++ b/lm_eval/tasks/agieval/math.yaml
-group:
-  - agieval
-  - agieval_en
 task: agieval_math
 dataset_path: hails/agieval-math
 dataset_name: null

--- a/lm_eval/tasks/agieval/sat-en-without-passage.yaml
+++ b/lm_eval/tasks/agieval/sat-en-without-passage.yaml
 include: aqua-rat.yaml
-group:
-  - agieval
-  - agieval_nous
-  - agieval_en
 task: agieval_sat_en_without_passage
 dataset_path: hails/agieval-sat-en-without-passage
--- a/lm_eval/tasks/agieval/sat-en.yaml
+++ b/lm_eval/tasks/agieval/sat-en.yaml
 include: aqua-rat.yaml
-group:
-  - agieval
-  - agieval_nous
-  - agieval_en
 task: agieval_sat_en
 dataset_path: hails/agieval-sat-en
--- a/lm_eval/tasks/agieval/sat-math.yaml
+++ b/lm_eval/tasks/agieval/sat-math.yaml
 include: aqua-rat.yaml
-group:
-  - agieval
-  - agieval_nous
-  - agieval_en
 task: agieval_sat_math
 dataset_path: hails/agieval-sat-math
--- a/lm_eval/tasks/ammlu/README.md
+++ b/lm_eval/tasks/ammlu/README.md
-# ArabicMMLU
+#Arabic COPA
 ### Paper
-ArabicMMLU: Measuring massive multitask language understanding in Arabic
+Original Title: `COPA`
-This dataset has been translated from the original MMLU with the help of GPT-4.
-The original data [MMLU](https://arxiv.org/pdf/2009.03300v3.pdf)
-The translation has been done with AceGPT researchers [AceGPT](https://arxiv.org/abs/2309.12053)
-ArabicMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Arabic language and culture.
+The Choice Of Plausible Alternatives (COPA) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning.
-ArabicMMLU covers a wide range of subjects, comprising 57 topics that span from elementary to advanced professional levels.
-Homepage: [AceGPT Homepage](https://github.com/FreedomIntelligence/AceGPT/tree/main/eval/benchmark_eval/benchmarks/MMLUArabic)
+[Homepage](https://people.ict.usc.edu/~gordon/copa.html)
-### Citation
+AlGhafa has translated this dataset to Arabic[AlGafa](https://aclanthology.org/2023.arabicnlp-1.21.pdf)
+The link to the Arabic version of the dataset [PICA](https://gitlab.com/tiiuae/alghafa/-/tree/main/arabic-eval/copa_ar)
+### Citation
 ### Groups and Tasks
 #### Groups
- `ammlu`: All 57 subjects of the ArabicMMLU dataset, evaluated following the methodology in MMLU's original implementation.
+* Not part of a group yet.
 #### Tasks
+* `copa_ar`
-The following tasks evaluate subjects in the ArabicMMLU dataset using loglikelihood-based multiple-choice scoring:
- `ammlu_{subject_english}`
 ### Checklist
+For adding novel benchmarks/datasets to the library:
 * [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
-  * [x] If yes, does the original paper provide a reference implementation?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * [x] Yes, original implementation contributed by author of the benchmark
 If other tasks on this dataset are already supported:
 * [x] Is the "Main" variant of this task clearly denoted?

--- a/lm_eval/tasks/alghafa/copa_ar/copa_ar.yaml
+++ b/lm_eval/tasks/alghafa/copa_ar/copa_ar.yaml
+task: copa_ar
+dataset_path: Hennara/copa_ar
+dataset_name: null
+output_type: multiple_choice
+training_split: null
+validation_split: null
+test_split: test
+doc_to_text: "السؤال: {{query}}\nالجواب:"
+doc_to_choice: "{{[sol1, sol2]}}"
+doc_to_target: label
+should_decontaminate: true
+doc_to_decontamination_query: query
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/alghafa/piqa_ar/README.md
+++ b/lm_eval/tasks/alghafa/piqa_ar/README.md
+#Arabic PIQA
+### Paper
+Original Title: `PIQA: Reasoning about Physical Commonsense in Natural Language`
+Original paper: [PICA](https://arxiv.org/abs/1911.11641)
+Physical Interaction: Question Answering (PIQA) is a physical commonsense
+reasoning and a corresponding benchmark dataset. PIQA was designed to investigate
+the physical knowledge of existing models. To what extent are current approaches
+actually learning about the world?
+[Homepage](https://yonatanbisk.com/piqa)
+AlGhafa has translated this dataset to Arabic[AlGafa](https://aclanthology.org/2023.arabicnlp-1.21.pdf)
+The link to the Arabic version of the dataset [PICA](https://gitlab.com/tiiuae/alghafa/-/tree/main/arabic-eval/pica_ar)
+### Citation
+### Groups and Tasks
+#### Groups
+* Not part of a group yet.
+#### Tasks
+* `piqa_ar`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/alghafa/piqa_ar/piqa_ar.yaml
+++ b/lm_eval/tasks/alghafa/piqa_ar/piqa_ar.yaml
+task: piqa_ar
+dataset_path: Hennara/pica_ar
+dataset_name: null
+output_type: multiple_choice
+training_split: null
+validation_split: null
+test_split: test
+doc_to_text: "السؤال: {{goal}}\nالجواب:"
+doc_to_choice: "{{[sol1, sol2]}}"
+doc_to_target: label
+should_decontaminate: true
+doc_to_decontamination_query: goal
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/ammlu/_generate_configs.py
+++ b/lm_eval/tasks/ammlu/_generate_configs.py
-"""
-Take in a YAML, and output all other splits with this YAML
-"""
-import argparse
-import os
-import yaml
-from tqdm import tqdm
-SUBJECTS = {
-    "abstract_algebra": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "anatomy": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "astronomy": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "business_ethics": "علوم أخرى",
-    "clinical_knowledge": "علوم أخرى",
-    "college_biology": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "college_chemistry": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "college_computer_science": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "college_mathematics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "college_medicine": "علوم أخرى",
-    "college_physics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "computer_security": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "conceptual_physics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "econometrics": "العلوم الإجتماعية",
-    "electrical_engineering": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "elementary_mathematics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "formal_logic": "العلوم الانسانية",
-    "global_facts": "علوم أخرى",
-    "high_school_biology": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_chemistry": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_computer_science": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_european_history": "العلوم الانسانية",
-    "high_school_geography": "العلوم الإجتماعية",
-    "high_school_government_and_politics": "العلوم الإجتماعية",
-    "high_school_macroeconomics": "العلوم الإجتماعية",
-    "high_school_mathematics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_microeconomics": "العلوم الإجتماعية",
-    "high_school_physics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_psychology": "العلوم الإجتماعية",
-    "high_school_statistics": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "high_school_us_history": "العلوم الانسانية",
-    "high_school_world_history": "العلوم الانسانية",
-    "human_aging": "علوم أخرى",
-    "human_sexuality": "العلوم الإجتماعية",
-    "international_law": "العلوم الانسانية",
-    "jurisprudence": "العلوم الانسانية",
-    "logical_fallacies": "العلوم الانسانية",
-    "machine_learning": "ألعلوم وتقنية المعلومات و الرياضيات",
-    "management": "علوم أخرى",
-    "marketing": "علوم أخرى",
-    "medical_genetics": "علوم أخرى",
-    "miscellaneous": "علوم أخرى",
-    "moral_disputes": "العلوم الانسانية",
-    "moral_scenarios": "العلوم الانسانية",
-    "nutrition": "علوم أخرى",
-    "philosophy": "العلوم الانسانية",
-    "prehistory": "العلوم الانسانية",
-    "professional_accounting": "علوم أخرى",
-    "professional_law": "العلوم الانسانية",
-    "professional_medicine": "علوم أخرى",
-    "professional_psychology": "العلوم الإجتماعية",
-    "public_relations": "العلوم الإجتماعية",
-    "security_studies": "العلوم الإجتماعية",
-    "sociology": "العلوم الإجتماعية",
-    "us_foreign_policy": "العلوم الإجتماعية",
-    "virology": "علوم أخرى",
-    "world_religions": "العلوم الانسانية",
-}
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--base_yaml_path", required=True)
-    parser.add_argument("--save_prefix_path", default="ammlu")
-    parser.add_argument("--cot_prompt_path", default=None)
-    parser.add_argument("--task_prefix", default="")
-    return parser.parse_args()
-if __name__ == "__main__":
-    args = parse_args()
-    # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
-    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
-    with open(args.base_yaml_path, encoding="utf-8") as f:
-        base_yaml = yaml.full_load(f)
-    if args.cot_prompt_path is not None:
-        import json
-        with open(args.cot_prompt_path, encoding="utf-8") as f:
-            cot_file = json.load(f)
-    for subject_eng, category in tqdm(SUBJECTS.items()):
-        if args.cot_prompt_path is not None:
-            description = cot_file[subject_eng]
-        else:
-            description = f"فم بعملية التقييم في مجال {category} \n\n"
-        yaml_dict = {
-            "include": base_yaml_name,
-            "task": f"ammlu_{args.task_prefix}_{subject_eng}"
-            if args.task_prefix != ""
-            else f"ammlu_{subject_eng}",
-            "dataset_name": subject_eng,
-            "description": description,
-        }
-        file_save_path = args.save_prefix_path + f"_{subject_eng}.yaml"
-        print(f"Saving yaml for subset {subject_eng} to {file_save_path}")
-        with open(file_save_path, "w", encoding="utf-8") as yaml_file:
-            yaml.dump(
-                yaml_dict,
-                yaml_file,
-                width=float("inf"),
-                allow_unicode=True,
-                default_style='"',
-            )
--- a/lm_eval/tasks/ammlu/ammlu_abstract_algebra.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_abstract_algebra.yaml
-"dataset_name": "abstract_algebra"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_abstract_algebra"
--- a/lm_eval/tasks/ammlu/ammlu_anatomy.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_anatomy.yaml
-"dataset_name": "anatomy"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_anatomy"
--- a/lm_eval/tasks/ammlu/ammlu_astronomy.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_astronomy.yaml
-"dataset_name": "astronomy"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_astronomy"
--- a/lm_eval/tasks/ammlu/ammlu_college_biology.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_college_biology.yaml
-"dataset_name": "college_biology"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_college_biology"
--- a/lm_eval/tasks/ammlu/ammlu_college_chemistry.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_college_chemistry.yaml
-"dataset_name": "college_chemistry"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_college_chemistry"
--- a/lm_eval/tasks/ammlu/ammlu_college_computer_science.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_college_computer_science.yaml
-"dataset_name": "college_computer_science"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_college_computer_science"
--- a/lm_eval/tasks/ammlu/ammlu_college_physics.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_college_physics.yaml
-"dataset_name": "college_physics"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_college_physics"
--- a/lm_eval/tasks/ammlu/ammlu_computer_security.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_computer_security.yaml
-"dataset_name": "computer_security"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_computer_security"
--- a/lm_eval/tasks/ammlu/ammlu_conceptual_physics.yaml
+++ b/lm_eval/tasks/ammlu/ammlu_conceptual_physics.yaml
-"dataset_name": "conceptual_physics"
-"description": "فم بعملية التقييم في مجال ألعلوم وتقنية المعلومات و الرياضيات \n\n"
-"include": "_default_template_yaml"
-"task": "ammlu_conceptual_physics"