添加Megatron项目

5add46aa · hepj · deb8370c · 5add46aa · 5add46aa · 5add46aa
Commit 5add46aa authored Jan 09, 2025 by hepj
20 changed files
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_law.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_law.yaml
+"dataset_name": "professional_law"
+"description": "فم بعملية التقييم في مجال العلوم الانسانية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_professional_law"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_medicine.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_medicine.yaml
+"dataset_name": "professional_medicine"
+"description": "فم بعملية التقييم في مجال علوم أخرى \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_professional_medicine"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_psychology.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_professional_psychology.yaml
+"dataset_name": "professional_psychology"
+"description": "فم بعملية التقييم في مجال العلوم الإجتماعية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_professional_psychology"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_public_relations.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_public_relations.yaml
+"dataset_name": "public_relations"
+"description": "فم بعملية التقييم في مجال العلوم الإجتماعية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_public_relations"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_security_studies.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_security_studies.yaml
+"dataset_name": "security_studies"
+"description": "فم بعملية التقييم في مجال العلوم الإجتماعية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_security_studies"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_sociology.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_sociology.yaml
+"dataset_name": "sociology"
+"description": "فم بعملية التقييم في مجال العلوم الإجتماعية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_sociology"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_us_foreign_policy.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_us_foreign_policy.yaml
+"dataset_name": "us_foreign_policy"
+"description": "فم بعملية التقييم في مجال العلوم الإجتماعية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_us_foreign_policy"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_virology.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_virology.yaml
+"dataset_name": "virology"
+"description": "فم بعملية التقييم في مجال علوم أخرى \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_virology"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_world_religions.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/ammlu/ammlu_world_religions.yaml
+"dataset_name": "world_religions"
+"description": "فم بعملية التقييم في مجال العلوم الانسانية \n\n"
+"include": "_default_template_yaml"
+"task": "ammlu_world_religions"
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/README.md
+# ANLI
+
+### Paper
+
+Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding`
+
+Paper Link: https://arxiv.org/abs/1910.14599
+
+Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial
+human-and-model-in-the-loop procedure. It consists of three rounds that progressively
+increase in difficulty and complexity, and each question-answer includes annotator-
+provided explanations.
+
+Homepage: https://github.com/facebookresearch/anli
+
+### Citation
+
+```
+@inproceedings{nie-etal-2020-adversarial,
+    title = "Adversarial {NLI}: A New Benchmark for Natural Language Understanding",
+    author = "Nie, Yixin  and
+      Williams, Adina  and
+      Dinan, Emily  and
+      Bansal, Mohit  and
+      Weston, Jason  and
+      Kiela, Douwe",
+    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `anli`: Evaluates `anli_r1`, `anli_r2`, and `anli_r3`
+
+#### Tasks
+* `anli_r1`: The data collected adversarially in the first round.
+* `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data.
+* `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data.
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+  * [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r1.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r1.yaml
+group:
+  - anli
+task: anli_r1
+dataset_path: anli
+dataset_name: null
+output_type: multiple_choice
+training_split: train_r1
+validation_split: dev_r1
+test_split: test_r1
+doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
+# True = entailment
+# False = contradiction
+# Neither = neutral
+doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
+doc_to_choice:
+  - "True"
+  - "Neither"
+  - "False"
+should_decontaminate: true
+doc_to_decontamination_query: premise
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r2.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r2.yaml
+include: anli_r1.yaml
+task: anli_r2
+training_split: train_r2
+validation_split: dev_r2
+test_split: test_r2
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r3.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/anli/anli_r3.yaml
+include: anli_r1.yaml
+task: anli_r3
+training_split: train_r3
+validation_split: dev_r3
+test_split: test_r3
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/README.md
+# ARC
+
+### Paper
+
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+
+Abstract: https://arxiv.org/abs/1803.05457
+
+The ARC dataset consists of 7,787 science exam questions drawn from a variety
+of sources, including science questions provided under license by a research
+partner affiliated with AI2. These are text-only, English language exam questions
+that span several grade levels as indicated in the files. Each question has a
+multiple choice structure (typically 4 answer options). The questions are sorted
+into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
+a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.
+
+Homepage: https://allenai.org/data/arc
+
+
+### Citation
+
+```
+@article{Clark2018ThinkYH,
+  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
+  author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
+  journal={ArXiv},
+  year={2018},
+  volume={abs/1803.05457}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`
+
+#### Tasks
+
+* `arc_easy`
+* `arc_challenge`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_challenge.yaml
+include: arc_easy.yaml
+task: arc_challenge
+dataset_name: ARC-Challenge
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_easy.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arc/arc_easy.yaml
+group:
+  - ai2_arc
+task: arc_easy
+dataset_path: allenai/ai2_arc
+dataset_name: ARC-Easy
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{choices.label.index(answerKey)}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/README.md
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/README.md
+# Arithmetic
+
+### Paper
+
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+
+
+### Citation
+
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `arithmetic`: Evaluates `1dc` to `5ds`
+
+#### Tasks
+
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+group:
+  - arithmetic
+task: arithmetic_1dc
+dataset_path: EleutherAI/arithmetic
+dataset_name: arithmetic_1dc
+output_type: loglikelihood
+validation_split: validation
+test_split: null
+doc_to_text: "{{context}}"
+doc_to_target: "{{completion}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2da.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2da.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_2da
+dataset_name: arithmetic_2da
+dataset_kwargs:
+  trust_remote_code: true
--- a/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml
+++ b/LM-Evaluation-Harness-240310/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml
+include: arithmetic_1dc.yaml
+task: arithmetic_2dm
+dataset_name: arithmetic_2dm
+dataset_kwargs:
+  trust_remote_code: true