afrimmlu folder update

f64b943d · Israel Abebe Azime · 64490d95 · 64490d95 · 64490d95 · 64490d95
Commit f64b943d authored May 08, 2024 by Israel Abebe Azime
20 changed files
--- a/lm_eval/tasks/masakhane/README.md
+++ b/lm_eval/tasks/masakhane/README.md
-# MathQA
-
-### Paper
-
-MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
-https://arxiv.org/pdf/1905.13319.pdf
-
-MathQA is a large-scale dataset of 37k English multiple-choice math word problems
-covering multiple math domain categories by modeling operation programs corresponding
-to word problems in the AQuA dataset (Ling et al., 2017).
-
-Homepage: https://math-qa.github.io/math-QA/
-
-
-### Citation
-
-```
-@misc{amini2019mathqa,
-    title={MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms},
-    author={Aida Amini and Saadia Gabriel and Peter Lin and Rik Koncel-Kedziorski and Yejin Choi and Hannaneh Hajishirzi},
-    year={2019},
-    eprint={1905.13319},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
-
-### Groups and Tasks
-
-#### Groups
-
-* `math_word_problems`
-
-#### Tasks
-
-* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
-
-### Checklist
-
-For adding novel benchmarks/datasets to the library:
-* [x] Is the task an existing benchmark in the literature?
-  * [x] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * The MathQA dataset predates transformer-based prompted LLMs. We should, however, return to this task to ensure equivalence to the non-CoT version of mathQA used in the Chain-of-Thought paper.
-
-If other tasks on this dataset are already supported:
-* [x] Is the "Main" variant of this task clearly denoted?
-* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
-  * [x] Checked for equivalence with v0.3.0 LM Evaluation Harness
--- a/lm_eval/tasks/masakhane/afrimmlu_amh.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_amh.yaml
-dataset_name: amh
-include: afrimmlu_common_yaml
-task: afrimmlu_amh
--- a/lm_eval/tasks/masakhane/afrimmlu_common_yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_common_yaml
-group:
-  - mmlu
-  - afrimmlu
-task: null
-dataset_path: masakhane/afrimmlu
-dataset_name: null
-output_type: multiple_choice
-validation_split: validation
-test_split: test
-fewshot_split: validation
-doc_to_text: "Question: {{question}}\nAnswer:"
-doc_to_target: "{{['A', 'B', 'C', 'D'].index(answer)}}"
-doc_to_choice: !function utils.doc_to_choice
-should_decontaminate: true
-doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
-metric_list:
-  - metric: f1 
-    aggregation: !function utils.weighted_f1_score 
-    # aggregation: mean
-    average: weighted 
-    hf_evaluate: true 
-    higher_is_better: True 
-    ignore_case: true
-    ignore_punctuation: true
-    regexes_to_ignore:
-      - ","
-      - "\\$"
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
-    ignore_case: true
-    ignore_punctuation: true
-    regexes_to_ignore:
-      - ","
-      - "\\$"
-metadata:
-  version: 1.0
--- a/lm_eval/tasks/masakhane/afrimmlu_eng.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_eng.yaml
-dataset_name: eng
-include: afrimmlu_common_yaml
-task: afrimmlu_eng
-
--- a/lm_eval/tasks/masakhane/afrimmlu_ewe.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_ewe.yaml
-dataset_name: eng
-include: afrimmlu_common_yaml
-task: afrimmlu_ewe
-
--- a/lm_eval/tasks/masakhane/afrimmlu_fra.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_fra.yaml
-dataset_name: fra
-include: afrimmlu_common_yaml
-task: afrimmlu_fra
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_hau.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_hau.yaml
-dataset_name: hau
-include: afrimmlu_common_yaml
-task: afrimmlu_hau
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_ibo.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_ibo.yaml
-dataset_name: ibo
-include: afrimmlu_common_yaml
-task: afrimmlu_ibo
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_kin.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_kin.yaml
-dataset_name: kin
-include: afrimmlu_common_yaml
-task: afrimmlu_kin
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_lin.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_lin.yaml
-dataset_name: lin
-include: afrimmlu_common_yaml
-task: afrimmlu_lin
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_lug.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_lug.yaml
-dataset_name: lug
-include: afrimmlu_common_yaml
-task: afrimmlu_lug
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_orm.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_orm.yaml
-dataset_name: orm
-include: afrimmlu_common_yaml
-task: afrimmlu_orm
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_sna.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_sna.yaml
-dataset_name: sna
-include: afrimmlu_common_yaml
-task: afrimmlu_sna
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_sot.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_sot.yaml
-dataset_name: sot
-include: afrimmlu_common_yaml
-task: afrimmlu_sot
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_twi.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_twi.yaml
-dataset_name: twi
-include: afrimmlu_common_yaml
-task: afrimmlu_twi
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_wol.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_wol.yaml
-dataset_name: wol
-include: afrimmlu_common_yaml
-task: afrimmlu_wol
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_xho.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_xho.yaml
-dataset_name: xho
-include: afrimmlu_common_yaml
-task: afrimmlu_xho
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_yor.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_yor.yaml
-dataset_name: yor
-include: afrimmlu_common_yaml
-task: afrimmlu_yor
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/afrimmlu_zul.yaml
+++ b/lm_eval/tasks/masakhane/afrimmlu_zul.yaml
-dataset_name: zul
-include: afrimmlu_common_yaml
-task: afrimmlu_zul
\ No newline at end of file
--- a/lm_eval/tasks/masakhane/fewshot.sh
+++ b/lm_eval/tasks/masakhane/fewshot.sh
-# NUMBER OF SHOT IS SET HERE 
-num_fewshot=0
-
-
-
-lm_eval --model hf \
-        --model_args pretrained=masakhane/African-ultrachat-alpaca  \
-        --tasks afrimmlu_amh,afrimmlu_eng,afrimmlu_ewe,afrimmlu_fra,afrimmlu_hau,afrimmlu_ibo,afrimmlu_kin,afrimmlu_lin,afrimmlu_lug,afrimmlu_orm,afrimmlu_sna,afrimmlu_sot,afrimmlu_twi,afrimmlu_wol,afrimmlu_xho,afrimmlu_yor,afrimmlu_zul   \
-        --device cuda:0     \
-        --batch_size 1 \
-        --num_fewshot $num_fewshot \
-        --verbosity DEBUG
\ No newline at end of file