Merge pull request #18 from JessicaOjo/africamgsm

fix exact match bug and restructure mmlu folder

Merge pull request #18 from JessicaOjo/africamgsm
fix exact match bug and restructure mmlu folder
4bb92ebc · Jess · GitHub · 348e304a · 5ba791e2 · 4bb92ebc
Unverified Commit 4bb92ebc authored May 14, 2024 by Jess Committed by GitHub May 14, 2024
4 changed files
--- a/lm_eval/tasks/afrimmlu_translate/afrimmlu_zul_translate.yaml
+++ b/lm_eval/tasks/afrimmlu_translate/afrimmlu_zul_translate.yaml
--- a/lm_eval/tasks/afrimmlu/translate/utils.py
+++ b/lm_eval/tasks/afrimmlu/translate/utils.py
+from sklearn.metrics import f1_score
+def doc_to_choice(doc):
+    choices = eval(doc["choices"])
+    return choices
+def doc_to_text(doc):
+    output = """You are a highly knowledgeable and intelligent artificial intelligence 
+                model answers multiple-choice questions about '{subject}'
+                Question: '''{question}'''
+                Choices:
+                        A: ''{choice1}'''
+                        B: ''{choice2}'''
+                        C: ''{choice3}'''
+                        D: ''{choice4}'''
+                Answer:  """
+    choices = eval(doc["choices"])
+    text = output.format(subject=doc['subject'],
+                         question=doc['question'],
+                         choice1=choices[0],
+                         choice2=choices[1],
+                         choice3=choices[2],
+                         choice4=choices[3])
+    return text
+def weighted_f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = f1_score(golds, preds, average="weighted")
+    return fscore
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu_translate/README.md
+++ b/lm_eval/tasks/afrimmlu_translate/README.md
-# MathQA
-### Paper
-MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
-https://arxiv.org/pdf/1905.13319.pdf
-MathQA is a large-scale dataset of 37k English multiple-choice math word problems
-covering multiple math domain categories by modeling operation programs corresponding
-to word problems in the AQuA dataset (Ling et al., 2017).
-Homepage: https://math-qa.github.io/math-QA/
-### Citation
-```
-@misc{amini2019mathqa,
-    title={MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms},
-    author={Aida Amini and Saadia Gabriel and Peter Lin and Rik Koncel-Kedziorski and Yejin Choi and Hannaneh Hajishirzi},
-    year={2019},
-    eprint={1905.13319},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}
-```
-### Groups and Tasks
-#### Groups
-* `math_word_problems`
-#### Tasks
-* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
-### Checklist
-For adding novel benchmarks/datasets to the library:
-* [x] Is the task an existing benchmark in the literature?
-  * [x] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * The MathQA dataset predates transformer-based prompted LLMs. We should, however, return to this task to ensure equivalence to the non-CoT version of mathQA used in the Chain-of-Thought paper.
-If other tasks on this dataset are already supported:
-* [x] Is the "Main" variant of this task clearly denoted?
-* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
-  * [x] Checked for equivalence with v0.3.0 LM Evaluation Harness
--- a/lm_eval/tasks/afrimmlu_translate/fewshot.sh
+++ b/lm_eval/tasks/afrimmlu_translate/fewshot.sh
-lm_eval --model hf \
-        --model_args pretrained=masakhane/African-ultrachat-alpaca  \
-        --tasks afrimmlu_amh,afrimmlu_eng,afrimmlu_ewe,afrimmlu_fra,afrimmlu_hau,afrimmlu_ibo,afrimmlu_kin,afrimmlu_lin,afrimmlu_lug,afrimmlu_orm,afrimmlu_sna,afrimmlu_sot,afrimmlu_twi,afrimmlu_wol,afrimmlu_xho,afrimmlu_yor,afrimmlu_zul   \
-        --device cuda:0     \
-        --batch_size 1 \
-        --num_fewshot 0 \
-        --verbosity DEBUG \
-        --wandb_args project=afrimmlu
\ No newline at end of file