add mathqa

255f4c6e · haileyschoelkopf · aa60d2b6 · 255f4c6e · 255f4c6e · 255f4c6e
Commit 255f4c6e authored Jul 07, 2023 by haileyschoelkopf
Showing with 77 additions and 0 deletions

lm_eval/tasks/mathqa/README.md lm_eval/tasks/mathqa/README.md +44 -0

lm_eval/tasks/mathqa/mathqa.yaml lm_eval/tasks/mathqa/mathqa.yaml +19 -0

lm_eval/tasks/mathqa/utils.py lm_eval/tasks/mathqa/utils.py +14 -0

No files found.
--- a/lm_eval/tasks/mathqa/README.md
+++ b/lm_eval/tasks/mathqa/README.md
+# MathQA
+
+### Paper
+
+MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
+https://arxiv.org/pdf/1905.13319.pdf
+
+MathQA is a large-scale dataset of 37k English multiple-choice math word problems
+covering multiple math domain categories by modeling operation programs corresponding
+to word problems in the AQuA dataset (Ling et al., 2017).
+
+Homepage: https://math-qa.github.io/math-QA/
+
+
+### Citation
+
+```
+@misc{amini2019mathqa,
+    title={MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms},
+    author={Aida Amini and Saadia Gabriel and Peter Lin and Rik Koncel-Kedziorski and Yejin Choi and Hannaneh Hajishirzi},
+    year={2019},
+    eprint={1905.13319},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+### Subtasks
+
+* `mathqa`: The MathQA dataset, as a multiple choice dataset where the answer choices are not in context.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+    * The MathQA dataset predates transformer-based prompted LLMs. We should, however, return to this task to ensure equivalence to the non-CoT version of mathQA used in the Chain-of-Thought paper.
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+  * [x] Checked for equivalence with v0.3.0 LM Evaluation Harness
--- a/lm_eval/tasks/mathqa/mathqa.yaml
+++ b/lm_eval/tasks/mathqa/mathqa.yaml
+group:
+  - multiple_choice
+task: mathqa
+dataset_path: math_qa
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+create_choices: !function utils.create_choices # create list of answer choices
+doc_to_text: "Question: {{Problem}}\nAnswer:"
+doc_to_target: !function utils.doc_to_target
+gold_alias: "{{['a', 'b', 'c', 'd', 'e'].index(correct)}}" # this will be cast to an int.
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/mathqa/utils.py
+++ b/lm_eval/tasks/mathqa/utils.py
+import re
+
+
+def create_choices(doc):
+    choices = [
+        c[4:].rstrip(" ,")
+        for c in re.findall(r"[abcd] \) .*?, |e \) .*?$", doc["options"])
+    ]
+    return choices
+
+
+def doc_to_target(doc):
+    choices = create_choices(doc)
+    return choices[["a", "b", "c", "d", "e"].index(doc["correct"])]