cleanups and readmes

ebe41744 · JessicaOjo · da211969 · ebe41744 · ebe41744 · da211969
Commit ebe41744 authored Jul 01, 2024 by JessicaOjo
20 changed files
--- a/lm_eval/tasks/afrimgsm/README.md
+++ b/lm_eval/tasks/afrimgsm/README.md
-# MGSM
+# MathQA

 ### Paper

-Title: `Language Models are Multilingual Chain-of-Thought Reasoners`
+IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
+https://arxiv.org/pdf/2406.03368

-Abstract: https://arxiv.org/abs/2210.03057
-
-Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
-
-The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
-
-GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
-
-You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
-We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
-
-Homepage: https://github.com/google-research/url-nlp/tree/main/mgsm
+IrokoBench is a human-translated benchmark dataset for 16 typologically diverse 
+low-resource African languages covering three tasks: natural language inference (AfriXNLI), 
+mathematical reasoning (AfriMGSM), and multi-choice knowledge-based QA (AfriMMLU).


 ### Citation

 ```
-@misc{cobbe2021training,
-    title={Training Verifiers to Solve Math Word Problems},
-    author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
-    year={2021},
-    eprint={2110.14168},
-    archivePrefix={arXiv},
-    primaryClass={cs.LG}
-}
-@misc{shi2022language,
-    title={Language Models are Multilingual Chain-of-Thought Reasoners},
-    author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
-    year={2022},
-    eprint={2210.03057},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
+@misc{adelani2024irokobenchnewbenchmarkafrican,
+      title={IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models}, 
+      author={David Ifeoluwa Adelani and Jessica Ojo and Israel Abebe Azime and Jian Yun Zhuang and Jesujoba O. Alabi and Xuanli He and Millicent Ochieng and Sara Hooker and Andiswa Bukula and En-Shiun Annie Lee and Chiamaka Chukwuneke and Happy Buzaaba and Blessing Sibanda and Godson Kalipe and Jonathan Mukiibi and Salomon Kabongo and Foutse Yuehgoh and Mmasibidi Setaka and Lolwethu Ndolela and Nkiruka Odu and Rooweither Mabuya and Shamsuddeen Hassan Muhammad and Salomey Osei and Sokhar Samb and Tadesse Kebede Guge and Pontus Stenetorp},
+      year={2024},
+      eprint={2406.03368},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2406.03368}, 
 }
 ```

@@ -53,42 +28,25 @@ Homepage: https://github.com/google-research/url-nlp/tree/main/mgsm

 #### Groups

-* `mgsm_direct`: Direct question
-  * `mgsm_direct_bn`: Bengali
-  * `mgsm_direct_de`: German
-  * `mgsm_direct_en`: English
-  * `mgsm_direct_es`: Spanish
-  * `mgsm_direct_fr`: French
-  * `mgsm_direct_ja`: Japanese
-  * `mgsm_direct_ru`: Russian
-  * `mgsm_direct_sw`: Swahili
-  * `mgsm_direct_te`: Telugu
-  * `mgsm_direct_th`: Thai
-  * `mgsm_direct_zh`: Chinese
-* `mgsm_cot_native`: Question with Answer followed by CoT prompt in the same language as the dataset.
-  * `mgsm_cot_native_bn`: Bengali
-  * `mgsm_cot_native_de`: German
-  * `mgsm_cot_native_en`: English
-  * `mgsm_cot_native_es`: Spanish
-  * `mgsm_cot_native_fr`: French
-  * `mgsm_cot_native_ja`: Japanese
-  * `mgsm_cot_native_ru`: Russian
-  * `mgsm_cot_native_sw`: Swahili
-  * `mgsm_cot_native_te`: Telugu
-  * `mgsm_cot_native_th`: Thai
-  * `mgsm_cot_native_zh`: Chinese
+* `afrimgsm`: All afrimgsm tasks
+* `afrimgsm_direct`: afrimgsm_direct evaluates models performance on the curated dataset
+* `afrimgsm_en_cot`: afrimgsm_en_cot includes 5-shot of exemplars for chain-of-thought approach
+* `afrimgsm_translate`: afrimgsm_translate evaluates models in translate-test setting

-Examplar Samples: https://github.com/google-research/url-nlp/blob/main/mgsm/exemplars.py
+#### Tasks
+* `afrimgsm_direct_{language_code}`: each task evaluates for one language
+* `afrimgsm_en_cot_{language_code}`: each task evaluates for one language
+* `afrimgsm_translate_{language_code}`: each task evaluates for one language

 ### Checklist

 For adding novel benchmarks/datasets to the library:
-* [ ] Is the task an existing benchmark in the literature?
-  * [ ] Have you referenced the original paper that introduced the task?
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

-
 If other tasks on this dataset are already supported:
-* [ ] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
-* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+  * [x] Checked for equivalence with v0.3.0 LM Evaluation Harness
--- a/lm_eval/tasks/afrimgsm/direct/direct_yaml
+++ b/lm_eval/tasks/afrimgsm/direct/direct_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group: afrimgsm_direct
+group:
+    - mgsm
+    - afrimgsm
+    - afrimgsm_direct
 dataset_path: masakhane/afrimgsm
 dataset_name: null  # Overridden by language-specific config.
 output_type: generate_until

--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_amh.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_amh.yaml
-# Generated by utils.py
-dataset_name: amh
-doc_to_target: '{% if answer is not none %}{{answer[15:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"ጥያቄ: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'ጥያቄ:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_amh
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_eng.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_eng.yaml
-# Generated by utils.py
-dataset_name: eng
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_eng
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_ewe.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_ewe.yaml
-# Generated by utils.py
-dataset_name: ewe
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_ewe
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_fra.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_fra.yaml
-# Generated by utils.py
-dataset_name: fra
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_fra
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_hau.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_hau.yaml
-# Generated by utils.py
-dataset_name: hau
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_hau
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_ibo.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_ibo.yaml
-# Generated by utils.py
-dataset_name: ibo
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_ibo
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_kin.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_kin.yaml
-# Generated by utils.py
-dataset_name: kin
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_kin
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_lin.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_lin.yaml
-# Generated by utils.py
-dataset_name: lin
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_lin
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_lug.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_lug.yaml
-# Generated by utils.py
-dataset_name: lug
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_lug
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_orm.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_orm.yaml
-# Generated by utils.py
-dataset_name: orm
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_orm
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_sna.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_sna.yaml
-# Generated by utils.py
-dataset_name: sna
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_sna
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_sot.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_sot.yaml
-# Generated by utils.py
-dataset_name: sot
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_sot
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_swa.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_swa.yaml
-# Generated by utils.py
-dataset_name: swa
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_swa
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_twi.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_twi.yaml
-# Generated by utils.py
-dataset_name: twi
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_twi
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_wol.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_wol.yaml
-# Generated by utils.py
-dataset_name: wol
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_wol
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_xho.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_xho.yaml
-# Generated by utils.py
-dataset_name: xho
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_xho
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_yor.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_yor.yaml
-# Generated by utils.py
-dataset_name: yor
-doc_to_target: '{% if answer is not none %}{{answer[16:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Ìbéèrè: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Ìbéèrè:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_yor
--- a/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_zul.yaml
+++ b/lm_eval/tasks/afrimgsm/direct_native/afrimgsm_direct_native_zul.yaml
-# Generated by utils.py
-dataset_name: zul
-doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
-doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
-generation_kwargs:
-  do_sample: false
-  until:
-  - 'Question:'
-  - </s>
-  - <|im_end|>
-include: direct_native_yaml
-task: afrimgsm_direct_native_zul