Unverified Commit f720ce81 authored by Jess's avatar Jess Committed by GitHub
Browse files

Merge pull request #20 from JessicaOjo/afri_mgsm

Afri mgsm modefied
parents facf38ca c56593ee
# MGSM
### Paper
Title: `Language Models are Multilingual Chain-of-Thought Reasoners`
Abstract: https://arxiv.org/abs/2210.03057
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
Homepage: https://github.com/google-research/url-nlp/tree/main/mgsm
### Citation
```
@misc{cobbe2021training,
title={Training Verifiers to Solve Math Word Problems},
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
year={2021},
eprint={2110.14168},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Groups and Tasks
#### Groups
* `mgsm_direct`: Direct question
* `mgsm_direct_bn`: Bengali
* `mgsm_direct_de`: German
* `mgsm_direct_en`: English
* `mgsm_direct_es`: Spanish
* `mgsm_direct_fr`: French
* `mgsm_direct_ja`: Japanese
* `mgsm_direct_ru`: Russian
* `mgsm_direct_sw`: Swahili
* `mgsm_direct_te`: Telugu
* `mgsm_direct_th`: Thai
* `mgsm_direct_zh`: Chinese
* `mgsm_cot_native`: Question with Answer followed by CoT prompt in the same language as the dataset.
* `mgsm_cot_native_bn`: Bengali
* `mgsm_cot_native_de`: German
* `mgsm_cot_native_en`: English
* `mgsm_cot_native_es`: Spanish
* `mgsm_cot_native_fr`: French
* `mgsm_cot_native_ja`: Japanese
* `mgsm_cot_native_ru`: Russian
* `mgsm_cot_native_sw`: Swahili
* `mgsm_cot_native_te`: Telugu
* `mgsm_cot_native_th`: Thai
* `mgsm_cot_native_zh`: Chinese
Examplar Samples: https://github.com/google-research/url-nlp/blob/main/mgsm/exemplars.py
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
#!/bin/bash
models=(
"masakhane/African-ultrachat-alpaca"
"masakhane/zephyr-7b-gemma-sft-african-alpaca"
"masakhane/zephyr-7b-gemma-sft-african-ultrachat-5k"
"google/flan-t5-xxl"
"bigscience/mt0-xxl-mt"
"CohereForAI/aya-101"
"bigscience/bloomz-7b1-mt"
"meta-llama/Llama-2-7b-chat-hf"
"meta-llama/Meta-Llama-3-8B-Instruct"
"meta-llama/Meta-Llama-3-70B-Instruct"
"google/gemma-1.1-7b-it"
"RWKV/v5-EagleX-v2-7B-HF"
"RWKV/rwkv-6-world-7b"
)
task=afrimgsm_direct_amh,afrimgsm_direct_eng,afrimgsm_direct_ewe,afrimgsm_direct_fra,afrimgsm_direct_hau,afrimgsm_direct_ibo,afrimgsm_direct_kin,afrimgsm_direct_lin,afrimgsm_direct_lug,afrimgsm_direct_orm,afrimgsm_direct_sna,afrimgsm_direct_sot,afrimgsm_direct_swa,afrimgsm_direct_twi,afrimgsm_direct_wol,afrimgsm_direct_xho,afrimgsm_direct_yor,afrimgsm_direct_zul
for model in "${models[@]}"
do
echo "Evaluating model: $model"
for fewshot in 0 2 4 6 8
do
export OUTPUT_DIR=results/$fewshot
mkdir -p "$OUTPUT_DIR"
lm_eval --model hf \
--model_args "pretrained=${model}" \
--tasks $task\
--device cuda:0 \
--batch_size 16 \
--output_path "$OUTPUT_DIR" \
--num_fewshot $fewshot \
--verbosity DEBUG
done
done
\ No newline at end of file
# Generated by utils.py # Generated by utils.py
dataset_name: amh dataset_name: amh
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_amh task: afrimgsm_direct_amh
# Generated by utils.py # Generated by utils.py
dataset_name: eng dataset_name: eng
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_eng task: afrimgsm_direct_eng
# Generated by utils.py # Generated by utils.py
dataset_name: ewe dataset_name: ewe
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_ewe task: afrimgsm_direct_ewe
# Generated by utils.py # Generated by utils.py
dataset_name: fra dataset_name: fra
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_fra task: afrimgsm_direct_fra
# Generated by utils.py # Generated by utils.py
dataset_name: hau dataset_name: hau
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_hau task: afrimgsm_direct_hau
# Generated by utils.py # Generated by utils.py
dataset_name: ibo dataset_name: ibo
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_ibo task: afrimgsm_direct_ibo
# Generated by utils.py # Generated by utils.py
dataset_name: kin dataset_name: kin
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_kin task: afrimgsm_direct_kin
# Generated by utils.py # Generated by utils.py
dataset_name: lin dataset_name: lin
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_lin task: afrimgsm_direct_lin
# Generated by utils.py # Generated by utils.py
dataset_name: lug dataset_name: lug
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_lug task: afrimgsm_direct_lug
# Generated by utils.py # Generated by utils.py
dataset_name: orm dataset_name: orm
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_orm task: afrimgsm_direct_orm
# Generated by utils.py # Generated by utils.py
dataset_name: sna dataset_name: sna
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_sna task: afrimgsm_direct_sna
# Generated by utils.py # Generated by utils.py
dataset_name: sot dataset_name: sot
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_sot task: afrimgsm_direct_sot
# Generated by utils.py # Generated by utils.py
dataset_name: swa dataset_name: swa
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_swa task: afrimgsm_direct_swa
# Generated by utils.py # Generated by utils.py
dataset_name: twi dataset_name: twi
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_twi task: afrimgsm_direct_twi
# Generated by utils.py # Generated by utils.py
dataset_name: wol dataset_name: wol
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_wol task: afrimgsm_direct_wol
# Generated by utils.py # Generated by utils.py
dataset_name: xho dataset_name: xho
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_xho task: afrimgsm_direct_xho
# Generated by utils.py # Generated by utils.py
dataset_name: yor dataset_name: yor
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_yor task: afrimgsm_direct_yor
# Generated by utils.py # Generated by utils.py
dataset_name: zul dataset_name: zul
include: afrimgsm_common_yaml doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
generation_kwargs:
do_sample: false
until:
- 'Question:'
- </s>
- <|im_end|>
include: direct_yaml
task: afrimgsm_direct_zul task: afrimgsm_direct_zul
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment