AfroBench: How Good are Large Language Models on African Languages? (#2825)

* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

AfroBench: How Good are Large Language Models on African Languages? (#2825)
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
18297993 · Jess · GitHub · cf51e699 · 18297993 · 18297993
Unverified Commit 18297993 authored May 15, 2025 by Jess Committed by GitHub May 15, 2025
20 changed files
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_bbj.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_bbj.yaml
+# Generated by utils.py
+dataset_name: bbj
+doc_to_text: 'This text is in Gbomala. Restore all diacritical marks to their proper
+  places in the following sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_bbj_prompt_3
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_fon.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'This text is in Fon. Restore all diacritical marks to their proper places
+  in the following sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_fon_prompt_3
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_ibo.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'This text is in Igbo. Restore all diacritical marks to their proper
+  places in the following sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_ibo_prompt_3
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_wol.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_wol.yaml
+# Generated by utils.py
+dataset_name: wol
+doc_to_text: 'This text is in Wolof. Restore all diacritical marks to their proper
+  places in the following sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_wol_prompt_3
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_yaml
+tag:
+- adr_tasks
+- adr_prompt_3
+dataset_path: masakhane/diacritics-restoration
+dataset_kwargs: {trust_remote_code: True}
+doc_to_target: target
+output_type: generate_until
+fewshot_split: dev
+test_split: test
+training_split: train
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+generation_kwargs:
+  do_sample: false
+  until:
+  - '<eos>'
+  - </s>
+  - <|im_end|>
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_yor.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_3/afridiacritics_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'This text is in Yoruba. Restore all diacritical marks to their proper
+  places in the following sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_yor_prompt_3
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_bbj.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_bbj.yaml
+# Generated by utils.py
+dataset_name: bbj
+doc_to_text: 'You are a linguist specializing in diacritical marks for Gbomala. Add
+  the appropriate diacritics to this Gbomala sentence: {{text}}. Return output sentence
+  only'
+include: afridiacritics_yaml
+task: afridiacritics_bbj_prompt_4
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_fon.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'You are a linguist specializing in diacritical marks for Fon. Add the
+  appropriate diacritics to this Fon sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_fon_prompt_4
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_ibo.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'You are a linguist specializing in diacritical marks for Igbo. Add the
+  appropriate diacritics to this Igbo sentence: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_ibo_prompt_4
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_wol.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_wol.yaml
+# Generated by utils.py
+dataset_name: wol
+doc_to_text: 'You are a linguist specializing in diacritical marks for Wolof. Add
+  the appropriate diacritics to this Wolof sentence: {{text}}. Return output sentence
+  only'
+include: afridiacritics_yaml
+task: afridiacritics_wol_prompt_4
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_yaml
+tag:
+- adr_tasks
+- adr_prompt_4
+dataset_path: masakhane/diacritics-restoration
+dataset_kwargs: {trust_remote_code: True}
+doc_to_target: target
+output_type: generate_until
+fewshot_split: dev
+test_split: test
+training_split: train
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+generation_kwargs:
+  do_sample: false
+  until:
+  - '<eos>'
+  - </s>
+  - <|im_end|>
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_yor.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_4/afridiacritics_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'You are a linguist specializing in diacritical marks for Yoruba. Add
+  the appropriate diacritics to this Yoruba sentence: {{text}}. Return output sentence
+  only'
+include: afridiacritics_yaml
+task: afridiacritics_yor_prompt_4
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_bbj.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_bbj.yaml
+# Generated by utils.py
+dataset_name: bbj
+doc_to_text: 'You are a linguist specializing in diacritical marks for Gbomala. Diacritics
+  are essential for proper pronunciation and meaning in Gbomala. You are tasked with
+  converting Gbomala sentences  without diacritics into their correctly accented forms.
+  Here''s the input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_bbj_prompt_5
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_fon.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'You are a linguist specializing in diacritical marks for Fon. Diacritics
+  are essential for proper pronunciation and meaning in Fon. You are tasked with converting
+  Fon sentences  without diacritics into their correctly accented forms. Here''s the
+  input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_fon_prompt_5
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_ibo.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'You are a linguist specializing in diacritical marks for Igbo. Diacritics
+  are essential for proper pronunciation and meaning in Igbo. You are tasked with
+  converting Igbo sentences  without diacritics into their correctly accented forms.
+  Here''s the input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_ibo_prompt_5
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_wol.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_wol.yaml
+# Generated by utils.py
+dataset_name: wol
+doc_to_text: 'You are a linguist specializing in diacritical marks for Wolof. Diacritics
+  are essential for proper pronunciation and meaning in Wolof. You are tasked with
+  converting Wolof sentences  without diacritics into their correctly accented forms.
+  Here''s the input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_wol_prompt_5
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yaml
+tag:
+- adr_tasks
+- adr_prompt_5
+dataset_path: masakhane/diacritics-restoration
+dataset_kwargs: {trust_remote_code: True}
+doc_to_target: target
+output_type: generate_until
+fewshot_split: dev
+test_split: test
+training_split: train
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+generation_kwargs:
+  do_sample: false
+  until:
+  - '<eos>'
+  - </s>
+  - <|im_end|>
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yor.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'You are a linguist specializing in diacritical marks for Yoruba. Diacritics
+  are essential for proper pronunciation and meaning in Yoruba. You are tasked with
+  converting Yoruba sentences  without diacritics into their correctly accented forms.
+  Here''s the input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_yor_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/README.md
+++ b/lm_eval/tasks/afrobench/afriqa/README.md
+#
+
+## Paper
+Title: `AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages`
+
+Paper Link: https://arxiv.org/abs/2305.06897
+
+## Abstract
+>AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology. African languages have historically been underserved in the digital landscape, with far less in-language content available online. This makes it difficult for QA systems to provide accurate information to users in their native language. However, cross-lingual open-retrieval question answering (XOR QA) systems can help fill this gap by retrieving answer content from other languages. AfriQA focuses specifically on African languages where cross-lingual answer content is the only high-coverage source of information. Previous datasets have primarily focused on languages where cross-lingual QA augments coverage from the target language, but AfriQA highlights the importance of African languages as a realistic use case for XOR QA.
+
+HomePage: https://github.com/masakhane-io/afriqa
+
+### Citation
+
+```
+@misc{ogundepo2023afriqa,
+      title={AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages},
+      author={Odunayo Ogundepo and Tajuddeen R. Gwadabe and Clara E. Rivera and Jonathan H. Clark and Sebastian Ruder and David Ifeoluwa Adelani and Bonaventure F. P. Dossou and Abdou Aziz DIOP and Claytone Sikasote and Gilles Hacheme and Happy Buzaaba and Ignatius Ezeani and Rooweither Mabuya and Salomey Osei and Chris Emezue and Albert Njoroge Kahira and Shamsuddeen H. Muhammad and Akintunde Oladipo and Abraham Toluwase Owodunni and Atnafu Lambebo Tonja and Iyanuoluwa Shode and Akari Asai and Tunde Oluwaseyi Ajayi and Clemencia Siro and Steven Arthur and Mofetoluwa Adeyemi and Orevaoghene Ahia and Aremu Anuoluwapo and Oyinkansola Awosan and Chiamaka Chukwuneke and Bernard Opoku and Awokoya Ayodele and Verrah Otiende and Christine Mwase and Boyd Sinkala and Andre Niyongabo Rubungo and Daniel A. Ajisafe and Emeka Felix Onwuegbuzia and Habib Mbow and Emile Niyomutabazi and Eunice Mukonde and Falalu Ibrahim Lawan and Ibrahim Said Ahmad and Jesujoba O. Alabi and Martin Namukombo and Mbonu Chinedu and Mofya Phiri and Neo Putini and Ndumiso Mngoma and Priscilla A. Amuok and Ruqayya Nasir Iro and Sonia Adhiambo},
+      year={2023},
+      eprint={2305.06897},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
--- a/lm_eval/tasks/afrobench/afriqa/afriqa.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/afriqa.yaml
+group: afriqa
+task:
+  - afriqa_prompt_1
+  - afriqa_prompt_2
+  - afriqa_prompt_3
+  - afriqa_prompt_4
+  - afriqa_prompt_5
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1