AfroBench: How Good are Large Language Models on African Languages? (#2825)

* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

AfroBench: How Good are Large Language Models on African Languages? (#2825)
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
18297993 · Jess · GitHub · cf51e699 · 18297993 · 18297993
Unverified Commit 18297993 authored May 15, 2025 by Jess Committed by GitHub May 15, 2025
20 changed files
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_amh.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_amh.yaml
+# Generated by utils.py
+dataset_name: amh
+include: afrisenti
+task: afrisenti_amh_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_arq.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_arq.yaml
+# Generated by utils.py
+dataset_name: arq
+include: afrisenti
+task: afrisenti_arq_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_ary.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_ary.yaml
+# Generated by utils.py
+dataset_name: ary
+include: afrisenti
+task: afrisenti_ary_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_hau.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_hau.yaml
+# Generated by utils.py
+dataset_name: hau
+include: afrisenti
+task: afrisenti_hau_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_ibo.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+include: afrisenti
+task: afrisenti_ibo_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_kin.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_kin.yaml
+# Generated by utils.py
+dataset_name: kin
+include: afrisenti
+task: afrisenti_kin_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_orm.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_orm.yaml
+# Generated by utils.py
+dataset_name: orm
+include: afrisenti
+task: afrisenti_orm_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_pcm.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_pcm.yaml
+# Generated by utils.py
+dataset_name: pcm
+include: afrisenti
+task: afrisenti_pcm_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_por.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_por.yaml
+# Generated by utils.py
+dataset_name: por
+include: afrisenti
+task: afrisenti_por_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_swa.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_swa.yaml
+# Generated by utils.py
+dataset_name: swa
+include: afrisenti
+task: afrisenti_swa_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_tir.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_tir.yaml
+# Generated by utils.py
+dataset_name: tir
+include: afrisenti
+task: afrisenti_tir_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_tso.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_tso.yaml
+# Generated by utils.py
+dataset_name: tso
+include: afrisenti
+task: afrisenti_tso_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_twi.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_twi.yaml
+# Generated by utils.py
+dataset_name: twi
+include: afrisenti
+task: afrisenti_twi_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_yor.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+include: afrisenti
+task: afrisenti_yor_prompt_1
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/run.sh
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/run.sh
+#!/bin/bash
+models=(
+  "google/gemma-1.1-7b-it"
+  "CohereForAI/aya-101"
+  "meta-llama/Llama-2-7b-chat-hf"
+  "meta-llama/Meta-Llama-3-8B-Instruct"
+  "google/gemma-2-9b-it"
+  "bigscience/mt0-xxl"
+  "google/gemma-2-27b-it"
+  "meta-llama/Meta-Llama-3-70B-Instruct"
+)
+task=afrisenti_amh_prompt_1,afrisenti_arq_prompt_1,afrisenti_ary_prompt_1,afrisenti_hau_prompt_1,afrisenti_ibo_prompt_1,afrisenti_kin_prompt_1,afrisenti_pcm_prompt_1,afrisenti_por_prompt_1,afrisenti_swa_prompt_1,afrisenti_tir_prompt_1,afrisenti_tso_prompt_1,afrisenti_twi_prompt_1,afrisenti_yor_prompt_1
+for model in "${models[@]}"
+do
+  echo "Evaluating model: $model"
+  for fewshot in 0 5
+  do
+    export OUTPUT_DIR=results/$fewshot
+    mkdir -p "$OUTPUT_DIR"
+    lm_eval --model hf \
+            --model_args "pretrained=${model}" \
+            --tasks $task\
+            --device cuda:0 \
+            --batch_size 16 \
+            --output_path "$OUTPUT_DIR" \
+            --num_fewshot $fewshot \
+            --verbosity DEBUG
+  done
+done
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/utils.py
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/utils.py
+from lm_eval.utils import weighted_f1_score
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/xx.py
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/xx.py
+from datasets import load_dataset
+# ['amh', 'hau', 'ibo', 'arq', 'ary', 'yor', 'por', 'twi', 'tso', 'tir', 'orm', 'pcm', 'kin', 'swa']
+data = load_dataset("masakhane/afrisenti", "pcm", trust_remote_code=True)
+print(data)
+print(data["test"][:5])
+#
+# ['Naija', 'Pipo', 'wey', 'dey', 'for', 'inside', 'social', 'Media', 'sef', 'don', 'put', 'hand', 'for', 'ear', 'give',
+#  'federal', 'goment', 'and', 'polical', 'leader', 'dem', 'ova', 'di', 'kilin', '.']
+#
+# [6, 0, 14, 17, 2, 2, 6, 0, 7, 17, 16, 0, 2, 0, 16, 0, 0, 9, 0, 0, 11, 2, 8, 0, 1]
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti
+tag:
+    - afrobench_sentiment_tasks
+    - afrisent_prompt_2
+dataset_path: masakhane/afrisenti
+dataset_name: null
+dataset_kwargs: {trust_remote_code: True}
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+fewshot_split: train
+doc_to_target: label
+doc_to_choice:
+    - "negative"
+    - "positive"
+    - "neutral"
+should_decontaminate: true
+doc_to_decontamination_query: 'text: {{tweet}} \nlabel: '
+metric_list:
+  - metric: f1
+    aggregation: !function utils.weighted_f1_score
+    # aggregation: mean
+    average: weighted
+    hf_evaluate: true
+    higher_is_better: True
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti_amh.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti_amh.yaml
+# Generated by utils.py
+dataset_name: amh
+doc_to_text: Does this Amharic statement; '{{tweet}}' have a Neutral, Positive or
+  Negative sentiment? Labels only
+include: afrisenti
+task: afrisenti_amh_prompt_2
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti_arq.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_2/afrisenti_arq.yaml
+# Generated by utils.py
+dataset_name: arq
+doc_to_text: Does this Algerian Arabic statement; '{{tweet}}' have a Neutral, Positive
+  or Negative sentiment? Labels only
+include: afrisenti
+task: afrisenti_arq_prompt_2