AfroBench: How Good are Large Language Models on African Languages? (#2825)

* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

AfroBench: How Good are Large Language Models on African Languages? (#2825)
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
18297993 · Jess · GitHub · cf51e699 · 18297993 · 18297993
Unverified Commit 18297993 authored May 15, 2025 by Jess Committed by GitHub May 15, 2025
20 changed files
--- a/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_twi.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_twi.yaml
+# Generated by utils.py
+dataset_name: twi
+doc_to_text: 'You are an AI assistant and your task is to answer the question based
+  on the provided context.Your answer should be the shortest span that contains the
+  answer within the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_twi_prompt_4
--- a/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_yor.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'You are an AI assistant and your task is to answer the question based
+  on the provided context.Your answer should be the shortest span that contains the
+  answer within the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_yor_prompt_4
--- a/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_zul.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_4/afriqa_zul.yaml
+# Generated by utils.py
+dataset_name: zul
+doc_to_text: 'You are an AI assistant and your task is to answer the question based
+  on the provided context.Your answer should be the shortest span that contains the
+  answer within the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_zul_prompt_4
--- a/lm_eval/tasks/afrobench/afriqa/prompt_4/utils.py
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_4/utils.py
+import re
+import string
+from collections import Counter
+
+
+def normalize_answer(s):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    Lower text and remove punctuation, articles and extra whitespace.
+    """
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def f1(items):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    """
+
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+
+    f1_list = []
+
+    for i in range(len(golds)):
+        prediction_tokens = normalize_answer(preds[i]).split()
+        references_tokens = normalize_answer(golds[i]).split()
+        common = Counter(prediction_tokens) & Counter(references_tokens)
+        num_same = sum(common.values())
+        if num_same == 0:
+            f1_score = 0
+        else:
+            precision = 1.0 * num_same / len(prediction_tokens)
+            recall = 1.0 * num_same / len(references_tokens)
+            f1_score = (2 * precision * recall) / (precision + recall)
+
+        f1_list.append(f1_score)
+
+    return sum(f1_list) / len(f1_list)
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa
+tag:
+    - afrobench_xqa_tasks
+    - afriqa_prompt_5
+dataset_kwargs: {trust_remote_code: True}
+dataset_path: masakhane/afriqa-gold-passages
+dataset_name: null
+output_type: generate_until
+test_split: test
+fewshot_split: train
+doc_to_target: answer_pivot
+should_decontaminate: true
+doc_to_decontamination_query: question_lang
+generation_kwargs:
+  until:
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: remove_whitespace
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+target_delimiter: " "
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+  - metric: f1
+    aggregation: !function utils.f1
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_bem.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_bem.yaml
+# Generated by utils.py
+dataset_name: bem
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_bem_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_fon.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_fon_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_hau.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_hau.yaml
+# Generated by utils.py
+dataset_name: hau
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_hau_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_ibo.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_ibo_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_kin.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_kin.yaml
+# Generated by utils.py
+dataset_name: kin
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_kin_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_swa.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_swa.yaml
+# Generated by utils.py
+dataset_name: swa
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+fewshot_split: test
+fewshot_config:
+  sampler: first_n
+task: afriqa_swa_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_twi.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_twi.yaml
+# Generated by utils.py
+dataset_name: twi
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_twi_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_yor.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_yor_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_zul.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/afriqa_zul.yaml
+# Generated by utils.py
+dataset_name: zul
+doc_to_text: 'Using the context, find the answer to the question.Respond with the
+  briefest span that includes the answer from the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_zul_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/prompt_5/utils.py
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_5/utils.py
+import re
+import string
+from collections import Counter
+
+
+def normalize_answer(s):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    Lower text and remove punctuation, articles and extra whitespace.
+    """
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def f1(items):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    """
+
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+
+    f1_list = []
+
+    for i in range(len(golds)):
+        prediction_tokens = normalize_answer(preds[i]).split()
+        references_tokens = normalize_answer(golds[i]).split()
+        common = Counter(prediction_tokens) & Counter(references_tokens)
+        num_same = sum(common.values())
+        if num_same == 0:
+            f1_score = 0
+        else:
+            precision = 1.0 * num_same / len(prediction_tokens)
+            recall = 1.0 * num_same / len(references_tokens)
+            f1_score = (2 * precision * recall) / (precision + recall)
+
+        f1_list.append(f1_score)
+
+    return sum(f1_list) / len(f1_list)
--- a/lm_eval/tasks/afrobench/afriqa/utils.py
+++ b/lm_eval/tasks/afrobench/afriqa/utils.py
+import argparse
+import os
+
+import yaml
+
+
+class FunctionTag:
+    def __init__(self, value):
+        self.value = value
+
+
+def prompt_func(mode, lang):
+    prompt_map = {
+        "prompt_1": "Your task is to answer a question given a context."
+        "Make sure you respond with the shortest span containing the answer in the context.\n"
+        "Question: {{question_lang}}\n"
+        "Context: {{context}}\n"
+        "Answer:",
+        "prompt_2": f"Your task is to answer a question given a context. The question is in {lang}, while the context is in English or French."
+        "Make sure you respond with the shortest span in the context that contains the answer.\n"
+        "Question: {{question_lang}}\n"
+        "Context: {{context}}\n"
+        "Answer:",
+        "prompt_3": "Given the context, provide the answer to the following question."
+        "Ensure your response is concise and directly from the context.\n"
+        "Question: {{question_lang}}\n"
+        "Context: {{context}}\n"
+        "Answer:",
+        "prompt_4": "You are an AI assistant and your task is to answer the question based on the provided context."
+        "Your answer should be the shortest span that contains the answer within the context.\n"
+        "Question: {{question_lang}}\n"
+        "Context: {{context}}\n"
+        "Answer:",
+        "prompt_5": "Using the context, find the answer to the question."
+        "Respond with the briefest span that includes the answer from the context.\n"
+        "Question: {{question_lang}}\n"
+        "Context: {{context}}\n"
+        "Answer:",
+    }
+    return prompt_map[mode]
+
+
+def gen_lang_yamls(output_dir: str, overwrite: bool, mode: str) -> None:
+    """
+    Generate a yaml file for each language.
+
+    :param output_dir: The directory to output the files to.
+    :param overwrite: Whether to overwrite files if they already exist.
+    """
+    err = []
+    languages = {
+        "bem": "Bemba",
+        "fon": "Fon",
+        "hau": "Hausa",
+        "ibo": "Igbo",
+        "kin": "Kinyarwanda",
+        "swa": "Swahili",
+        "twi": "Twi",
+        "wol": "Wolof",
+        "yor": "Yoruba",
+        "zul": "Zulu",
+    }
+
+    for lang in languages.keys():
+        try:
+            file_name = f"afriqa_{lang}.yaml"
+            task_name = f"afriqa_{lang}_{mode}"
+            yaml_template = "afriqa"
+            yaml_details = {
+                "include": yaml_template,
+                "task": task_name,
+                "dataset_name": lang,
+                "doc_to_text": prompt_func(mode, languages[lang]),
+            }
+            file_path = os.path.join(output_dir, mode)
+            os.makedirs(file_path, exist_ok=True)
+
+            with open(
+                f"{output_dir}/{mode}/{file_name}",
+                "w" if overwrite else "x",
+                encoding="utf8",
+            ) as f:
+                f.write("# Generated by utils.py\n")
+                yaml.dump(
+                    yaml_details,
+                    f,
+                    allow_unicode=True,
+                )
+        except FileExistsError:
+            err.append(file_name)
+
+    if len(err) > 0:
+        raise FileExistsError(
+            "Files were not created because they already exist (use --overwrite flag):"
+            f" {', '.join(err)}"
+        )
+
+
+def main() -> None:
+    """Parse CLI args and generate language-specific yaml files."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--overwrite",
+        default=True,
+        action="store_true",
+        help="Overwrite files if they already exist",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default="./",
+        help="Directory to write yaml files to",
+    )
+    parser.add_argument(
+        "--mode",
+        default="prompt_1",
+        choices=["prompt_1", "prompt_2", "prompt_3", "prompt_4", "prompt_5"],
+        help="Prompt number",
+    )
+    args = parser.parse_args()
+
+    gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite, mode=args.mode)
+
+
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/afrobench/afrisenti/README.md
+++ b/lm_eval/tasks/afrobench/afrisenti/README.md
+#
+
+## Paper
+Title: `AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages`
+
+Paper Link: https://aclanthology.org/2023.emnlp-main.862/
+
+## Abstract
+>Africa is home to over 2,000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (with over 200 participants, see website: https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the AfriSenti datasets and discuss their usefulness.
+
+HomePage: https://github.com/afrisenti-semeval/afrisent-semeval-2023
+
+### Citation
+
+```
+@inproceedings{muhammad-etal-2023-afrisenti,
+    title = "{A}fri{S}enti: A {T}witter Sentiment Analysis Benchmark for {A}frican Languages",
+    author = "Muhammad, Shamsuddeen Hassan  and
+      Abdulmumin, Idris  and
+      Ayele, Abinew Ali  and
+      Ousidhoum, Nedjma  and
+      Adelani, David Ifeoluwa  and
+      Yimam, Seid Muhie  and
+      Ahmad, Ibrahim Sa'id  and
+      Beloucif, Meriem  and
+      Mohammad, Saif M.  and
+      Ruder, Sebastian  and
+      Hourrane, Oumaima  and
+      Brazdil, Pavel  and
+      Jorge, Alipio  and
+      Ali, Felermino D{\'a}rio M{\'a}rio Ant{\'o}nio  and
+      David, Davis  and
+      Osei, Salomey  and
+      Shehu Bello, Bello  and
+      Ibrahim, Falalu  and
+      Gwadabe, Tajuddeen  and
+      Rutunda, Samuel  and
+      Belay, Tadesse  and
+      Messelle, Wendimu Baye  and
+      Balcha, Hailu Beshada  and
+      Chala, Sisay Adugna  and
+      Gebremichael, Hagos Tesfahun  and
+      Opoku, Bernard  and
+      Arthur, Stephen",
+    editor = "Bouamor, Houda  and
+      Pino, Juan  and
+      Bali, Kalika",
+    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.emnlp-main.862/",
+    doi = "10.18653/v1/2023.emnlp-main.862",
+    pages = "13968--13981",
+    abstract = "Africa is home to over 2,000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of {\ensuremath{>}}110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (with over 200 participants, see website: https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the AfriSenti datasets and discuss their usefulness."
+}
+```
--- a/lm_eval/tasks/afrobench/afrisenti/afrisenti.yaml
+++ b/lm_eval/tasks/afrobench/afrisenti/afrisenti.yaml
+group: afrisenti
+task:
+  - afrisenti_prompt_1
+  - afrisenti_prompt_2
+  - afrisenti_prompt_3
+  - afrisenti_prompt_4
+  - afrisenti_prompt_5
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1
--- a/lm_eval/tasks/afrobench/afrisenti/fewshot.sh
+++ b/lm_eval/tasks/afrobench/afrisenti/fewshot.sh
+lm_eval --model hf \
+        --model_args pretrained=masakhane/African-ultrachat-alpaca  \
+        --tasks afrimmlu_direct_amh,afrimmlu_direct_eng,afrimmlu_direct_ewe,afrimmlu_direct_fra,afrimmlu_direct_hau,afrimmlu_direct_ibo,afrimmlu_direct_kin,afrimmlu_direct_lin,afrimmlu_direct_lug,afrimmlu_direct_orm,afrimmlu_direct_sna,afrimmlu_direct_sot,afrimmlu_direct_twi,afrimmlu_direct_wol,afrimmlu_direct_xho,afrimmlu_direct_yor,afrimmlu_direct_zul   \
+        --device cuda:0     \
+        --batch_size 1 \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --wandb_args project=afrimmlu
+
+
+lm_eval --model hf \
+        --model_args pretrained=bigscience/mt0-small,parallelize=true \
+        --tasks afrisenti_amh_prompt_1,afrisenti_arq_prompt_1,afrisenti_ary_prompt_1,afrisenti_hau_prompt_1,afrisenti_ibo_prompt_1,afrisenti_kin_prompt_1,afrisenti_orm_prompt_1,afrisenti_pcm_prompt_1,afrisenti_por_prompt_1,afrisenti_swa_prompt_1,afrisenti_tir_prompt_1,afrisenti_tso_prompt_1,afrisenti_twi_prompt_1,afrisenti_yor_prompt_1\
+        --device cuda:0     \
+        --batch_size 1 \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 5
+
+
+lm_eval --model hf \
+        --model_args pretrained=bigscience/mt0-xxl,parallelize=true  \
+        --tasks afrisenti_amh_prompt_1,afrisenti_arq_prompt_1,afrisenti_ary_prompt_1,afrisenti_hau_prompt_1,afrisenti_ibo_prompt_1,afrisenti_kin_prompt_1,afrisenti_orm_prompt_1,afrisenti_pcm_prompt_1,afrisenti_por_prompt_1,afrisenti_swa_prompt_1,afrisenti_tir_prompt_1,afrisenti_tso_prompt_1,afrisenti_twi_prompt_1,afrisenti_yor_prompt_1\
+        --batch_size 128 \
+        --num_fewshot 0 \
+        --verbosity DEBUG
+
+lm_eval --model hf \
+        --model_args pretrained=google/gemma-2-27b-it,parallelize=true,trust_remote_code=True \
+        --tasks afriqa_wol_prompt_2\
+        --batch_size 1 \
+        --device 'cuda' \
+        --num_fewshot 5 \
+        --verbosity DEBUG \
+        --output_path './afriqa_results/' \
+        --log_samples
+
+lm_eval --model vllm \
+        --model_args pretrained=meta-llama/Llama-2-7b-chat-hf,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.8,data_parallel_size=1 \
+        --tasks masakhapos_pcm_prompt_1,masakhapos_pcm_prompt_2,masakhapos_pcm_prompt_3,masakhapos_pcm_prompt_4,masakhapos_pcm_prompt_5 \
+        --batch_size 'auto' \
+        --device 'cuda' \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 2
+
+
+lm_eval --model vllm \
+        --model_args pretrained=meta-llama/Llama-2-7b-chat-hf,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.8,data_parallel_size=1 \
+        --tasks masakhapos_pcm_prompt_1,masakhapos_pcm_prompt_2,masakhapos_pcm_prompt_3,masakhapos_bam_prompt_2,masakhapos_bbj_prompt_3 \
+        --batch_size 'auto' \
+        --device 'cuda' \
+        --num_fewshot 0 \
+        --verbosity DEBUG
+
+lm_eval --model vllm \
+        --model_args pretrained=google/gemma-1.1-7b-it,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.8,data_parallel_size=1 \
+        --tasks masakhaner_pcm_prompt_1\
+        --batch_size 'auto' \
+        --device 'cuda' \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 5
+
+lm_eval --model vllm \
+        --model_args pretrained=google/gemma-2-9b-it,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.8,data_parallel_size=1 \
+        --tasks masakhaner_pcm_prompt_1,masakhaner_pcm_prompt_2,masakhaner_pcm_prompt_3,masakhaner_pcm_prompt_4,masakhaner_pcm_prompt_5\
+        --batch_size 'auto' \
+        --device 'cuda' \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 5
+
+lm_eval --model vllm \
+        --model_args pretrained=google/gemma-1.1-7b-it,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.8,data_parallel_size=1 \
+        --tasks flores_eng_Latn-fuv_Latn_prompt_1,flores_eng_Latn-fuv_Latn_prompt_2,flores_eng_Latn-fuv_Latn_prompt_3,flores_fuv_Latn-eng_Latn_prompt_1,flores_fuv_Latn-eng_Latn_prompt_2,flores_fuv_Latn-eng_Latn_prompt_3 \
+        --batch_size 'auto' \
+        --device 'cuda' \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 2
+
+lm_eval --model vllm \
+        --model_args pretrained=google/gemma-2-27b-it,tensor_parallel_size=2,dtype='auto',gpu_memory_utilization=0.9,data_parallel_size=1 \
+        --tasks masakhapos_twi_prompt_3,masakhapos_wol_prompt_3,masakhapos_xho_prompt_3,masakhapos_yor_prompt_3,masakhapos_zul_prompt_3\
+        --batch_size 'auto' \
+        --num_fewshot 5 \
+        --verbosity DEBUG \
+        --output_path './masakhapos_results/' \
+        --log_samples
+
+lm_eval --model hf \
+        --model_args pretrained=bigscience/mt0-small,parallelize=true \
+        --tasks  injongointent_amh_prompt_1,injongointent_eng_prompt_1,injongointent_yor_prompt_1,injongointent_ibo_prompt_1,injongointent_wol_prompt_1\
+        --device 'mps'  \
+        --batch_size 1 \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --limit 5
+
+lm_eval --model hf \
+        --model_args pretrained=google/gemma-3-27b-it,parallelize=true \
+        --tasks  afrobench_sentiment_tasks\
+        --device 'cuda'  \
+        --batch_size 1 \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --output_path './senti_results/' \
+        --log_samples
--- a/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti
+++ b/lm_eval/tasks/afrobench/afrisenti/prompt_1/afrisenti
+tag:
+    - afrobench_sentiment_tasks
+    - afrisenti_prompt_1
+task: null
+dataset_path: masakhane/afrisenti
+dataset_name: null
+dataset_kwargs: {trust_remote_code: True}
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+fewshot_split: train
+doc_to_text: 'Does this statement; "{{tweet}}" have a Neutral, Positive or Negative sentiment? Labels only'
+doc_to_target: label
+doc_to_choice:
+    - "negative"
+    - "positive"
+    - "neutral"
+should_decontaminate: true
+doc_to_decontamination_query: tweet
+metric_list:
+  - metric: f1
+    aggregation: !function utils.weighted_f1_score
+    # aggregation: mean
+    average: weighted
+    hf_evaluate: true
+    higher_is_better: True
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0