Merge branch 'main' into feature/eval_from_config

601be343 · Baber · d0884a96 · 68c3a811 · 601be343 · 601be343
Commit 601be343 authored Jun 23, 2025 by Baber
20 changed files
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yaml
+tag:
+- adr_tasks
+- adr_prompt_5
+dataset_path: masakhane/diacritics-restoration
+dataset_kwargs: {trust_remote_code: True}
+doc_to_target: target
+output_type: generate_until
+fewshot_split: dev
+test_split: test
+training_split: train
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+generation_kwargs:
+  do_sample: false
+  until:
+  - '<eos>'
+  - </s>
+  - <|im_end|>
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yor.yaml
+++ b/lm_eval/tasks/afrobench/adr/prompt_5/afridiacritics_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'You are a linguist specializing in diacritical marks for Yoruba. Diacritics
+  are essential for proper pronunciation and meaning in Yoruba. You are tasked with
+  converting Yoruba sentences  without diacritics into their correctly accented forms.
+  Here''s the input: {{text}}. Return output sentence only'
+include: afridiacritics_yaml
+task: afridiacritics_yor_prompt_5
--- a/lm_eval/tasks/afrobench/afriqa/README.md
+++ b/lm_eval/tasks/afrobench/afriqa/README.md
+#
+
+## Paper
+Title: `AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages`
+
+Paper Link: https://arxiv.org/abs/2305.06897
+
+## Abstract
+>AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology. African languages have historically been underserved in the digital landscape, with far less in-language content available online. This makes it difficult for QA systems to provide accurate information to users in their native language. However, cross-lingual open-retrieval question answering (XOR QA) systems can help fill this gap by retrieving answer content from other languages. AfriQA focuses specifically on African languages where cross-lingual answer content is the only high-coverage source of information. Previous datasets have primarily focused on languages where cross-lingual QA augments coverage from the target language, but AfriQA highlights the importance of African languages as a realistic use case for XOR QA.
+
+HomePage: https://github.com/masakhane-io/afriqa
+
+### Citation
+
+```
+@misc{ogundepo2023afriqa,
+      title={AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages},
+      author={Odunayo Ogundepo and Tajuddeen R. Gwadabe and Clara E. Rivera and Jonathan H. Clark and Sebastian Ruder and David Ifeoluwa Adelani and Bonaventure F. P. Dossou and Abdou Aziz DIOP and Claytone Sikasote and Gilles Hacheme and Happy Buzaaba and Ignatius Ezeani and Rooweither Mabuya and Salomey Osei and Chris Emezue and Albert Njoroge Kahira and Shamsuddeen H. Muhammad and Akintunde Oladipo and Abraham Toluwase Owodunni and Atnafu Lambebo Tonja and Iyanuoluwa Shode and Akari Asai and Tunde Oluwaseyi Ajayi and Clemencia Siro and Steven Arthur and Mofetoluwa Adeyemi and Orevaoghene Ahia and Aremu Anuoluwapo and Oyinkansola Awosan and Chiamaka Chukwuneke and Bernard Opoku and Awokoya Ayodele and Verrah Otiende and Christine Mwase and Boyd Sinkala and Andre Niyongabo Rubungo and Daniel A. Ajisafe and Emeka Felix Onwuegbuzia and Habib Mbow and Emile Niyomutabazi and Eunice Mukonde and Falalu Ibrahim Lawan and Ibrahim Said Ahmad and Jesujoba O. Alabi and Martin Namukombo and Mbonu Chinedu and Mofya Phiri and Neo Putini and Ndumiso Mngoma and Priscilla A. Amuok and Ruqayya Nasir Iro and Sonia Adhiambo},
+      year={2023},
+      eprint={2305.06897},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
--- a/lm_eval/tasks/afrobench/afriqa/afriqa.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/afriqa.yaml
+group: afriqa
+task:
+  - afriqa_prompt_1
+  - afriqa_prompt_2
+  - afriqa_prompt_3
+  - afriqa_prompt_4
+  - afriqa_prompt_5
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa
+tag:
+    - afrobench_xqa_tasks
+    - afriqa_prompt_1
+dataset_kwargs: {trust_remote_code: True}
+dataset_path: masakhane/afriqa-gold-passages
+dataset_name: null
+output_type: generate_until
+test_split: test
+fewshot_split: train
+doc_to_target: answer_pivot
+should_decontaminate: true
+doc_to_decontamination_query: question_lang
+generation_kwargs:
+  until:
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: remove_whitespace
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+target_delimiter: " "
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+  - metric: f1
+    aggregation: !function utils.f1
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_bem.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_bem.yaml
+# Generated by utils.py
+dataset_name: bem
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_bem_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_fon.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_fon_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_hau.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_hau.yaml
+# Generated by utils.py
+dataset_name: hau
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_hau_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_ibo.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_ibo_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_kin.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_kin.yaml
+# Generated by utils.py
+dataset_name: kin
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_kin_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_swa.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_swa.yaml
+# Generated by utils.py
+dataset_name: swa
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+fewshot_split: test
+fewshot_config:
+  sampler: first_n
+task: afriqa_swa_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_twi.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_twi.yaml
+# Generated by utils.py
+dataset_name: twi
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_twi_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_yor.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_yor.yaml
+# Generated by utils.py
+dataset_name: yor
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_yor_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_zul.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/afriqa_zul.yaml
+# Generated by utils.py
+dataset_name: zul
+doc_to_text: 'Your task is to answer a qestion given a context.Make sure you respond
+  with the shortest span containing the answer in the context.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_zul_prompt_1
--- a/lm_eval/tasks/afrobench/afriqa/prompt_1/utils.py
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_1/utils.py
+import re
+import string
+from collections import Counter
+
+
+def normalize_answer(s):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    Lower text and remove punctuation, articles and extra whitespace.
+    """
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def f1(items):
+    """
+    Taken from the official evaluation script for v1.1 of the SQuAD dataset.
+    """
+
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+
+    f1_list = []
+
+    for i in range(len(golds)):
+        prediction_tokens = normalize_answer(preds[i]).split()
+        references_tokens = normalize_answer(golds[i]).split()
+        common = Counter(prediction_tokens) & Counter(references_tokens)
+        num_same = sum(common.values())
+        if num_same == 0:
+            f1_score = 0
+        else:
+            precision = 1.0 * num_same / len(prediction_tokens)
+            recall = 1.0 * num_same / len(references_tokens)
+            f1_score = (2 * precision * recall) / (precision + recall)
+
+        f1_list.append(f1_score)
+
+    return sum(f1_list) / len(f1_list)
--- a/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa
+tag:
+    - afrobench_xqa_tasks
+    - afriqa_prompt_2
+dataset_kwargs: {trust_remote_code: True}
+dataset_path: masakhane/afriqa-gold-passages
+dataset_name: null
+output_type: generate_until
+test_split: test
+fewshot_split: train
+doc_to_target: answer_pivot
+should_decontaminate: true
+doc_to_decontamination_query: question_lang
+generation_kwargs:
+  until:
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: remove_whitespace
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+target_delimiter: " "
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+  - metric: f1
+    aggregation: !function utils.f1
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+      - "."
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_bem.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_bem.yaml
+# Generated by utils.py
+dataset_name: bem
+doc_to_text: 'Your task is to answer a question given a context. The question is in
+  Bemba, while the context is in English or French.Make sure you respond with the
+  shortest span in the context that contains the answer.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_bem_prompt_2
--- a/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_fon.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_fon.yaml
+# Generated by utils.py
+dataset_name: fon
+doc_to_text: 'Your task is to answer a question given a context. The question is in
+  Fon, while the context is in English or French.Make sure you respond with the shortest
+  span in the context that contains the answer.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_fon_prompt_2
--- a/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_hau.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_hau.yaml
+# Generated by utils.py
+dataset_name: hau
+doc_to_text: 'Your task is to answer a question given a context. The question is in
+  Hausa, while the context is in English or French.Make sure you respond with the
+  shortest span in the context that contains the answer.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_hau_prompt_2
--- a/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_ibo.yaml
+++ b/lm_eval/tasks/afrobench/afriqa/prompt_2/afriqa_ibo.yaml
+# Generated by utils.py
+dataset_name: ibo
+doc_to_text: 'Your task is to answer a question given a context. The question is in
+  Igbo, while the context is in English or French.Make sure you respond with the shortest
+  span in the context that contains the answer.
+
+  Question: {{question_lang}}
+
+  Context: {{context}}
+
+  Answer:'
+include: afriqa
+task: afriqa_ibo_prompt_2