Irokobench: Benchmark Dataset for African languages (#2042)

* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>

Irokobench: Benchmark Dataset for African languages (#2042)
* add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
383bbd54 · Jess · GitHub · d5f39bf8 · 383bbd54 · 383bbd54
Unverified Commit 383bbd54 authored Jul 12, 2024 by Jess Committed by GitHub Jul 12, 2024
20 changed files
--- a/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_wol.yaml
+++ b/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_wol.yaml
+dataset_name: wol
+include: afrimmlu_common_yaml
+task: afrimmlu_direct_wol
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_xho.yaml
+++ b/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_xho.yaml
+dataset_name: xho
+include: afrimmlu_common_yaml
+task: afrimmlu_direct_xho
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_yor.yaml
+++ b/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_yor.yaml
+dataset_name: yor
+include: afrimmlu_common_yaml
+task: afrimmlu_direct_yor
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_zul.yaml
+++ b/lm_eval/tasks/afrimmlu/direct/afrimmlu_direct_zul.yaml
+dataset_name: zul
+include: afrimmlu_common_yaml
+task: afrimmlu_direct_zul
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/direct/utils.py
+++ b/lm_eval/tasks/afrimmlu/direct/utils.py
+import re
+import sys
+import unicodedata
+
+from sklearn.metrics import f1_score
+from lm_eval.filters.extraction import RegexFilter
+
+
+def doc_to_choice(doc):
+    choices = eval(doc["choices"])
+    return choices
+
+
+def doc_to_text(doc):
+    output = """You are a highly knowledgeable and intelligent artificial intelligence 
+                model answers multiple-choice questions about {subject}
+                
+                Question: {question}
+
+                Choices:
+                        A: {choice1}
+                        B: {choice2}
+                        C: {choice3}
+                        D: {choice4}
+                       
+                Answer:  """
+    
+    choices = eval(doc["choices"])
+    text = output.format(subject=doc['subject'],
+                         question=doc['question'],
+                         choice1=choices[0],
+                         choice2=choices[1],
+                         choice3=choices[2],
+                         choice4=choices[3])
+    return text
+
+
+def weighted_f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = f1_score(golds, preds, average="weighted")
+    return fscore
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/fewshot.sh
+++ b/lm_eval/tasks/afrimmlu/fewshot.sh
+lm_eval --model hf \
+        --model_args pretrained=masakhane/African-ultrachat-alpaca  \
+        --tasks afrimmlu_direct_amh,afrimmlu_direct_eng,afrimmlu_direct_ewe,afrimmlu_direct_fra,afrimmlu_direct_hau,afrimmlu_direct_ibo,afrimmlu_direct_kin,afrimmlu_direct_lin,afrimmlu_direct_lug,afrimmlu_direct_orm,afrimmlu_direct_sna,afrimmlu_direct_sot,afrimmlu_direct_twi,afrimmlu_direct_wol,afrimmlu_direct_xho,afrimmlu_direct_yor,afrimmlu_direct_zul   \
+        --device cuda:0     \
+        --batch_size 1 \
+        --num_fewshot 0 \
+        --verbosity DEBUG \
+        --wandb_args project=afrimmlu
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_common_translate_yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_common_translate_yaml
+group:
+  - mmlu
+  - afrimmlu_translate
+task: null
+dataset_path: masakhane/afrimmlu-translate-test
+dataset_name: null
+output_type: multiple_choice
+test_split: test
+doc_to_text: !function utils.doc_to_text 
+doc_to_target: "{{['A', 'B', 'C', 'D'].index(answer)}}"
+doc_to_choice: !function utils.doc_to_choice
+should_decontaminate: true
+doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
+metric_list:
+  - metric: f1 
+    aggregation: !function utils.weighted_f1_score 
+    # aggregation: mean
+    average: weighted 
+    hf_evaluate: true 
+    higher_is_better: True 
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_amh.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_amh.yaml
+dataset_name: amh
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_amh
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_eng.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_eng.yaml
+dataset_name: eng
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_eng
+
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_ewe.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_ewe.yaml
+dataset_name: ewe
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_ewe
+
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_fra.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_fra.yaml
+dataset_name: fra
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_fra
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_hau.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_hau.yaml
+dataset_name: hau
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_hau
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_ibo.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_ibo.yaml
+dataset_name: ibo
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_ibo
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_kin.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_kin.yaml
+dataset_name: kin
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_kin
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_lin.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_lin.yaml
+dataset_name: lin
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_lin
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_lug.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_lug.yaml
+dataset_name: lug
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_lug
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_orm.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_orm.yaml
+dataset_name: orm
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_orm
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_sna.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_sna.yaml
+dataset_name: sna
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_sna
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_sot.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_sot.yaml
+dataset_name: sot
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_sot
\ No newline at end of file
--- a/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_swa.yaml
+++ b/lm_eval/tasks/afrimmlu/translate/afrimmlu_translate_swa.yaml
+dataset_name: swa
+include: afrimmlu_common_translate_yaml
+task: afrimmlu_translate_swa
\ No newline at end of file