Commit c1e63555 authored by Yu Shi Jie's avatar Yu Shi Jie
Browse files

Merge branch 'upstream' into 'mmlu-pro'

add tokenizer logs info (#1731)

See merge request shijie.yu/lm-evaluation-harness!4
parents e361687c 42dc2448
from sklearn.metrics import f1_score
def doc_to_choice(doc):
choices = eval(doc["choices"])
return choices
def doc_to_text(doc):
output = """You are a highly knowledgeable and intelligent artificial intelligence
model answers multiple-choice questions about '{subject}'
Question: '''{question}'''
Choices:
A: ''{choice1}'''
B: ''{choice2}'''
C: ''{choice3}'''
D: ''{choice4}'''
Answer: """
choices = eval(doc["choices"])
text = output.format(
subject=doc["subject"],
question=doc["question"],
choice1=choices[0],
choice2=choices[1],
choice3=choices[2],
choice4=choices[3],
)
return text
def weighted_f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = f1_score(golds, preds, average="weighted")
return fscore
# IrokoBench
### Paper
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
https://arxiv.org/pdf/2406.03368
IrokoBench is a human-translated benchmark dataset for 16 typologically diverse
low-resource African languages covering three tasks: natural language inference (AfriXNLI),
mathematical reasoning (AfriMGSM), and multi-choice knowledge-based QA (AfriMMLU).
### Citation
```
@misc{adelani2024irokobenchnewbenchmarkafrican,
title={IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models},
author={David Ifeoluwa Adelani and Jessica Ojo and Israel Abebe Azime and Jian Yun Zhuang and Jesujoba O. Alabi and Xuanli He and Millicent Ochieng and Sara Hooker and Andiswa Bukula and En-Shiun Annie Lee and Chiamaka Chukwuneke and Happy Buzaaba and Blessing Sibanda and Godson Kalipe and Jonathan Mukiibi and Salomon Kabongo and Foutse Yuehgoh and Mmasibidi Setaka and Lolwethu Ndolela and Nkiruka Odu and Rooweither Mabuya and Shamsuddeen Hassan Muhammad and Salomey Osei and Sokhar Samb and Tadesse Kebede Guge and Pontus Stenetorp},
year={2024},
eprint={2406.03368},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.03368},
}
```
### Groups and Tasks
#### Groups
* `afrixnli`: All afrixnli tasks
* `afrixnli_en_direct`: afrixnli_en_direct evaluates models performance using the anli prompt on the curated dataset
* `afrixnli_native_direct`: afrixnli_native_direct evaluates models performance using the anli prompt translated to the
respective languages on the curated dataset
* `afrixnli_translate`: afrixnli_translate evaluates models using the anli prompt in translate-test setting
* `afrixnli_manual_direct`: afrixnli_manual_direct evaluates models performance using Lai's prompt on the curated dataset
* `afrixnli_manual_translate`: afrixnli_manual_translate evaluates models using Lai's prompt in translate-test setting
#### Tasks
* `afrixnli_en_direct_{language_code}`: each task evaluates for one language
* `afrixnli_native_direct_{language_code}`: each task evaluates for one language
* `afrixnli_translate_{language_code}`: each task evaluates for one language
* `afrixnli_manual_direct_{language_code}`: each task evaluates for one language
* `afrixnli_manual_translate_{language_code}`: each task evaluates for one language
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
* [x] Checked for equivalence with v0.3.0 LM Evaluation Harness
# Generated by utils.py
dataset_name: amh
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_amh
# Generated by utils.py
dataset_name: eng
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_eng
# Generated by utils.py
dataset_name: ewe
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_ewe
# Generated by utils.py
dataset_name: fra
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_fra
# Generated by utils.py
dataset_name: hau
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_hau
# Generated by utils.py
dataset_name: ibo
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_ibo
# Generated by utils.py
dataset_name: kin
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_kin
# Generated by utils.py
dataset_name: lin
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_lin
# Generated by utils.py
dataset_name: lug
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_lug
# Generated by utils.py
dataset_name: orm
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_orm
# Generated by utils.py
dataset_name: sna
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_sna
# Generated by utils.py
dataset_name: sot
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_sot
# Generated by utils.py
dataset_name: swa
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_swa
# Generated by utils.py
dataset_name: twi
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_twi
# Generated by utils.py
dataset_name: wol
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_wol
# Generated by utils.py
dataset_name: xho
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_xho
group:
- xnli
- afrixnli
- afrixnli_en_direct
dataset_path: masakhane/afrixnli
dataset_name: null
output_type: multiple_choice
validation_split: validation
test_split: test
fewshot_split: validation
doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
# True = entailment
# False = contradiction
# Neither = neutral
doc_to_target: !function utils.doc_to_target
doc_to_choice:
- "True"
- "Neither"
- "False"
should_decontaminate: true
doc_to_decontamination_query: premise
metric_list:
- metric: f1
aggregation: !function utils.weighted_f1_score
average: weighted
higher_is_better: True
ignore_case: true
ignore_punctuation: true
- metric: acc
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
# Generated by utils.py
dataset_name: yor
include: afrixnli_en_direct_yaml
task: afrixnli_en_direct_yor
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment