README.md 3.06 KB
Newer Older
JessicaOjo's avatar
JessicaOjo committed
1
# IrokoBench
JessicaOjo's avatar
JessicaOjo committed
2
3
4

### Paper

JessicaOjo's avatar
JessicaOjo committed
5
6
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
https://arxiv.org/pdf/2406.03368
JessicaOjo's avatar
JessicaOjo committed
7

JessicaOjo's avatar
JessicaOjo committed
8
9
10
IrokoBench is a human-translated benchmark dataset for 16 typologically diverse 
low-resource African languages covering three tasks: natural language inference (AfriXNLI), 
mathematical reasoning (AfriMGSM), and multi-choice knowledge-based QA (AfriMMLU).
JessicaOjo's avatar
JessicaOjo committed
11
12
13
14


### Citation

JessicaOjo's avatar
JessicaOjo committed
15
16
17
18
19
20
21
22
23
```
@misc{adelani2024irokobenchnewbenchmarkafrican,
      title={IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models}, 
      author={David Ifeoluwa Adelani and Jessica Ojo and Israel Abebe Azime and Jian Yun Zhuang and Jesujoba O. Alabi and Xuanli He and Millicent Ochieng and Sara Hooker and Andiswa Bukula and En-Shiun Annie Lee and Chiamaka Chukwuneke and Happy Buzaaba and Blessing Sibanda and Godson Kalipe and Jonathan Mukiibi and Salomon Kabongo and Foutse Yuehgoh and Mmasibidi Setaka and Lolwethu Ndolela and Nkiruka Odu and Rooweither Mabuya and Shamsuddeen Hassan Muhammad and Salomey Osei and Sokhar Samb and Tadesse Kebede Guge and Pontus Stenetorp},
      year={2024},
      eprint={2406.03368},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.03368}, 
JessicaOjo's avatar
JessicaOjo committed
24
25
26
27
28
29
30
}
```

### Groups and Tasks

#### Groups

JessicaOjo's avatar
JessicaOjo committed
31
32
33
34
35
36
37
* `afrixnli`: All afrixnli tasks
* `afrixnli_en_direct`: afrixnli_en_direct evaluates models performance using the anli prompt on the curated dataset
* `afrixnli_native_direct`: afrixnli_native_direct evaluates models performance using the anli prompt translated to the 
respective languages on the curated dataset
* `afrixnli_translate`: afrixnli_translate evaluates models using the anli prompt in translate-test setting
* `afrixnli_manual_direct`: afrixnli_manual_direct evaluates models performance using Lai's prompt on the curated dataset
* `afrixnli_manual_translate`: afrixnli_manual_translate evaluates models using Lai's prompt in translate-test setting
JessicaOjo's avatar
JessicaOjo committed
38
39

#### Tasks
JessicaOjo's avatar
JessicaOjo committed
40
41
42
43
44
* `afrixnli_en_direct_{language_code}`: each task evaluates for one language
* `afrixnli_native_direct_{language_code}`: each task evaluates for one language
* `afrixnli_translate_{language_code}`: each task evaluates for one language
* `afrixnli_manual_direct_{language_code}`: each task evaluates for one language
* `afrixnli_manual_translate_{language_code}`: each task evaluates for one language
JessicaOjo's avatar
JessicaOjo committed
45
46
47
48
49
50

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
JessicaOjo's avatar
JessicaOjo committed
51
  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
JessicaOjo's avatar
JessicaOjo committed
52
53

If other tasks on this dataset are already supported:
JessicaOjo's avatar
JessicaOjo committed
54
55
56
57
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
  * [x] Checked for equivalence with v0.3.0 LM Evaluation Harness