Unverified Commit decc533d authored by Malikeh Ehghaghi's avatar Malikeh Ehghaghi Committed by GitHub
Browse files

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232)



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 543617fe
task: arabic_leaderboard_acva_Tunisia
dataset_path: OALL/ACVA
dataset_name: Tunisia
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_United_Arab_Emirates
dataset_path: OALL/ACVA
dataset_name: United_Arab_Emirates
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Yemen
dataset_path: OALL/ACVA
dataset_name: Yemen
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_communication
dataset_path: OALL/ACVA
dataset_name: communication
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_computer_and_phone
dataset_path: OALL/ACVA
dataset_name: computer_and_phone
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_daily_life
dataset_path: OALL/ACVA
dataset_name: daily_life
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_entertainment
dataset_path: OALL/ACVA
dataset_name: entertainment
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import datasets
import numpy as np
def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
question = doc["question"]
answer = doc["answer"]
return {
"query": f"السؤال: {question}\nالإجابة:",
"choices": ["صح", "خطأ"],
"gold": ["صح", "خطأ"].index(answer),
}
return dataset.map(_process_doc)
group: arabic_leaderboard_complete
task:
- arabic_leaderboard_acva
- arabic_leaderboard_alghafa
- arabic_leaderboard_arabic_exams
- arabic_leaderboard_arabic_mt_arc_challenge
- arabic_leaderboard_arabic_mt_arc_easy
- arabic_leaderboard_arabic_mt_boolq
- arabic_leaderboard_arabic_mt_hellaswag
- arabic_leaderboard_arabic_mt_mmlu
- arabic_leaderboard_arabic_mt_copa
- arabic_leaderboard_arabic_mt_openbook_qa
- arabic_leaderboard_arabic_mt_piqa
- arabic_leaderboard_arabic_mt_race
- arabic_leaderboard_arabic_mt_sciq
- arabic_leaderboard_arabic_mt_toxigen
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
# Arabic Leaderboard Light
Title: Open Arabic LLM Leaderboard Light
This leaderboard follows all the details as in [`arabic_leaderboard_complete`](../arabic_leaderboard_complete), except that a light version - 10% random sample of the test set of each benchmark - is used to test the language models.
NOTE: In ACVA benchmark, there is Yemen subset, and it is a small dataset - it has only 10 samples in the test split. So, for this specific subset dataset, to have more reliable results, we consider the original dataset, instead of 10% of its test samples.
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: arabic_leaderboard_alghafa_light
task:
- arabic_leaderboard_alghafa_mcq_exams_test_ar_light
- arabic_leaderboard_alghafa_meta_ar_dialects_light
- arabic_leaderboard_alghafa_meta_ar_msa_light
- arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task_light
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task_light
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task_light
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task_light
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task_light
- arabic_leaderboard_alghafa_multiple_choice_sentiment_task_light
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_mcq_exams_test_ar_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: mcq_exams_test_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_meta_ar_dialects_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: meta_ar_dialects
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_meta_ar_msa_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: meta_ar_msa
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_facts_truefalse_balanced_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_grounded_statement_soqal_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_grounded_statement_xglue_mlqa_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_rating_sentiment_no_neutral_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_rating_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_sentiment_task_light
dataset_path: arcee-globe/AlGhafa-Arabic-LLM-Benchmark-Native-10percent
dataset_name: multiple_choice_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment