Unverified Commit decc533d authored by Malikeh Ehghaghi's avatar Malikeh Ehghaghi Committed by GitHub
Browse files

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232)



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 543617fe
task: arabic_mt_race
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Translated
dataset_name: race_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import datasets
import numpy as np
def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
question = doc["query"]
answer_index = int(doc["label"])
# Dynamically determining the choices by excluding '__few_shots', 'query' and 'label'
choices_keys = [
key for key in doc.keys() if key not in ["query", "label", "__few_shots"]
]
choices = [doc[key] for key in choices_keys]
instruction = "الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح\n\n"
query = f"{instruction}السؤال: {question}\n"
for index, choice in enumerate(choices):
query += f"{index}) {choice}\n"
query += "الإجابة:"
return {"query": query, "choices": choices, "gold": answer_index}
return dataset.map(_process_doc)
group: arabic_leaderboard_arabic_mt_sciq
task:
- arabic_mt_sciq
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_mt_sciq
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Translated
dataset_name: sciq_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import random
import datasets
import numpy as np
def doc_to_text(doc):
instruction = (
"بناءً على السياق أدناه، اختر الإجابة الصحيحة للسؤال التالي من قائمة الاقتراحات"
)
support = doc["support"]
question = doc["question"]
query = f"""{instruction}
السياق:
{support}
السؤال:
{question}
الإجابات المحتملة:
"""
return query
def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
correct_answer = doc["correct_answer"]
choices = [
doc["distractor1"],
doc["distractor2"],
doc["distractor3"],
correct_answer,
]
# Shuffle the choices
random.shuffle(choices)
answer_index = choices.index(correct_answer)
return {"query": doc_to_text(doc), "choices": choices, "gold": answer_index}
return dataset.map(_process_doc)
group: arabic_leaderboard_arabic_mt_toxigen
task:
- arabic_mt_toxigen
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_mt_toxigen
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Translated
dataset_name: toxigen_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import datasets
import numpy as np
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
text = doc["text"]
label = 1 if ((doc["toxicity_ai"] + doc["toxicity_human"]) > 5.5) else 0
instruction = 'هل العبارة التالية تحض على الكراهية؟ قم بالرد بـ"نعم" أو "لا".'
query = f"""{instruction}
العبارة:
'{text}'
الإجابة:
"""
out_doc = {
"query": query,
"choices": ["لا", "نعم"],
"gold": label,
}
return out_doc
return dataset.map(_process_doc)
group: arabic_leaderboard_acva
task:
- arabic_leaderboard_acva_Algeria
- arabic_leaderboard_acva_Ancient_Egypt
- arabic_leaderboard_acva_Arab_Empire
- arabic_leaderboard_acva_Arabic_Architecture
- arabic_leaderboard_acva_Arabic_Art
- arabic_leaderboard_acva_Arabic_Astronomy
- arabic_leaderboard_acva_Arabic_Calligraphy
- arabic_leaderboard_acva_Arabic_Ceremony
- arabic_leaderboard_acva_Arabic_Clothing
- arabic_leaderboard_acva_Arabic_Culture
- arabic_leaderboard_acva_Arabic_Food
- arabic_leaderboard_acva_Arabic_Funeral
- arabic_leaderboard_acva_Arabic_Geography
- arabic_leaderboard_acva_Arabic_History
- arabic_leaderboard_acva_Arabic_Language_Origin
- arabic_leaderboard_acva_Arabic_Literature
- arabic_leaderboard_acva_Arabic_Math
- arabic_leaderboard_acva_Arabic_Medicine
- arabic_leaderboard_acva_Arabic_Music
- arabic_leaderboard_acva_Arabic_Ornament
- arabic_leaderboard_acva_Arabic_Philosophy
- arabic_leaderboard_acva_Arabic_Physics_and_Chemistry
- arabic_leaderboard_acva_Arabic_Wedding
- arabic_leaderboard_acva_Bahrain
- arabic_leaderboard_acva_Comoros
- arabic_leaderboard_acva_Egypt_modern
- arabic_leaderboard_acva_InfluenceFromAncientEgypt
- arabic_leaderboard_acva_InfluenceFromByzantium
- arabic_leaderboard_acva_InfluenceFromChina
- arabic_leaderboard_acva_InfluenceFromGreece
- arabic_leaderboard_acva_InfluenceFromIslam
- arabic_leaderboard_acva_InfluenceFromPersia
- arabic_leaderboard_acva_InfluenceFromRome
- arabic_leaderboard_acva_Iraq
- arabic_leaderboard_acva_Islam_Education
- arabic_leaderboard_acva_Islam_branches_and_schools
- arabic_leaderboard_acva_Islamic_law_system
- arabic_leaderboard_acva_Jordan
- arabic_leaderboard_acva_Kuwait
- arabic_leaderboard_acva_Lebanon
- arabic_leaderboard_acva_Libya
- arabic_leaderboard_acva_Mauritania
- arabic_leaderboard_acva_Mesopotamia_civilization
- arabic_leaderboard_acva_Morocco
- arabic_leaderboard_acva_Oman
- arabic_leaderboard_acva_Palestine
- arabic_leaderboard_acva_Qatar
- arabic_leaderboard_acva_Saudi_Arabia
- arabic_leaderboard_acva_Somalia
- arabic_leaderboard_acva_Sudan
- arabic_leaderboard_acva_Syria
- arabic_leaderboard_acva_Tunisia
- arabic_leaderboard_acva_United_Arab_Emirates
- arabic_leaderboard_acva_Yemen
- arabic_leaderboard_acva_communication
- arabic_leaderboard_acva_computer_and_phone
- arabic_leaderboard_acva_daily_life
- arabic_leaderboard_acva_entertainment
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Algeria
dataset_path: OALL/ACVA
dataset_name: Algeria
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Ancient_Egypt
dataset_path: OALL/ACVA
dataset_name: Ancient_Egypt
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arab_Empire
dataset_path: OALL/ACVA
dataset_name: Arab_Empire
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Architecture
dataset_path: OALL/ACVA
dataset_name: Arabic_Architecture
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Art
dataset_path: OALL/ACVA
dataset_name: Arabic_Art
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Astronomy
dataset_path: OALL/ACVA
dataset_name: Arabic_Astronomy
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Calligraphy
dataset_path: OALL/ACVA
dataset_name: Arabic_Calligraphy
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Ceremony
dataset_path: OALL/ACVA
dataset_name: Arabic_Ceremony
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Clothing
dataset_path: OALL/ACVA
dataset_name: Arabic_Clothing
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Culture
dataset_path: OALL/ACVA
dataset_name: Arabic_Culture
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_acva_Arabic_Food
dataset_path: OALL/ACVA
dataset_name: Arabic_Food
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment