Unverified Commit decc533d authored by Malikeh Ehghaghi's avatar Malikeh Ehghaghi Committed by GitHub
Browse files

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) (#2232)



* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Update lm_eval/tasks/README.md
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------
Co-authored-by: default avatarshahrzads <sayehban@ualberta.ca>
Co-authored-by: default avatarshahrzads <56282669+shahrzads@users.noreply.github.com>
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 543617fe
......@@ -11,6 +11,8 @@
| [aexams](aexams/README.md) | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic |
| [agieval](agieval/README.md) | Tasks involving historical data or questions related to history and historical texts. | English, Chinese |
| [anli](anli/README.md) | Adversarial natural language inference tasks designed to test model robustness. | English |
| [arabic_leaderboard_complete](arabic_leaderboard_complete/README.md) | A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
| [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
| [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic |
| [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions. | English |
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
......
# Arabic Leaderboard
Title: Open Arabic LLM Leaderboard
The Open Arabic LLM Leaderboard evaluates language models on a large number of different evaluation tasks that reflect the characteristics of the Arabic language and culture.
The benchmark uses several datasets, most of them translated to Arabic, and validated by native Arabic speakers. They also used benchmarks from other papers or prepared benchmarks from scratch natively for Arabic.
Homepage: https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard
### Citation
```
@misc{OALL,
author = {Elfilali, Ali and Alobeidli, Hamza and Fourrier, Clémentine and Boussaha, Basma El Amel and Cojocaru, Ruxandra and Habib, Nathan and Hacid, Hakim},
title = {Open Arabic LLM Leaderboard},
year = {2024},
publisher = {OALL},
howpublished = "\url{https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard}"
}
@inproceedings{almazrouei-etal-2023-alghafa,
title = "{A}l{G}hafa Evaluation Benchmark for {A}rabic Language Models",
author = "Almazrouei, Ebtesam and
Cojocaru, Ruxandra and
Baldo, Michele and
Malartic, Quentin and
Alobeidli, Hamza and
Mazzotta, Daniele and
Penedo, Guilherme and
Campesan, Giulia and
Farooq, Mugariya and
Alhammadi, Maitha and
Launay, Julien and
Noune, Badreddine",
editor = "Sawaf, Hassan and
El-Beltagy, Samhaa and
Zaghouani, Wajdi and
Magdy, Walid and
Abdelali, Ahmed and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Habash, Nizar and
Khalifa, Salam and
Keleg, Amr and
Haddad, Hatem and
Zitouni, Imed and
Mrini, Khalil and
Almatham, Rawan",
booktitle = "Proceedings of ArabicNLP 2023",
month = dec,
year = "2023",
address = "Singapore (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.arabicnlp-1.21",
doi = "10.18653/v1/2023.arabicnlp-1.21",
pages = "244--275",
abstract = "Recent advances in the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and continuously increasing NLP resources, the Arabic LLM landscape has improved in a very short span of time, despite being plagued by training data scarcity and limited evaluation resources compared to English. In line with contributing towards this ever-growing field, we introduce AlGhafa, a new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a new suite of models, including a 14 billion parameter model, the largest monolingual Arabic decoder-only model to date. We use a collection of publicly available datasets, as well as a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.",
}
@misc{huang2023acegpt,
title={AceGPT, Localizing Large Language Models in Arabic},
author={Huang Huang and Fei Yu and Jianqing Zhu and Xuening Sun and Hao Cheng and Dingjie Song and Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche Liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruoyu Sun and Xiang Wan and Haizhou Li and Jinchao Xu},
year={2023},
eprint={2309.12053},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{lighteval,
author = {Fourrier, Clémentine and Habib, Nathan and Wolf, Thomas and Tunstall, Lewis},
title = {LightEval: A lightweight framework for LLM evaluation},
year = {2023},
version = {0.3.0},
url = {https://github.com/huggingface/lighteval}
}
```
### Groups and Tasks
* `arabic_leaderboard_alghafa`: A multiple-choice evaluation benchmark for zero- and few-shot evaluation of Arabic LLMs prepared from scratch natively for Arabic.
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* You can find the list of the tasks as follows:
* `arabic_leaderboard_alghafa_mcq_exams_test_ar`
* `arabic_leaderboard_alghafa_meta_ar_dialects`
* `arabic_leaderboard_alghafa_meta_ar_msa`
* `arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task`
* `arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task`
* `arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task`
* `arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task`
* `arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task`
* `arabic_leaderboard_alghafa_multiple_choice_sentiment_task`
* `arabic_leaderboard_arabic_exams`: A question answering benchmark for high school examinations in different school subjects that requires knowledge and reasoning in different languages in multiple domains.
* Paper: https://aclanthology.org/2020.emnlp-main.438.pdf
* `arabic_leaderboard_arabic_mmlu`: A multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries with native speakers in the region.
The data comprises multiple choice questions in 40 tasks.
* Paper: https://arxiv.org/pdf/2402.12840
* You can find the list of the tasks as follows:
* `arabic_leaderboard_arabic_mmlu_abstract_algebra`
* `arabic_leaderboard_arabic_mmlu_anatomy`
* `arabic_leaderboard_arabic_mmlu_astronomy`
* `arabic_leaderboard_arabic_mmlu_business_ethics`
* `arabic_leaderboard_arabic_mmlu_clinical_knowledge`
* `arabic_leaderboard_arabic_mmlu_college_biology`
* `arabic_leaderboard_arabic_mmlu_college_chemistry`
* `arabic_leaderboard_arabic_mmlu_college_computer_science`
* `arabic_leaderboard_arabic_mmlu_college_mathematics`
* `arabic_leaderboard_arabic_mmlu_college_medicine`
* `arabic_leaderboard_arabic_mmlu_college_physics`
* `arabic_leaderboard_arabic_mmlu_computer_security`
* `arabic_leaderboard_arabic_mmlu_conceptual_physics`
* `arabic_leaderboard_arabic_mmlu_econometrics`
* `arabic_leaderboard_arabic_mmlu_electrical_engineering`
* `arabic_leaderboard_arabic_mmlu_elementary_mathematics`
* `arabic_leaderboard_arabic_mmlu_formal_logic`
* `arabic_leaderboard_arabic_mmlu_global_facts`
* `arabic_leaderboard_arabic_mmlu_high_school_biology`
* `arabic_leaderboard_arabic_mmlu_high_school_chemistry`
* `arabic_leaderboard_arabic_mmlu_high_school_computer_science`
* `arabic_leaderboard_arabic_mmlu_high_school_european_history`
* `arabic_leaderboard_arabic_mmlu_high_school_geography`
* `arabic_leaderboard_arabic_mmlu_high_school_government_and_politics`
* `arabic_leaderboard_arabic_mmlu_high_school_macroeconomics`
* `arabic_leaderboard_arabic_mmlu_high_school_mathematics`
* `arabic_leaderboard_arabic_mmlu_high_school_microeconomics`
* `arabic_leaderboard_arabic_mmlu_high_school_physics`
* `arabic_leaderboard_arabic_mmlu_high_school_psychology`
* `arabic_leaderboard_arabic_mmlu_high_school_statistics`
* `arabic_leaderboard_arabic_mmlu_high_school_us_history`
* `arabic_leaderboard_arabic_mmlu_high_school_us_history`
* `arabic_leaderboard_arabic_mmlu_human_aging`
* `arabic_leaderboard_arabic_mmlu_human_sexuality`
* `arabic_leaderboard_arabic_mmlu_international_law`
* `arabic_leaderboard_arabic_mmlu_jurisprudence`
* `arabic_leaderboard_arabic_mmlu_logical_fallacies`
* `arabic_leaderboard_arabic_mmlu_machine_learning`
* `arabic_leaderboard_arabic_mmlu_management`
* `arabic_leaderboard_arabic_mmlu_marketing`
* `arabic_leaderboard_arabic_mmlu_medical_genetics`
* `arabic_leaderboard_arabic_mmlu_miscellaneous`
* `arabic_leaderboard_arabic_mmlu_moral_disputes`
* `arabic_leaderboard_arabic_mmlu_moral_scenarios`
* `arabic_leaderboard_arabic_mmlu_nutrition`
* `arabic_leaderboard_arabic_mmlu_philosophy`
* `arabic_leaderboard_arabic_mmlu_prehistory`
* `arabic_leaderboard_arabic_mmlu_professional_accounting`
* `arabic_leaderboard_arabic_mmlu_professional_law`
* `arabic_leaderboard_arabic_mmlu_professional_medicine`
* `arabic_leaderboard_arabic_mmlu_professional_psychology`
* `arabic_leaderboard_arabic_mmlu_public_relations`
* `arabic_leaderboard_arabic_mmlu_security_studies`
* `arabic_leaderboard_arabic_mmlu_sociology`
* `arabic_leaderboard_arabic_mmlu_us_foreign_policy`
* `arabic_leaderboard_arabic_mmlu_virology`
* `arabic_leaderboard_arabic_mmlu_world_religions`
* `arabic_leaderboard_arabic_mt_arc_challenge`: AI2 Reasoning Challenge (ARC) is a multiple-choice question task. The dataset contains only natural, grade-school science questions,
written for human tests. The challenge set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_arc_easy`: This dataset is the same as `arabic_arc_challenge`, except it is not from the challenge set.
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_boolq`: A true/false questions dataset that contains the columns passage, question, and the answer (i.e., true/false). (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_copa`: Choice Of Plausible Alternatives (COPA) is a multiple-choice question dataset, which involves open-domain commonsense causal reasoning. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_hellaswag`: The tesk is to choose the next set of sentences, based on the given candidates. The tasks involve reading comprehension and information retrieval challenges
by testing the abilities of the models on basic knowledge (i.e., from 3rd grade to 9th) and commonsense inference. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_mmlu`: A multiple-choice question answering dataset from various branches of knowledge including humanities, social sciences, hard sciences, and other areas. The examples in the English dataset are translated into Arabic using ChatGPT with a translation prompt.
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_openbook_qa`: A multiple-choice openbook question answering dataset that requires external knowledge and reasoning. The open book that comes with these questions is
based on elementary level science facts. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_piqa`: Physical Interaction Question Answering (PIQA) is a multiple-choice question answering based on physical commonsense reasoning. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_race`: A multiple-choice questions dataset to assess reading comprehension tasks based on English exams in China - designed for middle school and high school students
(machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_sciq`: A multiple-choice Science Question Answering task to assess understanding of scientific concepts about physics, chemistry, and biology. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_arabic_mt_toxigen`: This benchmark consists of tasks designed to evaluate language models and classify input text as hateful or not hateful. (machine translated benchmark - part of the Alghafa Arabic translated LLM benchmark)
* Paper: https://aclanthology.org/2023.arabicnlp-1.21.pdf
* `arabic_leaderboard_acva`: Arabic-Culture-Value-Alignment (ACVA) is a yes/no question dataset, generated by GPT3.5 Turbo from Arabic topics to assess model alignment with Arabic values and cultures.
* Paper: https://arxiv.org/pdf/2309.12053
* You can find the list of the tasks as follows:
- `arabic_leaderboard_acva_Algeria`
- `arabic_leaderboard_acva_Ancient_Egypt`
- `arabic_leaderboard_acva_Arab_Empire`
- `arabic_leaderboard_acva_Arabic_Architecture`
- `arabic_leaderboard_acva_Arabic_Art`
- `arabic_leaderboard_acva_Arabic_Astronomy`
- `arabic_leaderboard_acva_Arabic_Calligraphy`
- `arabic_leaderboard_acva_Arabic_Ceremony`
- `arabic_leaderboard_acva_Arabic_Clothing`
- `arabic_leaderboard_acva_Arabic_Culture`
- `arabic_leaderboard_acva_Arabic_Food`
- `arabic_leaderboard_acva_Arabic_Funeral`
- `arabic_leaderboard_acva_Arabic_Geography`
- `arabic_leaderboard_acva_Arabic_History`
- `arabic_leaderboard_acva_Arabic_Language_Origin`
- `arabic_leaderboard_acva_Arabic_Literature`
- `arabic_leaderboard_acva_Arabic_Math`
- `arabic_leaderboard_acva_Arabic_Medicine`
- `arabic_leaderboard_acva_Arabic_Music`
- `arabic_leaderboard_acva_Arabic_Ornament`
- `arabic_leaderboard_acva_Arabic_Philosophy`
- `arabic_leaderboard_acva_Arabic_Physics_and_Chemistry`
- `arabic_leaderboard_acva_Arabic_Wedding`
- `arabic_leaderboard_acva_Bahrain`
- `arabic_leaderboard_acva_Comoros`
- `arabic_leaderboard_acva_Egypt_modern`
- `arabic_leaderboard_acva_InfluenceFromAncientEgypt`
- `arabic_leaderboard_acva_InfluenceFromByzantium`
- `arabic_leaderboard_acva_InfluenceFromChina`
- `arabic_leaderboard_acva_InfluenceFromGreece`
- `arabic_leaderboard_acva_InfluenceFromIslam`
- `arabic_leaderboard_acva_InfluenceFromPersia`
- `arabic_leaderboard_acva_InfluenceFromRome`
- `arabic_leaderboard_acva_Iraq`
- `arabic_leaderboard_acva_Islam_Education`
- `arabic_leaderboard_acva_Islam_branches_and_schools`
- `arabic_leaderboard_acva_Islamic_law_system`
- `arabic_leaderboard_acva_Jordan`
- `arabic_leaderboard_acva_Kuwait`
- `arabic_leaderboard_acva_Lebanon`
- `arabic_leaderboard_acva_Libya`
- `arabic_leaderboard_acva_Mauritania`
- `arabic_acva_Mesopotamia_civilization`
- `arabic_leaderboard_acva_Morocco`
- `arabic_leaderboard_acva_Oman`
- `arabic_leaderboard_acva_Palestine`
- `arabic_leaderboard_acva_Qatar`
- `arabic_leaderboard_acva_Saudi_Arabia`
- `arabic_leaderboard_acva_Somalia`
- `arabic_leaderboard_acva_Sudan`
- `arabic_leaderboard_acva_Syria`
- `arabic_leaderboard_acva_Tunisia`
- `arabic_leaderboard_acva_United_Arab_Emirates`
- `arabic_leaderboard_acva_Yemen`
- `arabic_leaderboard_acva_communication`
- `arabic_leaderboard_acva_computer_and_phone`
- `arabic_leaderboard_acva_daily_life`
- `arabic_leaderboard_acva_entertainment`
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: arabic_leaderboard_alghafa
task:
- arabic_leaderboard_alghafa_mcq_exams_test_ar
- arabic_leaderboard_alghafa_meta_ar_dialects
- arabic_leaderboard_alghafa_meta_ar_msa
- arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task
- arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task
- arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task
- arabic_leaderboard_alghafa_multiple_choice_sentiment_task
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_mcq_exams_test_ar
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: mcq_exams_test_ar
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_meta_ar_dialects
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: meta_ar_dialects
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_meta_ar_msa
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: meta_ar_msa
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_facts_truefalse_balanced_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_facts_truefalse_balanced_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_soqal_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_grounded_statement_soqal_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_grounded_statement_xglue_mlqa_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_grounded_statement_xglue_mlqa_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_no_neutral_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_rating_sentiment_no_neutral_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_rating_sentiment_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_rating_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_alghafa_multiple_choice_sentiment_task
dataset_path: OALL/AlGhafa-Arabic-LLM-Benchmark-Native
dataset_name: multiple_choice_sentiment_task
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
import datasets
import numpy as np
def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
question = doc["query"]
answer_index = int(doc["label"])
# Dynamically determining the choices by excluding '__few_shots', 'query' and 'label'
choices_keys = [
key for key in doc.keys() if key not in ["query", "label", "__few_shots"]
]
choices = [doc[key] for key in choices_keys]
instruction = "الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح\n\n"
query = f"{instruction}السؤال: {question}\n"
for index, choice in enumerate(choices):
query += f"{index}) {choice}\n"
query += "الإجابة:"
return {"query": query, "choices": choices, "gold": answer_index}
return dataset.map(_process_doc)
task: arabic_exams
dataset_path: OALL/Arabic_EXAMS
dataset_name: default
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: validation
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
group: arabic_leaderboard_arabic_exams
task:
- arabic_exams
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
import datasets
import numpy as np
# fmt: off
LETTER_INDICES_AR = ["أ", "ب", "ج", "د", "هـ", "و", "ز", "ح", "ط", "ي", "ك", "ل", "م", "ن", "س", "ع", "ف", "ص", "ق", "ر", "ش", "ت", "ث", "خ", "ذ", "ض", "ظ", "غ"]
# fmt: on
# fmt: off
LETTER_INDICES = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]
# fmt: on
def process_docs(dataset: datasets.Dataset):
def _process_doc(doc):
topic = doc["subject"]
question = doc["question"]
choices = [doc["A"], doc["B"], doc["C"], doc["D"]]
choices_formatted = [
f" {LETTER_INDICES_AR[i]}) {choice}\n" for i, choice in enumerate(choices)
]
answer = doc["answer"]
answer_index = LETTER_INDICES.index(answer)
instruction = f"الأسئلة التالية هي أسئلة متعددة الإختيارات مع الجواب الصحيح حول {topic.replace('_', ' ')}. \n\n"
query = f"{instruction}السؤال: {question}\n"
query += "\n".join(choices_formatted)
query += "\nالإجابة:"
return {"query": query, "choices": LETTER_INDICES_AR[:4], "gold": answer_index}
return dataset.map(_process_doc)
group: arabic_leaderboard_arabic_mmlu
task:
- arabic_leaderboard_arabic_mmlu_abstract_algebra
- arabic_leaderboard_arabic_mmlu_anatomy
- arabic_leaderboard_arabic_mmlu_astronomy
- arabic_leaderboard_arabic_mmlu_business_ethics
- arabic_leaderboard_arabic_mmlu_clinical_knowledge
- arabic_leaderboard_arabic_mmlu_college_biology
- arabic_leaderboard_arabic_mmlu_college_chemistry
- arabic_leaderboard_arabic_mmlu_college_computer_science
- arabic_leaderboard_arabic_mmlu_college_mathematics
- arabic_leaderboard_arabic_mmlu_college_medicine
- arabic_leaderboard_arabic_mmlu_college_physics
- arabic_leaderboard_arabic_mmlu_computer_security
- arabic_leaderboard_arabic_mmlu_conceptual_physics
- arabic_leaderboard_arabic_mmlu_econometrics
- arabic_leaderboard_arabic_mmlu_electrical_engineering
- arabic_leaderboard_arabic_mmlu_elementary_mathematics
- arabic_leaderboard_arabic_mmlu_formal_logic
- arabic_leaderboard_arabic_mmlu_global_facts
- arabic_leaderboard_arabic_mmlu_high_school_biology
- arabic_leaderboard_arabic_mmlu_high_school_chemistry
- arabic_leaderboard_arabic_mmlu_high_school_computer_science
- arabic_leaderboard_arabic_mmlu_high_school_european_history
- arabic_leaderboard_arabic_mmlu_high_school_geography
- arabic_leaderboard_arabic_mmlu_high_school_government_and_politics
- arabic_leaderboard_arabic_mmlu_high_school_macroeconomics
- arabic_leaderboard_arabic_mmlu_high_school_mathematics
- arabic_leaderboard_arabic_mmlu_high_school_microeconomics
- arabic_leaderboard_arabic_mmlu_high_school_physics
- arabic_leaderboard_arabic_mmlu_high_school_psychology
- arabic_leaderboard_arabic_mmlu_high_school_statistics
- arabic_leaderboard_arabic_mmlu_high_school_us_history
- arabic_leaderboard_arabic_mmlu_high_school_world_history
- arabic_leaderboard_arabic_mmlu_human_aging
- arabic_leaderboard_arabic_mmlu_human_sexuality
- arabic_leaderboard_arabic_mmlu_international_law
- arabic_leaderboard_arabic_mmlu_jurisprudence
- arabic_leaderboard_arabic_mmlu_logical_fallacies
- arabic_leaderboard_arabic_mmlu_machine_learning
- arabic_leaderboard_arabic_mmlu_management
- arabic_leaderboard_arabic_mmlu_marketing
- arabic_leaderboard_arabic_mmlu_medical_genetics
- arabic_leaderboard_arabic_mmlu_miscellaneous
- arabic_leaderboard_arabic_mmlu_moral_disputes
- arabic_leaderboard_arabic_mmlu_moral_scenarios
- arabic_leaderboard_arabic_mmlu_nutrition
- arabic_leaderboard_arabic_mmlu_philosophy
- arabic_leaderboard_arabic_mmlu_prehistory
- arabic_leaderboard_arabic_mmlu_professional_accounting
- arabic_leaderboard_arabic_mmlu_professional_law
- arabic_leaderboard_arabic_mmlu_professional_medicine
- arabic_leaderboard_arabic_mmlu_professional_psychology
- arabic_leaderboard_arabic_mmlu_public_relations
- arabic_leaderboard_arabic_mmlu_security_studies
- arabic_leaderboard_arabic_mmlu_sociology
- arabic_leaderboard_arabic_mmlu_us_foreign_policy
- arabic_leaderboard_arabic_mmlu_virology
- arabic_leaderboard_arabic_mmlu_world_religions
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
task: arabic_leaderboard_arabic_mmlu_abstract_algebra
dataset_path: OALL/Arabic_MMLU
dataset_name: abstract_algebra
output_type: multiple_choice
training_split: null
validation_split: dev
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: dev
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_arabic_mmlu_anatomy
dataset_path: OALL/Arabic_MMLU
dataset_name: anatomy
output_type: multiple_choice
training_split: null
validation_split: dev
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: dev
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
task: arabic_leaderboard_arabic_mmlu_astronomy
dataset_path: OALL/Arabic_MMLU
dataset_name: astronomy
output_type: multiple_choice
training_split: null
validation_split: dev
test_split: test
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "choices"
fewshot_split: dev
fewshot_config:
sampler: first_n
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment