Unverified Commit ebbbb968 authored by Hadi Abdine's avatar Hadi Abdine Committed by GitHub
Browse files

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)



* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------
Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>
parent 5a9d5ba0
......@@ -39,6 +39,9 @@
| [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English |
| [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French |
| csatqa | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean |
| [darija_bench](darija_bench/README.md) | Traditional NLP tasks (Translation, Summariation, etc..) for Moroccan Darija | Moroccan Darija (some MT) |
| [darijahellaswag](darijahellaswag/README.md) | Moroccan Darija version of HellaSwag. | Moroccan Darija (MT) |
| [darijammlu](darijammlu/README.md)| Multiple-choice QA in Moroccan Darija (an Arabic dialect). | Moroccan Darija (MT) |
| [drop](drop/README.md) | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English |
| [eq_bench](eq_bench/README.md) | Tasks focused on equality and ethics in question answering and decision-making. | English |
| [eus_exams](eus_exams/README.md) | Tasks based on various professional and academic exams in the Basque language. | Basque |
......
# DarijaBench
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.
* `darija_summarization`: evaluates Darija summarization task.
* `darija_translation`: evaluates all Darija Translation tasks.
* `darija_transliteration`: evaluates Darija transliteration task.
#### Tasks
* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.
* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.
* `darija_translation_doda`: evaluates Darija translation task from [DODa-10k](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
* `darija_translation_flores`: evaluates Darija translation task from [FLORES+](https://github.com/openlanguagedata/flores) dataset.
* `darija_translation_madar`: evaluates Darija translation task from [MADAR](https://sites.google.com/nyu.edu/madar/) dataset.
* `darija_translation_seed`: evaluates Darija translation task from [NLLB-Seed](https://github.com/openlanguagedata/seed) datasets.
* `darija_transliteration_task`: evaluates Darija transliteration task from [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
Note: depending on the model, padding and padding side could affect the results. The default padding side in this library is forced to left. Use batch size equal to 1 to avoid problems.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# DarijaBench: Sentiment Analysis
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.
#### Tasks
* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: darija_sentiment
group_alias: Sentiment_Analysis
task:
- darija_sentiment_tasks
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
test_split: electro_maroc
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_electrom"
"task_alias": "Electro Maroc"
doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
test_split: mac
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_mac"
"task_alias": "MAC"
doc_to_choice: !function utils.doc_to_choice_3
\ No newline at end of file
test_split: msac
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_msac"
"task_alias": "MSAC"
doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
test_split: msda
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_msda"
"task_alias": "MSDA"
doc_to_choice: !function utils.doc_to_choice_3
\ No newline at end of file
test_split: myc
"include": "default_darija_sentiment_template_yaml"
"tag":
- "darija_sentiment_tasks"
"task": "darija_sentiment_myc"
"task_alias": "MYC"
doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
dataset_path: MBZUAI-Paris/DarijaBench
output_type: multiple_choice
doc_to_text: !function utils.doc_to_text
doc_to_choice: !function utils.doc_to_choice_3
doc_to_target: !function utils.doc_to_target
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
dataset_kwargs:
trust_remote_code: true
from lm_eval.api.filter import Filter
from lm_eval.api.registry import register_filter
alpha = ['A', 'B', 'C']
out_dic = {"ايجابي": 1, "سلبي": 0, "ماكينش إحساس": 2}
def doc_to_text(doc):
return doc["messages"][0]["content"].replace('-سلبي', 'A. سلبي').replace('-ايجابي', 'B. ايجابي').replace('-ماكينش إحساس', 'C. ماكينش إحساس\nThe answer should be strictly one letter of the following: A, B, C.')#.replace('شنو هو الإحساس ديال هاد الجملة؟', 'شنو هو الإحساس ديال هاد الجملة؟')
def doc_to_choice_3(doc):
return alpha
def doc_to_choice_2(doc):
return alpha[:2]
def doc_to_target(doc):
return alpha[out_dic[doc["messages"][1]["content"]]]
# DarijaBench: Summarization
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darija_summarization`: evaluates Darija summarization task.
#### Tasks
* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
"include": "summarization_common_yaml"
"task": "darija_summarization_task"
output_type: generate_until
dataset_path: MBZUAI-Paris/DarijaBench
test_split: marsum
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
metric_list:
- metric: !function utils.rouge1
- metric: !function utils.rouge2
- metric: !function utils.rougeL
- metric: !function utils.rougeLsum
- metric: !function utils.bert
- metric: chrf
generation_kwargs:
until:
- "<end_of_turn>"
- "<eos>"
- "</s>"
- "<|end_of_text|>"
- "<|eot_id|>"
- "<|endoftext|>"
do_sample: false
temperature: 0.0
max_new_tokens: 128
repeats: 1
filter_list:
- name: "STRIP_ANSWER"
filter:
- function: "custom"
filter_fn: !function utils.strip
metadata:
version: 1.0
group: darija_summarization
task:
- darija_summarization_task
metric_list:
- metric: !function utils.rouge1
aggregation: !function utils.agg_rouge1
higher_is_better: true
- metric: !function utils.rouge2
aggregation: !function utils.agg_rouge2
higher_is_better: true
- metric: !function utils.rougeL
aggregation: !function utils.agg_rougel
higher_is_better: true
- metric: !function utils.rougeLsum
aggregation: !function utils.agg_rougelsum
higher_is_better: true
- metric: !function utils.bert
aggregation: !function utils.darijabert
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 1.0
import evaluate
import datasets
def strip(resps, docs):
"""
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
return map(lambda r: r[0].strip(), resps)
def doc_to_text(doc):
doc_text = doc["messages"][0]["content"].replace("لخص هاد المقطع", "لخص هاد المقطع في ٣٠ كلمة")
return doc_text
def doc_to_target(doc):
return doc["messages"][1]["content"]
def bert(items):
return items
def Average(lst):
return sum(lst) / len(lst)
def darijabert(items):
bert_model = 'SI2M-Lab/DarijaBERT'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def rouge1(items):
return items
def rougeL(items):
return items
def rouge2(items):
return items
def rougeLsum(items):
return items
def agg_rougelsum(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rougeLsum"]
def agg_rouge1(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rouge1"]
def agg_rouge2(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rouge2"]
def agg_rougel(items):
rouge = evaluate.load("rouge")
predictions, references = zip(*items)
return rouge.compute(predictions=predictions, references=references)["rougeL"]
\ No newline at end of file
# DarijaBench: Translation
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darija_translation`: evaluates all Darija Translation tasks.
#### Tasks
* `darija_translation_doda`: evaluates Darija translation task from [DODa-10k](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
* `darija_translation_flores`: evaluates Darija translation task from [FLORES+](https://github.com/openlanguagedata/flores) dataset.
* `darija_translation_madar`: evaluates Darija translation task from [MADAR](https://sites.google.com/nyu.edu/madar/) dataset.
* `darija_translation_seed`: evaluates Darija translation task from [NLLB-Seed](https://github.com/openlanguagedata/seed) datasets.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
test_split: doda
\ No newline at end of file
include:
- translation_common_yaml
- doda_common_yaml
"tag":
- "darija_translation_tasks_doda"
"task": "trasnlation_all_doda"
"task_alias": "all_doda"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
group: darija_translation_doda
group_alias: translation_doda
task:
- darija_translation_tasks_doda
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment