Unverified Commit ebbbb968 authored by Hadi Abdine's avatar Hadi Abdine Committed by GitHub
Browse files

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)



* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------
Co-authored-by: default avatarBaber Abbasi <92168766+baberabb@users.noreply.github.com>
parent 5a9d5ba0
test_split: seed
\ No newline at end of file
include:
- translation_common_yaml
- seed_common_yaml
"tag":
- "darija_translation_tasks_seed"
"task": "trasnlation_all_seed"
"task_alias": "all_seed"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
group: darija_translation_seed
group_alias: translation_seed
task:
- darija_translation_tasks_seed
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
metadata:
version: 1.0
"process_docs": !function utils.dr_en
include:
- translation_common_yaml
- seed_common_yaml
"tag":
- "darija_translation_tasks_seed"
"task": "trasnlation_dr_en_seed"
"task_alias": "dr_en_seed"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.bertbase
higher_is_better: true
\ No newline at end of file
"process_docs": !function utils.en_dr
include:
- translation_common_yaml
- seed_common_yaml
"tag":
- "darija_translation_tasks_seed"
"task": "trasnlation_en_dr_seed"
"task_alias": "en_dr_seed"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.darijabert
higher_is_better: true
output_type: generate_until
dataset_path: MBZUAI-Paris/DarijaBench
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
metric_list:
- metric: bleu
- metric: ter
- metric: chrf
- metric: !function utils.bert
generation_kwargs:
until:
- "<end_of_turn>"
- "<eos>"
- "</s>"
- "<|end_of_text|>"
- "<|eot_id|>"
- "<|endoftext|>"
do_sample: false
temperature: 0.0
filter_list:
- name: "STRIP_ANSWER"
filter:
- function: "custom"
filter_fn: !function utils.strip
repeats: 1
metadata:
version: 1.0
group: darija_translation
group_alias: translation
task:
- darija_translation_doda
- darija_translation_flores
- darija_translation_madar
- darija_translation_seed
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
metadata:
version: 1.0
import evaluate
import datasets
def strip(resps, docs):
"""
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
return map(lambda r: r[0].strip(), resps)
def dr_fr(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "dr_fr")
def dr_en(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "dr_en")
def dr_msa(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "dr_msa")
def fr_dr(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "fr_dr")
def en_dr(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "en_dr")
def msa_dr(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "msa_dr")
prompt_templates = {
"fr_dr": "ترجم من الفرنساوية للدارجة:\n{0}",
"dr_fr": "ترجم من الدارجة للفرنساوية:\n{0}",
"en_dr": "ترجم من الإنجليزية للدارجة:\n{0}",
"dr_en": "ترجم من الدارجة للإنجليزية:\n{0}",
"msa_dr": "ترجم من الفصحى للدارجة:\n{0}",
"dr_msa": "ترجم من الدارجة للفصحى:\n{0}",
}
def doc_to_text(doc):
doc_text = doc["messages"][0]["content"]
return doc_text
def doc_to_target(doc):
return doc["messages"][1]["content"]
def bert(items):
return items
def Average(lst):
return sum(lst) / len(lst)
def camembert(items):
bert_model = 'almanach/camembert-base'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def darijabert(items):
bert_model = 'SI2M-Lab/DarijaBERT'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def arabert(items):
bert_model = "aubmindlab/bert-base-arabert"
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def bertbase(items):
bert_model = "google-bert/bert-base-uncased"
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def mbert(items):
bert_model = 'google-bert/bert-base-multilingual-cased'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
# DarijaBench: Transliteration
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darija_transliteration`: evaluates Darija transliteration task.
#### Tasks
* `darija_transliteration_task`: evaluates Darija transliteration task from [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
"process_docs": !function utils.ar_dr
"include": "transliteration_common_yaml"
"tag":
- "darija_transliteration_tasks"
"task": "transliteration_ar_dr"
"task_alias": "ar_dr"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.darijabert
higher_is_better: true
"process_docs": !function utils.dr_ar
"include": "transliteration_common_yaml"
"tag":
- "darija_transliteration_tasks"
"task": "transliteration_dr_ar"
"task_alias": "dr_ar"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.arabizibert
higher_is_better: true
"include": "transliteration_common_yaml"
"tag":
- "darija_transliteration_tasks"
"task": "transliteration_all"
"task_alias": "all"
metric_list:
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
output_type: generate_until
dataset_path: MBZUAI-Paris/DarijaBench
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
test_split: transliteration
metric_list:
- metric: bleu
- metric: ter
- metric: chrf
- metric: !function utils.bert
generation_kwargs:
until:
- "<end_of_turn>"
- "<eos>"
- "</s>"
- "<|end_of_text|>"
- "<|eot_id|>"
- "<|endoftext|>"
do_sample: false
temperature: 0.0
filter_list:
- name: "STRIP_ANSWER"
filter:
- function: "custom"
filter_fn: !function utils.strip
repeats: 1
metadata:
version: 1.0
group: darija_transliteration
group_alias: transliteration
task:
- darija_transliteration_tasks
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: chrf
aggregation: chrf
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: !function utils.bert
aggregation: !function utils.mbert
higher_is_better: true
metadata:
version: 1.0
import evaluate
import datasets
def strip(resps, docs):
"""
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
return map(lambda r: r[0].strip(), resps)
def dr_ar(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "dr_ar")
def ar_dr(dataset: datasets.Dataset):
return dataset.filter(lambda x: x["direction"] == "ar_dr")
def doc_to_text(doc):
doc_text = doc["messages"][0]["content"]
return doc_text
def doc_to_target(doc):
return doc["messages"][1]["content"]
def bert(items):
return items
def Average(lst):
return sum(lst) / len(lst)
def arabizibert(items):
bert_model = "SI2M-Lab/DarijaBERT-arabizi"
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def darijabert(items):
bert_model = 'SI2M-Lab/DarijaBERT'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
def mbert(items):
bert_model = 'google-bert/bert-base-multilingual-cased'
bert_score = evaluate.load("bertscore")
predictions, references = zip(*items)
bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
return Average(bert['f1'])
# DarijaHellaSwag
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaHellaSwag is a challenging multiple-choice benchmark designed to evaluate machine reading comprehension and commonsense reasoning in Moroccan Darija. It is a translated version of the HellaSwag validation set, which presents scenarios where models must choose the most plausible continuation of a passage from four options.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaHellaSwag](https://huggingface.co/datasets/MBZUAI-Paris/DarijaHellaSwag)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
- Not part of a group yet
#### Tasks
- `darijahellaswag`
### Checklist
For adding novel benchmarks/datasets to the library:
* [X] Is the task an existing benchmark in the literature?
* [X] Have you referenced the original paper that introduced the task?
* [X] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
tag:
- multiple_choice
task: darijahellaswag
dataset_path: MBZUAI-Paris/DarijaHellaSwag
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
process_docs: !function utils.process_docs
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
import datasets
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc):
ctx = doc["ctx"]
out_doc = {
"query": doc["activity_label"] + ": " + ctx,
"choices": doc["endings"],
"gold": int(doc["label"]),
}
return out_doc
return dataset.map(_process_doc)
# DarijaMMLU
### Paper
Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
DarijaMMLU is an evaluation benchmark designed to assess large language models' (LLM) performance in Moroccan Darija, a variety of Arabic. It consists of 22,027 multiple-choice questions, translated from selected subsets of the Massive Multitask Language Understanding (MMLU) and ArabicMMLU benchmarks to measure model performance on 44 subjects in Darija. DarijaMMLU is constructed by translating selected subsets from two major benchmarks into Darija from English and MSA: Massive Multitask Language Understanding (MMLU) and ArabicMMLU.
Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaMMLU](https://huggingface.co/datasets/MBZUAI-Paris/DarijaMMLU)
### Citation
```
@article{shang2024atlaschatadaptinglargelanguage,
title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect},
author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
year={2024},
eprint={2409.17912},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.17912},
}
```
### Groups and Tasks
#### Groups
* `darijammlu`: evaluates all DarijaMMLU tasks.
#### Tags
Source-based tags:
* `darijammlu_mmlu`: evaluates DarijaMMLU tasks that were translated from MMLU.
* `darijammlu_ar_mmlu`: evaluates DarijaMMLU tasks that were translated from ArabicMMLU.
Category-based tags:
* `darijammlu_stem`: evaluates DarijaMMLU STEM tasks.
* `darijammlu_social_sciences`: evaluates DarijaMMLU social sciences tasks.
* `darijammlu_humanities`: evaluates DarijaMMLU humanities tasks.
* `darijammlu_language`: evaluates DarijaMMLU language tasks.
* `darijammlu_other`: evaluates other DarijaMMLU tasks.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: darijammlu
group_alias: DarijaMMLU
task:
- darijammlu_mmlu
- darijammlu_ar_mmlu
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment