add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)

* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)
* add Darija tasks * fix multiple groups issue in darijammlu * add MT to the description of the Darija tasks * Update README.md nit * fix the recursion error caused by the darija_summarization task * use a custom filter instead of the decorator for the strip function --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
ebbbb968 · Hadi Abdine · GitHub · 5a9d5ba0 · ebbbb968 · ebbbb968
Unverified Commit ebbbb968 authored Mar 28, 2025 by Hadi Abdine Committed by GitHub Mar 28, 2025
20 changed files
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -39,6 +39,9 @@
 | [coqa](coqa/README.md)                                                   | Conversational question answering tasks to test dialog understanding.                                                                                                                                                                                                                                                                  | English                                                                                                               |
 | [crows_pairs](crows_pairs/README.md)                                     | Tasks designed to test model biases in various sociodemographic groups.                                                                                                                                                                                                                                                                | English, French                                                                                                       |
 | csatqa                                                                   | Tasks related to SAT and other standardized testing questions for academic assessment.                                                                                                                                                                                                                                                 | Korean                                                                                                                |
+| [darija_bench](darija_bench/README.md) | Traditional NLP tasks (Translation, Summariation, etc..) for Moroccan Darija | Moroccan Darija (some MT) |
+| [darijahellaswag](darijahellaswag/README.md) | Moroccan Darija version of HellaSwag. | Moroccan Darija (MT) |
+| [darijammlu](darijammlu/README.md)| Multiple-choice QA in Moroccan Darija (an Arabic dialect).  | Moroccan Darija (MT) |
 | [drop](drop/README.md)                                                   | Tasks requiring numerical reasoning, reading comprehension, and question answering.                                                                                                                                                                                                                                                    | English                                                                                                               |
 | [eq_bench](eq_bench/README.md)                                           | Tasks focused on equality and ethics in question answering and decision-making.                                                                                                                                                                                                                                                        | English                                                                                                               |
 | [eus_exams](eus_exams/README.md)                                         | Tasks based on various professional and academic exams in the Basque language.                                                                                                                                                                                                                                                         | Basque                                                                                                                |

--- a/lm_eval/tasks/darija_bench/README.md
+++ b/lm_eval/tasks/darija_bench/README.md
+# DarijaBench
+
+### Paper
+
+Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
+
+Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
+
+DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based  on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
+
+
+Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
+
+
+### Citation
+
+```
+@article{shang2024atlaschatadaptinglargelanguage,
+      title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, 
+      author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
+      year={2024},
+      eprint={2409.17912},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.17912}, 
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.
+* `darija_summarization`: evaluates Darija summarization task.
+* `darija_translation`: evaluates all Darija Translation tasks.
+* `darija_transliteration`: evaluates Darija transliteration task.
+
+#### Tasks
+
+* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
+* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
+* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
+* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
+* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.
+* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.
+* `darija_translation_doda`: evaluates Darija translation task from [DODa-10k](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
+* `darija_translation_flores`: evaluates Darija translation task from [FLORES+](https://github.com/openlanguagedata/flores) dataset.
+* `darija_translation_madar`: evaluates Darija translation task from [MADAR](https://sites.google.com/nyu.edu/madar/) dataset.
+* `darija_translation_seed`: evaluates Darija translation task from [NLLB-Seed](https://github.com/openlanguagedata/seed) datasets.
+* `darija_transliteration_task`: evaluates Darija transliteration task from [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
+
+Note: depending on the model, padding and padding side could affect the results. The default padding side in this library is forced to left. Use batch size equal to 1 to avoid problems.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/darija_bench/darija_sentiment/README.md
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/README.md
+# DarijaBench: Sentiment Analysis
+
+### Paper
+
+Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
+
+Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
+
+DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based  on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
+
+
+Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
+
+
+### Citation
+
+```
+@article{shang2024atlaschatadaptinglargelanguage,
+      title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, 
+      author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
+      year={2024},
+      eprint={2409.17912},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.17912}, 
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `darija_sentiment`: evaluates all Darija sentiment analysis tasks.
+
+#### Tasks
+
+* `darija_sentiment_mac`: evaluates Darija translation task from [MAC](https://github.com/LeMGarouani/MAC) dataset.
+* `darija_sentiment_myc`: evaluates Darija translation task from [MYC](https://github.com/MouadJb/MYC) dataset.
+* `darija_sentiment_msac`: evaluates Darija translation task from [MSAC](https://hal.science/hal-03670346/document) dataset.
+* `darija_sentiment_msda`: evaluates Darija translation task from [MSDA](https://cc.um6p.ma/cc_datasets) dataset.
+* `darija_sentiment_electrom`: evaluates Darija translation task from [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016) dataset.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment.yaml
+group: darija_sentiment
+group_alias: Sentiment_Analysis
+task:
+  - darija_sentiment_tasks
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 0
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_electrom.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_electrom.yaml
+test_split: electro_maroc
+"include": "default_darija_sentiment_template_yaml"
+"tag":
+- "darija_sentiment_tasks"
+"task": "darija_sentiment_electrom"
+"task_alias": "Electro Maroc"
+doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_mac.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_mac.yaml
+test_split: mac
+"include": "default_darija_sentiment_template_yaml"
+"tag":
+- "darija_sentiment_tasks"
+"task": "darija_sentiment_mac"
+"task_alias": "MAC"
+doc_to_choice: !function utils.doc_to_choice_3
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_msac.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_msac.yaml
+test_split: msac
+"include": "default_darija_sentiment_template_yaml"
+"tag":
+- "darija_sentiment_tasks"
+"task": "darija_sentiment_msac"
+"task_alias": "MSAC"
+doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_msda.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_msda.yaml
+test_split: msda
+"include": "default_darija_sentiment_template_yaml"
+"tag":
+- "darija_sentiment_tasks"
+"task": "darija_sentiment_msda"
+"task_alias": "MSDA"
+doc_to_choice: !function utils.doc_to_choice_3
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_myc.yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/darija_sentiment_myc.yaml
+test_split: myc
+"include": "default_darija_sentiment_template_yaml"
+"tag":
+- "darija_sentiment_tasks"
+"task": "darija_sentiment_myc"
+"task_alias": "MYC"
+doc_to_choice: !function utils.doc_to_choice_2
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_sentiment/default_darija_sentiment_template_yaml
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/default_darija_sentiment_template_yaml
+dataset_path: MBZUAI-Paris/DarijaBench
+output_type: multiple_choice
+doc_to_text: !function utils.doc_to_text
+doc_to_choice: !function utils.doc_to_choice_3
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
+dataset_kwargs:
+  trust_remote_code: true
--- a/lm_eval/tasks/darija_bench/darija_sentiment/utils.py
+++ b/lm_eval/tasks/darija_bench/darija_sentiment/utils.py
+from lm_eval.api.filter import Filter
+from lm_eval.api.registry import register_filter
+
+alpha = ['A', 'B', 'C']
+out_dic = {"ايجابي": 1, "سلبي": 0, "ماكينش إحساس": 2}
+
+def doc_to_text(doc):
+    return doc["messages"][0]["content"].replace('-سلبي', 'A. سلبي').replace('-ايجابي', 'B. ايجابي').replace('-ماكينش إحساس', 'C. ماكينش إحساس\nThe answer should be strictly one letter of the following: A, B, C.')#.replace('شنو هو الإحساس ديال هاد الجملة؟', 'شنو هو الإحساس ديال هاد الجملة؟')
+
+def doc_to_choice_3(doc):
+    return alpha
+
+def doc_to_choice_2(doc):
+    return alpha[:2]
+
+def doc_to_target(doc):
+    return alpha[out_dic[doc["messages"][1]["content"]]]
+
--- a/lm_eval/tasks/darija_bench/darija_summarization/README.md
+++ b/lm_eval/tasks/darija_bench/darija_summarization/README.md
+# DarijaBench: Summarization
+
+### Paper
+
+Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
+
+Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
+
+DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based  on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
+
+
+Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
+
+
+### Citation
+
+```
+@article{shang2024atlaschatadaptinglargelanguage,
+      title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, 
+      author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
+      year={2024},
+      eprint={2409.17912},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.17912}, 
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `darija_summarization`: evaluates Darija summarization task.
+
+#### Tasks
+
+* `darija_summarization_task`: evaluates Darija summarization task from [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization) corpus.
+
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/darija_bench/darija_summarization/summarization.yaml
+++ b/lm_eval/tasks/darija_bench/darija_summarization/summarization.yaml
+"include": "summarization_common_yaml"
+"task": "darija_summarization_task"
--- a/lm_eval/tasks/darija_bench/darija_summarization/summarization_common_yaml
+++ b/lm_eval/tasks/darija_bench/darija_summarization/summarization_common_yaml
+output_type: generate_until
+dataset_path: MBZUAI-Paris/DarijaBench
+test_split: marsum
+doc_to_text: !function utils.doc_to_text
+doc_to_target: !function utils.doc_to_target
+metric_list:
+  - metric: !function utils.rouge1
+  - metric: !function utils.rouge2
+  - metric: !function utils.rougeL
+  - metric: !function utils.rougeLsum
+  - metric: !function utils.bert
+  - metric: chrf
+generation_kwargs:
+  until:
+    - "<end_of_turn>"
+    - "<eos>"
+    - "</s>"
+    - "<|end_of_text|>"
+    - "<|eot_id|>"
+    - "<|endoftext|>"
+  do_sample: false
+  temperature: 0.0
+  max_new_tokens: 128
+repeats: 1
+filter_list:
+  - name: "STRIP_ANSWER"
+    filter:
+      - function: "custom"
+        filter_fn: !function utils.strip
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/darija_bench/darija_summarization/summarization_darija.yaml
+++ b/lm_eval/tasks/darija_bench/darija_summarization/summarization_darija.yaml
+group: darija_summarization
+task:
+  - darija_summarization_task
+metric_list:
+  - metric: !function utils.rouge1
+    aggregation: !function utils.agg_rouge1
+    higher_is_better: true
+  - metric: !function utils.rouge2
+    aggregation: !function utils.agg_rouge2
+    higher_is_better: true
+  - metric: !function utils.rougeL
+    aggregation: !function utils.agg_rougel
+    higher_is_better: true
+  - metric: !function utils.rougeLsum
+    aggregation: !function utils.agg_rougelsum
+    higher_is_better: true
+  - metric: !function utils.bert
+    aggregation: !function utils.darijabert
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/darija_bench/darija_summarization/utils.py
+++ b/lm_eval/tasks/darija_bench/darija_summarization/utils.py
+import evaluate
+import datasets
+
+def strip(resps, docs):
+    """
+    Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
+    """
+    return map(lambda r: r[0].strip(), resps)
+
+
+def doc_to_text(doc):
+    doc_text = doc["messages"][0]["content"].replace("لخص هاد المقطع", "لخص هاد المقطع في ٣٠ كلمة")
+    return doc_text
+
+def doc_to_target(doc):
+    return doc["messages"][1]["content"]
+
+def bert(items):
+    return items
+
+def Average(lst):
+        return sum(lst) / len(lst)
+
+def darijabert(items):
+    bert_model = 'SI2M-Lab/DarijaBERT'
+    bert_score = evaluate.load("bertscore")
+    predictions, references = zip(*items)
+    bert = bert_score.compute(predictions=predictions, references=references, model_type=bert_model, num_layers=12)
+    return Average(bert['f1'])
+
+def rouge1(items):
+    return items
+def rougeL(items):
+    return items
+def rouge2(items):
+    return items
+def rougeLsum(items):
+    return items
+
+def agg_rougelsum(items):
+    rouge = evaluate.load("rouge")
+    predictions, references = zip(*items)
+    return rouge.compute(predictions=predictions, references=references)["rougeLsum"]
+
+def agg_rouge1(items):
+    rouge = evaluate.load("rouge")
+    predictions, references = zip(*items)
+    return rouge.compute(predictions=predictions, references=references)["rouge1"]
+
+def agg_rouge2(items):
+    rouge = evaluate.load("rouge")
+    predictions, references = zip(*items)
+    return rouge.compute(predictions=predictions, references=references)["rouge2"]
+
+def agg_rougel(items):
+    rouge = evaluate.load("rouge")
+    predictions, references = zip(*items)
+    return rouge.compute(predictions=predictions, references=references)["rougeL"]
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_translation/README.md
+++ b/lm_eval/tasks/darija_bench/darija_translation/README.md
+# DarijaBench: Translation
+
+### Paper
+
+Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
+
+Abstract: [https://arxiv.org/abs/2409.17912](https://arxiv.org/abs/2409.17912)
+
+DarijaBench, a comprehensive evaluation dataset tailored for Moroccan Darija. DarijaBench includes different datasets for core NLP tasks such as translation (based on four datasets, [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K), [FLORES+](https://github.com/openlanguagedata/flores), [NLLB-Seed](https://github.com/openlanguagedata/seed) and [MADAR](https://sites.google.com/nyu.edu/madar/)), summarization (based  on [MArSum](https://github.com/KamelGaanoun/MoroccanSummarization)) and, sentiment analysis (based on five datasets, [MAC](https://github.com/LeMGarouani/MAC), [MYC](https://github.com/MouadJb/MYC), [MSAC](https://hal.science/hal-03670346/document), [MSDA](https://cc.um6p.ma/cc_datasets) and, [ElectroMorocco2016](https://github.com/sentiprojects/ElecMorocco2016)), in addition to a new transliteration task to convert between Darija (written in Arabic letters) and Arabizi (written in Latin letters) it is based on [DODa-10K](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) dataset.
+
+
+Homepage: [https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench](https://huggingface.co/datasets/MBZUAI-Paris/DarijaBench)
+
+
+### Citation
+
+```
+@article{shang2024atlaschatadaptinglargelanguage,
+      title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, 
+      author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing},
+      year={2024},
+      eprint={2409.17912},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.17912}, 
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `darija_translation`: evaluates all Darija Translation tasks.
+
+#### Tasks
+
+* `darija_translation_doda`: evaluates Darija translation task from [DODa-10k](https://huggingface.co/datasets/MBZUAI-Paris/DODa-10K) corpus.
+* `darija_translation_flores`: evaluates Darija translation task from [FLORES+](https://github.com/openlanguagedata/flores) dataset.
+* `darija_translation_madar`: evaluates Darija translation task from [MADAR](https://sites.google.com/nyu.edu/madar/) dataset.
+* `darija_translation_seed`: evaluates Darija translation task from [NLLB-Seed](https://github.com/openlanguagedata/seed) datasets.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/darija_bench/darija_translation/doda_common_yaml
+++ b/lm_eval/tasks/darija_bench/darija_translation/doda_common_yaml
+test_split: doda
\ No newline at end of file
--- a/lm_eval/tasks/darija_bench/darija_translation/doda_translation_all.yaml
+++ b/lm_eval/tasks/darija_bench/darija_translation/doda_translation_all.yaml
+include:
+  - translation_common_yaml
+  - doda_common_yaml
+"tag":
+- "darija_translation_tasks_doda"
+"task": "trasnlation_all_doda"
+"task_alias": "all_doda"
+metric_list:
+  - metric: !function utils.bert
+    aggregation: !function utils.mbert
+    higher_is_better: true
--- a/lm_eval/tasks/darija_bench/darija_translation/doda_translation_darija.yaml
+++ b/lm_eval/tasks/darija_bench/darija_translation/doda_translation_darija.yaml
+group: darija_translation_doda
+group_alias: translation_doda
+task:
+  - darija_translation_tasks_doda
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
+  - metric: ter
+    aggregation: ter
+    higher_is_better: false
+  - metric: !function utils.bert
+    aggregation: !function utils.mbert
+    higher_is_better: true
+metadata:
+  version: 1.0