mela (#1970)

* mela * Update mela_en.yaml * Create _mela.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

mela (#1970)
* mela * Update mela_en.yaml * Create _mela.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
a4987bba · Geralt · GitHub · b536f067 · a4987bba · a4987bba
Unverified Commit a4987bba authored Aug 21, 2024 by Geralt Committed by GitHub Aug 20, 2024
12 changed files
--- a/lm_eval/tasks/mela/README.md
+++ b/lm_eval/tasks/mela/README.md
+# Task-name
+### Paper
+Title: [MELA: Multilingual Evaluation of Linguistic Acceptability](https://arxiv.org/abs/2311.09033)
+**Abstract**: In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks.
+Homepage: https://github.com/sjtu-compling/MELA
+### Citation
+```
+@inproceedings{zhang2023mela,
+  author       = {Ziyin Zhang and
+                  Yikang Liu and
+                  Weifang Huang and
+                  Junyu Mao and
+                  Rui Wang and
+                  Hai Hu},
+  title        = {{MELA:} Multilingual Evaluation of Linguistic Acceptability},
+  booktitle    = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL} 2024, Bangkok, Thailand},
+  publisher    = {Association for Computational Linguistics},
+  year         = {2024},
+  url          = {https://doi.org/10.48550/arXiv.2311.09033}
+}
+```
+### Groups and Tasks
+#### Groups
+- `mela`: multilingual evaluation of linguistic acceptability
+#### Tasks
+- `mela_en`: English
+- `mela_zh`: Chinese
+- `mela_it`: Italian
+- `mela_ru`: Russian
+- `mela_de`: Germany
+- `mela_fr`: French
+- `mela_es`: Spanish
+- `mela_ja`: Japanese
+- `mela_ar`: Arabic
+- `mela_ar`: Icelandic
+### Checklist
+For adding novel benchmarks/datasets to the library:
+- [x] Is the task an existing benchmark in the literature?
+  - [x] Have you referenced the original paper that introduced the task?
+  - [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+- [ ] Is the "Main" variant of this task clearly denoted?
+- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/mela/_mela.yaml
+++ b/lm_eval/tasks/mela/_mela.yaml
+group: mela
+task:
+  - mela_en
+  - mela_zh
+  - mela_it
+  - mela_ru
+  - mela_de
+  - mela_fr
+  - mela_es
+  - mela_ja
+  - mela_ar
+  - mela_ar
+aggregate_metric_list:
+  - metric: mcc
+    weight_by_size: False
+metadata:
+  version: 1
--- a/lm_eval/tasks/mela/mela_ar.yaml
+++ b/lm_eval/tasks/mela/mela_ar.yaml
+include: mela_en.yaml
+task: mela_ar
+dataset_name: ar
+training_split: null
--- a/lm_eval/tasks/mela/mela_de.yaml
+++ b/lm_eval/tasks/mela/mela_de.yaml
+include: mela_en.yaml
+task: mela_de
+dataset_name: de
+training_split: null
--- a/lm_eval/tasks/mela/mela_en.yaml
+++ b/lm_eval/tasks/mela/mela_en.yaml
+task: mela_en
+dataset_path: Geralt-Targaryen/MELA
+dataset_name: en
+training_split: train
+validation_split: dev
+test_split: test
+output_type: multiple_choice
+doc_to_text: "Sentence: {{sentence}}\nDetermine whether this sentence is acceptable or unacceptable?\nA. Acceptable\nB. Unacceptable\nAnswer:"
+doc_to_choice: ["A", "B"]
+doc_to_target: "{{['B', 'A'][label]}}"
+description: "Determine whether the following sentence(s) violate certain linguistic constraints. If yes, then it is \"unacceptable\"; otherwise, \"acceptable\".\n\n"
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+metric_list:
+  - metric: mcc
+    higher_is_better: true
--- a/lm_eval/tasks/mela/mela_es.yaml
+++ b/lm_eval/tasks/mela/mela_es.yaml
+include: mela_en.yaml
+task: mela_es
+dataset_name: es
+training_split: null
--- a/lm_eval/tasks/mela/mela_fr.yaml
+++ b/lm_eval/tasks/mela/mela_fr.yaml
+include: mela_en.yaml
+task: mela_fr
+dataset_name: fr
+training_split: null
--- a/lm_eval/tasks/mela/mela_is.yaml
+++ b/lm_eval/tasks/mela/mela_is.yaml
+include: mela_en.yaml
+task: mela_is
+dataset_name: is
+training_split: null
--- a/lm_eval/tasks/mela/mela_it.yaml
+++ b/lm_eval/tasks/mela/mela_it.yaml
+include: mela_en.yaml
+task: mela_it
+dataset_name: it
--- a/lm_eval/tasks/mela/mela_ja.yaml
+++ b/lm_eval/tasks/mela/mela_ja.yaml
+include: mela_en.yaml
+task: mela_ja
+dataset_name: ja
+training_split: null
--- a/lm_eval/tasks/mela/mela_ru.yaml
+++ b/lm_eval/tasks/mela/mela_ru.yaml
+include: mela_en.yaml
+task: mela_ru
+dataset_name: ru
--- a/lm_eval/tasks/mela/mela_zh.yaml
+++ b/lm_eval/tasks/mela/mela_zh.yaml
+include: mela_en.yaml
+task: mela_zh
+dataset_name: zh