Add Latxa paper evaluation tasks for Basque (#1654)

* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit

Add Latxa paper evaluation tasks for Basque (#1654)
* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit
c2c8e238 · Julen Etxaniz · GitHub · ab7cc6b1 · c2c8e238 · c2c8e238
Unverified Commit c2c8e238 authored Apr 01, 2024 by Julen Etxaniz Committed by GitHub Apr 01, 2024
20 changed files
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opegasteizkoudala.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opegasteizkoudala.yaml
+# Generated by utils.py
+dataset_name: eu_opegasteizkoudala
+include: eus_exams_eu
+task: eus_exams_eu_opegasteizkoudala
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiadmineu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiadmineu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakiadmineu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakiadmineu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiauxenfeu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiauxenfeu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakiauxenfeu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakiauxenfeu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiauxeu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiauxeu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakiauxeu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakiauxeu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiceladoreu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakiceladoreu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakiceladoreu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakiceladoreu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakienfeu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakienfeu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakienfeu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakienfeu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakioperarioeu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakioperarioeu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakioperarioeu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakioperarioeu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakitecnicoeu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakitecnicoeu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakitecnicoeu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakitecnicoeu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakivarioseu.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_opeosakivarioseu.yaml
+# Generated by utils.py
+dataset_name: eu_opeosakivarioseu
+include: eus_exams_eu
+task: eus_exams_eu_opeosakivarioseu
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza1e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza1e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza1e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza1e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza2e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza2e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza2e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza2e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza3e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza3e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza3e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza3e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza5e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza5e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza5e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza5e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza6e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza6e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza6e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza6e
--- a/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza7e.yaml
+++ b/lm_eval/tasks/eus_exams/eus_exams_eu_osakidetza7e.yaml
+# Generated by utils.py
+dataset_name: eu_osakidetza7e
+include: eus_exams_eu
+task: eus_exams_eu_osakidetza7e
--- a/lm_eval/tasks/eus_exams/utils.py
+++ b/lm_eval/tasks/eus_exams/utils.py
+import datasets
+
+
+def process_docs(dataset: datasets.Dataset):
+    """Filter out examples with no answer."""
+
+    def valid_example(example: dict) -> bool:
+        """Check if an example is valid."""
+        if example["answer"] not in [0, 1, 2, 3]:
+            return False
+        if example["candidates"] == ["", "", "", ""]:
+            return False
+        return True
+
+    return dataset.filter(valid_example)
--- a/lm_eval/tasks/eus_proficiency/README.md
+++ b/lm_eval/tasks/eus_proficiency/README.md
+# EusProficiency
+
+### Paper
+
+Title: Latxa: An Open Language Model and Evaluation Suite for Basque
+
+Abstract: https://arxiv.org/abs/2403.20266
+
+EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque. We collected the atarikoa exercises from EGA exams through the years 1998 to 2008. Atarikoa is the first qualifying test of EGA, which measures different aspects of language competency, such as reading comprehension, grammar, vocabulary, spelling, and writing. Each test generally has 85 multiple-choice questions, with 4 choices and a single correct answer.
+
+Homepage: https://github.com/hitz-zentroa/latxa
+
+
+### Citation
+
+```
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+There are no groups.
+
+#### Tasks
+
+* `eus_proficiency`: EusProficiency comprises 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/eus_proficiency/eus_proficiency.yaml
+++ b/lm_eval/tasks/eus_proficiency/eus_proficiency.yaml
+dataset_path: HiTZ/EusProficiency
+dataset_name: default
+task: eus_proficiency
+doc_to_text: "Galdera: {{question}}\nA: {{candidates[0]}}\nB: {{candidates[1]}}\nC: {{candidates[2]}}\nD: {{candidates[3]}}\nErantzuna:"
+doc_to_choice: ["A", "B", "C", "D"]
+validation_split: null
+test_split: test
+fewshot_split: test
+output_type: multiple_choice
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
--- a/lm_eval/tasks/eus_reading/README.md
+++ b/lm_eval/tasks/eus_reading/README.md
+# EusReading
+
+### Paper
+
+Title: Latxa: An Open Language Model and Evaluation Suite for Basque
+
+Abstract: https://arxiv.org/abs/2403.20266
+
+EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008. Each test generally has 10 multiple-choice questions, with 4 choices and a single correct answer. These exercises are more challenging than Belebele due to the complexity and length of the input texts. As a result, EusReading is useful to measure long context understanding of models.
+
+Homepage: https://github.com/hitz-zentroa/latxa
+
+
+### Citation
+
+```
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+There are no groups.
+
+#### Tasks
+
+* `eus_reading`: EusReading consists of 352 reading comprehension exercises (irakurmena) sourced from the set of past EGA exams from 1998 to 2008.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/eus_reading/eus_reading.yaml
+++ b/lm_eval/tasks/eus_reading/eus_reading.yaml
+dataset_path: HiTZ/EusReading
+dataset_name: default
+task: eus_reading
+doc_to_text: !function utils.doc_to_text_context
+doc_to_choice: !function utils.doc_to_choice
+validation_split: null
+test_split: test
+fewshot_split: test
+output_type: multiple_choice
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0