add Basque translation of ARC and PAWS to BasqueBench (#2732)

* add Basque translation of ARC and PAWS to BasqueBench * pre-commit --------- Co-authored-by: Baber <baber@hey.com>

add Basque translation of ARC and PAWS to BasqueBench (#2732)
* add Basque translation of ARC and PAWS to BasqueBench * pre-commit --------- Co-authored-by: Baber <baber@hey.com>
2f403fa0 · Naiara Perez · GitHub · 01849b40 · 2f403fa0 · 2f403fa0
Unverified Commit 2f403fa0 authored Feb 25, 2025 by Naiara Perez Committed by GitHub Feb 25, 2025
6 changed files
--- a/lm_eval/tasks/basque_bench/README.md
+++ b/lm_eval/tasks/basque_bench/README.md
@@ -5,14 +5,16 @@
 BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.

 The new evaluation datasets included in BasqueBench are:
-| Task          | Category       | Homepage  |
-|:-------------:|:-----:|:-----:|
-| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
-| PIQA_eu | Question Answering | https://huggingface.co/datasets/HiTZ/PIQA-eu |
-| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
-| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
+| Task     | Category                   | Homepage                                      |
+|:--------:|:--------------------------:|:---------------------------------------------:|
+| ARC_eu   | Question Answering         | https://huggingface.co/datasets/HiTZ/ARC-eu   |
+| MGSM_eu  | Math                       | https://huggingface.co/datasets/HiTZ/MGSM-eu  |
+| PAWS_eu  | Paraphrasing               | https://huggingface.co/datasets/HiTZ/PAWS-eu  |
+| PIQA_eu  | Question Answering         | https://huggingface.co/datasets/HiTZ/PIQA-eu  |
+| WNLI_eu  | Natural Language Inference | https://huggingface.co/datasets/HiTZ/WNLI-eu  |
+| XCOPA_eu | Commonsense Reasoning      | https://huggingface.co/datasets/HiTZ/XCOPA-eu |

-The datasets included in BasqueBench that have been made public in previous pubications are:
+The datasets included in BasqueBench that have been made public in previous publications are:

 | Task          | Category       | Paper title          | Homepage  |
 |:-------------:|:-----:|:-------------:|:-----:|
@@ -73,6 +75,8 @@ The datasets included in BasqueBench that have been made public in previous pubi
 #### Tasks

 The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
+  - `arc_eu_challenge`
+  - `arc_eu_easy`
  - `belebele_eus_Latn`
  - `eus_exams_eu`
  - `eus_proficiency`
@@ -97,6 +101,7 @@ The following tasks evaluate tasks on BasqueBench dataset using various scoring
  - `flores_pt-eu`
  - `mgsm_direct_eu`
  - `mgsm_native_cot_eu`
+  - `paws_eu`
  - `piqa_eu`
  - `qnlieu`
  - `wnli_eu`

--- a/lm_eval/tasks/basque_bench/arc_eu_challenge.yaml
+++ b/lm_eval/tasks/basque_bench/arc_eu_challenge.yaml
+include: arc_eu_easy.yaml
+task: arc_eu_challenge
+dataset_name: ARC-Challenge
--- a/lm_eval/tasks/basque_bench/arc_eu_easy.yaml
+++ b/lm_eval/tasks/basque_bench/arc_eu_easy.yaml
+task: arc_eu_easy
+dataset_path: HiTZ/ARC-eu
+dataset_name: ARC-Easy
+output_type: multiple_choice
+training_split: null
+validation_split: validation
+test_split: test
+doc_to_text: "Galdera: {{question}}\nErantzuna:"
+doc_to_target: "{{choices.label.index(answerKey)}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "Galdera: {{question}}\nErantzuna:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/basque_bench/basque_bench.yaml
+++ b/lm_eval/tasks/basque_bench/basque_bench.yaml
 group: basque_bench
 task:
+    - arc_eu_challenge
+    - arc_eu_easy
    - belebele_eus_Latn
    - xstorycloze_eu
    - flores_eu
@@ -14,6 +16,7 @@ task:
    - xcopa_eu
    - mgsm_direct_eu
    - mgsm_native_cot_eu
+    - paws_eu
    - piqa_eu
 metadata:
  version: 1.0
--- a/lm_eval/tasks/basque_bench/paws_eu.yaml
+++ b/lm_eval/tasks/basque_bench/paws_eu.yaml
+task: paws_eu
+dataset_path: HiTZ/PAWS-eu
+dataset_name: null
+output_type: multiple_choice
+test_split: test
+process_docs: !function utils.paws_process_docs
+doc_to_text: ''
+doc_to_target: label
+doc_to_choice: '{{[sentence1+", ezta? Ez, "+sentence2, sentence1+", ezta? Bai, "+sentence2]}}'
+target_delimiter: ''
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/basque_bench/utils.py
+++ b/lm_eval/tasks/basque_bench/utils.py
-from functools import partial
-
-
 # ~~~~~~~~~~~ XCOPA ~~~~~~~~~~~ #

 xcopa_connectors = {"cause": " Izan ere,", "effect": " Beraz,"}
@@ -18,4 +15,28 @@ def xcopa_doc_to_choice(doc):
    return [convert_choice(doc["choice1"]), convert_choice(doc["choice2"])]


-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
+# ~~~~~~~~~~~ PAWS-X ~~~~~~~~~~~ #
+
+
+def paws_process_docs(dataset):
+    empty_docs = []
+
+    def _process_doc(doc):
+        if doc["sentence1"] not in [None, ""] and doc["sentence2"] not in [None, ""]:
+            # Remove final punctuation mark in the first sentence
+            if doc["sentence1"].endswith((".", ",", ";")):
+                doc["sentence1"] = doc["sentence1"][:-1]
+            # Start the second sentence in lowercase (to be used after "Yes, ...")
+            doc["sentence2"] = lowercase_first_letter(doc["sentence2"])
+            return doc
+        else:
+            empty_docs.append(doc)
+            return doc
+
+    def lowercase_first_letter(text):
+        return text[0].lower() + text[1:]
+
+    return dataset.filter(
+        lambda doc: doc["sentence1"] not in [None, ""]
+        and doc["sentence2"] not in [None, ""]
+    ).map(_process_doc)