Add new tasks to spanish_bench and fix duplicates (#2390)

* added tasks to spanish_bench * fixed capitalization in escola and run pre-commit * Update _flores_common_yaml * Update _flores_common_yaml * Update direct_yaml * Update cot_yaml * Update cot_yaml * Update _flores_common_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Add new tasks to spanish_bench and fix duplicates (#2390)
* added tasks to spanish_bench * fixed capitalization in escola and run pre-commit * Update _flores_common_yaml * Update _flores_common_yaml * Update direct_yaml * Update cot_yaml * Update cot_yaml * Update _flores_common_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
7ecee2bc · zxcvuser · GitHub · 7785577c · 7ecee2bc · 7ecee2bc
Unverified Commit 7ecee2bc authored Oct 17, 2024 by zxcvuser Committed by GitHub Oct 17, 2024
14 changed files
--- a/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml
+++ b/lm_eval/tasks/galician_bench/flores_gl/_flores_common_yaml
-group: flores
 dataset_path: facebook/flores
 dataset_name: all
 output_type: generate_until

--- a/lm_eval/tasks/mgsm/direct/direct_yaml
+++ b/lm_eval/tasks/mgsm/direct/direct_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group: mgsm_direct
+tag: mgsm_direct
 dataset_path: juletxara/mgsm
 dataset_name: null  # Overridden by language-specific config.
 output_type: generate_until

--- a/lm_eval/tasks/mgsm/en_cot/cot_yaml
+++ b/lm_eval/tasks/mgsm/en_cot/cot_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group: mgsm_cot_native
+tag: mgsm_cot_native
 dataset_path: juletxara/mgsm
 dataset_name: null  # Overridden by language-specific config.
 output_type: generate_until

--- a/lm_eval/tasks/mgsm/native_cot/cot_yaml
+++ b/lm_eval/tasks/mgsm/native_cot/cot_yaml
 # This file will be included in the generated language-specific task configs.
 # It doesn't have a yaml file extension as it is not meant to be imported directly
 # by the harness.
-group: mgsm_cot_native
+tag: mgsm_cot_native
 dataset_path: juletxara/mgsm
 dataset_name: null  # Overridden by language-specific config.
 output_type: generate_until

--- a/lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
+++ b/lm_eval/tasks/portuguese_bench/flores_pt/_flores_common_yaml
-group: flores
 dataset_path: facebook/flores
 dataset_name: all
 output_type: generate_until

--- a/lm_eval/tasks/spanish_bench/README.md
+++ b/lm_eval/tasks/spanish_bench/README.md
@@ -4,11 +4,18 @@

 SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.

-The datasets included in SpanishBench are:
+The new evaluation datasets included in SpanishBench are:
+| Task          | Category       | Homepage  |
+|:-------------:|:-----:|:-----:|
+| COPA-es | Commonsense Reasoning | https://huggingface.co/datasets/BSC-LT/COPA-es |
+| OpenBookQA_es | Question Answering | https://huggingface.co/datasets/BSC-LT/openbookqa-es |

+
+The datasets included in SpanishBench that have been made public in previous publications are:
 | Task          | Category       | Paper title          | Homepage  |
 |:-------------:|:-----:|:-------------:|:-----:|
 | Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
+| EsCoLA | Linguistic Acceptability | [EsCoLA: Spanish Corpus of Linguistic Acceptability](https://aclanthology.org/2024.lrec-main.554/) | https://huggingface.co/datasets/nbel/EsCoLA |
 | FLORES_es | Translation | [The FLORES-101  Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
 | MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm |
 | PAWS-X_es | Paraphrasing | [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://aclanthology.org/D19-1382/) | https://huggingface.co/datasets/google-research-datasets/paws-x |
@@ -19,6 +26,7 @@ The datasets included in SpanishBench are:
 | XStoryCloze_es | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |


+
 ### Citation
 Paper for SpanishBench coming soon.

@@ -36,6 +44,8 @@ Paper for SpanishBench coming soon.

 The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.
  - `belebele_spa_Latn`
+  - `copa_es`
+  - `escola`
  - `flores_es`
  - `flores_es-ca`
  - `flores_es-de`
@@ -53,20 +63,21 @@ The following tasks evaluate tasks on SpanishBench dataset using various scoring
  - `flores_gl-es`
  - `flores_it-es`
  - `flores_pt-es`
-  - `mgsm_direct_es_v2` (`v2` is due to an existing open issue in the original task)
-  - `paws_es`
+  - `mgsm_direct_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
+  - `openbookqa_es`
+  - `paws_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
  - `phrases_es`
  - `wnli_es`
  - `xlsum_es`
-  - `xnli_es`
+  - `xnli_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
  - `xquad_es`
  - `xstorycloze_es`

 Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
 - `belebele_spa_Latn`: Belebele Spanish
- `mgsm_direct_es`: MGSM Spanish (We fix an existing open issue in the original task)
- `paws_es`: PAWS-X Spanish
- `xnli_es`: XNLI Spanish
+- `mgsm_direct_es`: MGSM Spanish (fixed an existing open issue in the original task)
+- `paws_es`: PAWS-X Spanish (fixed an existing open issue in the original task)
+- `xnli_es`: XNLI Spanish (fixed an existing open issue in the original task)
 - `xstorycloze_es`: XStoryCloze Spanish

 ### Checklist

--- a/lm_eval/tasks/spanish_bench/paws_es.yaml
+++ b/lm_eval/tasks/spanish_bench/paws_es.yaml
-group:
-    - pawsx
-task: paws_es
-dataset_path: paws-x
-dataset_name: es
+task: copa_es
+dataset_path: BSC-LT/COPA-es
+dataset_name: null
 output_type: multiple_choice
-training_split: train
 validation_split: validation
 test_split: test
-process_docs: !function utils.process_docs_paraphrases
-doc_to_text: ''
-doc_to_target: label
-doc_to_choice: '{{[sentence1+", ¿verdad? No, "+sentence2, sentence1+", ¿verdad? Sí, "+sentence2]}}'
-target_delimiter: ''
+process_docs: !function utils.process_docs_copa_es
+doc_to_text: '{{premise[:-1].strip() + " " + {"cause": "porque", "effect": "y por lo tanto"}[question]}}'
+doc_to_target: '{{choice1 if label == 0 else choice2}}'
+doc_to_choice: '{{[choice1, choice2]}}'
 metric_list:
  - metric: acc
    aggregation: mean

--- a/lm_eval/tasks/spanish_bench/escola.yaml
+++ b/lm_eval/tasks/spanish_bench/escola.yaml
+task: escola
+dataset_path: nbel/EsCoLA
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: null
+doc_to_text: "{{Sentence}}\nPregunta: ¿Tiene sentido esta frase?\nRespuesta:"
+doc_to_target: Label
+doc_to_choice: ["no", "sí"]
+metric_list:
+  - metric: mcc
+  - metric: acc
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/spanish_bench/flores_es/_flores_common_yaml
+++ b/lm_eval/tasks/spanish_bench/flores_es/_flores_common_yaml
-group: flores
 dataset_path: facebook/flores
 dataset_name: all
 output_type: generate_until

--- a/lm_eval/tasks/spanish_bench/mgsm_direct_es_v2.yaml
+++ b/lm_eval/tasks/spanish_bench/mgsm_direct_es_v2.yaml
-include: ../mgsm/direct/mgsm_direct_es.yaml
-doc_to_target: '{{answer_number|string}}'
-doc_to_text: '{% if answer is not none %}{{question+"\nRespuesta: "}}{% else %}{{"Pregunta: "+question+"\nRespuesta: "}}{% endif %}'
-generation_kwargs:
-  until:
-    - "\n\n"
-    - "\n"
-
-task: mgsm_direct_es_v2
--- a/lm_eval/tasks/spanish_bench/openbookqa_es.yaml
+++ b/lm_eval/tasks/spanish_bench/openbookqa_es.yaml
+task: openbookqa_es
+dataset_path: BSC-LT/openbookqa-es
+output_type: multiple_choice
+training_split: null
+validation_split: validation
+test_split: test
+doc_to_text: question_stem
+doc_to_target: "{{choices.label.index(answerKey.lstrip())}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: question_stem
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/spanish_bench/spanish_bench.yaml
+++ b/lm_eval/tasks/spanish_bench/spanish_bench.yaml
 group: spanish_bench
 task:
  - belebele_spa_Latn
+  - copa_es
+  - escola
+  - openbookqa_es
  - wnli_es
  - xnli_es_spanish_bench
  - xstorycloze_es

--- a/lm_eval/tasks/spanish_bench/utils.py
+++ b/lm_eval/tasks/spanish_bench/utils.py
@@ -82,6 +82,15 @@ def process_docs_paraphrases(dataset):
    ).map(_process_doc)


+def process_docs_copa_es(dataset):
+    def _process_doc(doc):
+        doc["choice1"] = lowercase_first_letter(doc["choice1"])
+        doc["choice2"] = lowercase_first_letter(doc["choice2"])
+        return doc
+
+    return dataset.map(_process_doc)
+
+
 def rouge1(items):
    """
    # passthrough for efficiency

--- a/lm_eval/tasks/spanish_bench/xnli_es.yaml
+++ b/lm_eval/tasks/spanish_bench/xnli_es.yaml
-# Task configuration derived from Eleuther AI's implementation as of March 22, 2024, supplemented with an additional preprocessing function
-task: xnli_es
-dataset_path: xnli
-dataset_name: es
-output_type: multiple_choice
-doc_to_choice: '{{[premise+", ¿correcto? Sí, "+hypothesis,premise+", ¿correcto? Así
-  que, "+hypothesis,premise+", ¿correcto? No, "+hypothesis]}}'
-doc_to_text: ''
-target_delimiter: ''
-process_docs: !function utils.process_doc_nli
-training_split: null
-validation_split: validation
-doc_to_target: label
-metric_list:
-  - metric: acc
-    aggregation: mean
-    higher_is_better: true
-metadata:
-  version: 1.0