Unverified Commit 7ecee2bc authored by zxcvuser's avatar zxcvuser Committed by GitHub
Browse files

Add new tasks to spanish_bench and fix duplicates (#2390)



* added tasks to spanish_bench

* fixed capitalization in escola and run pre-commit

* Update _flores_common_yaml

* Update _flores_common_yaml

* Update direct_yaml

* Update cot_yaml

* Update cot_yaml

* Update _flores_common_yaml

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent 7785577c
group: flores
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
......
# This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group: mgsm_direct
tag: mgsm_direct
dataset_path: juletxara/mgsm
dataset_name: null # Overridden by language-specific config.
output_type: generate_until
......
# This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group: mgsm_cot_native
tag: mgsm_cot_native
dataset_path: juletxara/mgsm
dataset_name: null # Overridden by language-specific config.
output_type: generate_until
......
# This file will be included in the generated language-specific task configs.
# It doesn't have a yaml file extension as it is not meant to be imported directly
# by the harness.
group: mgsm_cot_native
tag: mgsm_cot_native
dataset_path: juletxara/mgsm
dataset_name: null # Overridden by language-specific config.
output_type: generate_until
......
group: flores
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
......
......@@ -4,11 +4,18 @@
SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.
The datasets included in SpanishBench are:
The new evaluation datasets included in SpanishBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| COPA-es | Commonsense Reasoning | https://huggingface.co/datasets/BSC-LT/COPA-es |
| OpenBookQA_es | Question Answering | https://huggingface.co/datasets/BSC-LT/openbookqa-es |
The datasets included in SpanishBench that have been made public in previous publications are:
| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| EsCoLA | Linguistic Acceptability | [EsCoLA: Spanish Corpus of Linguistic Acceptability](https://aclanthology.org/2024.lrec-main.554/) | https://huggingface.co/datasets/nbel/EsCoLA |
| FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
| MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm |
| PAWS-X_es | Paraphrasing | [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://aclanthology.org/D19-1382/) | https://huggingface.co/datasets/google-research-datasets/paws-x |
......@@ -19,6 +26,7 @@ The datasets included in SpanishBench are:
| XStoryCloze_es | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |
### Citation
Paper for SpanishBench coming soon.
......@@ -36,6 +44,8 @@ Paper for SpanishBench coming soon.
The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.
- `belebele_spa_Latn`
- `copa_es`
- `escola`
- `flores_es`
- `flores_es-ca`
- `flores_es-de`
......@@ -53,20 +63,21 @@ The following tasks evaluate tasks on SpanishBench dataset using various scoring
- `flores_gl-es`
- `flores_it-es`
- `flores_pt-es`
- `mgsm_direct_es_v2` (`v2` is due to an existing open issue in the original task)
- `paws_es`
- `mgsm_direct_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
- `openbookqa_es`
- `paws_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
- `phrases_es`
- `wnli_es`
- `xlsum_es`
- `xnli_es`
- `xnli_es_spanish_bench` (`spanish_bench` is due to an existing open issue in the original task)
- `xquad_es`
- `xstorycloze_es`
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_spa_Latn`: Belebele Spanish
- `mgsm_direct_es`: MGSM Spanish (We fix an existing open issue in the original task)
- `paws_es`: PAWS-X Spanish
- `xnli_es`: XNLI Spanish
- `mgsm_direct_es`: MGSM Spanish (fixed an existing open issue in the original task)
- `paws_es`: PAWS-X Spanish (fixed an existing open issue in the original task)
- `xnli_es`: XNLI Spanish (fixed an existing open issue in the original task)
- `xstorycloze_es`: XStoryCloze Spanish
### Checklist
......
group:
- pawsx
task: paws_es
dataset_path: paws-x
dataset_name: es
task: copa_es
dataset_path: BSC-LT/COPA-es
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
process_docs: !function utils.process_docs_paraphrases
doc_to_text: ''
doc_to_target: label
doc_to_choice: '{{[sentence1+", ¿verdad? No, "+sentence2, sentence1+", ¿verdad? Sí, "+sentence2]}}'
target_delimiter: ''
process_docs: !function utils.process_docs_copa_es
doc_to_text: '{{premise[:-1].strip() + " " + {"cause": "porque", "effect": "y por lo tanto"}[question]}}'
doc_to_target: '{{choice1 if label == 0 else choice2}}'
doc_to_choice: '{{[choice1, choice2]}}'
metric_list:
- metric: acc
aggregation: mean
......
task: escola
dataset_path: nbel/EsCoLA
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
doc_to_text: "{{Sentence}}\nPregunta: ¿Tiene sentido esta frase?\nRespuesta:"
doc_to_target: Label
doc_to_choice: ["no", "sí"]
metric_list:
- metric: mcc
- metric: acc
metadata:
version: 1.0
group: flores
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
......
include: ../mgsm/direct/mgsm_direct_es.yaml
doc_to_target: '{{answer_number|string}}'
doc_to_text: '{% if answer is not none %}{{question+"\nRespuesta: "}}{% else %}{{"Pregunta: "+question+"\nRespuesta: "}}{% endif %}'
generation_kwargs:
until:
- "\n\n"
- "\n"
task: mgsm_direct_es_v2
task: openbookqa_es
dataset_path: BSC-LT/openbookqa-es
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: test
doc_to_text: question_stem
doc_to_target: "{{choices.label.index(answerKey.lstrip())}}"
doc_to_choice: "{{choices.text}}"
should_decontaminate: true
doc_to_decontamination_query: question_stem
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
group: spanish_bench
task:
- belebele_spa_Latn
- copa_es
- escola
- openbookqa_es
- wnli_es
- xnli_es_spanish_bench
- xstorycloze_es
......
......@@ -82,6 +82,15 @@ def process_docs_paraphrases(dataset):
).map(_process_doc)
def process_docs_copa_es(dataset):
def _process_doc(doc):
doc["choice1"] = lowercase_first_letter(doc["choice1"])
doc["choice2"] = lowercase_first_letter(doc["choice2"])
return doc
return dataset.map(_process_doc)
def rouge1(items):
"""
# passthrough for efficiency
......
# Task configuration derived from Eleuther AI's implementation as of March 22, 2024, supplemented with an additional preprocessing function
task: xnli_es
dataset_path: xnli
dataset_name: es
output_type: multiple_choice
doc_to_choice: '{{[premise+", ¿correcto? Sí, "+hypothesis,premise+", ¿correcto? Así
que, "+hypothesis,premise+", ¿correcto? No, "+hypothesis]}}'
doc_to_text: ''
target_delimiter: ''
process_docs: !function utils.process_doc_nli
training_split: null
validation_split: validation
doc_to_target: label
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment