Unverified Commit ea17b98e authored by zxcvuser's avatar zxcvuser Committed by GitHub
Browse files

Add new benchmark: Spanish bench (#2157)

* Add spanish_bench

* Add flores_es group

* Update _flores_common_yaml

* Delete lm_eval/tasks/spanish_bench/escola.yaml

* Delete escola from spanish_bench.yaml

* Delete escola from README.md

* pre-commit run --all-files

* Updated some task groupings and readme

---------
parent 15ffb0da
......@@ -86,6 +86,7 @@
| [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English |
| [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English |
| [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish |
| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |
| [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English |
| [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English |
| [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English |
......@@ -95,6 +96,7 @@
| [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English |
| [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English |
| [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning. | English |
| [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish |
| [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English |
| [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English |
| [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English |
......@@ -121,4 +123,4 @@
| [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
| [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
| [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |
# SpanishBench
### Paper
SpanishBench is a benchmark for evaluating language models in Spanish tasks. This is, it evaluates the ability of a language model to understand and generate Spanish text. SpanishBench offers a combination of pre-existing, open datasets. All the details of SpanishBench will be published in a paper soon.
The datasets included in SpanishBench are:
| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_es | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| FLORES_es | Translation | [The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation](https://arxiv.org/abs/2106.03193) | https://huggingface.co/datasets/facebook/flores |
| MGSM_es | Math | [Language Models are Multilingual Chain-of-Thought Reasoners](https://arxiv.org/abs/2210.03057) | https://huggingface.co/datasets/juletxara/mgsm |
| PAWS-X_es | Paraphrasing | [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://aclanthology.org/D19-1382/) | https://huggingface.co/datasets/google-research-datasets/paws-x |
| WNLI-es | Natural Language Inference | No paper. | https://huggingface.co/datasets/PlanTL-GOB-ES/wnli-es |
| XL-Sum_es | Summarization | [XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages](https://aclanthology.org/2021.findings-acl.413/) | https://huggingface.co/datasets/csebuetnlp/xlsum |
| XNLI_es | Natural Language Inference | [XNLI: Evaluating Cross-lingual Sentence Representations](https://aclanthology.org/D18-1269/) | https://huggingface.co/datasets/facebook/xnli |
| XQuAD_es | Question Answering | [On the Cross-lingual Transferability of Monolingual Representations](https://aclanthology.org/2020.acl-main.421/) | https://huggingface.co/datasets/google/xquad |
| XStoryCloze_es | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |
### Citation
Paper for SpanishBench coming soon.
### Groups and Tasks
#### Groups
- `spanish_bench`: All tasks included in SpanishBench.
- `flores_es`: All FLORES translation tasks from or to Spanish.
#### Tags
- `phrases_es`: Two Phrases_va tasks for language adaptation between Spanish and Valencian.
#### Tasks
The following tasks evaluate tasks on SpanishBench dataset using various scoring methods.
- `belebele_spa_Latn`
- `flores_es`
- `flores_es-ca`
- `flores_es-de`
- `flores_es-en`
- `flores_es-eu`
- `flores_es-fr`
- `flores_es-gl`
- `flores_es-it`
- `flores_es-pt`
- `flores_ca-es`
- `flores_de-es`
- `flores_en-es`
- `flores_eu-es`
- `flores_fr-es`
- `flores_gl-es`
- `flores_it-es`
- `flores_pt-es`
- `mgsm_direct_es_v2` (`v2` is due to an existing open issue in the original task)
- `paws_es`
- `phrases_es`
- `wnli_es`
- `xlsum_es`
- `xnli_es`
- `xquad_es`
- `xstorycloze_es`
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_spa_Latn`: Belebele Spanish
- `mgsm_direct_es`: MGSM Spanish (We fix an existing open issue in the original task)
- `paws_es`: PAWS-X Spanish
- `xnli_es`: XNLI Spanish
- `xstorycloze_es`: XStoryCloze Spanish
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: flores
dataset_path: facebook/flores
dataset_name: all
output_type: generate_until
#! The test split of flores is not publicly available! (See paper section 6.1)
#! We are using `dev` and `devtest` splits, but they're mapped to train/validation/test in `data/flores/flores.py`.
training_split: dev
validation_split: dev
test_split: devtest
fewshot_split: dev
target_delimiter: ''
generation_kwargs:
until:
- "\n"
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: chrf
aggregation: chrf
higher_is_better: true
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
"""
Script to generate task YAMLs for the FLORES-200 dataset.
Based on `tasks/translation/utils.py`.
"""
import argparse
from itertools import *
import yaml
from langcodes import *
# utils
flatten = lambda l: list(itertools.chain(*l))
# constants
_LANGUAGES = [
"ace_Arab",
"bam_Latn",
"dzo_Tibt",
"hin_Deva",
"khm_Khmr",
"mag_Deva",
"pap_Latn",
"sot_Latn",
"tur_Latn",
"ace_Latn",
"ban_Latn",
"ell_Grek",
"hne_Deva",
"kik_Latn",
"mai_Deva",
"pbt_Arab",
"spa_Latn",
"twi_Latn",
"acm_Arab",
"bel_Cyrl",
"eng_Latn",
"hrv_Latn",
"kin_Latn",
"mal_Mlym",
"pes_Arab",
"srd_Latn",
"tzm_Tfng",
"acq_Arab",
"bem_Latn",
"epo_Latn",
"hun_Latn",
"kir_Cyrl",
"mar_Deva",
"plt_Latn",
"srp_Cyrl",
"uig_Arab",
"aeb_Arab",
"ben_Beng",
"est_Latn",
"hye_Armn",
"kmb_Latn",
"min_Arab",
"pol_Latn",
"ssw_Latn",
"ukr_Cyrl",
"afr_Latn",
"bho_Deva",
"eus_Latn",
"ibo_Latn",
"kmr_Latn",
"min_Latn",
"por_Latn",
"sun_Latn",
"umb_Latn",
"ajp_Arab",
"bjn_Arab",
"ewe_Latn",
"ilo_Latn",
"knc_Arab",
"mkd_Cyrl",
"prs_Arab",
"swe_Latn",
"urd_Arab",
"aka_Latn",
"bjn_Latn",
"fao_Latn",
"ind_Latn",
"knc_Latn",
"mlt_Latn",
"quy_Latn",
"swh_Latn",
"uzn_Latn",
"als_Latn",
"bod_Tibt",
"fij_Latn",
"isl_Latn",
"kon_Latn",
"mni_Beng",
"ron_Latn",
"szl_Latn",
"vec_Latn",
"amh_Ethi",
"bos_Latn",
"fin_Latn",
"ita_Latn",
"kor_Hang",
"mos_Latn",
"run_Latn",
"tam_Taml",
"vie_Latn",
"apc_Arab",
"bug_Latn",
"fon_Latn",
"jav_Latn",
"lao_Laoo",
"mri_Latn",
"rus_Cyrl",
"taq_Latn",
"war_Latn",
"arb_Arab",
"bul_Cyrl",
"fra_Latn",
"jpn_Jpan",
"lij_Latn",
"mya_Mymr",
"sag_Latn",
"taq_Tfng",
"wol_Latn",
"arb_Latn",
"cat_Latn",
"fur_Latn",
"kab_Latn",
"lim_Latn",
"nld_Latn",
"san_Deva",
"tat_Cyrl",
"xho_Latn",
"ars_Arab",
"ceb_Latn",
"fuv_Latn",
"kac_Latn",
"lin_Latn",
"nno_Latn",
"sat_Olck",
"tel_Telu",
"ydd_Hebr",
"ary_Arab",
"ces_Latn",
"gaz_Latn",
"kam_Latn",
"lit_Latn",
"nob_Latn",
"scn_Latn",
"tgk_Cyrl",
"yor_Latn",
"arz_Arab",
"cjk_Latn",
"gla_Latn",
"kan_Knda",
"lmo_Latn",
"npi_Deva",
"shn_Mymr",
"tgl_Latn",
"yue_Hant",
"asm_Beng",
"ckb_Arab",
"gle_Latn",
"kas_Arab",
"ltg_Latn",
"nso_Latn",
"sin_Sinh",
"tha_Thai",
"zho_Hans",
"ast_Latn",
"crh_Latn",
"glg_Latn",
"kas_Deva",
"ltz_Latn",
"nus_Latn",
"slk_Latn",
"tir_Ethi",
"zho_Hant",
"awa_Deva",
"cym_Latn",
"grn_Latn",
"kat_Geor",
"lua_Latn",
"nya_Latn",
"slv_Latn",
"tpi_Latn",
"zsm_Latn",
"ayr_Latn",
"dan_Latn",
"guj_Gujr",
"kaz_Cyrl",
"lug_Latn",
"oci_Latn",
"smo_Latn",
"tsn_Latn",
"zul_Latn",
"azb_Arab",
"deu_Latn",
"hat_Latn",
"kbp_Latn",
"luo_Latn",
"ory_Orya",
"sna_Latn",
"tso_Latn",
"azj_Latn",
"dik_Latn",
"hau_Latn",
"kea_Latn",
"lus_Latn",
"pag_Latn",
"snd_Arab",
"tuk_Latn",
"bak_Cyrl",
"dyu_Latn",
"heb_Hebr",
"khk_Cyrl",
"lvs_Latn",
"pan_Guru",
"som_Latn",
"tum_Latn",
]
LANGUAGE_PAIRS = [
(a, b) for idx, a in enumerate(_LANGUAGES) for b in _LANGUAGES[idx + 1 :]
]
LANGUAGES_OF_INTEREST = [
"cat_Latn",
"spa_Latn",
"eng_Latn",
"glg_Latn",
"eus_Latn",
"ita_Latn",
"deu_Latn",
"por_Latn",
"fra_Latn",
]
MAIN_LANG = "spa_Latn"
LANGUAGE_PAIRS = [
(a, b)
for (a, b) in LANGUAGE_PAIRS
if a in LANGUAGES_OF_INTEREST and b in LANGUAGES_OF_INTEREST and MAIN_LANG in (a, b)
]
# auxiliary functions
code_to_language_name = lambda code: Language.make(
language=Language.get(code)["language"]
).display_name()
code_to_short_name = lambda code: Language.get(code)["language"]
jinja_var = (
lambda s: "{{" + s + "}}"
) # wrapper to avoid having to escape { } in format strings
def doc_to_text(src: str, tgt: str) -> str:
src_name, tgt_name = map(code_to_language_name, [src, tgt])
return f"""\
{src_name} sentence: {jinja_var('sentence_' + src)}
{tgt_name} sentence:"""
def doc_to_target(tgt: str) -> str:
return f"{jinja_var('sentence_' + tgt)}"
# main function
def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
"""
Generate a YAML file for each translation direction.
"""
err = []
for src, tgt in LANGUAGE_PAIRS:
# do both translation directions for each lang pair
for src, tgt in [(src, tgt), (tgt, src)]:
lang_pair_name = f"{code_to_short_name(src)}-{code_to_short_name(tgt)}"
yaml_file_name = f"flores_{lang_pair_name}.yaml"
try:
with open(
f"{output_dir}/{yaml_file_name}",
"w" if overwrite else "x",
encoding="utf-8",
) as outfile:
print(f"Creating {yaml_file_name}...")
outfile.write("# File generated by `create-yamls.py`\n")
yaml.dump(
{
# "group": "flores_es",
"include": "_flores_common_yaml",
"task": f"flores_{lang_pair_name}",
"doc_to_text": doc_to_text(src, tgt),
"doc_to_target": doc_to_target(tgt),
},
outfile,
sort_keys=False,
)
except FileExistsError:
err.append(yaml_file_name)
if len(err) > 0:
raise FileExistsError(
"Files were not created because they already exist:"
f" {', '.join(err)}"
"\nUse flag --overwrite to overwrite them."
)
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--overwrite",
default=False,
action="store_true",
help="Overwrite files if they already exist",
)
parser.add_argument(
"--output-dir", default=".", help="Directory to write yaml files to"
)
args = parser.parse_args()
gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
if __name__ == "__main__":
main()
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_ca-es
doc_to_text: 'Catalan sentence: {{sentence_cat_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_de-es
doc_to_text: 'German sentence: {{sentence_deu_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_en-es
doc_to_text: 'English sentence: {{sentence_eng_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-ca
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Catalan sentence:'
doc_to_target: '{{sentence_cat_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-de
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
German sentence:'
doc_to_target: '{{sentence_deu_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-en
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
English sentence:'
doc_to_target: '{{sentence_eng_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-eu
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Basque sentence:'
doc_to_target: '{{sentence_eus_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-fr
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
French sentence:'
doc_to_target: '{{sentence_fra_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-gl
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Galician sentence:'
doc_to_target: '{{sentence_glg_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-it
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Italian sentence:'
doc_to_target: '{{sentence_ita_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_es-pt
doc_to_text: 'Spanish sentence: {{sentence_spa_Latn}}
Portuguese sentence:'
doc_to_target: '{{sentence_por_Latn}}'
group: flores_es
task:
- flores_es-en
- flores_en-es
- flores_es-eu
- flores_eu-es
- flores_es-pt
- flores_pt-es
- flores_es-it
- flores_it-es
- flores_es-fr
- flores_fr-es
- flores_es-ca
- flores_ca-es
- flores_es-gl
- flores_gl-es
- flores_es-de
- flores_de-es
aggregate_metric_list:
- metric: bleu
aggregation: mean
weight_by_size: false
metadata:
version: 1.0
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_eu-es
doc_to_text: 'Basque sentence: {{sentence_eus_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_fr-es
doc_to_text: 'French sentence: {{sentence_fra_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_gl-es
doc_to_text: 'Galician sentence: {{sentence_glg_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
# File generated by `create-yamls.py`
include: _flores_common_yaml
task: flores_it-es
doc_to_text: 'Italian sentence: {{sentence_ita_Latn}}
Spanish sentence:'
doc_to_target: '{{sentence_spa_Latn}}'
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment