Merge branch 'main' into inverse-scaling-tasks

60c9c170 · haileyschoelkopf · 4b2d565b · b4cd85d4 · 60c9c170 · 60c9c170
Commit 60c9c170 authored May 29, 2024 by haileyschoelkopf
20 changed files
--- a/lm_eval/tasks/aclue/aclue_ancient_medical.yaml
+++ b/lm_eval/tasks/aclue/aclue_ancient_medical.yaml
+"dataset_name": "ancient_medical"
+"description": "以下是关于医古文的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_medical"
--- a/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml
+++ b/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml
+"dataset_name": "ancient_phonetics"
+"description": "以下是关于古音学的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_phonetics"
--- a/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml
+++ b/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml
+"dataset_name": "basic_ancient_chinese"
+"description": "以下是关于古汉语知识的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_basic_ancient_chinese"
--- a/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml
+++ b/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml
+"dataset_name": "couplet_prediction"
+"description": "以下是关于对联的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_couplet_prediction"
--- a/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml
+++ b/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml
+"dataset_name": "homographic_character_resolution"
+"description": "以下是关于通假字的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_homographic_character_resolution"
--- a/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml
+++ b/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml
+"dataset_name": "named_entity_recognition"
+"description": "以下是关于古汉语命名体识别的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_named_entity_recognition"
--- a/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml
+++ b/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml
+"dataset_name": "poetry_appreciate"
+"description": "以下是关于古诗词曲鉴赏的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_appreciate"
--- a/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml
+++ b/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml
+"dataset_name": "poetry_context_prediction"
+"description": "以下是关于古诗词上下句预测的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_context_prediction"
--- a/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml
+++ b/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml
+"dataset_name": "poetry_quality_assessment"
+"description": "以下是关于古诗词质量评估的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_quality_assessment"
--- a/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml
+++ b/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml
+"dataset_name": "poetry_sentiment_analysis"
+"description": "以下是关于诗词情感分类的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_sentiment_analysis"
--- a/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml
+++ b/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml
+"dataset_name": "polysemy_resolution"
+"description": "以下是关于古文单字多义的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_polysemy_resolution"
--- a/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml
+++ b/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml
+"dataset_name": "reading_comprehension"
+"description": "以下是关于古文阅读理解的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_reading_comprehension"
--- a/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml
+++ b/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml
+"dataset_name": "sentence_segmentation"
+"description": "以下是关于古文断句的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_sentence_segmentation"
--- a/lm_eval/tasks/basqueglue/README.md
+++ b/lm_eval/tasks/basqueglue/README.md
+# BasqueGLUE
+
+### Paper
+
+Title: `BasqueGLUE: A Natural Language Understanding Benchmark for Basque`
+
+Abstract: `https://aclanthology.org/2022.lrec-1.172/`
+
+Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.
+
+Homepage: `https://github.com/orai-nlp/BasqueGLUE`
+
+Title: `Latxa: An Open Language Model and Evaluation Suite for Basque`
+
+Abstract: `https://arxiv.org/abs/2403.20266`
+
+The use of BasqueGLUE for evaluating the performance of decoder models in Basque is presented in this paper.
+
+Homepage: `https://github.com/hitz-zentroa/latxa`
+
+### Citation
+
+```
+@InProceedings{urbizu2022basqueglue,
+  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
+  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
+  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
+  month          = {June},
+  year           = {2022},
+  address        = {Marseille, France},
+  publisher      = {European Language Resources Association},
+  pages     = {1603--1612},
+  url       = {https://aclanthology.org/2022.lrec-1.172}
+}
+
+@misc{etxaniz2024latxa,
+      title={Latxa: An Open Language Model and Evaluation Suite for Basque},
+      author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
+      year={2024},
+      eprint={2403.20266},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `basque-glue`: First version of the implementation
+
+#### Tasks
+
+* `bhtc_v2`: Topic classification of news extracts with 12 categories.
+* `bec`: Sentiment analysis on tweets about the campaign for the 2016 Basque elections.
+* `vaxx_stance`: Stance detection on tweets around the anti-vaccine movement.
+* `qnlieu`: Q&A NLI as in [glue/qnli](../glue/qnli).
+* `wiceu`: Word-in-Context as in [super_glue/wic](../super_glue/wic).
+* `epec_korref_bin`: Correference detection as in [super_glue/wsc](../super_glue/wsc).
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/basqueglue/bec.yaml
+++ b/lm_eval/tasks/basqueglue/bec.yaml
+group: basque-glue
+task: bec2016eu
+dataset_path: orai-nlp/basqueGLUE
+dataset_name: bec
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak?\nErantzuna:"
+doc_to_target: label
+doc_to_choice: ['negatiboa', 'neutrala', 'positiboa']
+metric_list:
+  - metric: f1
+    aggregation: !function utils.micro_f1_score
+    higher_is_better: true
+metadata:
+  - version: 1.0
--- a/lm_eval/tasks/basqueglue/bhtc.yaml
+++ b/lm_eval/tasks/basqueglue/bhtc.yaml
+group: basque-glue
+task: bhtc_v2
+dataset_path: orai-nlp/basqueGLUE
+dataset_name: bhtc
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "Testua: {{text}}\nGaldera: Zein da aurreko testuaren gaia?\nErantzuna:"
+doc_to_target: label
+doc_to_choice: ['Ekonomia', 'Euskal Herria', 'Euskara', 'Gizartea', 'Historia', 'Ingurumena', 'Iritzia', 'Komunikazioa', 'Kultura', 'Nazioartea', 'Politika', 'Zientzia']
+metric_list:
+  - metric: f1
+    aggregation: !function utils.micro_f1_score
+    higher_is_better: true
+metadata:
+  - version: 1.0
--- a/lm_eval/tasks/basqueglue/coref.yaml
+++ b/lm_eval/tasks/basqueglue/coref.yaml
+group: basque-glue
+task: epec_koref_bin
+dataset_path: orai-nlp/basqueGLUE
+dataset_name: coref
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.coref_doc_to_text
+doc_to_target: label
+doc_to_choice: ['ez', 'bai']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  - version: 1.0
--- a/lm_eval/tasks/basqueglue/qnli.yaml
+++ b/lm_eval/tasks/basqueglue/qnli.yaml
+group: basque-glue
+task: qnlieu
+dataset_path: orai-nlp/basqueGLUE
+dataset_name: qnli
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "{{question}}\n{{sentence}}\nGaldera: aurreko galderari erantzuten al dio emandako testuak?\nErantzuna:"
+doc_to_target: label
+doc_to_choice: ['bai', 'ez']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  - version: 1.0
--- a/lm_eval/tasks/basqueglue/utils.py
+++ b/lm_eval/tasks/basqueglue/utils.py
+import html
+import re
+
+from datasets import load_metric
+
+
+def general_detokenize(string):
+    string = re.sub(r"\s+([.,;:!?)])", r"\1", string)
+    string = re.sub(r"(\s+|^)\(\s+([^)]+)\s+\)", r"\1(\2)", string)
+    string = re.sub(r"(\s+|^)\[\s+([^)]+)\s+\]", r"\1[\2]", string)
+    string = re.sub(r'(\s+|^)"\s+([^"]+)\s+"', r'\1"\2"', string)
+    string = re.sub(r"(\s+|^)'\s+([^']+)\s+'", r"\1'\2'", string)
+    return string
+
+
+def process_doc(string):
+    string = html.unescape(string)
+    string = general_detokenize(string)
+    return string
+
+
+def process_wic_docs(dataset):
+    def _helper(doc):
+        # there's some issues with the encoding on this one
+        doc["sentence1"] = (
+            process_doc(doc["sentence1"]).encode("latin-1").decode("utf-8")
+        )
+        doc["sentence2"] = (
+            process_doc(doc["sentence2"]).encode("latin-1").decode("utf-8")
+        )
+        return doc
+
+    return dataset.map(_helper)
+
+
+def coref_doc_to_text(x):
+    def _span_in_context(span_index, span_text):
+        span_start = span_index
+        span_end = span_start + len(span_text.split(" ")) - 1
+        tokens[span_start] = f"*{tokens[span_start]}"
+        tokens[span_end] = f"{tokens[span_end]}*"
+
+    tokens = x["text"].split(" ")
+    _span_in_context(x["span1_index"], x["span1_text"])
+    _span_in_context(
+        x["span2_index"] - 1, x["span2_text"]
+    )  # span1_index is 0-based but span2_index is 1-based ??
+    context = process_doc(" ".join(tokens))
+    span_1 = process_doc(x["span1_text"])
+    span_2 = process_doc(x["span2_text"])
+    text = (
+        f"Testua: {context}\n"
+        + f'Galdera: Aurreko testuan, "*{span_1}*" eta "*{span_2}*" gauza bera dira?\n'
+        + "Erantzuna:"
+    )
+    return text
+
+
+# Measure F1 as in the benchmark repo: https://github.com/orai-nlp/BasqueGLUE/blob/main/eval_basqueglue.py
+
+
+def micro_f1_score(items):
+    f1_metric = load_metric("f1")
+    golds, preds = list(zip(*items))
+    f1_score = f1_metric.compute(references=golds, predictions=preds, average="micro")[
+        "f1"
+    ]
+    return f1_score
+
+
+def vaxx_f1_score(items):
+    f1_metric = load_metric("f1")
+    golds, preds = list(zip(*items))
+    f1_class = f1_metric.compute(
+        references=golds, predictions=preds, labels=[0, 2], average=None
+    )["f1"]
+    f1_score = sum(f1_class) / len(f1_class)
+    return f1_score
--- a/lm_eval/tasks/basqueglue/vaxx.yaml
+++ b/lm_eval/tasks/basqueglue/vaxx.yaml
+group: basque-glue
+task: vaxx_stance
+dataset_path: orai-nlp/basqueGLUE
+dataset_name: vaxx
+output_type: multiple_choice
+validation_split: validation
+test_split: test
+doc_to_text: "Testua: {{text}}\nGaldera: Nolako jarrera agertzen du aurreko testuak txertoei buruz?\nErantzuna:"
+doc_to_target: label
+doc_to_choice: ['aurka', 'neutrala', 'alde']
+metric_list:
+  - metric: f1
+    aggregation: !function utils.vaxx_f1_score
+    higher_is_better: true
+metadata:
+  - version: 1.0