add Kobest (#1263)

* Add: kobest config file * Add: kobest utils * Add: README * Update utils.py

add Kobest (#1263)
* Add: kobest config file * Add: kobest utils * Add: README * Update utils.py
653217a7 · jp · GitHub · 75dc2b87 · 653217a7 · 653217a7
Unverified Commit 653217a7 authored Jan 12, 2024 by jp Committed by GitHub Jan 12, 2024
7 changed files
--- a/lm_eval/tasks/kobest/README.md
+++ b/lm_eval/tasks/kobest/README.md
+# LAMBADA
+### Paper
+Title: `KOBEST: Korean Balanced Evaluation of Significant Tasks`
+Abstract: https://arxiv.org/abs/2204.04541
+A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.
+Homepage: https://huggingface.co/datasets/skt/kobest_v1
+### Groups and Tasks
+#### Groups
+- `kobest`
+#### Tasks
+- `kobest_boolq`
+- `kobest_copa`
+- `kobest_hallawag`
+- `kobest_sentineg`
+- `kobest_wic`
+### Citation
+@misc{
+    author={Dohyeong Kim, Myeongjun Jang, Deuk Sin Kwon, Eric Davis},
+    title={KOBEST: Korean Balanced Evaluation of Significant Tasks},
+    DOI={https://doi.org/10.48550/arXiv.2204.04541},
+    publisher={arXiv},
+    year={2022},
+    month={Apr}
+}
--- a/lm_eval/tasks/kobest/kobest_boolq.yaml
+++ b/lm_eval/tasks/kobest/kobest_boolq.yaml
+group:
+  - kobest
+task: kobest_boolq
+dataset_path: skt/kobest_v1
+dataset_name: boolq
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "{{paragraph}} 질문: {{question}} 답변: "
+doc_to_target: "{{label}}"
+doc_to_choice: ["아니오", "예"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_copa.yaml
+++ b/lm_eval/tasks/kobest/kobest_copa.yaml
+group:
+  - kobest
+task: kobest_copa
+dataset_path: skt/kobest_v1
+dataset_name: copa
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.copa_doc_to_text
+doc_to_target: !function utils.copa_doc_to_target
+doc_to_choice: !function utils.copa_doc_to_choice
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_hellaswag.yaml
+++ b/lm_eval/tasks/kobest/kobest_hellaswag.yaml
+group:
+  - kobest
+task: kobest_hellaswag
+dataset_path: skt/kobest_v1
+dataset_name: hellaswag
+training_split: train
+validation_split: validation
+output_type: multiple_choice
+test_split: test
+doc_to_text: "{{query}}"
+doc_to_target: "{{label}}"
+process_docs: !function utils.hellaswag_process_doc
+doc_to_choice: "choices"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_sentineg.yaml
+++ b/lm_eval/tasks/kobest/kobest_sentineg.yaml
+group:
+  - kobest
+task: kobest_sentineg
+dataset_path: skt/kobest_v1
+dataset_name: sentineg
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.sentineg_doc_to_text
+doc_to_target: "{{label}}"
+doc_to_choice: ["부정", "긍정"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_wic.yaml
+++ b/lm_eval/tasks/kobest/kobest_wic.yaml
+group:
+  - kobest
+task: kobest_wic
+dataset_path: skt/kobest_v1
+dataset_name: wic
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.wic_doc_to_text
+doc_to_target: "{{label}}"
+doc_to_choice: ['아니오', '예']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/utils.py
+++ b/lm_eval/tasks/kobest/utils.py
+from datasets import Dataset
+from sklearn.metrics import f1_score
+def copa_doc_to_text(doc: dict) -> str:
+    connector = {"원인": " 왜냐하면", "결과": " 그래서"}[doc["question"].strip()]
+    return f"""{doc["premise"]} {connector}"""
+def copa_doc_to_target(doc: dict) -> str:
+    correct_choice = doc["alternative_1"] if doc["label"] == 0 else doc["alternative_2"]
+    return f"""{correct_choice}"""
+def copa_doc_to_choice(doc: dict) -> list:
+    return [f"""{doc["alternative_1"]}""", f"""{doc["alternative_2"]}"""]
+def sentineg_doc_to_text(doc: dict):
+    return f"""문장: {doc["sentence"]} 긍부정:"""
+def wic_doc_to_text(doc: dict) -> str:
+    return f"""문장1: {doc["context_1"]} 문장2: {doc["context_2"]} 두 문장에서 {doc["word"]}가 같은 뜻으로 쓰였나?"""
+def hellaswag_process_doc(doc: Dataset) -> Dataset:
+    def preprocessor(dataset):
+        return {
+            "query": f"""문장: {dataset["context"]}""",
+            "choices": [dataset["ending_1"], dataset["ending_2"], dataset["ending_3"], dataset["ending_4"]],
+            "gold": int(dataset["label"]),
+        }
+    return doc.map(preprocessor)
+def macro_f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = f1_score(golds, preds, average='macro')
+    return fscore