Unverified Commit 653217a7 authored by jp's avatar jp Committed by GitHub
Browse files

add Kobest (#1263)

* Add: kobest config file

* Add: kobest utils

* Add: README

* Update utils.py
parent 75dc2b87
# LAMBADA
### Paper
Title: `KOBEST: Korean Balanced Evaluation of Significant Tasks`
Abstract: https://arxiv.org/abs/2204.04541
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.
Homepage: https://huggingface.co/datasets/skt/kobest_v1
### Groups and Tasks
#### Groups
- `kobest`
#### Tasks
- `kobest_boolq`
- `kobest_copa`
- `kobest_hallawag`
- `kobest_sentineg`
- `kobest_wic`
### Citation
@misc{
author={Dohyeong Kim, Myeongjun Jang, Deuk Sin Kwon, Eric Davis},
title={KOBEST: Korean Balanced Evaluation of Significant Tasks},
DOI={https://doi.org/10.48550/arXiv.2204.04541},
publisher={arXiv},
year={2022},
month={Apr}
}
group:
- kobest
task: kobest_boolq
dataset_path: skt/kobest_v1
dataset_name: boolq
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{paragraph}} 질문: {{question}} 답변: "
doc_to_target: "{{label}}"
doc_to_choice: ["아니오", "예"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_copa
dataset_path: skt/kobest_v1
dataset_name: copa
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.copa_doc_to_text
doc_to_target: !function utils.copa_doc_to_target
doc_to_choice: !function utils.copa_doc_to_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_hellaswag
dataset_path: skt/kobest_v1
dataset_name: hellaswag
training_split: train
validation_split: validation
output_type: multiple_choice
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
process_docs: !function utils.hellaswag_process_doc
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: acc_norm
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_sentineg
dataset_path: skt/kobest_v1
dataset_name: sentineg
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.sentineg_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ["부정", "긍정"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_wic
dataset_path: skt/kobest_v1
dataset_name: wic
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.wic_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ['아니오', '예']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
from datasets import Dataset
from sklearn.metrics import f1_score
def copa_doc_to_text(doc: dict) -> str:
connector = {"원인": " 왜냐하면", "결과": " 그래서"}[doc["question"].strip()]
return f"""{doc["premise"]} {connector}"""
def copa_doc_to_target(doc: dict) -> str:
correct_choice = doc["alternative_1"] if doc["label"] == 0 else doc["alternative_2"]
return f"""{correct_choice}"""
def copa_doc_to_choice(doc: dict) -> list:
return [f"""{doc["alternative_1"]}""", f"""{doc["alternative_2"]}"""]
def sentineg_doc_to_text(doc: dict):
return f"""문장: {doc["sentence"]} 긍부정:"""
def wic_doc_to_text(doc: dict) -> str:
return f"""문장1: {doc["context_1"]} 문장2: {doc["context_2"]} 두 문장에서 {doc["word"]}가 같은 뜻으로 쓰였나?"""
def hellaswag_process_doc(doc: Dataset) -> Dataset:
def preprocessor(dataset):
return {
"query": f"""문장: {dataset["context"]}""",
"choices": [dataset["ending_1"], dataset["ending_2"], dataset["ending_3"], dataset["ending_4"]],
"gold": int(dataset["label"]),
}
return doc.map(preprocessor)
def macro_f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = f1_score(golds, preds, average='macro')
return fscore
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment