Commit b13753cd authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

Merge branch 'main' into fix-task-table

parents 8ea9c59d 5c25dd55
...@@ -26,4 +26,4 @@ metric_list: ...@@ -26,4 +26,4 @@ metric_list:
aggregation: !function utils.agg_inst_level_acc aggregation: !function utils.agg_inst_level_acc
higher_is_better: true higher_is_better: true
metadata: metadata:
version: 1.0 version: 2.0
# LAMBADA
### Paper
Title: `KOBEST: Korean Balanced Evaluation of Significant Tasks`
Abstract: https://arxiv.org/abs/2204.04541
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.
Homepage: https://huggingface.co/datasets/skt/kobest_v1
### Groups and Tasks
#### Groups
- `kobest`
#### Tasks
- `kobest_boolq`
- `kobest_copa`
- `kobest_hallawag`
- `kobest_sentineg`
- `kobest_wic`
### Citation
@misc{
author={Dohyeong Kim, Myeongjun Jang, Deuk Sin Kwon, Eric Davis},
title={KOBEST: Korean Balanced Evaluation of Significant Tasks},
DOI={https://doi.org/10.48550/arXiv.2204.04541},
publisher={arXiv},
year={2022},
month={Apr}
}
group:
- kobest
task: kobest_boolq
dataset_path: skt/kobest_v1
dataset_name: boolq
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{paragraph}} 질문: {{question}} 답변: "
doc_to_target: "{{label}}"
doc_to_choice: ["아니오", "예"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_copa
dataset_path: skt/kobest_v1
dataset_name: copa
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.copa_doc_to_text
doc_to_target: !function utils.copa_doc_to_target
doc_to_choice: !function utils.copa_doc_to_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_hellaswag
dataset_path: skt/kobest_v1
dataset_name: hellaswag
training_split: train
validation_split: validation
output_type: multiple_choice
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
process_docs: !function utils.hellaswag_process_doc
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: acc_norm
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_sentineg
dataset_path: skt/kobest_v1
dataset_name: sentineg
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.sentineg_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ["부정", "긍정"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_wic
dataset_path: skt/kobest_v1
dataset_name: wic
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.wic_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ['아니오', '예']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
from datasets import Dataset
from sklearn.metrics import f1_score
def copa_doc_to_text(doc: dict) -> str:
connector = {"원인": " 왜냐하면", "결과": " 그래서"}[doc["question"].strip()]
return f"""{doc["premise"]} {connector}"""
def copa_doc_to_target(doc: dict) -> str:
correct_choice = doc["alternative_1"] if doc["label"] == 0 else doc["alternative_2"]
return f"""{correct_choice}"""
def copa_doc_to_choice(doc: dict) -> list:
return [f"""{doc["alternative_1"]}""", f"""{doc["alternative_2"]}"""]
def sentineg_doc_to_text(doc: dict):
return f"""문장: {doc["sentence"]} 긍부정:"""
def wic_doc_to_text(doc: dict) -> str:
return f"""문장1: {doc["context_1"]} 문장2: {doc["context_2"]} 두 문장에서 {doc["word"]}가 같은 뜻으로 쓰였나?"""
def hellaswag_process_doc(doc: Dataset) -> Dataset:
def preprocessor(dataset):
return {
"query": f"""문장: {dataset["context"]}""",
"choices": [dataset["ending_1"], dataset["ending_2"], dataset["ending_3"], dataset["ending_4"]],
"gold": int(dataset["label"]),
}
return doc.map(preprocessor)
def macro_f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = f1_score(golds, preds, average='macro')
return fscore
task: medmcqa
dataset_path: medmcqa
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: validation
doc_to_text: !function utils_medmcqa.doc_to_text
doc_to_target: cop
doc_to_choice: [ 'A','B','C','D' ]
should_decontaminate: true
doc_to_decontamination_query: "{{question}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
# Copied from Master
def doc_to_text(doc) -> str:
"""
Question: <question>
Choices:
A. <choice1>
B. <choice2>
C. <choice3>
D. <choice4>
Answer:
"""
choices = [doc["opa"], doc["opb"], doc["opc"], doc["opd"]]
option_choices = {'A': choices[0], 'B': choices[1], 'C': choices[2], 'D': choices[3]}
prompt = "Question: " + doc["question"] + "\nChoices:\n"
for choice, option in option_choices.items():
prompt += f"{choice.upper()}. {option}\n"
prompt += "Answer:"
return prompt
task: medqa_4options
dataset_path: GBaker/MedQA-USMLE-4-options-hf
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function preprocess_medqa.doc_to_text
doc_to_target: !function preprocess_medqa.doc_to_target
doc_to_choice: [ 'A', 'B', 'C', 'D' ]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
def doc_to_text(doc) -> str:
option_choices = {'A': doc["ending0"], 'B': doc["ending1"], 'C': doc["ending2"], 'D': doc["ending3"]}
answers = "".join((f"{k}. {v}\n") for k, v in option_choices.items())
return f"Question: {doc['sent1']}\n{answers}Answer:"
def doc_to_target(doc) -> int:
return doc["label"]
...@@ -26,4 +26,4 @@ metric_list: ...@@ -26,4 +26,4 @@ metric_list:
ignore_case: true ignore_case: true
ignore_punctuation: true ignore_punctuation: true
metadata: metadata:
version: 0.0 version: 1.0
...@@ -28,4 +28,4 @@ filter_list: ...@@ -28,4 +28,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)" regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
- function: "take_first" - function: "take_first"
metadata: metadata:
version: 0.0 version: 1.0
...@@ -28,4 +28,4 @@ filter_list: ...@@ -28,4 +28,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)" regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
- function: "take_first" - function: "take_first"
metadata: metadata:
version: 1.0 version: 2.0
...@@ -21,4 +21,4 @@ metric_list: ...@@ -21,4 +21,4 @@ metric_list:
higher_is_better: true higher_is_better: true
num_fewshot: 0 num_fewshot: 0
metadata: metadata:
version: 0.0 version: 1.0
...@@ -3,7 +3,7 @@ dataset_path: nq_open ...@@ -3,7 +3,7 @@ dataset_path: nq_open
output_type: generate_until output_type: generate_until
training_split: train training_split: train
validation_split: validation validation_split: validation
description: "Answer these questions:\n" description: "Answer these questions:\n\n"
doc_to_text: "Q: {{question}}?\nA:" doc_to_text: "Q: {{question}}?\nA:"
doc_to_target: "{{answer}}" # TODO: should be multi-target doc_to_target: "{{answer}}" # TODO: should be multi-target
fewshot_delimiter: "\n" fewshot_delimiter: "\n"
...@@ -27,6 +27,6 @@ metric_list: ...@@ -27,6 +27,6 @@ metric_list:
ignore_case: true ignore_case: true
ignore_punctuation: true ignore_punctuation: true
regexes_to_ignore: regexes_to_ignore:
- "\ban|a|the\b" - "\\b(?:The |the |An |A |The |a |an )"
metadata: metadata:
version: 0.0 version: 3.0
# Multilingual HellaSwag
### Paper
Title: `Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback`
Abstract: https://arxiv.org/abs/2307.16039
A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at this https URL.
Homepage: `https://github.com/nlp-uoregon/Okapi`
### Citation
```
@article{dac2023okapi,
title={Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback},
author={Dac Lai, Viet and Van Nguyen, Chien and Ngo, Nghia Trung and Nguyen, Thuat and Dernoncourt, Franck and Rossi, Ryan A and Nguyen, Thien Huu},
journal={arXiv e-prints},
pages={arXiv--2307},
year={2023}
}
```
### Groups and Tasks
#### Groups
- hellaswag_multilingual
#### Tasks
- `hellaswag_{ar,bn,ca,da,de,es,eu,fr,gu,hi,hr,hu,hy,id,it,kn,ml,mr,ne,nl,pt,ro,ru,sk,sr,sv,ta,te,uk,vi}`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group:
- hellaswag_multilingual
dataset_path: null
dataset_name: null
output_type: multiple_choice
training_split: null
validation_split: validation
test_split: null
process_docs: !function utils.process_docs
doc_to_text: "query"
doc_to_target: "{{label.lstrip()}}"
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
include: _hellaswag_yaml
task: hellaswag_ar
dataset_path: alexandrainst/m_hellaswag
dataset_name: ar
training_split: null
validation_split: val
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment