Commit b8bda478 authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

Merge branch 'main' into add-chat-templating

parents 6ca8ab15 588a493c
@software{eval-harness, @misc{eval-harness,
author = {Gao, Leo and author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation}, title = {A framework for few-shot language model evaluation},
month = sep, month = 12,
year = 2021, year = 2023,
publisher = {Zenodo}, publisher = {Zenodo},
version = {v0.0.1}, version = {v0.4.0},
doi = {10.5281/zenodo.5371628}, doi = {10.5281/zenodo.10256836},
url = {https://doi.org/10.5281/zenodo.5371628} url = {https://zenodo.org/records/10256836}
} }
...@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on ...@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers. - Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
- Easy support for custom prompts and evaluation metrics. - Easy support for custom prompts and evaluation metrics.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,17476825572045927382,18443729326628441434,12854182577605049984) is used internally by dozens of companies including NVIDIA, Cohere, Nous Research, Booz Allen Hamilton, and Mosaic ML. The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
## Install ## Install
...@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a ...@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a
#### Multi-GPU Evaluation with Hugging Face `accelerate` #### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we leverage the [accelerate 🚀](https://github.com/huggingface/accelerate) library as follows: We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
``` ```
accelerate launch -m lm_eval --model hf \ accelerate launch -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
--batch_size 16 --batch_size 16
``` ```
(or via `accelerate launch --no-python lm_eval`).
For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one. **WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors. The second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*
To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well: In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
```
lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--model_args parallelize=True \
--batch_size 16
```
This means that your model's weights will be split across all available GPUs.
For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
- `device_map_option`: How to split model weights across available GPUs. defaults to "auto". - `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model. - `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM. - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
- `offload_folder`: a folder where model weights will be offloaded to disk if needed. - `offload_folder`: a folder where model weights will be offloaded to disk if needed.
To use `accelerate` with the `lm-eval` command, use These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
```
accelerate launch --no_python lm-eval --model ...
```
### Tensor + Data Parallel and Optimized Inference with `vLLM` ### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example: We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
```bash ```bash
lm_eval --model vllm \ lm_eval --model vllm \
...@@ -219,11 +231,11 @@ lm_eval --model hf \ ...@@ -219,11 +231,11 @@ lm_eval --model hf \
--device cuda:0 --device cuda:0
``` ```
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument: [GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,autogptq=NAME` (or `,autogptq=True` for default names) in the `model_args` argument:
```bash ```bash
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \ --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
--tasks hellaswag --tasks hellaswag
``` ```
...@@ -301,10 +313,14 @@ The best way to get support is to open an issue on this repo or join the [Eleuth ...@@ -301,10 +313,14 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
## Cite as ## Cite as
``` ```
@article{gao2021framework, @misc{eval-harness,
title={A framework for few-shot language model evaluation}, author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
author={Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and others}, title = {A framework for few-shot language model evaluation},
journal={Version v0. 0.1. Sept}, month = 12,
year={2021} year = 2023,
publisher = {Zenodo},
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
} }
``` ```
...@@ -46,16 +46,6 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas ...@@ -46,16 +46,6 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`. dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
``` ```
------------------------------
**Tip:** To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
```
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
```
-------------------------------
Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist: Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
```yaml ```yaml
...@@ -99,6 +89,36 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the ...@@ -99,6 +89,36 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
process_docs: !function utils.process_docs process_docs: !function utils.process_docs
``` ```
### Using Local Datasets
To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
```
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
```
Or with files already split into separate directories:
```
dataset_path: arrow
dataset_kwargs:
data_files:
train: /path/to/arrow/train/data-00000-of-00001.arrow
validation: /path/to/arrow/validation/data-00000-of-00001.arrow
```
Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
```
dataset_path: hellaswag
dataset_kwargs:
data_dir: hellaswag_local/
```
You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
## Writing a Prompt Template ## Writing a Prompt Template
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format. The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
......
...@@ -301,6 +301,23 @@ task: ...@@ -301,6 +301,23 @@ task:
- hendrycksTest* - hendrycksTest*
``` ```
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
```yaml
group: multimedqa
task:
- pubmedqa
- medmcqa
- medqa_4options
- task: mmlu_anatomy
task_alias: "anatomy (mmlu)"
group_alias: null
- task: mmlu_clinical_knowledge
task_alias: "clinical_knowledge (mmlu)"
group_alias: null
...
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set. Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml ```yaml
......
...@@ -527,6 +527,10 @@ class ConfigurableTask(Task): ...@@ -527,6 +527,10 @@ class ConfigurableTask(Task):
"Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg" "Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg"
) )
if isinstance(self.config.metadata, dict):
if "version" in self.config.metadata:
self.VERSION = self.config.metadata["version"]
if self.config.output_type is not None: if self.config.output_type is not None:
assert self.config.output_type in ALL_OUTPUT_TYPES assert self.config.output_type in ALL_OUTPUT_TYPES
self.OUTPUT_TYPE = self.config.output_type self.OUTPUT_TYPE = self.config.output_type
...@@ -755,6 +759,8 @@ class ConfigurableTask(Task): ...@@ -755,6 +759,8 @@ class ConfigurableTask(Task):
def fewshot_docs(self): def fewshot_docs(self):
if self.config.fewshot_split is not None: if self.config.fewshot_split is not None:
if self.config.process_docs is not None:
return self.config.process_docs(self.dataset[self.config.fewshot_split])
return self.dataset[self.config.fewshot_split] return self.dataset[self.config.fewshot_split]
else: else:
if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0): if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0):
......
...@@ -133,6 +133,8 @@ class HFLM(LM): ...@@ -133,6 +133,8 @@ class HFLM(LM):
gpus = torch.cuda.device_count() gpus = torch.cuda.device_count()
accelerator = Accelerator() accelerator = Accelerator()
if accelerator.num_processes > 1:
self.accelerator = accelerator
if not (parallelize or accelerator.num_processes > 1): if not (parallelize or accelerator.num_processes > 1):
# use user-passed device # use user-passed device
...@@ -202,15 +204,16 @@ class HFLM(LM): ...@@ -202,15 +204,16 @@ class HFLM(LM):
self.model.tie_weights() self.model.tie_weights()
if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"): if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
if not (parallelize or autogptq or ("device_map" in kwargs)): # TODO: can remove this whole snippet except in the mps case, perhaps?
if not (parallelize or autogptq or hasattr(self, "accelerator")):
# place model onto device requested manually, # place model onto device requested manually,
# if not using HF Accelerate or device_map # if not using HF Accelerate or device_map
# or any other option that preloads model onto device # or any other option that preloads model onto device
try: try:
self.model.to(self.device) self.model.to(self.device)
except ValueError: except ValueError:
eval_logger.info( eval_logger.debug(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore." "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
) )
self._create_tokenizer( self._create_tokenizer(
...@@ -456,12 +459,24 @@ class HFLM(LM): ...@@ -456,12 +459,24 @@ class HFLM(LM):
if parallelize: if parallelize:
model_kwargs.update( model_kwargs.update(
_get_accelerate_args( _get_accelerate_args(
device_map_option, device_map_option, # TODO: phase out device_map_option?
max_memory_per_gpu, max_memory_per_gpu,
max_cpu_memory, max_cpu_memory,
offload_folder, offload_folder,
) )
) )
elif "device_map" not in model_kwargs:
# set a device_map to initialize model on the right GPU.
# this is needed because it seems that the default behavior
# for quantized models now seems to be device_map="auto"
# which breaks data-parallel mode.
if hasattr(self, "accelerator"):
model_kwargs.update(
{"device_map": {"": f"cuda:{self.accelerator.local_process_index}"}}
)
else:
model_kwargs.update({"device_map": {"": str(self.device)}})
if not autogptq: if not autogptq:
if model_kwargs.get("load_in_4bit", None): if model_kwargs.get("load_in_4bit", None):
assert ( assert (
......
...@@ -61,11 +61,27 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) - ...@@ -61,11 +61,27 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -
task_list = [task for task in all_task_list if type(task) == str] task_list = [task for task in all_task_list if type(task) == str]
for task_config in config_list: for task_config in config_list:
base_config = {}
task_name_config = {}
if "task" in task_config:
task_name = task_config["task"]
if task_name in ALL_TASKS:
task_obj = get_task_dict(task_name)[task_name]
if type(task_obj) == tuple:
_, task_obj = task_obj
if task_obj is not None:
base_config = task_obj._config.to_dict()
task_name_config["task"] = f"{group}_{task_name}"
task_config = utils.load_yaml_config(yaml_path, task_config) task_config = utils.load_yaml_config(yaml_path, task_config)
var_configs = check_prompt_config( var_configs = check_prompt_config(
{ {
**base_config,
**task_config, **task_config,
**{"group": group}, **{"group": group},
**task_name_config,
}, },
yaml_path=os.path.dirname(yaml_path), yaml_path=os.path.dirname(yaml_path),
) )
......
...@@ -3,9 +3,21 @@ task: ...@@ -3,9 +3,21 @@ task:
- pubmedqa - pubmedqa
- medmcqa - medmcqa
- medqa_4options - medqa_4options
- mmlu_anatomy - task: mmlu_anatomy
- mmlu_clinical_knowledge task_alias: "anatomy (mmlu)"
- mmlu_college_medicine group_alias: null
- mmlu_medical_genetics - task: mmlu_clinical_knowledge
- mmlu_professional_medicine task_alias: "clinical_knowledge (mmlu)"
- mmlu_college_biology group_alias: null
- task: mmlu_college_medicine
task_alias: "college_medicine (mmlu)"
group_alias: null
- task: mmlu_medical_genetics
task_alias: "medical_genetics (mmlu)"
group_alias: null
- task: mmlu_professional_medicine
task_alias: "professional_medicine (mmlu)"
group_alias: null
- task: mmlu_college_biology
task_alias: "college_biology (mmlu)"
group_alias: null
...@@ -31,4 +31,4 @@ filter_list: ...@@ -31,4 +31,4 @@ filter_list:
- function: "majority_vote" - function: "majority_vote"
- function: "take_first" - function: "take_first"
metadata: metadata:
version: 1.0 version: 2.0
...@@ -5,16 +5,16 @@ dataset_path: gsm8k ...@@ -5,16 +5,16 @@ dataset_path: gsm8k
dataset_name: main dataset_name: main
output_type: generate_until output_type: generate_until
test_split: test test_split: test
doc_to_text: "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\n\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\n\ doc_to_text: "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\n\
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\n\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\n\ Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\n\
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\n\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\n\ Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\n\
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\n\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\n\ Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\n\
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\n\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\n\ Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\n\
Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\n\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\n\ Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\n\
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\n\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\n\ Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\n\
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\n\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\n\ Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\n\
Q: {{question}}\n\nA:" Q: {{question}}\nA:"
doc_to_target: " {{answer.split('### ')[-1].rstrip()}}" doc_to_target: "{{answer.split('####')[-1].strip()}}"
metric_list: metric_list:
- metric: exact_match - metric: exact_match
aggregation: mean aggregation: mean
...@@ -31,7 +31,6 @@ generation_kwargs: ...@@ -31,7 +31,6 @@ generation_kwargs:
- "Q:" - "Q:"
- "\n\n" - "\n\n"
do_sample: false do_sample: false
temperature: 0.0
repeats: 1 repeats: 1
num_fewshot: 0 num_fewshot: 0
filter_list: filter_list:
...@@ -41,4 +40,4 @@ filter_list: ...@@ -41,4 +40,4 @@ filter_list:
regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)." regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
- function: "take_first" - function: "take_first"
metadata: metadata:
version: 1.0 version: 2.0
...@@ -24,7 +24,6 @@ generation_kwargs: ...@@ -24,7 +24,6 @@ generation_kwargs:
- "\n\n" - "\n\n"
- "Question:" - "Question:"
do_sample: false do_sample: false
temperature: 0.0
repeats: 1 repeats: 1
num_fewshot: 5 num_fewshot: 5
filter_list: filter_list:
......
# LAMBADA
### Paper
Title: `KOBEST: Korean Balanced Evaluation of Significant Tasks`
Abstract: https://arxiv.org/abs/2204.04541
A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.
Homepage: https://huggingface.co/datasets/skt/kobest_v1
### Groups and Tasks
#### Groups
- `kobest`
#### Tasks
- `kobest_boolq`
- `kobest_copa`
- `kobest_hallawag`
- `kobest_sentineg`
- `kobest_wic`
### Citation
@misc{
author={Dohyeong Kim, Myeongjun Jang, Deuk Sin Kwon, Eric Davis},
title={KOBEST: Korean Balanced Evaluation of Significant Tasks},
DOI={https://doi.org/10.48550/arXiv.2204.04541},
publisher={arXiv},
year={2022},
month={Apr}
}
group:
- kobest
task: kobest_boolq
dataset_path: skt/kobest_v1
dataset_name: boolq
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "{{paragraph}} 질문: {{question}} 답변: "
doc_to_target: "{{label}}"
doc_to_choice: ["아니오", "예"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_copa
dataset_path: skt/kobest_v1
dataset_name: copa
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.copa_doc_to_text
doc_to_target: !function utils.copa_doc_to_target
doc_to_choice: !function utils.copa_doc_to_choice
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_hellaswag
dataset_path: skt/kobest_v1
dataset_name: hellaswag
training_split: train
validation_split: validation
output_type: multiple_choice
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{label}}"
process_docs: !function utils.hellaswag_process_doc
doc_to_choice: "choices"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: acc_norm
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_sentineg
dataset_path: skt/kobest_v1
dataset_name: sentineg
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.sentineg_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ["부정", "긍정"]
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
group:
- kobest
task: kobest_wic
dataset_path: skt/kobest_v1
dataset_name: wic
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: !function utils.wic_doc_to_text
doc_to_target: "{{label}}"
doc_to_choice: ['아니오', '예']
metric_list:
- metric: acc
aggregation: mean
higher_is_better: True
- metric: f1
aggregation: !function utils.macro_f1_score
average: macro
hf_evaluate: true
higher_is_better: True
metadata:
version: 1.0
from datasets import Dataset
from sklearn.metrics import f1_score
def copa_doc_to_text(doc: dict) -> str:
connector = {"원인": " 왜냐하면", "결과": " 그래서"}[doc["question"].strip()]
return f"""{doc["premise"]} {connector}"""
def copa_doc_to_target(doc: dict) -> str:
correct_choice = doc["alternative_1"] if doc["label"] == 0 else doc["alternative_2"]
return f"""{correct_choice}"""
def copa_doc_to_choice(doc: dict) -> list:
return [f"""{doc["alternative_1"]}""", f"""{doc["alternative_2"]}"""]
def sentineg_doc_to_text(doc: dict):
return f"""문장: {doc["sentence"]} 긍부정:"""
def wic_doc_to_text(doc: dict) -> str:
return f"""문장1: {doc["context_1"]} 문장2: {doc["context_2"]} 두 문장에서 {doc["word"]}가 같은 뜻으로 쓰였나?"""
def hellaswag_process_doc(doc: Dataset) -> Dataset:
def preprocessor(dataset):
return {
"query": f"""문장: {dataset["context"]}""",
"choices": [dataset["ending_1"], dataset["ending_2"], dataset["ending_3"], dataset["ending_4"]],
"gold": int(dataset["label"]),
}
return doc.map(preprocessor)
def macro_f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = f1_score(golds, preds, average='macro')
return fscore
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment