Merge branch 'main' into add-chat-templating

b8bda478 · haileyschoelkopf · 6ca8ab15 · 588a493c · b8bda478 · b8bda478
Commit b8bda478 authored Jan 15, 2024 by haileyschoelkopf
18 changed files
--- a/CITATION.bib
+++ b/CITATION.bib
-@software{eval-harness,
+@misc{eval-harness,
-  author       = {Gao, Leo and
+  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
-                  Tow, Jonathan and
-                  Biderman, Stella and
-                  Black, Sid and
-                  DiPofi, Anthony and
-                  Foster, Charles and
-                  Golding, Laurence and
-                  Hsu, Jeffrey and
-                  McDonell, Kyle and
-                  Muennighoff, Niklas and
-                  Phang, Jason and
-                  Reynolds, Laria and
-                  Tang, Eric and
-                  Thite, Anish and
-                  Wang, Ben and
-                  Wang, Kevin and
-                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
-  month        = sep,
+  month        = 12,
-  year         = 2021,
+  year         = 2023,
  publisher    = {Zenodo},
-  version      = {v0.0.1},
+  version      = {v0.4.0},
-  doi          = {10.5281/zenodo.5371628},
+  doi          = {10.5281/zenodo.10256836},
-  url          = {https://doi.org/10.5281/zenodo.5371628}
+  url          = {https://zenodo.org/records/10256836}
 }
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on
 - Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
 - Easy support for custom prompts and evaluation metrics.
-The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,17476825572045927382,18443729326628441434,12854182577605049984) is used internally by dozens of companies including NVIDIA, Cohere, Nous Research, Booz Allen Hamilton, and Mosaic ML.
+The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
 ## Install
@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a
 #### Multi-GPU Evaluation with Hugging Face `accelerate`
-To parallelize evaluation of HuggingFace models across multiple GPUs, we leverage the [accelerate 🚀](https://github.com/huggingface/accelerate) library as follows:
+We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
+To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
 ```
 accelerate launch -m lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --batch_size 16
 ```
+(or via `accelerate launch --no-python lm_eval`).
+For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
-This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.
+**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
-If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.
+The second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*
-To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well:
+In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
+```
+lm_eval --model hf \
+    --tasks lambada_openai,arc_easy \
+    --model_args parallelize=True \
+    --batch_size 16
+```
+This means that your model's weights will be split across all available GPUs.
+For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
 - `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
 - `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
 - `offload_folder`: a folder where model weights will be offloaded to disk if needed.
-To use `accelerate` with the `lm-eval` command, use
+These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
-```
-accelerate launch --no_python lm-eval --model ...
-```
 ### Tensor + Data Parallel and Optimized Inference with `vLLM`
-We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
+We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
 ```bash
 lm_eval --model vllm \
@@ -219,11 +231,11 @@ lm_eval --model hf \
    --device cuda:0
 ```
-[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:
+[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,autogptq=NAME` (or `,autogptq=True` for default names) in the `model_args` argument:
 ```bash
 lm_eval --model hf \
-    --model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \
+    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag
 ```
@@ -301,10 +313,14 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
 ## Cite as
 ```
-@article{gao2021framework,
+@misc{eval-harness,
-  title={A framework for few-shot language model evaluation},
+  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
-  author={Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and others},
+  title        = {A framework for few-shot language model evaluation},
-  journal={Version v0. 0.1. Sept},
+  month        = 12,
-  year={2021}
+  year         = 2023,
+  publisher    = {Zenodo},
+  version      = {v0.4.0},
+  doi          = {10.5281/zenodo.10256836},
+  url          = {https://zenodo.org/records/10256836}
 }
 ```
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -46,16 +46,6 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas
 dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
 ```
------------------------------
-**Tip:** To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
-```
-dataset_path: json
-dataset_name: null
-dataset_kwargs:
-  data_files: /path/to/my/json
-```
-------------------------------
 Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
 ```yaml
@@ -99,6 +89,36 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
 process_docs: !function utils.process_docs
 ```
+### Using Local Datasets
+To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
+```
+dataset_path: json
+dataset_name: null
+dataset_kwargs:
+  data_files: /path/to/my/json
+```
+Or with files already split into separate directories:
+```
+dataset_path: arrow
+dataset_kwargs:
+  data_files:
+    train: /path/to/arrow/train/data-00000-of-00001.arrow
+    validation: /path/to/arrow/validation/data-00000-of-00001.arrow
+```
+Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
+```
+dataset_path: hellaswag
+dataset_kwargs:
+  data_dir: hellaswag_local/
+```
+You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
 ## Writing a Prompt Template
 The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.

--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -301,6 +301,23 @@ task:
  - hendrycksTest*
 ```
+It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
+```yaml
+group: multimedqa
+task:
+  - pubmedqa
+  - medmcqa
+  - medqa_4options
+  - task: mmlu_anatomy
+    task_alias: "anatomy (mmlu)"
+    group_alias: null
+  - task: mmlu_clinical_knowledge
+    task_alias: "clinical_knowledge (mmlu)"
+    group_alias: null
+  ...
+```
 Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
 ```yaml

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -527,6 +527,10 @@ class ConfigurableTask(Task):
                "Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg"
            )
+        if isinstance(self.config.metadata, dict):
+            if "version" in self.config.metadata:
+                self.VERSION = self.config.metadata["version"]
        if self.config.output_type is not None:
            assert self.config.output_type in ALL_OUTPUT_TYPES
            self.OUTPUT_TYPE = self.config.output_type
@@ -755,6 +759,8 @@ class ConfigurableTask(Task):
    def fewshot_docs(self):
        if self.config.fewshot_split is not None:
+            if self.config.process_docs is not None:
+                return self.config.process_docs(self.dataset[self.config.fewshot_split])
            return self.dataset[self.config.fewshot_split]
        else:
            if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0):

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -133,6 +133,8 @@ class HFLM(LM):
            gpus = torch.cuda.device_count()
            accelerator = Accelerator()
+            if accelerator.num_processes > 1:
+                self.accelerator = accelerator
            if not (parallelize or accelerator.num_processes > 1):
                # use user-passed device
@@ -202,15 +204,16 @@ class HFLM(LM):
        self.model.tie_weights()
        if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
-            if not (parallelize or autogptq or ("device_map" in kwargs)):
+            # TODO: can remove this whole snippet except in the mps case, perhaps?
+            if not (parallelize or autogptq or hasattr(self, "accelerator")):
                # place model onto device requested manually,
                # if not using HF Accelerate or device_map
                # or any other option that preloads model onto device
                try:
                    self.model.to(self.device)
                except ValueError:
-                    eval_logger.info(
+                    eval_logger.debug(
-                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
+                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
                    )
        self._create_tokenizer(
@@ -456,12 +459,24 @@ class HFLM(LM):
        if parallelize:
            model_kwargs.update(
                _get_accelerate_args(
-                    device_map_option,
+                    device_map_option,  # TODO: phase out device_map_option?
                    max_memory_per_gpu,
                    max_cpu_memory,
                    offload_folder,
                )
            )
+        elif "device_map" not in model_kwargs:
+            # set a device_map to initialize model on the right GPU.
+            # this is needed because it seems that the default behavior
+            # for quantized models now seems to be device_map="auto"
+            # which breaks data-parallel mode.
+            if hasattr(self, "accelerator"):
+                model_kwargs.update(
+                    {"device_map": {"": f"cuda:{self.accelerator.local_process_index}"}}
+                )
+            else:
+                model_kwargs.update({"device_map": {"": str(self.device)}})
        if not autogptq:
            if model_kwargs.get("load_in_4bit", None):
                assert (

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -61,11 +61,27 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -
    task_list = [task for task in all_task_list if type(task) == str]
    for task_config in config_list:
+        base_config = {}
+        task_name_config = {}
+        if "task" in task_config:
+            task_name = task_config["task"]
+            if task_name in ALL_TASKS:
+                task_obj = get_task_dict(task_name)[task_name]
+                if type(task_obj) == tuple:
+                    _, task_obj = task_obj
+                if task_obj is not None:
+                    base_config = task_obj._config.to_dict()
+                    task_name_config["task"] = f"{group}_{task_name}"
        task_config = utils.load_yaml_config(yaml_path, task_config)
        var_configs = check_prompt_config(
            {
+                **base_config,
                **task_config,
                **{"group": group},
+                **task_name_config,
            },
            yaml_path=os.path.dirname(yaml_path),
        )

--- a/lm_eval/tasks/benchmarks/multimedqa/multimedqa.yaml
+++ b/lm_eval/tasks/benchmarks/multimedqa/multimedqa.yaml
@@ -3,9 +3,21 @@ task:
  - pubmedqa
  - medmcqa
  - medqa_4options
-  - mmlu_anatomy
+  - task: mmlu_anatomy
-  - mmlu_clinical_knowledge
+    task_alias: "anatomy (mmlu)"
-  - mmlu_college_medicine
+    group_alias: null
-  - mmlu_medical_genetics
+  - task: mmlu_clinical_knowledge
-  - mmlu_professional_medicine
+    task_alias: "clinical_knowledge (mmlu)"
-  - mmlu_college_biology
+    group_alias: null
+  - task: mmlu_college_medicine
+    task_alias: "college_medicine (mmlu)"
+    group_alias: null
+  - task: mmlu_medical_genetics
+    task_alias: "medical_genetics (mmlu)"
+    group_alias: null
+  - task: mmlu_professional_medicine
+    task_alias: "professional_medicine (mmlu)"
+    group_alias: null
+  - task: mmlu_college_biology
+    task_alias: "college_biology (mmlu)"
+    group_alias: null
--- a/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml
@@ -31,4 +31,4 @@ filter_list:
      - function: "majority_vote"
      - function: "take_first"
 metadata:
-  version: 1.0
+  version: 2.0
--- a/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
@@ -5,16 +5,16 @@ dataset_path: gsm8k
 dataset_name: main
 output_type: generate_until
 test_split: test
-doc_to_text: "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\n\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\n\
+doc_to_text: "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\n\
-Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\n\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\n\
+Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\n\
-Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\n\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\n\
+Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\n\
-Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\n\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\n\
+Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\n\
-Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\n\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\n\
+Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\n\
-Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\n\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\n\
+Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\n\
-Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\n\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\n\
+Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\n\
-Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\n\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\n\
+Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\n\
-Q: {{question}}\n\nA:"
+Q: {{question}}\nA:"
-doc_to_target: " {{answer.split('### ')[-1].rstrip()}}"
+doc_to_target: "{{answer.split('####')[-1].strip()}}"
 metric_list:
  - metric: exact_match
    aggregation: mean
@@ -31,7 +31,6 @@ generation_kwargs:
    - "Q:"
    - "\n\n"
  do_sample: false
-  temperature: 0.0
 repeats: 1
 num_fewshot: 0
 filter_list:
@@ -41,4 +40,4 @@ filter_list:
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
      - function: "take_first"
 metadata:
-  version: 1.0
+  version: 2.0
--- a/lm_eval/tasks/gsm8k/gsm8k.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
@@ -24,7 +24,6 @@ generation_kwargs:
    - "\n\n"
    - "Question:"
  do_sample: false
-  temperature: 0.0
 repeats: 1
 num_fewshot: 5
 filter_list:

--- a/lm_eval/tasks/kobest/README.md
+++ b/lm_eval/tasks/kobest/README.md
+# LAMBADA
+### Paper
+Title: `KOBEST: Korean Balanced Evaluation of Significant Tasks`
+Abstract: https://arxiv.org/abs/2204.04541
+A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline models and human performance results. Our dataset is available on the Huggingface.
+Homepage: https://huggingface.co/datasets/skt/kobest_v1
+### Groups and Tasks
+#### Groups
+- `kobest`
+#### Tasks
+- `kobest_boolq`
+- `kobest_copa`
+- `kobest_hallawag`
+- `kobest_sentineg`
+- `kobest_wic`
+### Citation
+@misc{
+    author={Dohyeong Kim, Myeongjun Jang, Deuk Sin Kwon, Eric Davis},
+    title={KOBEST: Korean Balanced Evaluation of Significant Tasks},
+    DOI={https://doi.org/10.48550/arXiv.2204.04541},
+    publisher={arXiv},
+    year={2022},
+    month={Apr}
+}
--- a/lm_eval/tasks/kobest/kobest_boolq.yaml
+++ b/lm_eval/tasks/kobest/kobest_boolq.yaml
+group:
+  - kobest
+task: kobest_boolq
+dataset_path: skt/kobest_v1
+dataset_name: boolq
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "{{paragraph}} 질문: {{question}} 답변: "
+doc_to_target: "{{label}}"
+doc_to_choice: ["아니오", "예"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_copa.yaml
+++ b/lm_eval/tasks/kobest/kobest_copa.yaml
+group:
+  - kobest
+task: kobest_copa
+dataset_path: skt/kobest_v1
+dataset_name: copa
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.copa_doc_to_text
+doc_to_target: !function utils.copa_doc_to_target
+doc_to_choice: !function utils.copa_doc_to_choice
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_hellaswag.yaml
+++ b/lm_eval/tasks/kobest/kobest_hellaswag.yaml
+group:
+  - kobest
+task: kobest_hellaswag
+dataset_path: skt/kobest_v1
+dataset_name: hellaswag
+training_split: train
+validation_split: validation
+output_type: multiple_choice
+test_split: test
+doc_to_text: "{{query}}"
+doc_to_target: "{{label}}"
+process_docs: !function utils.hellaswag_process_doc
+doc_to_choice: "choices"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_sentineg.yaml
+++ b/lm_eval/tasks/kobest/kobest_sentineg.yaml
+group:
+  - kobest
+task: kobest_sentineg
+dataset_path: skt/kobest_v1
+dataset_name: sentineg
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.sentineg_doc_to_text
+doc_to_target: "{{label}}"
+doc_to_choice: ["부정", "긍정"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/kobest_wic.yaml
+++ b/lm_eval/tasks/kobest/kobest_wic.yaml
+group:
+  - kobest
+task: kobest_wic
+dataset_path: skt/kobest_v1
+dataset_name: wic
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: !function utils.wic_doc_to_text
+doc_to_target: "{{label}}"
+doc_to_choice: ['아니오', '예']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: True
+  - metric: f1
+    aggregation: !function utils.macro_f1_score
+    average: macro
+    hf_evaluate: true
+    higher_is_better: True
+metadata:
+  version: 1.0
--- a/lm_eval/tasks/kobest/utils.py
+++ b/lm_eval/tasks/kobest/utils.py
+from datasets import Dataset
+from sklearn.metrics import f1_score
+def copa_doc_to_text(doc: dict) -> str:
+    connector = {"원인": " 왜냐하면", "결과": " 그래서"}[doc["question"].strip()]
+    return f"""{doc["premise"]} {connector}"""
+def copa_doc_to_target(doc: dict) -> str:
+    correct_choice = doc["alternative_1"] if doc["label"] == 0 else doc["alternative_2"]
+    return f"""{correct_choice}"""
+def copa_doc_to_choice(doc: dict) -> list:
+    return [f"""{doc["alternative_1"]}""", f"""{doc["alternative_2"]}"""]
+def sentineg_doc_to_text(doc: dict):
+    return f"""문장: {doc["sentence"]} 긍부정:"""
+def wic_doc_to_text(doc: dict) -> str:
+    return f"""문장1: {doc["context_1"]} 문장2: {doc["context_2"]} 두 문장에서 {doc["word"]}가 같은 뜻으로 쓰였나?"""
+def hellaswag_process_doc(doc: Dataset) -> Dataset:
+    def preprocessor(dataset):
+        return {
+            "query": f"""문장: {dataset["context"]}""",
+            "choices": [dataset["ending_1"], dataset["ending_2"], dataset["ending_3"], dataset["ending_4"]],
+            "gold": int(dataset["label"]),
+        }
+    return doc.map(preprocessor)
+def macro_f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = f1_score(golds, preds, average='macro')
+    return fscore