Merge branch 'big-refactor' into fix-unittests

de71ad92 · Lintang Sutawika · GitHub · 09d20bfa · 73c80915 · de71ad92
Unverified Commit de71ad92 authored Oct 17, 2023 by Lintang Sutawika Committed by GitHub Oct 17, 2023
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -33,7 +33,6 @@ repos:
    rev: 22.3.0
    hooks:
      - id: black
-        language_version: python3.8
  - repo: https://github.com/codespell-project/codespell
    rev: v2.1.0
    hooks:

--- a/README.md
+++ b/README.md
@@ -23,8 +23,12 @@ Features:
 - Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
 - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
 - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
+- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
+- Support for local models and benchmarks.
+- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
+The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and is used internally by dozens of companies including NVIDIA, Cohere, Booz Allen Hamilton, and Mosaic ML.
 ## Install
@@ -232,7 +236,7 @@ We support wildcards in task names, for example you can run all of the machine-t
 To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
-As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
+As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in [the task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md) and [the advanced task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md) and welcome contributions of novel task templates and task variants.
 ## How to Contribute or Learn More?
@@ -248,16 +252,23 @@ You can also ask for help, or discuss new features with the maintainers in the #
 @software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
+                  Abbasi, Baber and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
+                  Le Noac'h, Alain and
+                  Li, Haonan and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
+                  Ociepa, Chris
                  Phang, Jason and
                  Reynolds, Laria and
+                  Schoelkopf, Hailey and
+                  Skowron, Aviya and
+                  Sutawika, Lintang and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and

--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -142,7 +142,7 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t
 - performing the same sequence of filters on these new sets of 8 responses, for each document.
 ```yaml
 - name: "maj@8"
-    filter:
+  filter:
    - function: "take_first_k"
      k: 8
    - function: "regex"

--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -101,7 +101,6 @@ def parse_eval_args() -> argparse.Namespace:
 def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
    if not args:
        # we allow for args to be passed externally, else we parse them ourselves
        args = parse_eval_args()
@@ -132,19 +131,21 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        else:
            tasks_list = args.tasks.split(",")
            task_names = utils.pattern_match(tasks_list, ALL_TASKS)
-            task_missing = []
            for task in [task for task in tasks_list if task not in task_names]:
                if os.path.isfile(task):
                    config = utils.load_yaml_config(task)
                    task_names.append(config)
+            task_missing = [task for task in tasks_list if task not in task_names]
-        if task_missing != []:
-            missing = ", ".join(task_missing)
+            if task_missing:
-            eval_logger.error(
+                missing = ", ".join(task_missing)
-                f"Tasks were not found: {missing}\n"
+                eval_logger.error(
-                f"{SPACING}Try `lm-eval -h` for list of available tasks",
+                    f"Tasks were not found: {missing}\n"
-            )
+                    f"{SPACING}Try `lm-eval -h` for list of available tasks",
-            raise ValueError(f"Tasks {missing} were not found.")
+                )
+                raise ValueError(
+                    f"Tasks {missing} were not found. Try `lm-eval -h` for list of available tasks."
+                )
    if args.output_path:
        path = Path(args.output_path)

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -44,7 +44,7 @@ ALL_OUTPUT_TYPES = [
    "loglikelihood",
    "multiple_choice",
    "loglikelihood_rolling",
-    "greedy_until",
+    "generate_until",
 ]
@@ -80,7 +80,7 @@ class TaskConfig(dict):
    num_fewshot: int = 0
    # scoring options
    metric_list: list = None
-    output_type: str = "greedy_until"
+    output_type: str = "generate_until"
    generation_kwargs: dict = None
    repeats: int = 1
    filter_list: Union[str, list] = None
@@ -97,11 +97,11 @@ class TaskConfig(dict):
            self.dataset_path = inspect.getfile(import_module(self.dataset_path))
        if self.generation_kwargs is not None:
-            if self.output_type != "greedy_until":
+            if self.output_type != "generate_until":
                eval_logger.warning(
-                    "passed `generation_kwargs`, but not using `output_type: greedy_until`!"
+                    f"[{self.task}] passed `generation_kwargs`, but not using `output_type: generate_until`!"
                )
-                assert self.output_type != "greedy_until"
+                assert self.output_type != "generate_until"
            if "temperature" in self.generation_kwargs:
                self.generation_kwargs["temperature"] = float(
@@ -111,7 +111,7 @@ class TaskConfig(dict):
            if "until" not in self.generation_kwargs:
                self.generation_kwargs["until"] = [self.fewshot_delimiter]
        else:
-            if self.output_type == "greedy_until":
+            if self.output_type == "generate_until":
                # ensure that we greedily generate in absence of explicit arguments otherwise
                self.generation_kwargs = {
                    "until": None
@@ -759,7 +759,6 @@ class ConfigurableTask(Task):
            return super().fewshot_docs()
    def apply_filters(self):
        if hasattr(self, "_filters"):
            for f in self._filters:
                f.apply(self._instances, self.task_docs)
@@ -959,7 +958,7 @@ class ConfigurableTask(Task):
                )
            return request_list
-        elif self.OUTPUT_TYPE == "greedy_until":
+        elif self.OUTPUT_TYPE == "generate_until":
            arguments = (ctx, self.config.generation_kwargs)
        return Instance(
@@ -967,7 +966,6 @@ class ConfigurableTask(Task):
        )
    def process_results(self, doc, results):
        if callable(self.config.process_results):
            return self.config.process_results(doc, results)
@@ -1072,7 +1070,7 @@ class ConfigurableTask(Task):
                acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0
                result_dict["acc_mutual_info"] = acc_mutual_info
-        elif self.OUTPUT_TYPE == "greedy_until":
+        elif self.OUTPUT_TYPE == "generate_until":
            gold = self.doc_to_target(doc)
            result = results[0]
            if self.config.doc_to_choice is not None:
@@ -1104,7 +1102,9 @@ class ConfigurableTask(Task):
                                predictions=[result],
                                **self._metric_fn_kwargs[metric],
                            )
-                        except TypeError:  # TODO: this is hacky and I don't want to do it
+                        except (
+                            TypeError
+                        ):  # TODO: this is hacky and I don't want to do it
                            result_score = self._metric_fn_list[metric](
                                [gold_option, result]
                            )
@@ -1123,7 +1123,9 @@ class ConfigurableTask(Task):
                            predictions=[result],
                            **self._metric_fn_kwargs[metric],
                        )
-                    except TypeError:  # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
+                    except (
+                        TypeError
+                    ):  # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
                        result_score = self._metric_fn_list[metric]([gold, result])
                    if isinstance(result_score, dict):
                        # TODO: this handles the case where HF evaluate returns a dict.
@@ -1132,7 +1134,7 @@ class ConfigurableTask(Task):
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
-                "'loglikelihood', 'loglikelihood_rolling', 'greedy_until' or 'multiple_choice'",
+                "'loglikelihood', 'loglikelihood_rolling', 'generate_until' or 'multiple_choice'",
            )
        return result_dict

--- a/lm_eval/benchmarks/__init__.py
+++ b/lm_eval/benchmarks/__init__.py
-import os
-import yaml
-from lm_eval import utils
-from lm_eval.tasks import register_configurable_task, check_prompt_config
-from lm_eval.logger import eval_logger
-from lm_eval.api.registry import (
-    TASK_REGISTRY,
-    GROUP_REGISTRY,
-    ALL_TASKS,
-)
-def include_benchmarks(task_dir: str) -> None:
-    for root, subdirs, file_list in os.walk(task_dir):
-        if (subdirs == [] or "__pycache__" in subdirs) and (len(file_list) > 0):
-            for f in file_list:
-                if f.endswith(".yaml"):
-                    try:
-                        benchmark_path = os.path.join(root, f)
-                        with open(benchmark_path, "rb") as file:
-                            yaml_config = yaml.full_load(file)
-                        if "prompts" in yaml_config:
-                            continue  # Skip it
-                        assert "group" in yaml_config
-                        group = yaml_config["group"]
-                        all_task_list = yaml_config["task"]
-                        config_list = [
-                            task for task in all_task_list if type(task) != str
-                        ]
-                        task_list = [
-                            task for task in all_task_list if type(task) == str
-                        ]
-                        for task_config in config_list:
-                            yaml_dir = os.path.dirname(benchmark_path)
-                            task_config = utils.load_yaml_config(
-                                yaml_config=task_config, yaml_dir=yaml_dir
-                            )
-                            if "use_prompt" in task_config:
-                                if "yaml" in task_config["use_prompt"]:
-                                    task_config["use_prompt"] = os.path.join(
-                                        root, task_config["use_prompt"]
-                                    )
-                            var_configs = check_prompt_config(
-                                {
-                                    **task_config,
-                                    **{"group": group},
-                                }
-                            )
-                            for config in var_configs:
-                                register_configurable_task(config)
-                        task_names = utils.pattern_match(task_list, ALL_TASKS)
-                        for task in task_names:
-                            if task in TASK_REGISTRY:
-                                if group in GROUP_REGISTRY:
-                                    GROUP_REGISTRY[group].append(task)
-                                else:
-                                    GROUP_REGISTRY[group] = [task]
-                                    ALL_TASKS.add(group)
-                    except Exception as error:
-                        eval_logger.warning(
-                            "Failed to load benchmark in\n"
-                            f"                                 {benchmark_path}\n"
-                            "                                 Benchmark will not be added to registry\n"
-                            f"                                 Error: {error}"
-                        )
-task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
-include_benchmarks(task_dir)
--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
@@ -138,7 +138,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
    def _loglikelihood_tokens(self, requests, disable_tqdm: bool = False):
        raise NotImplementedError("No support for logits.")
-    def greedy_until(self, requests) -> List[str]:
+    def generate_until(self, requests) -> List[str]:
        if not requests:
            return []
@@ -164,7 +164,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
                )
                res.append(response)
-                self.cache_hook.add_partial("greedy_until", request, response)
+                self.cache_hook.add_partial("generate_until", request, response)
            except anthropic.APIConnectionError as e:  # type: ignore # noqa: F821
                eval_logger.critical(f"Server unreachable: {e.__cause__}")
                break
@@ -179,7 +179,7 @@ please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e
        raise NotImplementedError()
    def _model_generate(self, context, max_length, eos_token_id):
-        # Isn't used because we override greedy_until
+        # Isn't used because we override generate_until
        raise NotImplementedError()
    def loglikelihood(self, requests):

--- a/lm_eval/models/dummy.py
+++ b/lm_eval/models/dummy.py
@@ -20,7 +20,7 @@ class DummyLM(LM):
        return res
-    def greedy_until(self, requests):
+    def generate_until(self, requests):
        res = []
        for ctx, _ in requests:

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -621,6 +621,23 @@ class HFLM(LM):
        return loglikelihoods
+    def _batch_scheduler(self, pos, n_reordered_requests):
+        sched = pos // int(len(n_reordered_requests) / self.batch_schedule)
+        if sched in self.batch_sizes:
+            return self.batch_sizes[sched]
+        if (len(self.batch_sizes) > 1) and (
+            self.batch_sizes[sched - 1] == self.max_batch_size
+        ):
+            # if previous batch size is already maximal, skip recomputation
+            self.batch_sizes[sched] = self.max_batch_size
+            return self.batch_sizes[sched]
+        print(
+            f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
+        )
+        self.batch_sizes[sched] = self._detect_batch_size(n_reordered_requests, pos)
+        print(f"Determined largest batch size: {self.batch_sizes[sched]}")
+        return self.batch_sizes[sched]
    def _loglikelihood_tokens(
        self, requests, disable_tqdm: bool = False, override_bs=None
    ):
@@ -644,25 +661,6 @@ class HFLM(LM):
        # automatic (variable) batch size detection for vectorization
        # pull longest context sample from request
-        def _batch_scheduler(pos):
-            sched = pos // int(n_reordered_requests / self.batch_schedule)
-            if sched in self.batch_sizes:
-                return self.batch_sizes[sched]
-            if (len(self.batch_sizes) > 1) and (
-                self.batch_sizes[sched - 1] == self.max_batch_size
-            ):
-                # if previous batch size is already maximal, skip recomputation
-                self.batch_sizes[sched] = self.max_batch_size
-                return self.batch_sizes[sched]
-            print(
-                f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
-            )
-            self.batch_sizes[sched] = self._detect_batch_size(
-                re_ord.get_reordered(), pos
-            )
-            print(f"Determined largest batch size: {self.batch_sizes[sched]}")
-            return self.batch_sizes[sched]
        for chunk in utils.chunks(
            tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))),
            n=self.batch_size
@@ -670,7 +668,7 @@ class HFLM(LM):
            else override_bs
            if override_bs is not None
            else 0,
-            fn=_batch_scheduler
+            fn=self._batch_scheduler
            if self.batch_size == "auto"
            and n_reordered_requests > 0
            and not override_bs
@@ -815,7 +813,7 @@ class HFLM(LM):
        return re_ord.get_original(res)
-    def greedy_until(self, requests):
+    def generate_until(self, requests):
        res = defaultdict(list)
        re_ords = {}
@@ -838,12 +836,24 @@ class HFLM(LM):
            re_ords[key] = utils.Reorderer([req.args for req in reqs], _collate)
        pbar = tqdm(total=len(requests), disable=(self.rank != 0))
+        if self.batch_size == "auto":
+            # using rolling window with maximum context
+            print("Passed argument batch_size = auto. Detecting largest batch size")
+            batch_size = self._detect_batch_size()
+            print(f"Determined Largest batch size: {batch_size}")
+            adaptive_batch_size = batch_size
        # for each different set of kwargs, we execute all requests, by batch.
        for key, re_ord in re_ords.items():
            for chunk in utils.chunks(
-                re_ord.get_reordered(),
+                tqdm(re_ord.get_reordered(), disable=self.rank != 0),
-                self.batch_size,
+                n=self.batch_size
+                if self.batch_size != "auto"
+                else adaptive_batch_size
+                if adaptive_batch_size is not None
+                else 0,
+                fn=self._batch_scheduler
+                if self.batch_size == "auto" and not adaptive_batch_size
+                else None,
            ):
                contexts, all_gen_kwargs = zip(*chunk)
                # we assume all gen kwargs in the batch are the same
@@ -920,7 +930,7 @@ class HFLM(LM):
                    res[key].append(s)
                    self.cache_hook.add_partial(
-                        "greedy_until", (context, gen_kwargs), s
+                        "generate_until", (context, gen_kwargs), s
                    )
                    pbar.update(1)
            # reorder this group of results back to original unsorted form

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -203,7 +203,7 @@ class OpenaiCompletionsLM(LM):
                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
        return re_ord.get_original(res)
-    def greedy_until(self, requests) -> List[str]:
+    def generate_until(self, requests) -> List[str]:
        if not requests:
            return []
        res = []
@@ -260,7 +260,7 @@ class OpenaiCompletionsLM(LM):
                # partial caching
                self.cache_hook.add_partial(
-                    "greedy_until", (context, {"until": until_}), s
+                    "generate_until", (context, {"until": until_}), s
                )
                res.append(s)
@@ -271,7 +271,7 @@ class OpenaiCompletionsLM(LM):
        raise NotImplementedError()
    def _model_generate(self, context, max_length, eos_token_id):
-        # Isn't used because we override greedy_until
+        # Isn't used because we override generate_until
        raise NotImplementedError()
    def loglikelihood_rolling(self, requests) -> List[float]:

--- a/lm_eval/models/textsynth.py
+++ b/lm_eval/models/textsynth.py
@@ -58,7 +58,7 @@ class TextSynthLM(LM):
    @property
    def eot_token_id(self):
-        # Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until
+        # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
        raise NotImplementedError()
    @property
@@ -72,20 +72,20 @@ class TextSynthLM(LM):
    @property
    def batch_size(self):
-        # Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until
+        # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
        raise NotImplementedError()
    @property
    def device(self):
-        # Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until
+        # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
        raise NotImplementedError()
    def tok_encode(self, string: str):
-        # Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until
+        # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
        raise NotImplementedError()
    def tok_decode(self, tokens):
-        # Isn't used because we override loglikelihood, loglikelihood_rolling and greedy_until
+        # Isn't used because we override loglikelihood, loglikelihood_rolling and generate_until
        raise NotImplementedError()
    def loglikelihood(self, requests):
@@ -122,7 +122,7 @@ class TextSynthLM(LM):
            "input tokenization support from TextSynth."
        )
-    def greedy_until(self, requests):
+    def generate_until(self, requests):
        if not requests:
            return []
@@ -146,7 +146,7 @@ class TextSynthLM(LM):
                s = resp["text"]
                res.append(s)
-                self.cache_hook.add_partial("greedy_until", (inp, request_args), s)
+                self.cache_hook.add_partial("generate_until", (inp, request_args), s)
            else:
                logger.error(
                    f"The following response does not contain generated `text`. "
@@ -160,5 +160,5 @@ class TextSynthLM(LM):
        raise NotImplementedError()
    def _model_generate(self, context, max_length, eos_token_id):
-        # Isn't used because we override greedy_until
+        # Isn't used because we override generate_until
        raise NotImplementedError()
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -59,6 +59,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] MGSM
 - [ ] SCROLLS
 - [x] Babi
+- [x] Belebele
 # Novel Tasks
 Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -27,7 +27,9 @@ def register_configurable_task(config: Dict[str, str]) -> int:
        register_task(task_name)(SubClass)
    if "group" in config:
-        if type(config["group"]) == str:
+        if config["group"] == config["task"]:
+            raise ValueError("task and group name cannot be the same")
+        elif type(config["group"]) == str:
            group_name = [config["group"]]
        else:
            group_name = config["group"]
@@ -45,7 +47,6 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -
    task_list = [task for task in all_task_list if type(task) == str]
    for task_config in config_list:
        task_config = utils.load_yaml_config(yaml_path, task_config)
        var_configs = check_prompt_config(
            {
@@ -97,7 +98,7 @@ def check_prompt_config(
                            ]
                        )
                    },
-                    **{"output_type": "greedy_until"},
+                    **{"output_type": "generate_until"},
                }
            )
    else:
@@ -137,7 +138,10 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None:
                        else:
                            if type(config["task"]) == list:
                                register_configurable_group(config, yaml_path)
+                except ModuleNotFoundError as e:
+                    eval_logger.warning(
+                        f"{yaml_path}: {e}. Config will not be added to registry."
+                    )
                except Exception as error:
                    import traceback
@@ -187,7 +191,6 @@ def get_task_name_from_object(task_object):
 # TODO: pass num_fewshot and other cmdline overrides in a better way
 def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
    config = {**kwargs}
    task_name_from_registry_dict = {}
@@ -199,7 +202,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
    for task_element in task_name_list:
        if isinstance(task_element, str):
            if task_element in GROUP_REGISTRY:
                group_name = task_element
                for task_name in GROUP_REGISTRY[task_element]:
@@ -237,7 +239,6 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
            }
        elif isinstance(task_element, Task):
            task_name_from_object_dict = {
                **task_name_from_object_dict,
                get_task_name_from_object(task_element): task_element,

--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
 task: babi
 dataset_path: Muennighoff/babi
 dataset_name: null
-output_type: greedy_until
+output_type: generate_until
 training_split: train
 validation_split: valid
 test_split: test

--- a/lm_eval/tasks/bbh/flan_cot_fewshot/_flan_cot_fewshot_template_yaml
+++ b/lm_eval/tasks/bbh/flan_cot_fewshot/_flan_cot_fewshot_template_yaml
 group: bbh_flan_cot_fewshot
 dataset_path: lukaemon/bbh
-output_type: greedy_until
+output_type: generate_until
 test_split: test
 doc_to_target: "{{target}}"
 metric_list:

--- a/lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/flan_cot_zeroshot/_flan_cot_zeroshot_template_yaml
 group: bbh_flan_cot_zeroshot
 dataset_path: lukaemon/bbh
-output_type: greedy_until
+output_type: generate_until
 test_split: test
 doc_to_target: "{{target}}"
 metric_list:

--- a/lm_eval/tasks/bbh/flan_fewshot/_flan_fewshot_template_yaml
+++ b/lm_eval/tasks/bbh/flan_fewshot/_flan_fewshot_template_yaml
 group: bbh_flan_fewshot
 dataset_path: lukaemon/bbh
-output_type: greedy_until
+output_type: generate_until
 test_split: test
 doc_to_target: "{{target}}"
 metric_list:

--- a/lm_eval/tasks/bbh/flan_zeroshot/_flan_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/flan_zeroshot/_flan_zeroshot_template_yaml
 group: bbh_flan_zeroshot
 dataset_path: lukaemon/bbh
-output_type: greedy_until
+output_type: generate_until
 test_split: test
 doc_to_target: "{{target}}"
 metric_list:

--- a/lm_eval/tasks/belebele/README.md
+++ b/lm_eval/tasks/belebele/README.md
+# Belebele
+### Paper
+The Belebele Benchmark for Massively Multilingual NLU Evaluation
+https://arxiv.org/abs/2308.16884
+Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.
+Homepage: https://github.com/facebookresearch/belebele
+### Citation
+```bibtex
+@misc{bandarkar2023belebele,
+      title={The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants},
+      author={Lucas Bandarkar and Davis Liang and Benjamin Muller and Mikel Artetxe and Satya Narayan Shukla and Donald Husa and Naman Goyal and Abhinandan Krishnan and Luke Zettlemoyer and Madian Khabsa},
+      year={2023},
+      eprint={2308.16884},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+### Groups and Tasks
+#### Groups
+- `belebele`: All 122 languages of the Belebele dataset, evaluated following the methodology in MMLU's original implementation.
+#### Tasks
+The following tasks evaluate languages in the Belebele dataset using loglikelihood-based multiple-choice scoring:
+- `belebele_{language}`
+The variant evaluated here is the 0-shot or few-shot evaluation with English Instructions.
+### Checklist
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation?
+    * [ ] Yes, original implementation contributed by author of the benchmark
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/belebele/_default_template_yaml
+++ b/lm_eval/tasks/belebele/_default_template_yaml
+group: belebele
+dataset_path: facebook/belebele
+test_split: test
+fewshot_split: test
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+should_decontaminate: true
+doc_to_decontamination_query: "{{question}}"
+doc_to_text: "P: {{flores_passage}}\nQ: {{question.strip()}}\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nAnswer："
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: "{{['1', '2', '3', '4'].index(correct_answer_num)}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true