Faster Task and Group Loading, Allow Recursive Groups (#1321)

* add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * check if config is task update * add GroupConfig object * edit test yaml * remove args * testing returning to python task list * add weight_by_size config * describe weight_by_size in docs * fix weight by size potential error * can load individual custom python class task * moved import_function into the config loading file * remove print lines * add squadv2 yaml * temporary scroll implementation * revert back to use load_yaml_config but with modes * fix group being loaded with a None * reformat * can load unregistered tasks from a group * update scrolls * edit scrolls multiplechoice task * adjust class initialization * fix initialization * changed how to identify group and python tasks, fix logger * allow loading "include" that is nested in a group config * reworked flan benchmark * allow duplicate task in the same group to co-exist * process group_alias * removed group_alias * allow parameters set in group_config to apply to all tasks in tasklist * add function, but comment for now * reworked processing dict-base config * fixed how configs in group are processed * update to allow root group to have its alias used * remove unused classes * remove unused classes * revert some parts to original * forgot to change one variable * adapt the new process to use get_task_dict * fix for singular group call * fix variable names * add TaskManager into the evaluator * format * changed how dict tasks are loaded * add docs * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update evaluator.py * Update evaluator.py * remove groupconfig for now * changed _config to config * update interface.md to explain TaskManager * added property functions * adjusted logger * update write_out.py * updated tests * added documentation and some modifications * added docstring documentation * precommit format * updated task loading for tests * updates tests * changed arg order for load_yaml_config * update to handle scrolls and edit log message * remove unused lines * return a list of task classes and not a dict * Update __init__.py * Delete lm_eval/tasks/benchmarks/test.yaml * Update task.py * Update lm_eval/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update utils.py * re-added old functions with new log message * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update new_task_guide.md * added infor regarding `get_task_dict` and documentation * add get_config for Task * pre-commit formatting --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Faster Task and Group Loading, Allow Recursive Groups (#1321)
* add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * add trust_remote_code as default * task for testing recursive * changed source of ALL_TASKS * tasks should only accept TaskObjects * initialize_tasks returns list of tasks and groups * remove trust_remote_code for now * moved constructor process to inside load_yaml_config * more comprehensive way to index tasks and groups * pre-commit format * add exit after error * adjust how task objects are called * no need to use get_task_dict * load_task_or_group works but only for tasks * pre-commit format * half working for nested groups * changed variable names * allow groups and tasks to work * temp save * indexing and loading are part of a task_manager object * adapted initialize_tasks * iron out bugs * fixed typo * fixed typo * simplified code * further tidy up * remove lines for testing * removed test lines * removed unused code * remove unused import * fixed bug * removed comments * group in a list of group can accept parameter changes like `num_fewshot` * check if config is task update * add GroupConfig object * edit test yaml * remove args * testing returning to python task list * add weight_by_size config * describe weight_by_size in docs * fix weight by size potential error * can load individual custom python class task * moved import_function into the config loading file * remove print lines * add squadv2 yaml * temporary scroll implementation * revert back to use load_yaml_config but with modes * fix group being loaded with a None * reformat * can load unregistered tasks from a group * update scrolls * edit scrolls multiplechoice task * adjust class initialization * fix initialization * changed how to identify group and python tasks, fix logger * allow loading "include" that is nested in a group config * reworked flan benchmark * allow duplicate task in the same group to co-exist * process group_alias * removed group_alias * allow parameters set in group_config to apply to all tasks in tasklist * add function, but comment for now * reworked processing dict-base config * fixed how configs in group are processed * update to allow root group to have its alias used * remove unused classes * remove unused classes * revert some parts to original * forgot to change one variable * adapt the new process to use get_task_dict * fix for singular group call * fix variable names * add TaskManager into the evaluator * format * changed how dict tasks are loaded * add docs * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update evaluator.py * Update evaluator.py * remove groupconfig for now * changed _config to config * update interface.md to explain TaskManager * added property functions * adjusted logger * update write_out.py * updated tests * added documentation and some modifications * added docstring documentation * precommit format * updated task loading for tests * updates tests * changed arg order for load_yaml_config * update to handle scrolls and edit log message * remove unused lines * return a list of task classes and not a dict * Update __init__.py * Delete lm_eval/tasks/benchmarks/test.yaml * Update task.py * Update lm_eval/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update utils.py * re-added old functions with new log message * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update new_task_guide.md * added infor regarding `get_task_dict` and documentation * add get_config for Task * pre-commit formatting --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
d714fc95 · Lintang Sutawika · GitHub · 17191063 · d714fc95 · d714fc95
Unverified Commit d714fc95 authored Feb 01, 2024 by Lintang Sutawika Committed by GitHub Feb 01, 2024
20 changed files
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -61,14 +61,25 @@ import lm_eval
 my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
 ...
-lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
+# instantiate an LM subclass that takes your initialized model and can run
+# - `Your_LM.loglikelihood()`
-lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory.
+# - `Your_LM.loglikelihood_rolling()`
+# - `Your_LM.generate_until()`
+lm_obj = Your_LM(model=my_model, batch_size=16)
+# indexes all tasks from the `lm_eval/tasks` subdirectory.
+# Alternatively, you can set `TaskManager(include_path="path/to/my/custom/task/configs")`
+# to include a set of tasks in a separate directory.
+task_manager = lm_eval.tasks.TaskManager()
+# Setting `task_manager` to the one above is optional and should generally be done
+# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
+# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
 results = lm_eval.simple_evaluate( # call simple_evaluate
    model=lm_obj,
    tasks=["taskname1", "taskname2"],
    num_fewshot=0,
+    task_manager=task_manager,
    ...
 )
 ```
@@ -84,18 +95,49 @@ As a brief example usage of `evaluate()`:
 ```python
 import lm_eval
-from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
+# suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
+from my_tasks import MyTask1
 ...
-my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
+# create your model (could be running finetuning with some custom modeling code)
+my_model = initialize_my_model()
 ...
-lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
-lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory.
+# instantiate an LM subclass that takes your initialized model and can run
+# - `Your_LM.loglikelihood()`
+# - `Your_LM.loglikelihood_rolling()`
+# - `Your_LM.generate_until()`
+lm_obj = Your_LM(model=my_model, batch_size=16)
+# The task_manager indexes tasks including ones
+# specified by the user through `include_path`
+task_manager = lm_eval.tasks.TaskManager(
+    include_path="/path/to/custom/yaml"
+    )
+# To get a task dict for `evaluate`
+task_dict = lm_eval.tasks.get_task_dict(
+    [
+        "mmlu", # A stock task
+        "my_custom_task", # A custom task
+        {
+            "task": ..., # A dict that configures a task
+            "doc_to_text": ...,
+            },
+        MyTask1 # A task object from `lm_eval.task.Task`
+        ],
+    task_manager # A task manager that allows lm_eval to
+                 # load the task during evaluation.
+                 # If none is provided, `get_task_dict`
+                 # will instantiated one itself, but this
+                 # only includes the stock tasks so users
+                 # will need to set this if including
+                 # custom paths is required.
+    )
 def evaluate(
    lm=lm_obj,
-    task_dict={"mytask1": MyTask1},
+    task_dict=task_dict,
    ...
 ):
 ```
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -290,17 +290,80 @@ This will add your task to the `group1` and `group2` groups, enabling people to
 If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
-You can do this via adding the Python snippet
+You can do this via the `--include_path` argument in `__main__.py`. This command will be used to initialize the `TaskManager` object which you can also use for your custom scripts.
 ```python
-from lm_eval.tasks import include_task_folder
+task_manager = TaskManager(args.verbosity, include_path=args.include_path)
-include_task_folder("/path/to/yaml/parent/folder")
 ```
-to the top of any Python file that is run or imported when performing evaluation, such as `\_\_main\_\_.py`.
 Passing `--tasks /path/to/yaml/file` is also accepted.
+### Advanced Group Configs
+You can make more complete group config while also tailoring parameters for individual tasks.
+For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks.
+```yaml
+group: nli_and_mmlu
+task:
+  - group: nli_tasks
+    task:
+      - cb
+      - anli_r1
+      - rte
+  - task: mmlu
+    num_fewshot: 2
+```
+It's also important to note how you can basically insert a group config as a task. Here, to make a group of natural language inference tasks, you simply write like how you would normally write a group config but this time place that as part of a task list under the main group being built.
+### Duplicate Tasks in Group Configs
+There might be cases where you might want to evaluate prompts and how models perform over prompt variations. You can list an existing task (In the example below, `anli_r1`) which varying `doc_to_text` implementation. To differentiate from each variation, we can utilize `task_alias`. LM-Eval will recognize that there are multiple variations of the same tasks and differentiate them.
+```yaml
+group: flan_held_in
+group_alias: Flan (Held-In)
+task:
+  # ANLI R1
+  - group: anli_r1_flan
+    group_alias: ANLI R1
+    task:
+      - task: anli_r1
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nChoose your answer ..."
+        ...
+      - task: anli_r1
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nBased on ..."
+      ...
+```
+### Configuring python classes
+There can occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.
+```yaml
+task: squadv2
+class: !function task.SQuAD2
+```
+This also applies to building group configurations with subtasks that are python classes.
+```yaml
+group: scrolls
+task:
+  - task: scrolls_qasper
+    class: !function task.Qasper
+  - task: scrolls_quality
+    class: !function task.QuALITY
+  - task: scrolls_narrativeqa
+    class: !function task.NarrativeQA
+  ...
+```
 ## Beautifying Table Display
 To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.

--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -10,8 +10,7 @@ from typing import Union
 import numpy as np
 from lm_eval import evaluator, utils
-from lm_eval.api.registry import ALL_TASKS
+from lm_eval.tasks import TaskManager, include_path, initialize_tasks
-from lm_eval.tasks import include_path, initialize_tasks
 from lm_eval.utils import make_table
@@ -169,6 +168,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        assert args.output_path, "Specify --output_path"
    initialize_tasks(args.verbosity)
+    task_manager = TaskManager(args.verbosity, include_path=args.include_path)
    if args.limit:
        eval_logger.warning(
@@ -180,12 +180,12 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        include_path(args.include_path)
    if args.tasks is None:
-        task_names = ALL_TASKS
+        eval_logger.error("Need to specify task to evaluate.")
+        sys.exit()
    elif args.tasks == "list":
        eval_logger.info(
-            f"Available Tasks:\n - {(os.linesep + ' - ').join(sorted(ALL_TASKS))}"
+            "Available Tasks:\n - {}".format("\n - ".join(task_manager.all_tasks()))
        )
-        sys.exit()
    else:
        if os.path.isdir(args.tasks):
            import glob
@@ -196,16 +196,14 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
                config = utils.load_yaml_config(yaml_file)
                task_names.append(config)
        else:
-            tasks_list = args.tasks.split(",")
+            task_list = args.tasks.split(",")
-            task_names = utils.pattern_match(tasks_list, ALL_TASKS)
+            task_names = task_manager.match_tasks(task_list)
-            for task in [task for task in tasks_list if task not in task_names]:
+            for task in [task for task in task_list if task not in task_names]:
                if os.path.isfile(task):
                    config = utils.load_yaml_config(task)
                    task_names.append(config)
            task_missing = [
-                task
+                task for task in task_list if task not in task_names and "*" not in task
-                for task in tasks_list
-                if task not in task_names and "*" not in task
            ]  # we don't want errors if a wildcard ("*") task name was used
            if task_missing:
@@ -237,6 +235,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
            output_path_file = path.joinpath("results.json")
    eval_logger.info(f"Selected Tasks: {task_names}")
+    eval_logger.info("Loading selected tasks...")
    results = evaluator.simple_evaluate(
        model=args.model,
@@ -253,6 +252,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        write_out=args.write_out,
        log_samples=args.log_samples,
        gen_kwargs=args.gen_kwargs,
+        task_manager=task_manager,
        predict_only=args.predict_only,
    )

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -85,7 +85,6 @@ class TaskConfig(dict):
    filter_list: Union[str, list] = None
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
    metadata: dict = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks
    def __post_init__(self) -> None:
@@ -306,7 +305,7 @@ class Task(abc.ABC):
            return self.validation_docs()
        else:
            eval_logger.warning(
-                "has_training_docs and has_validation_docs are False"
+                f"[Task: {self.config.task}] has_training_docs and has_validation_docs are False"
                ", using test_docs as fewshot_docs but this is not recommended."
            )
            return self.test_docs()
@@ -437,6 +436,9 @@ class Task(abc.ABC):
        """
        pass
+    def get_config(self, key: str) -> Any:
+        return getattr(self._config, key, None)
    @classmethod
    def count_bytes(cls, doc):
        """Used for byte-level perplexity metrics in rolling loglikelihood"""
@@ -622,7 +624,7 @@ class ConfigurableTask(Task):
                    INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
                    metric_agg = get_metric_aggregation(metric_name)
                    eval_logger.warning(
-                        f"[Task: {self._config.task}] metric {metric_name} is defined, but aggregation is not. "
+                        f"[Task: {self.config.task}] metric {metric_name} is defined, but aggregation is not. "
                        f"using default "
                        f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
                    )
@@ -634,7 +636,7 @@ class ConfigurableTask(Task):
                    ]
                else:
                    eval_logger.warning(
-                        f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. "
+                        f"[Task: {self.config.task}] metric {metric_name} is defined, but higher_is_better is not. "
                        f"using default "
                        f"higher_is_better={is_higher_better(metric_name)}"
                    )

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -4,20 +4,24 @@ import collections
 import torch
+import logging
 import numpy as np
 import lm_eval.api
-import lm_eval.tasks
 import lm_eval.models
 import lm_eval.api.metrics
 import lm_eval.api.registry
+from lm_eval.tasks import (
+    get_task_dict,
+    TaskManager
+)
 from lm_eval.utils import (
    positional_deprecated,
    run_task_tests,
    get_git_commit_hash,
    simple_parse_args_string,
-    eval_logger,
+    eval_logger
 )
@@ -38,6 +42,8 @@ def simple_evaluate(
    write_out: bool = False,
    log_samples: bool = True,
    gen_kwargs: str = None,
+    task_manager: TaskManager = None,
+    verbosity: str = "INFO",
    predict_only: bool = False,
 ):
    """Instantiate and evaluate a model on a list of tasks.
@@ -47,7 +53,7 @@ def simple_evaluate(
    :param model_args: Optional[str]
        String arguments for each model class, see LM.create_from_arg_string.
        Ignored if `model` argument is a LM object.
-    :param tasks: list[Union[str, Task]]
+    :param tasks: list[Union[str, dict, Task]]
        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
@@ -84,6 +90,8 @@ def simple_evaluate(
        1234
    )  # TODO: this may affect training runs that are run with evaluation mid-run.
+    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
    if tasks is None:
        tasks = []
    assert (
@@ -125,11 +133,18 @@ def simple_evaluate(
            + ".db",
        )
-    task_dict = lm_eval.tasks.get_task_dict(tasks)
+    if task_manager is None:
+        task_manager = TaskManager(verbosity)
+    eval_logger.info(
+        "get_task_dict has been updated to accept an optional argument, `task_manager`"
+        "Read more here: https://github.com/EleutherAI/lm-evaluation-harness/blob/recursive-groups/docs/interface.md#external-library-usage"
+        )
+    task_dict = get_task_dict(tasks, task_manager)
    for task_name in task_dict.keys():
        task_obj = task_dict[task_name]
        if isinstance(task_obj, tuple):
-            group, task_obj = task_obj
+            _, task_obj = task_obj
            if task_obj is None:
                continue
@@ -169,6 +184,7 @@ def simple_evaluate(
        decontamination_ngrams_path=decontamination_ngrams_path,
        write_out=write_out,
        log_samples=log_samples,
+        verbosity=verbosity,
    )
    if lm.rank == 0:
@@ -211,6 +227,7 @@ def evaluate(
    decontamination_ngrams_path=None,
    write_out: bool = False,
    log_samples: bool = True,
+    verbosity: str = "INFO",
 ):
    """Instantiate and evaluate a model on a list of tasks.
@@ -230,6 +247,7 @@ def evaluate(
        Dictionary of results
    """
+    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
    # decontaminate = decontamination_ngrams_path is not None
    for task_name, task in task_dict.items():
@@ -511,11 +529,6 @@ def evaluate(
                            metrics.pop("alias")
                        current_size = metrics.pop("samples")
-                        # TODO: There should be a way for users
-                        #       to toggle between weighted and
-                        #       unweighted averaging
-                        # For unweighted averaging, use:
-                        #     current_size = 1
                        all_stderr = []
                        for metric in [

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
 import os
-import yaml
+import abc
+import collections
+from functools import partial
 from typing import List, Union, Dict
 from lm_eval import utils
-from lm_eval import prompts
+from lm_eval.api.task import Task, ConfigurableTask
-from lm_eval.api.task import TaskConfig, Task, ConfigurableTask
-from lm_eval.api.registry import (
-    register_task,
-    register_group,
-    TASK_REGISTRY,
-    GROUP_REGISTRY,
-    ALL_TASKS,
-)
 import logging
-# import python tasks
-from .squadv2.task import SQuAD2
-from .scrolls.task import (
-    QuALITY,
-    NarrativeQA,
-    ContractNLI,
-    GovReport,
-    SummScreenFD,
-    QMSum,
-)
-eval_logger = utils.eval_logger
-def register_configurable_task(config: Dict[str, str]) -> int:
-    SubClass = type(
-        config["task"] + "ConfigurableTask",
-        (ConfigurableTask,),
-        {"CONFIG": TaskConfig(**config)},
-    )
-    if "task" in config:
+class TaskManager:
-        task_name = "{}".format(config["task"])
+    """TaskManager indexes all tasks from the default `lm_eval/tasks/`
-        register_task(task_name)(SubClass)
+    and an optional directory if provided.
-    if "group" in config:
+    """
-        if config["group"] == config["task"]:
+    def __init__(
-            raise ValueError("task and group name cannot be the same")
+        self,
-        elif isinstance(config["group"], str):
+        verbosity="INFO",
-            group_name = [config["group"]]
+        include_path=None
-        else:
+        ) -> None:
-            group_name = config["group"]
+        self.verbosity = verbosity
+        self.include_path = include_path
+        self.logger = utils.eval_logger
+        self.logger.setLevel(getattr(logging, f"{verbosity}"))
+        self._task_index = self.initialize_tasks(
+            include_path=include_path
+            )
+        self._all_tasks = sorted(list(self._task_index.keys()))
-        for group in group_name:
+        self.task_group_map = collections.defaultdict(list)
-            register_group(group)(SubClass)
-    return 0
+    def initialize_tasks(self, include_path: str = None):
+        """Creates an dictionary of tasks index.
+        :param include_path: str = None
+            An additional path to be searched for tasks
+        :return
+            Dictionary of task names as key and task metadata
+        """
+        all_paths = [os.path.dirname(os.path.abspath(__file__)) + "/"]
+        if include_path is not None:
+            if isinstance(include_path, str):
+                include_path = [include_path]
+            all_paths.extend(include_path)
+        task_index = {}
+        for task_dir in all_paths:
+            tasks = self._get_task_and_group(task_dir)
+            task_index = {**tasks, **task_index}
-def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -> int:
+        return task_index
-    group = config["group"]
-    all_task_list = config["task"]
+    @property
-    config_list = [task for task in all_task_list if not isinstance(task, str)]
+    def all_tasks(self):
-    task_list = [task for task in all_task_list if isinstance(task, str)]
+        return self._all_tasks
-    for task_config in config_list:
+    @property
+    def task_index(self):
-        base_config = {}
+        return self._task_index
-        task_name_config = {}
-        if "task" in task_config:
+    def match_tasks(self, task_list):
-            task_name = task_config["task"]
+        return utils.pattern_match(
-            if task_name in ALL_TASKS:
+            task_list, self.all_tasks
-                task_obj = TASK_REGISTRY[task_name]
-                if isinstance(task_obj, tuple):
-                    _, task_obj = task_obj
-                if task_obj is not None:
-                    base_config = task_obj.CONFIG.to_dict(keep_callable=True)
-                    task_name_config["task"] = f"{group}_{task_name}"
-        task_config = utils.load_yaml_config(yaml_path, task_config)
-        var_configs = check_prompt_config(
-            {
-                **base_config,
-                **task_config,
-                **{"group": group},
-                **task_name_config,
-            },
-            yaml_path=os.path.dirname(yaml_path),
        )
-        for config in var_configs:
-            register_configurable_task(config)
+    def _name_is_registered(self, name):
+        if name in self.all_tasks:
-    task_names = utils.pattern_match(task_list, ALL_TASKS)
+            return True
-    for task in task_names:
+        return False
-        if (task in TASK_REGISTRY) or (task in GROUP_REGISTRY):
-            if group in GROUP_REGISTRY:
+    def _name_is_task(self, name):
-                GROUP_REGISTRY[group].append(task)
+        if self._name_is_registered(name) and ("task" in self.task_index[name]["type"]):
+            return True
+        return False
+    def _name_is_group(self, name):
+        if self._name_is_registered(name) and (self.task_index[name]["type"] == "group"):
+            return True
+        return False
+    def _name_is_python_task(self, name):
+        if self._name_is_registered(name) and (self.task_index[name]["type"] == "python_task"):
+            return True
+        return False
+    def _config_is_task(self, config):
+        if ("task" in config) and isinstance(config["task"], str):
+            return True
+        return False
+    def _config_is_group(self, config):
+        if ("task" in config) and isinstance(config["task"], list):
+            return True
+        return False
+    def _config_is_python_task(self, config):
+        if "class" in config:
+            return True
+        return False
+    def _get_yaml_path(self, name):
+        assert name in self.task_index
+        return self.task_index[name]["yaml_path"]
+    def _get_config(self, name):
+        assert name in self.task_index
+        yaml_path = self._get_yaml_path(name)
+        if yaml_path == -1:
+            return {}
+        else:
+            return utils.load_yaml_config(yaml_path, mode="full")
+    def _get_tasklist(self, name):
+        assert self._name_is_task(name) == False
+        return self.task_index[name]["task"]
+    def _process_alias(self, config, group=None):
+        # If the group is not the same as the original
+        # group which the group alias was intended for,
+        # Set the group_alias to None instead.
+        if ("group_alias" in config) and ("group" in config) and group is not None:
+            if config["group"] != group:
+                config["group_alias"] = None
+        return config
+    def _load_individual_task_or_group(
+            self,
+            name_or_config: Union[str, dict] = None,
+            parent_name: str = None,
+            update_config: dict = None,
+            yaml_path: str = None,
+        ) -> ConfigurableTask:
+        def load_task(config, task, group=None, yaml_path=None):
+            if "include" in config:
+                assert yaml_path is not None
+                config.update(
+                    utils.load_yaml_config(
+                        yaml_path,
+                        yaml_config={"include": config.pop("include")},
+                        mode="full",
+                    )
+                )
+            if self._config_is_python_task(config):
+                task_object = config["class"]()
+            else:
+                config = self._process_alias(config, group=group)
+                task_object = ConfigurableTask(config=config)
+            if group is not None:
+                task_object = (group, task_object)
+            return {task: task_object}
+        if isinstance(name_or_config, str):
+            if update_config is not None:
+                # Process name_or_config as a dict instead
+                name_or_config = {"task": name_or_config, **update_config}
+            elif self._name_is_task(name_or_config):
+                task_config = self._get_config(name_or_config)
+                return load_task(task_config, task=name_or_config, group=parent_name)
            else:
-                GROUP_REGISTRY[group] = [task]
+                group_name = name_or_config
-                ALL_TASKS.add(group)
+                subtask_list = self._get_tasklist(name_or_config)
+                if subtask_list == -1:
+                    group_config = self._get_config(name_or_config)
+                    subtask_list = group_config["task"]
+                # This checks if we're at the root.
+                if parent_name is None:
+                    group_config = self._get_config(name_or_config)
+                    if set(group_config.keys()) > set(["task", "group"]):
+                        update_config = {
+                            k:v for k,v in group_config.items() if k not in ["task", "group"]
+                        }
+                    yaml_path = self._get_yaml_path(group_name)
-    return 0
+                    if (update_config is not None) and ("group_alias" in update_config):
+                        group_name = update_config["group_alias"]
+                        update_config.pop("group_alias")
+        if isinstance(name_or_config, dict):
-def check_prompt_config(
+            if update_config is not None:
-    config: Dict[str, str], yaml_path: str = None
+                name_or_config={
-) -> List[Dict[str, str]]:
+                    **name_or_config,
-    all_configs = []
+                    **update_config,
-    if "use_prompt" in config:
-        prompt_list = prompts.load_prompt_list(
-            use_prompt=config["use_prompt"],
-            dataset_name=config["dataset_path"],
-            subset_name=config["dataset_name"] if "dataset_name" in config else None,
-            yaml_path=yaml_path,
-        )
-        for idx, prompt_variation in enumerate(prompt_list):
-            all_configs.append(
-                {
-                    **config,
-                    **{"use_prompt": prompt_variation},
-                    **{
-                        "task": "_".join(
-                            [
-                                config["task"]
-                                if "task" in config
-                                else get_task_name_from_config(config),
-                                prompt_variation.split("/")[-1]
-                                if ".yaml" in prompt_variation
-                                else prompt_variation,
-                            ]
-                        )
-                    },
-                    **{"output_type": "generate_until"},
                }
-            )
-    else:
-        all_configs.append(config)
-    return all_configs
-def get_task_name_from_config(task_config: Dict[str, str]) -> str:
+            if self._config_is_task(name_or_config):
-    if "dataset_name" in task_config:
+                name = name_or_config["task"]
-        return "{dataset_path}_{dataset_name}".format(**task_config)
+                # If the name is registered as a group
-    else:
+                # if self._name_is_task(name) is False:
-        return "{dataset_path}".format(**task_config)
+                if self._name_is_group(name):
+                    group_name = name
+                    update_config = {k:v for k,v in name_or_config.items() if k != "task"}
+                    subtask_list = self._get_tasklist(name)
+                    if subtask_list == -1:
+                        subtask_list = self._get_config(name)["task"]
+                else:
+                    if self._name_is_registered(name):
+                        base_task_config = self._get_config(name)
+                        # Check if this is a duplicate.
+                        if parent_name is not None:
+                            name_or_config["group"] = parent_name
+                            num_duplicate = len(list(filter(lambda x: x.startswith(name), self.task_group_map[parent_name])))
+                            if num_duplicate > 0:
+                                name = f"{name}-{num_duplicate}"
+                            self.task_group_map[parent_name].append(name)
+                        task_config={
+                                **base_task_config,
+                                **name_or_config,
+                            }
+                    else:
+                        task_config = name_or_config
+                    return load_task(task_config, task=name, group=parent_name, yaml_path=yaml_path)
+            else:
+                group_name = name_or_config["group"]
+                subtask_list = name_or_config["task"]
+                # update_config = {k:v for k,v in name_or_config.items() if k != "task"}
+                if set(name_or_config.keys()) > set(["task", "group"]):
+                    update_config = {
+                        k:v for k,v in name_or_config.items() if k not in ["task", "group"]
+                    }
+        all_subtasks = {}
+        if (parent_name is not None):
+            all_subtasks = {group_name: (parent_name, None)}
-def include_task_folder(task_dir: str, register_task: bool = True) -> None:
+        fn = partial(self._load_individual_task_or_group, parent_name=group_name, update_config=update_config, yaml_path=yaml_path)
-    """
+        all_subtasks = {**all_subtasks, **dict(collections.ChainMap(*map(fn, subtask_list)))}
-    Calling this function
+        return all_subtasks
-    """
-    # Track whether any tasks failed during loading
-    import_fail = False
-    for root, subdirs, file_list in os.walk(task_dir):
-        # if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
-        for f in file_list:
-            if f.endswith(".yaml"):
-                yaml_path = os.path.join(root, f)
-                try:
-                    config = utils.load_yaml_config(yaml_path)
-                    if "task" not in config:
-                        continue
-                    all_configs = check_prompt_config(
-                        config, yaml_path=os.path.dirname(yaml_path)
-                    )
-                    for config in all_configs:
-                        if register_task:
-                            if isinstance(config["task"], str):
-                                register_configurable_task(config)
-                        else:
-                            if isinstance(config["task"], list):
-                                register_configurable_group(config, yaml_path)
-                # Log this silently and show it only when
-                # the user defines the appropriate verbosity.
-                except (ImportError, ModuleNotFoundError) as e:
-                    import_fail = True
-                    eval_logger.debug(
-                        f"{yaml_path}: {e}. Config will not be added to registry."
-                    )
-                except Exception as error:
-                    import traceback
-                    eval_logger.warning(
-                        "Unexpected error loading config in\n"
-                        f"                                 {yaml_path}\n"
-                        "                                 Config will not be added to registry\n"
-                        f"                                 Error: {error}\n"
-                        f"                                 Traceback: {traceback.format_exc()}"
-                    )
-    if import_fail:
+    def load_task_or_group(self, task_list: Union[str, list] = None) -> dict:
-        eval_logger.warning(
+        """Loads a dictionary of task objects from a list
-          "Some tasks could not be loaded due to missing dependencies."
-          " Run with `--verbosity DEBUG` for full details."
-          )
-    return 0
+        :param task_list: Union[str, list] = None
+            Single string or list of string of task names to be loaded
-def include_path(task_dir):
+        :return
-    include_task_folder(task_dir)
+            Dictionary of task objects
-    # Register Benchmarks after all tasks have been added
+        """
-    include_task_folder(task_dir, register_task=False)
+        if isinstance(task_list, str):
-    return 0
+            task_list = [task_list]
+        all_loaded_tasks = dict(
+            collections.ChainMap(
+                *map(
+                    self._load_individual_task_or_group,
+                    task_list
+                )
+            )
+        )
+        return all_loaded_tasks
+    def load_config(self, config: Dict):
+        return self._load_individual_task_or_group(config)
+    def _get_task_and_group(self, task_dir: str):
+        """Creates an dictionary of tasks index with the following metadata,
+        - `type`, that can be either `task`, `python_task`, or `group`.
+            `task` refer to regular task configs, `python_task` are special
+            yaml files that only consists of `task` and `class` parameters.
+            `group` are group configs.
+        - `yaml_path`, path to the yaml file. If the entry is a `group` that
+            was configured through a task config, the yaml_path will be -1
+            and all subtasks will be listed in `task` (see below)
+        - `task`, reserved for entries with `type` as `group`. This will list
+            all subtasks. When a group config is created (as opposed to task
+            config having `group` parameter set), this will be set to -1 to
+            avoid recursive indexing. The whole list of subtasks will be loaded
+            at evaluation.
+        :param task_dir: str
+            A directory to check for tasks
+        :return
+            Dictionary of task names as key and task metadata
+        """
+        tasks_and_groups = collections.defaultdict()
+        for root, _, file_list in os.walk(task_dir):
+            for f in file_list:
+                if f.endswith(".yaml"):
+                    yaml_path = os.path.join(root, f)
+                    config = utils.load_yaml_config(yaml_path, mode="simple")
+                    if self._config_is_python_task(config):
+                        # This is a python class config
+                        tasks_and_groups[config["task"]] = {
+                            "type": "python_task",
+                            "yaml_path": yaml_path,
+                        }
+                    elif self._config_is_group(config):
+                        # This is a group config
+                        tasks_and_groups[config["group"]] = {
+                            "type": "group",
+                            "task": -1, # This signals that
+                                        # we don't need to know
+                                        # the task list for indexing
+                                        # as it can be loaded
+                                        # when called.
+                            "yaml_path": yaml_path,
+                        }
-def initialize_tasks(verbosity="INFO"):
+                        # # Registered the level 1 tasks from a group config
-    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
+                        # for config in config["task"]:
+                        #     if isinstance(config, dict) and self._config_is_task(config):
+                        #         task = config["task"]
+                        #         tasks_and_groups[task] = {
+                        #             "type": "task",
+                        #             "yaml_path": yaml_path,
+                        #             }
+                    elif self._config_is_task(config):
+                        # This is a task config
+                        task = config["task"]
+                        tasks_and_groups[task] = {
+                            "type": "task",
+                            "yaml_path": yaml_path,
+                            }
-    task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
+                        if "group" in config:
-    include_path(task_dir)
+                            groups = config["group"]
+                            if isinstance(config["group"], str):
+                                groups = [groups]
+                            for group in groups:
+                                if group not in tasks_and_groups:
+                                    tasks_and_groups[group] = {
+                                        "type": "group",
+                                        "task": [task],
+                                        "yaml_path": -1,
+                                    }
+                                else:
+                                    tasks_and_groups[group]["task"].append(task)
+                    else:
+                        self.logger.debug(f"File {f} in {root} could not be loaded")
+        return tasks_and_groups
+def include_path(task_dir):
+    logger = utils.eval_logger
+    logger.setLevel(getattr(logging, "INFO"))
+    logger.info(
+        "To still use tasks loaded from args.include_path,"
+        "see an example of the new TaskManager API in https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage"
+    )
+    return 0
-def get_task(task_name, config):
+def initialize_tasks(verbosity="INFO"):
-    try:
+    logger = utils.eval_logger
-        return TASK_REGISTRY[task_name](config=config)
+    logger.setLevel(getattr(logging, f"{verbosity}"))
-    except KeyError:
+    logger.info(
-        eval_logger.info("Available tasks:")
+        "lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. "
-        eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
+        "It will be removed in v0.4.2 release. "
-        raise KeyError(f"Missing task {task_name}")
+        "TaskManager will instead be used."
+    )
+    return 0
+def get_task_name_from_config(task_config: Dict[str, str]) -> str:
+    if "task" in task_config:
+        return task_config["task"]
+    if "dataset_name" in task_config:
+        return "{dataset_path}_{dataset_name}".format(**task_config)
+    else:
+        return "{dataset_path}".format(**task_config)
 def get_task_name_from_object(task_object):
-    for name, class_ in TASK_REGISTRY.items():
+    if hasattr(task_object, "config"):
-        if class_ is task_object:
+        return task_object._config["task"]
-            return name
    # TODO: scrap this
    # this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
@@ -234,54 +382,40 @@ def get_task_name_from_object(task_object):
        else type(task_object).__name__
    )
+def get_task_dict(task_name_list: List[Union[str, Dict, Task]], task_manager: TaskManager = None):
+    """Creates a dictionary of task objects from either a name of task, config, or prepared Task object.
-# TODO: pass num_fewshot and other cmdline overrides in a better way
+    :param task_name_list: List[Union[str, Dict, Task]]
-def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
+        Name of model or LM object, see lm_eval.models.get_model
-    config = {**kwargs}
+    :param task_manager: TaskManager = None
+        A TaskManager object that stores indexed tasks. If not set,
+        task_manager will load one. This should be set by the user
+        if there are additional paths that want to be included
+        via `include_path`
-    task_name_from_registry_dict = {}
+    :return
+        Dictionary of task objects
+    """
+    task_name_from_string_dict = {}
    task_name_from_config_dict = {}
    task_name_from_object_dict = {}
-    if not isinstance(task_name_list, list):
+    if isinstance(task_name_list, str):
        task_name_list = [task_name_list]
-    for task_element in task_name_list:
+    string_task_name_list = [task for task in task_name_list if isinstance(task, str)]
-        if isinstance(task_element, str):
+    others_task_name_list = [task for task in task_name_list if ~isinstance(task, str)]
-            if task_element in GROUP_REGISTRY:
+    if len(string_task_name_list) > 0:
-                group_name = task_element
+        if task_manager is None:
-                for task_name in GROUP_REGISTRY[task_element]:
+            task_manager = TaskManager()
-                    if task_name not in task_name_from_registry_dict:
-                        task_obj = get_task_dict(task_name)
-                        if task_name in task_obj.keys():
-                            task_dict = {
-                                task_name: (group_name, task_obj[task_name]),
-                            }
-                        else:
-                            task_dict = {
-                                task_name: (group_name, None),
-                                **task_obj,
-                            }
-                        task_name_from_registry_dict = {
+        task_name_from_string_dict = task_manager.load_task_or_group(string_task_name_list)
-                            **task_name_from_registry_dict,
-                            **task_dict,
-                        }
-            else:
-                task_name = task_element
-                if task_name not in task_name_from_registry_dict:
-                    task_name_from_registry_dict = {
-                        **task_name_from_registry_dict,
-                        task_name: get_task(task_name=task_element, config=config),
-                    }
-        elif isinstance(task_element, dict):
+    for task_element in others_task_name_list:
-            task_element.update(config)
+        if isinstance(task_element, dict):
            task_name_from_config_dict = {
                **task_name_from_config_dict,
-                get_task_name_from_config(task_element): ConfigurableTask(
+                **task_manager.load_config(config=task_element),
-                    config=task_element
-                ),
            }
        elif isinstance(task_element, Task):
@@ -290,11 +424,11 @@ def get_task_dict(task_name_list: List[Union[str, Dict, Task]], **kwargs):
                get_task_name_from_object(task_element): task_element,
            }
-    assert set(task_name_from_registry_dict.keys()).isdisjoint(
+    assert set(task_name_from_string_dict.keys()).isdisjoint(
        set(task_name_from_object_dict.keys())
    )
    return {
-        **task_name_from_registry_dict,
+        **task_name_from_string_dict,
        **task_name_from_config_dict,
        **task_name_from_object_dict,
    }
--- a/lm_eval/tasks/benchmarks/flan/yaml_templates/held_in_template_yaml
+++ b/lm_eval/tasks/benchmarks/flan/yaml_templates/held_in_template_yaml
 output_type: generate_until
-validation_split: validation
+test_split: null
+doc_to_choice: null
 metric_list:
  - metric: exact_match
    aggregation: mean

--- a/lm_eval/tasks/benchmarks/flan/flan_anli.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_anli.yaml
-group: flan_anli
-task:
-  - include: yaml_templates/held_in_template_yaml
-    task: anli_r1
-    dataset_path: anli
-    use_prompt: prompt_templates/anli.yaml:*
-    validation_split: dev_r1
-  - include: yaml_templates/held_in_template_yaml
-    task: anli_r2
-    dataset_path: anli
-    use_prompt: prompt_templates/anli.yaml:*
-    validation_split: dev_r2
-  - include: yaml_templates/held_in_template_yaml
-    task: anli_r3
-    dataset_path: anli
-    use_prompt: prompt_templates/anli.yaml:*
-    validation_split: dev_r3
--- a/lm_eval/tasks/benchmarks/flan/flan_arc.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_arc.yaml
-group: flan_arc
-task:
-  - include: yaml_templates/held_in_template_yaml
-    task: arc_easy
-    dataset_path: ai2_arc
-    dataset_name: ARC-Easy
-    use_prompt: prompt_templates/arc.yaml:*
-    validation_split: validation
-  - include: yaml_templates/held_in_template_yaml
-    task: arc_challenge
-    dataset_path: ai2_arc
-    dataset_name: ARC-Challenge
-    use_prompt: prompt_templates/arc.yaml:*
-    validation_split: validation
--- a/lm_eval/tasks/benchmarks/flan/flan_boolq.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_boolq.yaml
-group: flan_boolq
-task:
-  - include: yaml_templates/held_in_template_yaml
-    dataset_path: super_glue
-    dataset_name: boolq
-    use_prompt: prompt_templates/boolq.yaml:*
-    validation_split: validation
--- a/lm_eval/tasks/benchmarks/flan/flan_cot.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_cot.yaml
-group: flan_cot
-task:
-  - include: yaml_templates/cot_template_yaml
-    dataset_path: gsm8k
-    use_prompt: promptsource:*
-    validation_split: validation
-  - include: yaml_templates/cot_template_yaml
-    dataset_path: EleutherAI/asdiv
-    use_prompt: promptsource:*
-    validation_split: validation
--- a/lm_eval/tasks/benchmarks/flan/flan_held_in.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_held_in.yaml
 group: flan_held_in
+group_alias: Flan (Held-In)
 task:
-  - flan_boolq
+  # ANLI R1
-  - flan_rte
+  - group: anli_r1_flan
-  - flan_anli
+    group_alias: ANLI R1
-  - flan_arc
+    task:
+      - task: anli_r1
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nChoose your answer: based on the paragraph above can we conclude that \"{{hypothesis}}\"?\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nI think the answer is"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nBased on that paragraph can we conclude that this sentence is true?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nCan we draw the following conclusion?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nDoes this next sentence follow, given the preceding text?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nCan we infer the following?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nThe answer is:"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Read the following paragraph and determine if the hypothesis is true:\n\n{{premise}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nHypothesis: {{hypothesis}}\n\n\n"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "Read the text and determine if the sentence is true (see options at the end):\n\n{{premise}}\n\nSentence: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-7
+        include: _held_in_template_yaml
+        doc_to_text: "Can we draw the following hypothesis from the context (see options)? \n\nContext:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r1
+        task_alias: prompt-8
+        include: _held_in_template_yaml
+        doc_to_text: "Choose from options: Determine if the sentence is true based on the text below:\n{{hypothesis}}\n\n{{premise}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+  # ANLI R2
+  - group: anli_r2_flan
+    group_alias: ANLI R2
+    task:
+      - task: anli_r2
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nChoose your answer: based on the paragraph above can we conclude that \"{{hypothesis}}\"?\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nI think the answer is"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nBased on that paragraph can we conclude that this sentence is true?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nCan we draw the following conclusion?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nDoes this next sentence follow, given the preceding text?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nCan we infer the following?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nThe answer is:"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Read the following paragraph and determine if the hypothesis is true:\n\n{{premise}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nHypothesis: {{hypothesis}}\n\n\n"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "Read the text and determine if the sentence is true (see options at the end):\n\n{{premise}}\n\nSentence: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-7
+        include: _held_in_template_yaml
+        doc_to_text: "Can we draw the following hypothesis from the context (see options)? \n\nContext:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r2
+        task_alias: prompt-8
+        include: _held_in_template_yaml
+        doc_to_text: "Choose from options: Determine if the sentence is true based on the text below:\n{{hypothesis}}\n\n{{premise}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+  # ANLI R3
+  - group: anli_r3_flan
+    group_alias: ANLI R3
+    task:
+      - task: anli_r3
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nChoose your answer: based on the paragraph above can we conclude that \"{{hypothesis}}\"?\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nI think the answer is"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nBased on that paragraph can we conclude that this sentence is true?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\n\nCan we draw the following conclusion?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nDoes this next sentence follow, given the preceding text?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "{{premise}}\nCan we infer the following?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nThe answer is:"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Read the following paragraph and determine if the hypothesis is true:\n\n{{premise}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nHypothesis: {{hypothesis}}\n\n\n"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "Read the text and determine if the sentence is true (see options at the end):\n\n{{premise}}\n\nSentence: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-7
+        include: _held_in_template_yaml
+        doc_to_text: "Can we draw the following hypothesis from the context (see options)? \n\nContext:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+      - task: anli_r3
+        task_alias: prompt-8
+        include: _held_in_template_yaml
+        doc_to_text: "Choose from options: Determine if the sentence is true based on the text below:\n{{hypothesis}}\n\n{{premise}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
+        doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
+  # Arc Easy
+  - group: arc_easy_flan
+    group_alias: Arc Easy
+    task:
+      - task: arc_easy
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "Question: {{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}\nAnswer:"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "Question: {{question}}\n\nWhat is the correct answer to the question from the following choices?\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "Q: {{question}}\nWhat is the correct answer to this question?\nOPTIONS:\n- {{choices.text|join('\n- ')}}...A:"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "Choose your answer?\n\n{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Answer the question\n\n{{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_easy
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "{{question}}\n\nPick the answer from these options\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+  # Arc Challenge
+  - group: arc_challenge_flan
+    group_alias: Arc Challenge
+    task:
+      - task: arc_challenge
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "Question: {{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}\nAnswer:"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "Question: {{question}}\n\nWhat is the correct answer to the question from the following choices?\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "Q: {{question}}\nWhat is the correct answer to this question?\nOPTIONS:\n- {{choices.text|join('\n- ')}}...A:"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "Choose your answer?\n\n{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Answer the question\n\n{{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+      - task: arc_challenge
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "{{question}}\n\nPick the answer from these options\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
+        doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
+  # BoolQ
+  - group: boolq_flan
+    group_alias: BoolQ
+    task:
+      - task: boolq
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\nCan we conclude that {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\nIs it true that {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\n{{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "Text: {{passage}}\n\nQuestion: {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\nWhat's the best answer to this question: {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\nBased on the above text what's the best answer to this question: {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\nAnswer this question making sure that the answer is supposed by the text: {{question}}?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-7
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\nIs the following statement correct based on the text\n\n{{question}}\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-8
+        include: _held_in_template_yaml
+        doc_to_text: "{{passage}}\n\nIs this statement correct \"{{question}}\"?\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+      - task: boolq
+        task_alias: prompt-9
+        include: _held_in_template_yaml
+        doc_to_text: "Is it true that {{question}} based on the following text?\n\n{{passage}}\n\nOPTIONS:\n- no\n- yes"
+        doc_to_target: "{{['no', 'yes'][label]}}"
+  # RTE
+  - group: rte_flan
+    group_alias: RTE
+    task:
+      - task: rte
+        task_alias: prompt-0
+        include: _held_in_template_yaml
+        doc_to_text: "{{sentence1}}\n\nQuestion with options: Based on the paragraph above can we conclude that \"{{sentence2}}\"?\n\nOPTIONS:\n- yes\n- no"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-1
+        include: _held_in_template_yaml
+        doc_to_text: "{{sentence1}}\n\nBased on that paragraph can we conclude that the sentence below is true?\n{{sentence2}}\n\nOPTIONS:\n- yes\n- no"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-2
+        include: _held_in_template_yaml
+        doc_to_text: "{{sentence1}}\n\nQ with options: Can we draw the following conclusion?\n{{sentence2}}\n\nOPTIONS:\n- yes\n- no"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-3
+        include: _held_in_template_yaml
+        doc_to_text: "{{sentence1}}\nDoes this next sentence follow, given the preceding text?\n{{sentence2}}\n\nOPTIONS:\n- yes\n- no"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-4
+        include: _held_in_template_yaml
+        doc_to_text: "{{sentence1}}\nOPTIONS:\n- yes\n- no\nQuestion: Can we infer the following?\n{{sentence2}}"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-5
+        include: _held_in_template_yaml
+        doc_to_text: "Read the following paragraph and determine if the hypothesis is true. Select from options at the end:\n\n{{sentence1}}\n\nHypothesis: {{sentence2}}\nOPTIONS:\n- yes\n- no\nThe answer is"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-6
+        include: _held_in_template_yaml
+        doc_to_text: "Read the text and determine if the sentence is true:\n\n{{sentence1}}\n\nSentence: {{sentence2}}\nOPTIONS:\n- yes\n- no\nA:"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-7
+        include: _held_in_template_yaml
+        doc_to_text: "Question with options: can we draw the following hypothesis from the context? \n\nContext:\n\n{{sentence1}}\n\nHypothesis: {{sentence2}}\nOPTIONS:\n- yes\n- no\nA:"
+        doc_to_target: "{{['yes', 'no'][label]}}"
+      - task: rte
+        task_alias: prompt-8
+        include: _held_in_template_yaml
+        doc_to_text: "Determine if the sentence is true based on the text below. Choose from options.\n{{sentence2}}\n\n{{sentence1}}\nOPTIONS:\n- yes\n- no"
+        doc_to_target: "{{['yes', 'no'][label]}}"
--- a/lm_eval/tasks/benchmarks/flan/flan_held_in_yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_held_in_yaml
-group: flan_held_in
-task:
-  - include: flan/yaml_templates/held_in_template_yaml
-    dataset_path: super_glue
-    dataset_name: boolq
-    use_prompt: flan/prompt_templates/boolq.yaml:*
-    validation_split: validation
-  - include: flan/yaml_templates/held_in_template_yaml
-    dataset_path: super_glue
-    dataset_name: rte
-    use_prompt: flan/prompt_templates/rte.yaml:*
-    validation_split: validation
-  - include: flan/yaml_templates/held_in_template_yaml
-    task: anli_r1
-    dataset_path: anli
-    use_prompt: flan/prompt_templates/anli.yaml:*
-    validation_split: dev_r1
-  - include: flan/yaml_templates/held_in_template_yaml
-    task: anli_r2
-    dataset_path: anli
-    use_prompt: flan/prompt_templates/anli.yaml:*
-    validation_split: dev_r2
-  - include: flan/yaml_templates/held_in_template_yaml
-    task: anli_r3
-    dataset_path: anli
-    use_prompt: flan/prompt_templates/anli.yaml:*
-    validation_split: dev_r3
-  - include: flan/yaml_templates/held_in_template_yaml
-    task: arc_easy
-    dataset_path: ai2_arc
-    dataset_name: ARC-Easy
-    use_prompt: flan/prompt_templates/arc.yaml:*
-    validation_split: validation
-  - include: flan/yaml_templates/held_in_template_yaml
-    task: arc_challenge
-    dataset_path: ai2_arc
-    dataset_name: ARC-Challenge
-    use_prompt: flan/prompt_templates/arc.yaml:*
-    validation_split: validation
--- a/lm_eval/tasks/benchmarks/flan/flan_held_out.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_held_out.yaml
 group: flan_held_out
 task:
  # BBH
-  - bbh_flan_zeroshot
+  - bbh_zeroshot
-  - bbh_flan_fewshot
+  - bbh_fewshot
-  - bbh_flan_cot_fewshot
+  - bbh_cot_fewshot
-  - bbh_flan_cot_zeroshot
+  - bbh_cot_zeroshot
  # MMLU
  - mmlu
  - mmlu_flan_n_shot_generative

--- a/lm_eval/tasks/benchmarks/flan/flan_rte.yaml
+++ b/lm_eval/tasks/benchmarks/flan/flan_rte.yaml
-group: flan_rte
-task:
-  - include: yaml_templates/held_in_template_yaml
-    dataset_path: super_glue
-    dataset_name: rte
-    use_prompt: prompt_templates/rte.yaml:*
-    validation_split: validation
--- a/lm_eval/tasks/benchmarks/flan/prompt_templates/anli.yaml
+++ b/lm_eval/tasks/benchmarks/flan/prompt_templates/anli.yaml
-# Flan Prompt Templates
-prompts:
-  "template-0":
-    doc_to_text: "{{premise}}\n\nChoose your answer: based on the paragraph above can we conclude that \"{{hypothesis}}\"?\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nI think the answer is"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-1":
-    doc_to_text: "{{premise}}\n\nBased on that paragraph can we conclude that this sentence is true?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-2":
-    doc_to_text: "{{premise}}\n\nCan we draw the following conclusion?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-3":
-    doc_to_text: "{{premise}}\nDoes this next sentence follow, given the preceding text?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-4":
-    doc_to_text: "{{premise}}\nCan we infer the following?\n{{hypothesis}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nThe answer is:"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-5":
-    doc_to_text: "Read the following paragraph and determine if the hypothesis is true:\n\n{{premise}}\n\nOPTIONS:\n- Yes\n- It's impossible to say\n- No\nHypothesis: {{hypothesis}}\n\n\n"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-6":
-    doc_to_text: "Read the text and determine if the sentence is true (see options at the end):\n\n{{premise}}\n\nSentence: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-7":
-    doc_to_text: "Can we draw the following hypothesis from the context (see options)? \n\nContext:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
-  "template-8":
-    doc_to_text: "Choose from options: Determine if the sentence is true based on the text below:\n{{hypothesis}}\n\n{{premise}}\nOPTIONS:\n- Yes\n- It's impossible to say\n- No"
-    doc_to_target: "{{[\"Yes\", \"It's impossible to say\", \"No\"][label]}}"
--- a/lm_eval/tasks/benchmarks/flan/prompt_templates/arc.yaml
+++ b/lm_eval/tasks/benchmarks/flan/prompt_templates/arc.yaml
-# Flan Prompt Templates
-prompts:
-  "template-0":
-    doc_to_text: "{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-1":
-    doc_to_text: "Question: {{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}\nAnswer:"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-2":
-    doc_to_text: "Question: {{question}}\n\nWhat is the correct answer to the question from the following choices?\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-3":
-    doc_to_text: "Q: {{question}}\nWhat is the correct answer to this question?\nOPTIONS:\n- {{choices.text|join('\n- ')}}...A:"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-4":
-    doc_to_text: "Choose your answer?\n\n{{question}}\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-5":
-    doc_to_text: "Answer the question\n\n{{question}}\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
-  "template-6":
-    doc_to_text: "{{question}}\n\nPick the answer from these options\n\nOPTIONS:\n- {{choices.text|join('\n- ')}}"
-    doc_to_target: "{{choices.text[choices.label.index(answerKey)]}}"
--- a/lm_eval/tasks/benchmarks/flan/prompt_templates/boolq.yaml
+++ b/lm_eval/tasks/benchmarks/flan/prompt_templates/boolq.yaml
-# Flan Prompt Templates
-prompts:
-  "template-0":
-    doc_to_text: "{{passage}}\n\nCan we conclude that {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-1":
-    doc_to_text: "{{passage}}\n\nIs it true that {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-2":
-    doc_to_text: "{{passage}}\n\n{{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-3":
-    doc_to_text: "Text: {{passage}}\n\nQuestion: {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-4":
-    doc_to_text: "{{passage}}\n\nWhat's the best answer to this question: {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-5":
-    doc_to_text: "{{passage}}\nBased on the above text what's the best answer to this question: {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-6":
-    doc_to_text: "{{passage}}\nAnswer this question making sure that the answer is supposed by the text: {{question}}?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-7":
-    doc_to_text: "{{passage}}\n\nIs the following statement correct based on the text\n\n{{question}}\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-8":
-    # doc_to_text: "{{title}}\n\n{{passage}}\n\nIs this statement correct \"{{question}}\"?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_text: "{{passage}}\n\nIs this statement correct \"{{question}}\"?\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
-  "template-9":
-    doc_to_text: "Is it true that {{question}} based on the following text?\n\n{{passage}}\n\nOPTIONS:\n- no\n- yes"
-    doc_to_target: "{{['no', 'yes'][label]}}"
--- a/lm_eval/tasks/benchmarks/flan/prompt_templates/rte.yaml
+++ b/lm_eval/tasks/benchmarks/flan/prompt_templates/rte.yaml
-# Flan Prompt Templates
-prompts:
-  "template-0":
-    doc_to_text: "{{premise}}\n\nQuestion with options: Based on the paragraph above can we conclude that \"{{hypothesis}}\"?\n\nOPTIONS:\n- yes\n- no"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-1":
-    doc_to_text: "{{premise}}\n\nBased on that paragraph can we conclude that the sentence below is true?\n{{hypothesis}}\n\nOPTIONS:\n- yes\n- no"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-2":
-    doc_to_text: "{{premise}}\n\nQ with options: Can we draw the following conclusion?\n{{hypothesis}}\n\nOPTIONS:\n- yes\n- no"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-3":
-    doc_to_text: "{{premise}}\nDoes this next sentence follow, given the preceding text?\n{{hypothesis}}\n\nOPTIONS:\n- yes\n- no"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-4":
-    doc_to_text: "{{premise}}\nOPTIONS:\n- yes\n- no\nQuestion: Can we infer the following?\n{{hypothesis}}"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-5":
-    doc_to_text: "Read the following paragraph and determine if the hypothesis is true. Select from options at the end:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- yes\n- no\nThe answer is"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-6":
-    doc_to_text: "Read the text and determine if the sentence is true:\n\n{{premise}}\n\nSentence: {{hypothesis}}\nOPTIONS:\n- yes\n- no\nA:"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-7":
-    doc_to_text: "Question with options: can we draw the following hypothesis from the context? \n\nContext:\n\n{{premise}}\n\nHypothesis: {{hypothesis}}\nOPTIONS:\n- yes\n- no\nA:"
-    doc_to_target: "{{['yes', 'no'][label]}}"
-  "template-8":
-    doc_to_text: "Determine if the sentence is true based on the text below. Choose from options.\n{{hypothesis}}\n\n{{premise}}\nOPTIONS:\n- yes\n- no"
-    doc_to_target: "{{['yes', 'no'][label]}}"
--- a/lm_eval/tasks/benchmarks/flan/yaml_templates/cot_template_yaml
+++ b/lm_eval/tasks/benchmarks/flan/yaml_templates/cot_template_yaml
-group: flan-cot
-output_type: generate_until
-validation_split: validation
-doc_to_target: "{{answer}}"
-metric_list:
-  - metric: exact_match
-    aggregation: mean
-    higher_is_better: true
-generation_kwargs:
-  until:
-    - "\n\n"
-  do_sample: false
-  temperature: 0.0
-filter_list:
-  - name: "get-answer"
-    filter:
-      - function: "regex"
-        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
-      - function: "take_first"
-metadata:
-  version: 1.0