Merge branch 'recursive-groups' of...

Merge branch 'recursive-groups' of https://github.com/EleutherAI/lm-evaluation-harness into t5v2-alt-plus

Merge branch 'recursive-groups' of...
Merge branch 'recursive-groups' of https://github.com/EleutherAI/lm-evaluation-harness into t5v2-alt-plus
c31c4300 · lintangsutawika · f7f298ee · 6282a1be · c31c4300 · c31c4300
Commit c31c4300 authored Jan 25, 2024 by lintangsutawika
18 changed files
--- a/README.md
+++ b/README.md
@@ -172,7 +172,7 @@ lm_eval --model openai-completions \
    --tasks lambada_openai,hellaswag
 ```

-We also support using your own local inference server with an implemented version of the OpenAI ChatCompletions endpoint and passing trained HuggingFace artifacts and tokenizers.
+We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.

 ```bash
 lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1
@@ -181,7 +181,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_

 | API or Inference Server                                                                                                   | Implemented?                    | `--model <xxx>` name                                                | Models supported:                                                                             | Request Types:                                             |
 |---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
-| OpenAI Completions                                                                                                        | :heavy_check_mark:              | `openai-completions` | up to `code-davinci-002`                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
+| OpenAI Completions                                                                                                        | :heavy_check_mark:              | `openai-completions`, `local-completions` | All OpenAI Completions API models                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
 | OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:        | `openai-chat-completions`, `local-chat-completions`                                                               | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                 | `generate_until` (no logprobs)                             |
 | Anthropic                                                                                                                 | :heavy_check_mark:              | `anthropic`                                                         | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)  | `generate_until` (no logprobs)                             |
 | Textsynth                                                                                                                 | :heavy_check_mark:                   | `textsynth`                                                         | [All supported engines](https://textsynth.com/documentation.html#engines)                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
@@ -189,9 +189,12 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
 | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:              | `gguf`, `ggml`                                                      | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                   | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
 | vLLM                                                                                                                      | :heavy_check_mark:       | `vllm`                                                              | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
 | Mamba                       | :heavy_check_mark:       | `mamba_ssm`                                                                      | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                             |
-| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface                                  | `generate_until`                                           |                                | ...                                                      |
+| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface                                  | `generate_until`                                           |                                | ...                |
+| `local-completions` (using `openai-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface                                  | `generate_until`                                           |                                | ...                                                      |

-It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
+Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
+
+For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).

 ### Other Frameworks


--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -232,6 +232,10 @@ If you would like to run evaluation on all prompt templates, you can simply call
 use_prompt: "promptsource:*"
 ```

+### Weighting evaluation based on task size
+
+By default, all tasks are aggregated by simple average (A group of 2 task with the same metric will simply be summed and divided by 2 for its group metric). You might find it necessary to aggregate multiple task scores by their weight. To do this, you can set within the task config `weight_by_size` to `True` to have its scores be weighted by the number of samples it has.
+
 ### Setting metrics

 You're almost done! Now we need to choose how to score our task.

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -40,6 +40,21 @@ ALL_OUTPUT_TYPES = [

 eval_logger = logging.getLogger("lm-eval")

+@dataclass
+class GroupConfig(dict):
+    group: str = None
+    task: Union[str, list] = None
+    weight_by_size: bool = False
+
+    def __getitem__(self, item):
+        return getattr(self, item)
+
+    def __setitem__(self, item, value):
+        return setattr(self, item, value)
+
+    def to_dict(self):
+        return asdict(self)
+

 @dataclass
 class TaskConfig(dict):
@@ -80,7 +95,7 @@ class TaskConfig(dict):
    filter_list: Union[str, list] = None
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
-
+    weight_by_size: bool = False
    metadata: Union[
        str, list
    ] = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -124,7 +124,7 @@ def simple_evaluate(
    for task_name in task_dict.keys():
        task_obj = task_dict[task_name]
        if type(task_obj) == tuple:
-            group, task_obj = task_obj
+            _, task_obj = task_obj
            if task_obj is None:
                continue

@@ -160,11 +160,16 @@ def simple_evaluate(
    )

    if lm.rank == 0:
+        if isinstance(model, str):
+            model_name = model
+        elif hasattr(model, "config") and hasattr(model.config, "_name_or_path"):
+            model_name = model.config._name_or_path
+        else:
+            model_name = type(model).__name__
+
        # add info about the model and few shot config
        results["config"] = {
-            "model": model
-            if isinstance(model, str)
-            else model.model.config._name_or_path,
+            "model": model_name,
            "model_args": model_args,
            "batch_size": batch_size,
            "batch_sizes": list(lm.batch_sizes.values())
@@ -482,10 +487,7 @@ def evaluate(
                        if "alias" in metrics:
                            metrics.pop("alias")

-                        # TODO: There should be a way for users
-                        #       to toggle between weighted and
-                        #       unweighted averaging
-                        if weight_by_size:
+                        if ("weight_by_size" in configs) and configs[task]["weight_by_size"]:
                            current_size = metrics.pop("samples")
                        else:
                            metrics.pop("samples")

--- a/lm_eval/models/mamba_lm.py
+++ b/lm_eval/models/mamba_lm.py
@@ -42,7 +42,7 @@ class MambaLMWrapper(HFLM):

        The HFLM arguments

-        `backend`, `revision`, `subfolder`, `tokenizer`, `truncation`, `max_length`,
+        `backend`, `tokenizer`, `truncation`, `max_length`,
        `device`, `dtype`, `batch_size`, `max_batch_size`, `trust_remote_code`, `use_fast_tokenizer`

        Are all supported by Mamba where they do not conflict
@@ -98,7 +98,6 @@ please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba
            pretrained,
            device=self._device,
            dtype=torch.float16 if dtype == "auto" else utils.get_dtype(dtype),
-            **kwargs,
        )

    def _model_generate(self, context, max_length, stop, **generation_kwargs):

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -2,14 +2,14 @@ import copy
 import os
 from collections import defaultdict
 from importlib.util import find_spec
-from typing import List, Optional, Tuple
+from typing import List, Literal, Optional, Tuple

 from tqdm import tqdm

 from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
-from lm_eval.utils import retry_on_specific_exceptions
+from lm_eval.utils import eval_logger, retry_on_specific_exceptions


 def get_result(response, ctxlen: int) -> Tuple[float, bool]:
@@ -40,7 +40,7 @@ def get_result(response, ctxlen: int) -> Tuple[float, bool]:
    return continuation_logprobs, is_greedy


-def oa_completion(**kwargs):
+def oa_completion(client, chat: bool = False, **kwargs):
    """Query OpenAI API for completion.

    Retry with back-off until they respond
@@ -64,19 +64,24 @@ def oa_completion(**kwargs):
        on_exception_callback=_exception_callback,
    )
    def completion():
-        return openai.completions.create(**kwargs)
+        if chat:
+            return client.chat.completions.create(**kwargs)
+        else:
+            return client.completions.create(**kwargs)

    return completion()


-@register_model("openai-completions")
+@register_model("openai-completions", "local-completions")
 class OpenaiCompletionsLM(LM):
-    REQ_CHUNK_SIZE = 20
    _DEFAULT_MAX_LENGTH = 2048

    def __init__(
        self,
        model: str,
+        base_url: str = None,
+        tokenizer: Optional[str] = None,
+        tokenizer_backend: Literal["tiktoken", "huggingface"] = "tiktoken",
        truncate: bool = False,
        max_gen_toks: int = 256,
        batch_size: int = 1,
@@ -101,15 +106,44 @@ class OpenaiCompletionsLM(LM):
    please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
            )
        self.model = model
-        self.tokenizer = tiktoken.encoding_for_model(self.model)
-        self.vocab_size = self.tokenizer.n_vocab
+        self.base_url = base_url
+        self.tokenizer_backend = tokenizer_backend
        self.truncate = truncate
-        self.end_of_text_token_id = self.tokenizer.eot_token
+        self._batch_size = batch_size
        self._max_gen_toks = max_gen_toks
        self._max_length = max_length

+        # if we have a local model, use HF tokenizer over tiktoken
+        if self.tokenizer_backend == "huggingface":
+            import transformers  # noqa: E401
+
+            self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+                tokenizer if tokenizer else self.model
+            )
+            self.vocab_size = self.tokenizer.vocab
+            self.end_of_text_token_id = self.tokenizer.eos_token
+        elif self.tokenizer_backend == "tiktoken":
+            if self.base_url:
+                eval_logger.warning(
+                    f"Passed `base_url={self.base_url}` but using Tiktoken tokenizer backend. "
+                    "Pass `tokenizer_backend=huggingface` and provide the HF tokenizer name if your model does not use Tiktoken."
+                )
+
+            self.tokenizer = tiktoken.encoding_for_model(self.model)
+            self.vocab_size = self.tokenizer.n_vocab
+            self.end_of_text_token_id = self.tokenizer.eot_token
+        else:
+            raise ValueError(
+                f"Expected tokenizer_backend to be one of ['tiktoken', 'huggingface'] but got {self.tokenizer_backend}"
+            )
+
        # Read from environment variable OPENAI_API_KEY
+        # Set to EMPTY for local
        openai.api_key = os.environ["OPENAI_API_KEY"]
+        if self.base_url:
+            self.client = openai.OpenAI(base_url=self.base_url)
+        else:
+            self.client = openai.OpenAI()

    @property
    def eot_token_id(self):
@@ -127,9 +161,8 @@ class OpenaiCompletionsLM(LM):
        return self._max_gen_toks

    @property
-    def batch_size(self):
-        # Isn't used because we override _loglikelihood_tokens
-        raise NotImplementedError()
+    def batch_size(self) -> int:
+        return self._batch_size

    @property
    def device(self):
@@ -186,7 +219,7 @@ class OpenaiCompletionsLM(LM):
        re_ord = utils.Reorderer(requests, _collate)

        for chunk in tqdm(
-            list(utils.chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE)),
+            list(utils.chunks(re_ord.get_reordered(), self.batch_size)),
            disable=disable_tqdm,
        ):
            inps = []
@@ -203,6 +236,7 @@ class OpenaiCompletionsLM(LM):
                ctxlens.append(ctxlen)

            response = oa_completion(
+                client=self.client,
                model=self.model,
                prompt=inps,
                echo=True,
@@ -251,7 +285,7 @@ class OpenaiCompletionsLM(LM):

        # todo: more intelligent batching for heterogeneous `until`
        for chunk, request_args in tqdm(
-            list(sameuntil_chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE))
+            list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size))
        ):
            inps = []
            self._max_gen_toks = request_args.pop("max_gen_toks", self.max_gen_toks)
@@ -265,6 +299,7 @@ class OpenaiCompletionsLM(LM):
            request_args["temperature"] = request_args.get("temperature", 0)

            response = oa_completion(
+                client=self.client,
                model=self.model,
                prompt=inps,
                max_tokens=self.max_gen_toks,
@@ -329,35 +364,6 @@ class OpenaiCompletionsLM(LM):
        return loglikelihoods


-def oa_chat_completion(client, **kwargs):
-    """Query OpenAI API for chat completion.
-
-    Retry with back-off until they respond
-    """
-    if not find_spec("openai") or not find_spec("tiktoken"):
-        raise Exception(
-            "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. "
-            "Please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`"
-        )
-    else:
-        import openai
-
-    def _exception_callback(e: Exception, sleep_time: float) -> None:
-        import traceback
-
-        traceback.print_exc()
-
-    @retry_on_specific_exceptions(
-        on_exceptions=[openai.OpenAIError],
-        max_retries=None,  # retry forever, consider changing
-        on_exception_callback=_exception_callback,
-    )
-    def completion():
-        return client.chat.completions.create(**kwargs)
-
-    return completion()
-
-
 @register_model("openai-chat-completions", "local-chat-completions")
 class OpenaiChatCompletionsLM(LM):
    def __init__(
@@ -460,8 +466,12 @@ class OpenaiChatCompletionsLM(LM):
                        f"Expected repr(kwargs) to be of type repr(dict) but got {kwargs}"
                    )

-                response = oa_chat_completion(
-                    client=self.client, messages=inps, model=self.model, **kwargs
+                response = oa_completion(
+                    client=self.client,
+                    chat=True,
+                    messages=inps,
+                    model=self.model,
+                    **kwargs,
                )

                for resp, (context, args_) in zip(response.choices, chunk):

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -3,34 +3,32 @@ import abc
 import yaml
 import collections

-from functools import partial, lru_cache
+from functools import partial
 from typing import List, Union, Dict

 from lm_eval import utils
 from lm_eval import prompts
 from lm_eval.api.task import TaskConfig, Task, ConfigurableTask
-from lm_eval.api.registry import (
-    register_task,
-    register_group,
-    TASK_REGISTRY,
-    GROUP_REGISTRY,
-)

 import logging

-# import python tasks
-from .squadv2.task import SQuAD2
-from .scrolls.task import (
-    QuALITY,
-    NarrativeQA,
-    ContractNLI,
-    GovReport,
-    SummScreenFD,
-    QMSum,
-)
+# # import python tasks
+# import squadv2.task
+# import scrolls.task
+# python_tasks = {
+#     "squadv2": squadv2.task.SQuAD2,
+#     "scrolls_quality": scrolls.task.QuALITY,
+#     "scrolls_narrativeqa": scrolls.task.NarrativeQA,
+#     "scrolls_contractnli": scrolls.task.ContractNLI,
+#     "scrolls_govreport": scrolls.task.GovReport,
+#     "scrolls_summscreenfd": scrolls.task.SummScreenFD,
+#     "scrolls_qmsum": scrolls.task.QMSum,
+# }

 eval_logger = utils.eval_logger

+GROUP_KEYS = ["group", "task", "weight_by_size"]
+PYTHON_TASK_KEYS = ["task", "class"]

 class TaskManager(abc.ABC):

@@ -72,15 +70,25 @@ class TaskManager(abc.ABC):
        return False

    def _name_is_task(self, name):
-        if self.ALL_TASKS[name]["type"] == "task":
+        if self._name_is_registered(name) and ("task" in self.ALL_TASKS[name]["type"]):
+            return True
+        return False
+
+    def _name_is_python_task(self, name):
+        if self._name_is_registered(name) and (self.ALL_TASKS[name]["type"] == "python_task"):
            return True
        return False

    def _config_is_task(self, config):
-        if list(config.keys()) == ["group", "task"]:
+        if set(config.keys()) <= set(GROUP_KEYS):
            return False
        return True

+    def _config_is_python_task(self, config):
+        if set(config.keys()) == set(PYTHON_TASK_KEYS):
+            return True
+        return False
+
    def _get_yaml_path(self, name):
        assert name in self.ALL_TASKS
        return self.ALL_TASKS[name]["yaml_path"]
@@ -94,47 +102,75 @@ class TaskManager(abc.ABC):
        assert self._name_is_task(name) == False
        return self.ALL_TASKS[name]["task"]

-    @lru_cache(None)
-    def _load_individual_task_or_group(self, name_or_config: Union[str, dict] = None, parent_name: str = None) -> ConfigurableTask:
+    def _load_individual_task_or_group(
+            self,
+            name_or_config: Union[str, dict] = None,
+            parent_name: str = None,
+            update_config: dict = None
+        ) -> ConfigurableTask:

-        def load_task(config, task, group=None):
-            task_object = ConfigurableTask(config=config)
+        def load_task(config, task, group=None, is_python_class=False):
+            if is_python_class:
+                task_object = config["class"]()
+            else:
+                task_object = ConfigurableTask(config=config)
            if group is not None:
                task_object = (group, task_object)
            return {task: task_object}

        if isinstance(name_or_config, str):
-            if self._name_is_task(name_or_config):
+            if update_config is not None:
+                # Process name_or_config as a dict instead
+                name_or_config = {"task": name_or_config, **update_config}
+            elif self._name_is_task(name_or_config):
                task_config = self._get_config(name_or_config)
-                return load_task(task_config, task=name_or_config, group=parent_name)
+                is_python_class=False
+                if self._name_is_python_task(name_or_config):
+                    is_python_class=True
+                return load_task(task_config, task=name_or_config, group=parent_name, is_python_class=is_python_class)
            else:
                group_name = name_or_config
                subtask_list = self._get_tasklist(name_or_config)
                if subtask_list == -1:
                    subtask_list = self._get_config(name_or_config)["task"]

-        elif isinstance(name_or_config, dict):
+        if isinstance(name_or_config, dict):
+
+            if update_config is not None:
+                name_or_config={
+                    **name_or_config,
+                    **update_config,
+                }
+
            if self._config_is_task(name_or_config):
-                task_name = name_or_config["task"]
-                if self._name_is_registered(task_name):
-                    base_task_config = self._get_config(task_name)
-                    task_config={
-                            **base_task_config,
-                            **name_or_config,
-                        }
+                name = name_or_config["task"]
+                # If the name is registered as a group
+                if self._name_is_task(name) is False:
+                    group_name = name
+                    update_config = {k:v for k,v in name_or_config.items() if k != "task"}
+                    subtask_list = self._get_tasklist(name)
+                    if subtask_list == -1:
+                        subtask_list = self._get_config(name)["task"]
                else:
-                    task_config = name_or_config
-                return load_task(task_config, task=name_or_config, group=parent_name)
+                    if self._name_is_registered(name):
+                        base_task_config = self._get_config(name)
+                        task_config={
+                                **base_task_config,
+                                **name_or_config,
+                            }
+                    else:
+                        task_config = name_or_config
+                    return load_task(task_config, task=name, group=parent_name)
            else:
                group_name = name_or_config["group"]
                subtask_list = name_or_config["task"]

-        if self._get_yaml_path(group_name) == -1:
+        if (self._name_is_registered(group_name) is False) or (self._get_yaml_path(group_name) == -1):
            all_subtasks = {group_name: (parent_name, None)}
        else:
            all_subtasks = {}

-        fn = partial(self._load_individual_task_or_group, parent_name=group_name)
+        fn = partial(self._load_individual_task_or_group, parent_name=group_name, update_config=update_config)
        all_subtasks = {**all_subtasks, **dict(collections.ChainMap(*map(fn, subtask_list)))}
        return all_subtasks

@@ -161,7 +197,13 @@ class TaskManager(abc.ABC):
                if f.endswith(".yaml"):
                    yaml_path = os.path.join(root, f)
                    config = utils.simple_load_yaml_config(yaml_path)
-                    if list(config.keys()) == ["group", "task"]:
+                    if set(config.keys()) == set(PYTHON_TASK_KEYS):
+                        # This is a python class config
+                        tasks_and_groups[config["task"]] = {
+                            "type": "python_task",
+                            "yaml_path": yaml_path,
+                        }
+                    elif set(config.keys()) <= set(GROUP_KEYS):
                        # This is a group config
                        tasks_and_groups[config["group"]] = {
                            "type": "group",

--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
 group:
  - ai2_arc
 task: arc_easy
-dataset_path: ai2_arc
+dataset_path: allenai/ai2_arc
 dataset_name: ARC-Easy
 output_type: multiple_choice
 training_split: train

--- a/lm_eval/tasks/benchmarks/test.yaml
+++ b/lm_eval/tasks/benchmarks/test.yaml
 group: grouptest
 task:
  - boolq
-  - group: arc_stuff
+  - group: random_collection
    task:
-      - arc_challenge
-      - glue
-      - task: arc_easy
-        metric_list:
-          - metric: acc
+      - ai2_arc
+      - task: cola
        num_fewshot: 3
-  # - task: mmlu
-  #   num_fewshot: 2
+  - task: mmlu
+    num_fewshot: 2
--- a/lm_eval/tasks/polemo2/polemo2_in.yaml
+++ b/lm_eval/tasks/polemo2/polemo2_in.yaml
@@ -41,5 +41,6 @@ metric_list:
  - metric: accuracy
    aggregation: mean
    higher_is_better: true
+    hf_evaluate: true
 metadata:
  version: 1.0
--- a/lm_eval/tasks/qasper/bool.yaml
+++ b/lm_eval/tasks/qasper/bool.yaml
 group: qasper
 task: qasper_bool
-dataset_path: qasper
+dataset_path: allenai/qasper
 output_type: multiple_choice
 training_split: train
 validation_split: validation

--- a/lm_eval/tasks/qasper/freeform.yaml
+++ b/lm_eval/tasks/qasper/freeform.yaml
 group: qasper
 task: qasper_freeform
-dataset_path: qasper
+dataset_path: allenai/qasper
 output_type: generate_until
 training_split: train
 validation_split: validation

--- a/lm_eval/tasks/scrolls/scrolls.yaml
+++ b/lm_eval/tasks/scrolls/scrolls.yaml
 group: scrolls
 task:
-  - scrolls_qasper
-  - scrolls_quality
-  - scrolls_narrativeqa
-  - scrolls_contractnli
-  - scrolls_govreport
-  - scrolls_summscreenfd
-  - scrolls_qmsum
+  # - task: scrolls_qasper
+  #   class: !function task.Qasper
+  - task: scrolls_quality
+    class: !function task.QuALITY
+  # - scrolls_narrativeqa
+  #   class: !function task.NarrativeQA
+  # - scrolls_contractnli
+  #   class: !function task.ContractNLI
+  # - scrolls_govreport
+  #   class: !function task.GovReport
+  # - scrolls_summscreenfd
+  #   class: !function task.SummScreenFD
+  # - scrolls_qmsum
+  #   class: !function task.QMSum
--- a/lm_eval/tasks/scrolls/task.py
+++ b/lm_eval/tasks/scrolls/task.py
@@ -279,7 +279,7 @@ class _SCROLLSSummaryTask(_SCROLLSTask):
        return f"{doc['input']}\n\nQuestion: What is a summary of the preceding text?\nAnswer:"


-@register_task("scrolls_qasper")
+# @register_task("scrolls_qasper")
 class Qasper(_SCROLLSTask):
    """A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
    https://arxiv.org/abs/2105.03011
@@ -337,7 +337,7 @@ class Qasper(_SCROLLSTask):
            )


-@register_task("scrolls_quality")
+# @register_task("scrolls_quality")
 class QuALITY(_SCROLLSMultipleChoiceTask):
    """QuALITY: Question Answering with Long Input Texts, Yes!
    https://arxiv.org/abs/2112.08608
@@ -366,7 +366,7 @@ class QuALITY(_SCROLLSMultipleChoiceTask):
        return [doc]


-@register_task("scrolls_narrativeqa")
+# @register_task("scrolls_narrativeqa")
 class NarrativeQA(_SCROLLSTask):
    """The NarrativeQA Reading Comprehension Challenge
    https://arxiv.org/abs/1712.07040
@@ -400,7 +400,7 @@ class NarrativeQA(_SCROLLSTask):
        )


-@register_task("scrolls_contractnli")
+# @register_task("scrolls_contractnli")
 class ContractNLI(_SCROLLSMultipleChoiceTask):
    """ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
    https://arxiv.org/abs/1712.07040
@@ -419,7 +419,7 @@ class ContractNLI(_SCROLLSMultipleChoiceTask):
        return f"{doc['text']}\n\nHypothesis: {doc['question']}\nConclusion:"


-@register_task("scrolls_govreport")
+# @register_task("scrolls_govreport")
 class GovReport(_SCROLLSSummaryTask):
    """Efficient Attentions for Long Document Summarization
    https://arxiv.org/abs/2104.02112
@@ -433,7 +433,7 @@ class GovReport(_SCROLLSSummaryTask):
    DATASET_NAME = "gov_report"


-@register_task("scrolls_summscreenfd")
+# @register_task("scrolls_summscreenfd")
 class SummScreenFD(_SCROLLSSummaryTask):
    """SummScreen: A Dataset for Abstractive Screenplay Summarization
    https://arxiv.org/abs/2104.07091
@@ -442,7 +442,7 @@ class SummScreenFD(_SCROLLSSummaryTask):
    DATASET_NAME = "summ_screen_fd"


-@register_task("scrolls_qmsum")
+# @register_task("scrolls_qmsum")
 class QMSum(_SCROLLSSummaryTask):
    """QMSum: A New Benchmark for Query-based Multi-domain
    Meeting Summarization

--- a/lm_eval/tasks/squadv2/squadv2.yaml
+++ b/lm_eval/tasks/squadv2/squadv2.yaml
+task: squadv2
+class: !function task.SQuAD2
\ No newline at end of file
--- a/lm_eval/tasks/squadv2/task.py
+++ b/lm_eval/tasks/squadv2/task.py
@@ -21,7 +21,6 @@ from packaging import version

 from lm_eval.api.task import Task
 from lm_eval.api.instance import Instance
-from lm_eval.api.registry import register_task

 _CITATION = """
 @misc{rajpurkar2018know,
@@ -47,7 +46,6 @@ def _squad_agg(key, items):
    return _squad_metric(predictions=predictions, references=references).get(key, 0)


-@register_task("squadv2")
 class SQuAD2(Task):
    VERSION = 3
    DATASET_PATH = "squad_v2"

--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -480,28 +480,10 @@ def get_git_commit_hash():
    return git_hash


-def import_function(loader, node):
-    function_name = loader.construct_scalar(node)
-    yaml_path = os.path.dirname(loader.name)
-
-    *module_name, function_name = function_name.split(".")
-    if isinstance(module_name, list):
-        module_name = ".".join(module_name)
-    module_path = os.path.normpath(os.path.join(yaml_path, "{}.py".format(module_name)))
-
-    spec = importlib.util.spec_from_file_location(module_name, module_path)
-    module = importlib.util.module_from_spec(spec)
-    spec.loader.exec_module(module)
-
-    function = getattr(module, function_name)
-    return function
-
-
-def ignore_constructor(loader, node):
-    return node
-
-
 def simple_load_yaml_config(yaml_path=None, yaml_config=None, yaml_dir=None):
+    def ignore_constructor(loader, node):
+        return node
+
    yaml.add_constructor("!function", ignore_constructor)
    with open(yaml_path, "rb") as file:
        yaml_config = yaml.full_load(file)
@@ -509,6 +491,24 @@ def simple_load_yaml_config(yaml_path=None, yaml_config=None, yaml_dir=None):


 def load_yaml_config(yaml_path=None, yaml_config=None, yaml_dir=None):
+    def import_function(loader, node):
+        function_name = loader.construct_scalar(node)
+        yaml_path = os.path.dirname(loader.name)
+
+        *module_name, function_name = function_name.split(".")
+        if isinstance(module_name, list):
+            module_name = ".".join(module_name)
+        module_path = os.path.normpath(
+            os.path.join(yaml_path, "{}.py".format(module_name))
+        )
+
+        spec = importlib.util.spec_from_file_location(module_name, module_path)
+        module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(module)
+
+        function = getattr(module, function_name)
+        return function
+
    # Add the import_function constructor to the YAML loader
    yaml.add_constructor("!function", import_function)
    if yaml_config is None:

--- a/tests/utils.py
+++ b/tests/utils.py
@@ -22,12 +22,13 @@ def load_changed_files(file_path: str) -> List[str]:
 # checks the txt file for list of changed files.
 # if file ends with .yaml then check yaml for task name
 # if file ends with .py then parse the folder for all yaml files
+# skips benchmarks folder
 def parser(full_path: List[str]) -> List[str]:
    _output = set()
    for x in full_path:
-        if x.endswith(".yaml"):
+        if x.endswith(".yaml") and "benchmarks" not in x:
            _output.add(load_yaml_config(x)["task"])
-        elif x.endswith(".py"):
+        elif x.endswith(".py") and "benchmarks" not in x:
            path = [str(x) for x in (list(Path(x).parent.glob("*.yaml")))]
            _output |= {load_yaml_config(x)["task"] for x in path}
    return list(_output)