Merge branch 'big-refactor' of github.com:EleutherAI/lm-evaluation-harness into big-refactor-merge

c490f165 · gk · b5efc813 · b6c70fc8 · c490f165 · c490f165
Commit c490f165 authored Jun 15, 2023 by gk
20 changed files
--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
@@ -19,7 +19,7 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
 - **reference** (`str`, *optional*) —
 - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
 - **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
+- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
 - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
 - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
 - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc
 ## Passing Arguments to Metrics
-Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
+Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
 ```
 metric_list:
@@ -225,4 +225,3 @@ Generative tasks:
 Tasks using complex filtering:
 - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
 ## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files.
+To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
-To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
 ```sh
-touch lm_eval/tasks/new_mcqa.yaml
+touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
 ```
-or
+Or, copy the template subfolder we provide from `templates/new_yaml_task`:
 ```sh
-touch lm_eval/tasks/new_generative_task.yaml
+cp -r templates/new_yaml_task lm_eval/tasks/
 ```
+and rename the folders and YAML file(s) as desired.
 ### Selecting and configuring a dataset
@@ -241,16 +236,17 @@ The checklist is the following:
 For adding novel benchmarks/datasets to the library:
 * [ ] Is the task an existing benchmark in the literature?
-  * [ ] Has the task been checked for equivalence with the original paper's methodology?
+  * [ ] Have you referenced the original paper that introduced the task?
-  * [ ] Is the task in Eval-harness v0.3.0 or earlier?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
-    * [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
 If other tasks on this dataset are already supported:
 * [ ] Is the "Main" variant of this task clearly denoted?
 * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
 ## Submitting your task
 You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
--- a/examples/README.md
+++ b/examples/README.md
--- a/examples/chain_of_thought/README.md
+++ b/examples/chain_of_thought/README.md
--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -20,6 +20,8 @@ def median(arr):
    return arr[len(arr) // 2]
+# Certain metrics must be calculated across all documents in a benchmark.
+# We use them as aggregation metrics, paired with no-op passthrough metric fns.
 @register_aggregation("perplexity")
 def perplexity(items):
    return math.exp(-mean(items))
@@ -35,6 +37,25 @@ def bits_per_byte(items):
    return -weighted_mean(items) / math.log(2)
+@register_aggregation("f1")
+def f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = sklearn.metrics.f1_score(golds, preds)
+    return np.max(fscore)
+@register_aggregation("matthews_corrcoef")
+def matthews_corrcoef(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    # print(preds)
+    return sklearn.metrics.matthews_corrcoef(golds, preds)
 @register_metric(
    metric="acc",
    higher_is_better=True,
@@ -119,27 +140,24 @@ def mean_stderr(arr):
    return sample_stddev(arr) / math.sqrt(len(arr))
-@register_metric(metric="matthews_corrcoef", higher_is_better=True, aggregation="mean")
+@register_metric(
-def matthews_corrcoef(items):
+    metric="mcc",
-    unzipped_list = list(zip(*items))
+    higher_is_better=True,
-    golds = unzipped_list[0]
+    output_type="multiple_choice",
-    preds = unzipped_list[1]
+    aggregation="matthews_corrcoef",
-    return sklearn.metrics.matthews_corrcoef(golds, preds)
+)
+def mcc_fn(items):  # This is a passthrough function
+    return items
 @register_metric(
    metric="f1",
    higher_is_better=True,
    output_type="multiple_choice",
-    aggregation="mean",
+    aggregation="f1",
 )
-def f1_score(items):
+def f1_fn(items):  # This is a passthrough function
-    unzipped_list = list(zip(*items))
+    return items
-    golds = unzipped_list[0]
-    preds = unzipped_list[1]
-    fscore = sklearn.metrics.f1_score(golds, preds)
-    return np.max(fscore)
 @register_metric(

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
 import abc
+from typing import Union
 from lm_eval import utils

--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -26,7 +26,12 @@ def register_model(*names):
 def get_model(model_name):
+    try:
        return MODEL_REGISTRY[model_name]
+    except KeyError:
+        raise ValueError(
+            f"Attempted to load model '{model_name}', but no model for this name found! Supported model names: {', '.join(MODEL_REGISTRY.keys())}"
+        )
 TASK_REGISTRY = {}
@@ -74,10 +79,7 @@ DEFAULT_METRIC_REGISTRY = {
        "acc",
    ],
    "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
-    "multiple_choice": [
+    "multiple_choice": ["acc", "acc_norm"],
-        "acc",
-        "acc_norm"
-    ],
    "greedy_until": ["exact_match"],
 }
@@ -135,7 +137,6 @@ searching in HF Evaluate library..."
 def register_aggregation(name):
-    # TODO: should we enforce a specific interface to aggregation metrics?
    def decorate(fn):
        assert (
            name not in AGGREGATION_REGISTRY

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -108,7 +108,21 @@ class TaskConfig(dict):
        return getattr(self, item)
    def to_dict(self):
-        return asdict(self)
+        """dumps the current config as a dictionary object, as a printable format.
+        null fields will not be printed.
+        Used for dumping results alongside full task configuration
+        :return: dict
+            A printable dictionary version of the TaskConfig object.
+        # TODO: should any default value in the TaskConfig not be printed?
+        """
+        cfg_dict = asdict(self)
+        # remove values that are `None`
+        for k, v in list(cfg_dict.items()):
+            if v is None:
+                cfg_dict.pop(k)
+        return cfg_dict
 class Task(abc.ABC):
@@ -663,6 +677,7 @@ class ConfigurableTask(Task):
        else:
            if self._config.num_fewshot > 0:
                eval_logger.warning(
+                    f"Task '{self._config.task}': "
                    "num_fewshot > 0 but fewshot_split is None. "
                    "using preconfigured rule."
                )
@@ -852,7 +867,8 @@ class ConfigurableTask(Task):
            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),
-                **({"f1": (pred, gold)} if "f1" in use_metric else {}),
+                **({"f1": (gold, pred)} if "f1" in use_metric else {}),
+                **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
                **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
            }

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -167,8 +167,9 @@ def evaluate(
    # get lists of each type of request
    for task_name, task in task_dict.items():
        versions[task_name] = task.VERSION
-        # TODO: don't access a private attribute here ; for non-YAML tasks handle this case
+        configs[task_name] = dict(
-        configs[task_name] = dict(task.dump_config())
+            task.dump_config()
+        )  # TODO: don't access a private attribute here ; for non-YAML tasks handle this case
        # deterministically shuffle docs and chop off the first `limit` because sometimes docs are in some kind of order
        # task_docs = list(task_doc_func())
@@ -316,7 +317,11 @@ def evaluate(
            if stderr is not None:
                results[task_name][metric + "_stderr" + "," + key] = stderr(items)
-        return {"results": dict(results), "configs": dict(configs), "versions": dict(versions)}
+        return {
+            "results": dict(results),
+            "configs": dict(configs),
+            "versions": dict(versions),
+        }
    else:
        return None
--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -57,7 +57,7 @@ def oa_completion(**kwargs):
            backoff_time *= 1.5
-@register_model("openai", "gooseai")
+@register_model("openai", "openai-completions", "gooseai")
 class GPT3LM(LM):
    REQ_CHUNK_SIZE = 20

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
 # v1.0 Tasks
 This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
-Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
 - [ ] Glue
- [ ] SuperGlue
+- [x] SuperGlue
 - [ ] CoQA
 - [ ] DROP
 - [x] ~~Lambada~~
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
 - [ ] WebQs
 - [ ] WSC273
 - [ ] Winogrande
- [ ] ANLI
+- [x] ANLI
 - [ ] Hendrycks Ethics
 - [ ] TruthfulQA
 - [ ] MuTual

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -3,6 +3,7 @@ from typing import List, Union
 from .gsm8k import *
 from .triviaqa import *
+from .glue import *
 from lm_eval import utils
 from lm_eval.logger import eval_logger
@@ -66,7 +67,7 @@ def get_task(task_name, config):
        return TASK_REGISTRY[task_name](config=config)
    except KeyError:
        eval_logger.info("Available tasks:")
-        eval_logger.info(ALL_TASKS)
+        eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
        raise KeyError(f"Missing task {task_name}")

--- a/lm_eval/tasks/gsm8k/README.md
+++ b/lm_eval/tasks/gsm8k/README.md
--- a/lm_eval/tasks/pile/pile_europarl.yaml
+++ b/lm_eval/tasks/pile/pile_europarl.yaml
--- a/lm_eval/tasks/pile/pile_gutenberg.yaml
+++ b/lm_eval/tasks/pile/pile_gutenberg.yaml
--- a/lm_eval/tasks/pile/pile_pubmed-abstracts.yaml
+++ b/lm_eval/tasks/pile/pile_pubmed-abstracts.yaml
 include: pile_arxiv.yaml
 task: pile_pubmed-abstracts
 dataset_name: pile_pubmed-abstracts
--- a/lm_eval/tasks/pile/pile_pubmed-central.yaml
+++ b/lm_eval/tasks/pile/pile_pubmed-central.yaml
 include: pile_arxiv.yaml
 task: pile_pubmed-central
 dataset_name: pile_pubmed-central
--- a/lm_eval/tasks/pile/pile_stackexchange.yaml
+++ b/lm_eval/tasks/pile/pile_stackexchange.yaml
 include: pile_arxiv.yaml
 task: pile_stackexchange
 dataset_name: pile_stackexchange
--- a/lm_eval/tasks/pile/pile_ubuntu-irc.yaml
+++ b/lm_eval/tasks/pile/pile_ubuntu-irc.yaml
 include: pile_arxiv.yaml
 task: pile_ubuntu-irc
 dataset_name: pile_ubuntu-irc