merged conflict

4cc57af4 · lintangsutawika · 9554f8da · 9d3247be · 4cc57af4 · 4cc57af4
Commit 4cc57af4 authored Jun 16, 2023 by lintangsutawika
20 changed files
--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
@@ -9,4 +9,4 @@ Tracking progress on revamping documentation pages for the refactor of LM-Evalua
 * [ ] Explaining registries + decorators
 * [ ] model_guide.md for adding new model API
  * [ ] guide to writing an adapter to new advanced codebase (e.g. NeoX)
-* [ ] Parallelism guide (?)
\ No newline at end of file
+* [ ] Parallelism guide (?)
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
 # Advanced Task Configuration

-The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format. 
+The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.

-These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations. 
+These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.

-While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users. 
+While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.

 If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.

@@ -17,14 +17,14 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
 - **task** (`str`, defaults to None) — name of the task.
 - **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
 - **reference** (`str`, *optional*) —
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. 
+- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
 - **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
+- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
 - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
 - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
 - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
 - **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **template_aliases** (`str`, *optional*) — 
+- **template_aliases** (`str`, *optional*) —
 - **aliases**: (`Union[str, list]`, *optional*) —
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
 - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
@@ -32,15 +32,15 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
 - **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
 - **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
 - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`. 
+- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
 - **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
 - **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
- **should_decontaminate** (`bool`, *optional*, defaults to False) - 
+- **should_decontaminate** (`bool`, *optional*, defaults to False) -
 - **doc_to_decontamination_query** (`str`, *optional*) —
 - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed. 
+- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.

 ## Filters

@@ -52,10 +52,10 @@ After getting scores or output text from our LM on each `Instance` or document i

 However, certain tasks may require more complex behavior than directly turning over model outputs to a metric function. For example, we may want to post-process our output text by truncating it or extracting a model's answer, we may want to ensemble over multiple "takes" on a different document, et cetera.

-**Detailed Aside**: 
-We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`. 
+**Detailed Aside**:
+We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`.

-`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`. 
+`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.

 Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.

@@ -87,7 +87,7 @@ filter_list:
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "majority_vote"
      - function: "take_first"
-  - name: "maj@8" 
+  - name: "maj@8"
    filter:
      - function: "take_first_k"
        k: 8
@@ -97,9 +97,9 @@ filter_list:
      - function: "take_first"
 ```

-We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence. 
+We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence.

-Our first filter pipeline implements 
+Our first filter pipeline implements
 - applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
 - selecting only the first out of the 64 model answers

@@ -127,7 +127,7 @@ Our second filter pipeline, "maj@64", does majority voting across all 64 answers
    - function: "take_first"
 ```

-Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via: 
+Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
 - subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
 - performing the same sequence of filters on these new sets of 8 responses, for each document.
 ```yaml
@@ -141,7 +141,7 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t
    - function: "take_first"
 ```

-Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines. 
+Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.


 ## Embedded Python Code
@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc

 ## Passing Arguments to Metrics

-Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
+Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.

 ```
 metric_list:
@@ -212,9 +212,9 @@ Aggregation functions:

 ## Good Reference Tasks

-Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include: 
+Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:

-Multiple choice tasks: 
+Multiple choice tasks:
 - SciQ (`lm_eval/tasks/sciq/sciq.yaml`)

 Corpus perplexity evaluations:
@@ -225,4 +225,3 @@ Generative tasks:

 Tasks using complex filtering:
 - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
-
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (

 ## Creating a YAML file

- Tasks in eval harness are largely implemented via YAML files.
-
- mention the tasks worth "forking"/building off of
-
- Step through the different args all tasks will need
-
-To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
+To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,

 ```sh
-touch lm_eval/tasks/new_mcqa.yaml
+touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
 ```
-or
+Or, copy the template subfolder we provide from `templates/new_yaml_task`:
 ```sh
-touch lm_eval/tasks/new_generative_task.yaml
+cp -r templates/new_yaml_task lm_eval/tasks/
 ```
+and rename the folders and YAML file(s) as desired.

 ### Selecting and configuring a dataset

@@ -231,26 +226,27 @@ a simple eye test.

 ## Checking performance + equivalence

-It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible. 
+It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.

 To enable this, we provide a checklist that should be completed when contributing a new task, to enable accurate book-keeping and to ensure that tasks added to the library are well-tested and, where applicable, precedented.

 ### Task impl. checklist

-The checklist is the following: 
+The checklist is the following:

 For adding novel benchmarks/datasets to the library:
 * [ ] Is the task an existing benchmark in the literature?
-  * [ ] Has the task been checked for equivalence with the original paper's methodology?
-  * [ ] Is the task in Eval-harness v0.3.0 or earlier?
-    * [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+

 If other tasks on this dataset are already supported:
 * [ ] Is the "Main" variant of this task clearly denoted?
 * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

+It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
+
 ## Submitting your task

 You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
-
--- a/examples/README.md
+++ b/examples/README.md
-This folder is meant to contain instructions and task setups required to evaluate certain papers which may perform non-standard evaluation setups. 
+This folder is meant to contain instructions and task setups required to evaluate certain papers which may perform non-standard evaluation setups.

 Tasks can be supported already in the library under `lm_eval/tasks`, or if highly paper-specific, may remain as YAMLs in the respective `examples/paper-title` folder.

@@ -17,4 +17,4 @@ Tasks can be supported already in the library under `lm_eval/tasks`, or if highl
 * All setups from GPT-3 Paper
 * Varying few-shot orderings + selection ; Varying the label choices for multiple-choice tasks

-* Your Paper Here!
\ No newline at end of file
+* Your Paper Here!
--- a/examples/chain_of_thought/README.md
+++ b/examples/chain_of_thought/README.md
@@ -9,4 +9,4 @@ https://arxiv.org/abs/2201.11903

 ## Reproduction Scripts

-* ...
\ No newline at end of file
+* ...
--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -20,6 +20,8 @@ def median(arr):
    return arr[len(arr) // 2]


+# Certain metrics must be calculated across all documents in a benchmark.
+# We use them as aggregation metrics, paired with no-op passthrough metric fns.
 @register_aggregation("perplexity")
 def perplexity(items):
    return math.exp(-mean(items))
@@ -35,6 +37,25 @@ def bits_per_byte(items):
    return -weighted_mean(items) / math.log(2)


+@register_aggregation("f1")
+def f1_score(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    fscore = sklearn.metrics.f1_score(golds, preds)
+
+    return np.max(fscore)
+
+
+@register_aggregation("matthews_corrcoef")
+def matthews_corrcoef(items):
+    unzipped_list = list(zip(*items))
+    golds = unzipped_list[0]
+    preds = unzipped_list[1]
+    # print(preds)
+    return sklearn.metrics.matthews_corrcoef(golds, preds)
+
+
 @register_metric(
    metric="acc",
    higher_is_better=True,
@@ -119,27 +140,24 @@ def mean_stderr(arr):
    return sample_stddev(arr) / math.sqrt(len(arr))


-@register_metric(metric="matthews_corrcoef", higher_is_better=True, aggregation="mean")
-def matthews_corrcoef(items):
-    unzipped_list = list(zip(*items))
-    golds = unzipped_list[0]
-    preds = unzipped_list[1]
-    return sklearn.metrics.matthews_corrcoef(golds, preds)
+@register_metric(
+    metric="mcc",
+    higher_is_better=True,
+    output_type="multiple_choice",
+    aggregation="matthews_corrcoef",
+)
+def mcc_fn(items):  # This is a passthrough function
+    return items


 @register_metric(
    metric="f1",
    higher_is_better=True,
    output_type="multiple_choice",
-    aggregation="mean",
+    aggregation="f1",
 )
-def f1_score(items):
-    unzipped_list = list(zip(*items))
-    golds = unzipped_list[0]
-    preds = unzipped_list[1]
-    fscore = sklearn.metrics.f1_score(golds, preds)
-
-    return np.max(fscore)
+def f1_fn(items):  # This is a passthrough function
+    return items


 @register_metric(

--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -26,12 +26,17 @@ def register_model(*names):


 def get_model(model_name):
-    return MODEL_REGISTRY[model_name]
+    try:
+        return MODEL_REGISTRY[model_name]
+    except KeyError:
+        raise ValueError(
+            f"Attempted to load model '{model_name}', but no model for this name found! Supported model names: {', '.join(MODEL_REGISTRY.keys())}"
+        )


 TASK_REGISTRY = {}
 GROUP_REGISTRY = {}
-ALL_TASKS = []
+ALL_TASKS = set()
 func2task_index = {}


@@ -42,6 +47,7 @@ def register_task(name):
        ), f"task named '{name}' conflicts with existing registered task!"

        TASK_REGISTRY[name] = fn
+        ALL_TASKS.add(name)
        func2task_index[fn.__name__] = name
        return fn

@@ -55,6 +61,7 @@ def register_group(name):
            GROUP_REGISTRY[name].append(func_name)
        else:
            GROUP_REGISTRY[name] = [func_name]
+            ALL_TASKS.add(name)
        return fn

    return decorate
@@ -72,10 +79,7 @@ DEFAULT_METRIC_REGISTRY = {
        "acc",
    ],
    "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
-    "multiple_choice": [
-        "acc",
-        "acc_norm"
-    ],
+    "multiple_choice": ["acc", "acc_norm"],
    "greedy_until": ["exact_match"],
 }

@@ -133,7 +137,6 @@ searching in HF Evaluate library..."


 def register_aggregation(name):
-    # TODO: should we enforce a specific interface to aggregation metrics?
    def decorate(fn):
        assert (
            name not in AGGREGATION_REGISTRY

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -98,7 +98,9 @@ class TaskConfig(dict):
                self.gold_alias = self.template_aliases + self.doc_to_target

        if self.generation_kwargs or self.output_type == "greedy_until":
-            assert self.output_type == "greedy_until", "passed `generation_kwargs`, but not using a generation request type!"
+            assert (
+                self.output_type == "greedy_until"
+            ), "passed `generation_kwargs`, but not using a generation request type!"
            # ensure that we greedily generate in absence of explicit arguments otherwise
            self.generation_kwargs = {"do_sample": False, "temperature": 0.0}

@@ -106,7 +108,21 @@ class TaskConfig(dict):
        return getattr(self, item)

    def to_dict(self):
-        return asdict(self)
+        """dumps the current config as a dictionary object, as a printable format.
+        null fields will not be printed.
+        Used for dumping results alongside full task configuration
+
+        :return: dict
+            A printable dictionary version of the TaskConfig object.
+
+        # TODO: should any default value in the TaskConfig not be printed?
+        """
+        cfg_dict = asdict(self)
+        # remove values that are `None`
+        for k, v in list(cfg_dict.items()):
+            if v is None:
+                cfg_dict.pop(k)
+        return cfg_dict


 class Task(abc.ABC):
@@ -419,7 +435,7 @@ class Task(abc.ABC):
        if num_fewshot == 0:
            labeled_examples = ""
        else:
-            labeled_examples = self.sampler.get_context(doc, self._config.num_fewshot)
+            labeled_examples = self.sampler.get_context(doc, num_fewshot)

            # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
            # if self.has_training_docs():
@@ -460,7 +476,7 @@ class Task(abc.ABC):
            return self._instances

    def dump_config(self):
-        """Returns a dictionary representing the task's config. 
+        """Returns a dictionary representing the task's config.

        :returns: str
            The fewshot context.
@@ -532,7 +548,7 @@ class ConfigurableTask(Task):
                }
                try:
                    self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
-                except:
+                except Exception:
                    eval_logger.warning(
                        f"Metric {metric_name} not found, "
                        "Searching from https://huggingface.co/evaluate-metric"
@@ -550,15 +566,24 @@ class ConfigurableTask(Task):

                if "aggregation" in metric_config:
                    agg_name = metric_config["aggregation"]
-                    self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[agg_name]
+                    if type(agg_name) == str:
+                        self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[
+                            agg_name
+                        ]
+                    elif callable(agg_name):
+                        self._aggregation_list[metric_name] = metric_config[
+                            "aggregation"
+                        ]
                else:
+
+                    INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
+                    metric_agg = DEFAULT_AGGREGATION_REGISTRY[metric_name]
                    eval_logger.warning(
-                        f"metric {metric_name} is defined, but aggregation is not"
-                        f"using default aggregation for {metric_name}"
+                        f"metric {metric_name} is defined, but aggregation is not. "
+                        f"using default "
+                        f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
                    )
-                    self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
-                        metric_name
-                    ]
+                    self._aggregation_list[metric_name] = metric_agg

                if "higher_is_better" in metric_config:
                    self._higher_is_better[metric_name] = metric_config[
@@ -566,8 +591,9 @@ class ConfigurableTask(Task):
                    ]
                else:
                    eval_logger.warning(
-                        f"metric {metric_name} is defined, but higher_is_better is not"
-                        f"using default higher_is_better for {metric_name}"
+                        f"metric {metric_name} is defined, but higher_is_better is not. "
+                        f"using default "
+                        f"higher_is_better={HIGHER_IS_BETTER_REGISTRY[metric_name]}"
                    )
                    self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
                        metric_name
@@ -592,9 +618,7 @@ class ConfigurableTask(Task):
                    filter_pipeline = build_filter_ensemble(filter_name, components)
                self._filters.append(filter_pipeline)
        else:
-            self._filters = [
-                build_filter_ensemble("none", [["take_first", None]])
-            ]
+            self._filters = [build_filter_ensemble("none", [["take_first", None]])]

        if self._config.use_prompt is not None:
            eval_logger.info(f"loading prompt {self._config.use_prompt}")
@@ -653,6 +677,7 @@ class ConfigurableTask(Task):
        else:
            if self._config.num_fewshot > 0:
                eval_logger.warning(
+                    f"Task '{self._config.task}': "
                    "num_fewshot > 0 but fewshot_split is None. "
                    "using preconfigured rule."
                )
@@ -842,7 +867,8 @@ class ConfigurableTask(Task):

            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),
-                **({"f1": (pred, gold)} if "f1" in use_metric else {}),
+                **({"f1": (gold, pred)} if "f1" in use_metric else {}),
+                **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
                **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
            }


--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -149,7 +149,7 @@ def evaluate(
    results = collections.defaultdict(dict)
    versions = collections.defaultdict(dict)
    configs = collections.defaultdict(dict)
-
+    samples = collections.defaultdict(list)
    requests = collections.defaultdict(list)

    # docs = {}
@@ -232,7 +232,7 @@ def evaluate(
                    enumerate(task.validation_docs()), lm.rank, limit, lm.world_size
                )
            )
-            example_logger = logging.getLogger("examples")
+
            for doc_id, doc in doc_iterator:
                # subset instances to only this document id ; sort by idx
                requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
@@ -249,7 +249,7 @@ def evaluate(
                    "filtered_resps": [req.filtered_resps[key] for req in requests],
                }
                example.update(metrics)
-                example_logger.info(json.dumps(example))
+                samples[task_name].append(example)
                for metric, value in metrics.items():
                    vals[(task_name, key, metric)].append(value)

@@ -314,6 +314,7 @@ def evaluate(
            "results": dict(results),
            "configs": dict(configs),
            "versions": dict(versions),
+            "samples": samples,
        }

    else:

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -57,7 +57,7 @@ def oa_completion(**kwargs):
            backoff_time *= 1.5


-@register_model("openai", "gooseai")
+@register_model("openai", "openai-completions", "gooseai")
 class GPT3LM(LM):
    REQ_CHUNK_SIZE = 20


--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
 # v1.0 Tasks
 This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.

-Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.

- [ ] Glue
- [ ] SuperGlue
+- [ ] Glue (WIP)
+- [x] SuperGlue
 - [ ] CoQA
 - [ ] DROP
 - [x] ~~Lambada~~
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
 - [ ] WebQs
 - [ ] WSC273
 - [ ] Winogrande
- [ ] ANLI
+- [x] ANLI
 - [ ] Hendrycks Ethics
 - [ ] TruthfulQA
 - [ ] MuTual

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -3,6 +3,7 @@ from typing import List, Union

 from .gsm8k import *
 from .triviaqa import *
+from .glue import *

 from lm_eval import utils
 from lm_eval.logger import eval_logger
@@ -12,6 +13,7 @@ from lm_eval.api.registry import (
    register_group,
    TASK_REGISTRY,
    GROUP_REGISTRY,
+    ALL_TASKS,
 )


@@ -38,6 +40,9 @@ def include_task_folder(task_dir):
                        )

                        if "task" in config:
+                            # task_name = "{}:{}".format(
+                            #     get_task_name_from_config(config), config["task"]
+                            # )
                            task_name = "{}".format(config["task"])
                            register_task(task_name)(SubClass)

@@ -62,7 +67,7 @@ def get_task(task_name, config):
        return TASK_REGISTRY[task_name](config=config)
    except KeyError:
        eval_logger.info("Available tasks:")
-        eval_logger.info(ALL_TASKS)
+        eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
        raise KeyError(f"Missing task {task_name}")



--- a/lm_eval/tasks/gsm8k/README.md
+++ b/lm_eval/tasks/gsm8k/README.md
@@ -43,4 +43,4 @@ Homepage: https://github.com/openai/grade-school-math

 - [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
 - [ ] Using Verifiers
- [ ] Majority voting "without CoT"
\ No newline at end of file
+- [ ] Majority voting "without CoT"
--- a/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
@@ -16,6 +16,6 @@ metric_list:
  - metric: perplexity
    aggregation: perplexity
    higher_is_better: false
-  - metric: accuracy
+  - metric: acc
    aggregation: mean
    higher_is_better: true
--- a/lm_eval/tasks/pile/pile_europarl.yaml
+++ b/lm_eval/tasks/pile/pile_europarl.yaml
 include: pile_arxiv.yaml
 task: pile_europarl
-dataset_name: pile_europarl
\ No newline at end of file
+dataset_name: pile_europarl
--- a/lm_eval/tasks/pile/pile_gutenberg.yaml
+++ b/lm_eval/tasks/pile/pile_gutenberg.yaml
 include: pile_arxiv.yaml
 task: pile_gutenberg
-dataset_name: pile_gutenberg
\ No newline at end of file
+dataset_name: pile_gutenberg
--- a/lm_eval/tasks/pile/pile_pubmed-abstracts.yaml
+++ b/lm_eval/tasks/pile/pile_pubmed-abstracts.yaml
 include: pile_arxiv.yaml
 task: pile_pubmed-abstracts
 dataset_name: pile_pubmed-abstracts
-
--- a/lm_eval/tasks/pile/pile_pubmed-central.yaml
+++ b/lm_eval/tasks/pile/pile_pubmed-central.yaml
 include: pile_arxiv.yaml
 task: pile_pubmed-central
 dataset_name: pile_pubmed-central
-
--- a/lm_eval/tasks/pile/pile_stackexchange.yaml
+++ b/lm_eval/tasks/pile/pile_stackexchange.yaml
 include: pile_arxiv.yaml
 task: pile_stackexchange
 dataset_name: pile_stackexchange
-
--- a/lm_eval/tasks/pile/pile_ubuntu-irc.yaml
+++ b/lm_eval/tasks/pile/pile_ubuntu-irc.yaml
 include: pile_arxiv.yaml
 task: pile_ubuntu-irc
 dataset_name: pile_ubuntu-irc
-