[Refactor] [WIP] New YAML advanced docs (#567)

* add wip gsm8k yaml * cleanup tasks dir * push gsm8k yaml changes * rename gpt2.py * add updated gsm8k , triviaqa baseline * add new cot yaml * allow for multiple filter pipelines, new filter types * updated gsm8k + sampling gen configs * cleanup self-consistency yaml * push outline for advanced docs * push docs checklist * switch to inheritance for many tasks * acc_norm and acc_mutual_info fixed * fix missing newline in error msg * remove many .py tasks * updated GSM8k * added more doc * Update advanced_task_guide.md Added list of parameters * Update advanced_task_guide.md * Added details on listing metrics * Update advanced_task_guide.md * Added more explanation * modify current default filter name * add new tags to tasks * remove a lingering print() * add rest of param docs, cleanup deprecated fields * push docs update * move ALL_TASKS definition location * confirm write_out.py works if no description dict passed --------- Co-authored-by: lintangsutawika <lintang@sutawika.com>

[Refactor] [WIP] New YAML advanced docs (#567)
* add wip gsm8k yaml * cleanup tasks dir * push gsm8k yaml changes * rename gpt2.py * add updated gsm8k , triviaqa baseline * add new cot yaml * allow for multiple filter pipelines, new filter types * updated gsm8k + sampling gen configs * cleanup self-consistency yaml * push outline for advanced docs * push docs checklist * switch to inheritance for many tasks * acc_norm and acc_mutual_info fixed * fix missing newline in error msg * remove many .py tasks * updated GSM8k * added more doc * Update advanced_task_guide.md Added list of parameters * Update advanced_task_guide.md * Added details on listing metrics * Update advanced_task_guide.md * Added more explanation * modify current default filter name * add new tags to tasks * remove a lingering print() * add rest of param docs, cleanup deprecated fields * push docs update * move ALL_TASKS definition location * confirm write_out.py works if no description dict passed --------- Co-authored-by: lintangsutawika <lintang@sutawika.com>
79b972d6 · Hailey Schoelkopf · GitHub · 761f0087 · 79b972d6 · 79b972d6
Unverified Commit 79b972d6 authored Jun 12, 2023 by Hailey Schoelkopf Committed by GitHub Jun 12, 2023
20 changed files
--- a/docs/PROGRESS.md
+++ b/docs/PROGRESS.md
+Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
+
+
+## Desired Pages
+
+* [ ] YAML explainer
+  * [ ] Explainer on filters + advanced features
+  * [ ] Walkthrough start-to-finish of adding a new task to codebase
+* [ ] Explaining registries + decorators
+* [ ] model_guide.md for adding new model API
+  * [ ] guide to writing an adapter to new advanced codebase (e.g. NeoX)
+* [ ] Parallelism guide (?)
\ No newline at end of file
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
+# Advanced Task Configuration
+
+The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format. 
+
+These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations. 
+
+While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users. 
+
+If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
+
+## Configurations
+
+
+### Parameters
+
+- **task** (`str`, defaults to None) — name of the task.
+- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
+- **reference** (`str`, *optional*) —
+- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. 
+- **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
+- **dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
+- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
+- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
+- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
+- **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
+- **template_aliases** (`str`, *optional*) — 
+- **aliases**: (`Union[str, list]`, *optional*) —
+- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
+- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
+- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
+- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
+- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
+- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
+- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`. 
+- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
+- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
+- **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
+- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
+- **should_decontaminate** (`bool`, *optional*, defaults to False) - 
+- **doc_to_decontamination_query** (`str`, *optional*) —
+- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
+- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed. 
+
+## Filters
+
+Explain: What are filters? What is their place in the pipeline?
+
+Format of the `resps` object, and what needs to happen to yield proper scorable results
+TODO: triviaqa is implementable if we don't use `take_first` and implement a multi-alias exact_match_any metric
+TODO: Filters might warrant a separate doc.
+
+### Multiple Filter Pipelines
+
+On the same model outputs, we can perform multiple distinct filtering setups in parallel
+
+Case study: gsm8k-CoT-self-consistency
+
+### "Splitting" Pipelines
+
+TODO: either allow for pipelines that "split" and report multiple keys, or something different. We in particular want to support not re-running reward /scoring models on every different filter pipeline if can be shared.
+
+## Embedded Python Code
+
+There could be cases where Jinja 2 or simple f-string format won't cut it. For tasks like these, we additionally support the importing of Python helper functions that can be injected directly to the yaml. It should be noted that the function script must be in the same directory as the yaml.
+
+TODO: document the `!function filename.pythonfunctionname` syntax here.
+
+TODO: add permannent link to wikitext.yaml and super_glue_cb.yml
+```
+wikitext.yaml and helper fn go here
+```
+
+## (No Longer Recommended) Direct `Task` Subclassing
+
+The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass
+
+{Insert a sample custom `Task` subclass code block here}
+
+## Configuring Tasks with YAMLs
+
+You can easily make a task evaluation using yamls, this is to allow faster and easier experience.
+
+Doc to text
+Jinja,
+You can use Jinja or f-strings to make a prompt template.
+To set a mapping of verbalizer to label, you can define that in the jinja string dorectly.
+
+
+## Including a Base YAML
+
+You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
+```
+include: <YAML file or with full path>
+...
+```
+You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
+
+
+## Listing Metrics
+
+Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting a `exact_match` (TODO: Add url to metric), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
+
+```
+metric_list:
+  - metric: acc
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+```
+
+## Using Promptsource
+
+- load prompt from promptsource
+
+
+## Good Reference Tasks
+
+- This section should list some "canonized" task examples for different use cases / subcategories, as suggestions from which to build new tasks off of.
\ No newline at end of file
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
+# New Task Guide
+
+`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs). 
+
+This documentation page provides a walkthrough to get started creating your own task.
+
+## Setup
+
+If you haven't already, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+
+```sh
+# After forking...
+git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
+cd lm-evaluation-harness
+git checkout -b <task-name>
+pip install -e ".[dev]"
+```
+
+As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (a *generative* task which requires sampling text from a model) and the `sciq` benchmark. (a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices).
+
+## Creating a YAML file
+
+- Tasks in eval harness are largely implemented via YAML files.
+
+- mention the tasks worth "forking"/building off of
+
+- Step through the different args all tasks will need
+
+To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
+
+```sh
+touch lm_eval/tasks/new_mcqa.yaml
+```
+or
+```sh
+touch lm_eval/tasks/new_generative_task.yaml
+```
+
+### Selecting and configuring a dataset
+
+All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
+.
+
+Once you have a HuggingFace dataset prepared for your task, we want to assign our new YAML to use this dataset:
+
+```yaml
+dataset_path: ... # the name of the dataset on the HF Hub.
+dataset_name: ... # the dataset configuration to use. Leave `null` if your dataset does not require a config to be passed. See https://huggingface.co/docs/datasets/load_hub#configurations for more info.
+dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
+```
+
+Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
+
+```yaml
+training_split: <split name of training set, or `null`>
+validation_split: <split name of val. set, or `null`>
+test_split: <split name of test set, or `null`>
+```
+Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`.
+
+We can also specify from which split the task should retrieve few-shot examples via:
+```yaml
+fewshot_split: <split name to draw fewshot examples from, or `null`>
+```
+though if this is not set, we will default to train/validation/test sets, in that order.
+
+### Writing a prompt
+
+The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
+
+We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
+
+To write a prompt, users are required to write two YAML fields in Jinja as strings:
+```yaml
+doc_to_text:
+doc_to_target:
+```
+Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
+```
+Question: {document[question]}
+Answer:
+```
+We do this by writing 
+```yaml
+doc_to_text: "Question: {{question}}\nAnswer:"
+```
+Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
+
+Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
+```yaml
+doc_to_target: "{{answer}}"
+```
+
+**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
+
+TODO: mention promptsource here, or reserve it for advanced guide
+
+#### Multiple choice format
+
+- template_aliases
+
+- expected mcqa setup
+
+### Setting metrics
+
+You're almost done! Now we need to choose how to score our task.
+- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? 
+- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
+
+If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
+
+```yaml
+metric_list:
+  - metric: <name of the metric here>
+    aggregation: <name of the aggregation fn here>
+    higher_is_better: <true or false>
+  - metric: ...
+    aggregation: ...
+    higher_is_better: ...
+```
+
+For a full list of natively supported metrics and aggregation functions see `TODO: we should list out all supported metrics, aggregations, models, somewhere in the docs.` All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.
+
+### Optional, more advanced setup
+
+Some tasks may require more advanced processing logic than is described in this guide.
+
+As a heuristic check:
+* Does your task require generating multiple free-form outputs per input document?
+* Does your task require complex, multi-step post-processing of generated model outputs?
+* Does your task require subsetting documents on the fly based on their content?
+* Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
+* Does your task rely on metrics that need a custom implementation? 
+
+For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
+
+### Task name + groups (registering a task)
+
+To test a task conveniently, it helps to *register* the task--that is, to give it a name and make the `lm-eval` library aware it exists!
+
+If you're writing your YAML file inside the `lm_eval/tasks` folder, you just need to give your task a name! You can do this inside your YAML file:
+
+```yaml
+task: <name of the task>
+```
+Including a task name is mandatory.
+
+It is often also convenient to label your task with several `groups`, or tags, though this field is optional:
+
+```yaml
+group:
+  - group1
+  - group2
+```
+This will add your task to the `group1` and `group2` groups, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
+
+
+If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
+
+You can do this via adding the Python snippet 
+
+```python
+from lm_eval.tasks import include_task_folder
+include_task_folder("/path/to/yaml/parent/folder")
+```
+to the top of any Python file that is run or imported when performing evaluation, such as `main.py`.
+
+Passing `--tasks /path/to/yaml/file` is also accepted.
+
+
+## Checking validity
+
+- write_out
+
+## Checking performance ; implementation equivalence
+
+## Task impl. checklist
+
+- turn this into a GH PR template too
+
+- README.md in task dir
+
+## Submitting your task
+
+You're all set! Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -45,6 +45,16 @@ def acc_fn(items):  # This is a passthrough function
    return items


+@register_metric(
+    metric="acc_norm",
+    higher_is_better=True,
+    output_type=["loglikelihood", "multiple_choice"],
+    aggregation="mean",
+)
+def acc_norm_fn(items):  # This is a passthrough function
+    return items
+
+
 @register_metric(
    metric="acc_mutual_info",
    higher_is_better=True,

--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -31,6 +31,7 @@ def get_model(model_name):

 TASK_REGISTRY = {}
 GROUP_REGISTRY = {}
+ALL_TASKS = []
 func2task_index = {}


@@ -49,10 +50,6 @@ def register_task(name):

 def register_group(name):
    def decorate(fn):
-        # assert (
-        #     name not in GROUP_REGISTRY
-        # ), f"group named '{name}' conflicts with existing registered group!"
-
        func_name = func2task_index[fn.__name__]
        if name in GROUP_REGISTRY:
            GROUP_REGISTRY[name].append(func_name)
@@ -77,6 +74,7 @@ DEFAULT_METRIC_REGISTRY = {
    "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
    "multiple_choice": [
        "acc",
+        "acc_norm"
    ],
    "greedy_until": ["exact_match"],
 }

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
 import abc
-from dataclasses import dataclass, field
+from dataclasses import dataclass, field, asdict

 import re
 import ast
@@ -51,11 +51,9 @@ ALL_OUTPUT_TYPES = [
 class TaskConfig(dict):

    task: str = None
-    group: str = None
+    group: Union[str, list] = None
    reference: str = None
-    task_name: str = (
-        None  # TODO: deprecate this, it'll be set in __post_init__ to be names[0]
-    )
+
    dataset_path: str = None
    dataset_name: str = None
    dataset_kwargs: dict = None
@@ -68,6 +66,7 @@ class TaskConfig(dict):
    aliases: Union[str, list] = None
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
+    use_prompt: str = None

    num_fewshot: int = 0
    batch_size: int = 1
@@ -79,12 +78,8 @@ class TaskConfig(dict):
    generation_kwargs: dict = None
    delimiter: str = "\n\n"
    filter_list: Union[str, list] = None
-    normalization: str = (
-        None  # TODO: add length-normalization of various types, mutual info
-    )
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
-    use_prompt: str = None

    metadata: str = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks

@@ -102,13 +97,17 @@ class TaskConfig(dict):
            if type(self.gold_alias) == str:
                self.gold_alias = self.template_aliases + self.doc_to_target

-        if not self.generation_kwargs:
+        if self.generation_kwargs or self.output_type == "greedy_until":
+            assert self.output_type == "greedy_until", "passed `generation_kwargs`, but not using a generation request type!"
            # ensure that we greedily generate in absence of explicit arguments otherwise
            self.generation_kwargs = {"do_sample": False, "temperature": 0.0}

    def __getitem__(self, item):
        return getattr(self, item)

+    def to_dict(self):
+        return asdict(self)
+

 class Task(abc.ABC):
    """A task represents an entire benchmark including its dataset, problems,
@@ -460,10 +459,20 @@ class Task(abc.ABC):
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances

+    def dump_config(self):
+        """Returns a dictionary representing the task's config. 
+
+        :returns: str
+            The fewshot context.
+        """
+        # TODO: this should only return the overrides applied to a non-YAML task's configuration.
+        # (batch size, num_fewshot)
+        return self._config.to_dict()
+

 class ConfigurableTask(Task):

-    VERSION = "2.0"
+    VERSION = "Yaml"
    OUTPUT_TYPE = None
    CONFIG = None

@@ -503,7 +512,7 @@ class ConfigurableTask(Task):

        _metric_list = DEFAULT_METRIC_REGISTRY[self._config.output_type]
        if self._config.metric_list is None:
-
+            # TODO: handle this in TaskConfig.__post_init__ ?
            for metric_name in _metric_list:
                self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
                self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
@@ -521,9 +530,9 @@ class ConfigurableTask(Task):
                    for key in metric_config
                    if key not in ["metric", "aggregation", "higher_is_better"]
                }
-                if metric_name in _metric_list:
+                try:
                    self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
-                else:
+                except:
                    eval_logger.warning(
                        f"Metric {metric_name} not found, "
                        "Searching from https://huggingface.co/evaluate-metric"
@@ -540,7 +549,8 @@ class ConfigurableTask(Task):
                        )

                if "aggregation" in metric_config:
-                    self._aggregation_list[metric_name] = metric_config["aggregation"]
+                    agg_name = metric_config["aggregation"]
+                    self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[agg_name]
                else:
                    eval_logger.warning(
                        f"metric {metric_name} is defined, but aggregation is not"
@@ -579,12 +589,11 @@ class ConfigurableTask(Task):
                            key: function[key] for key in function if key != "function"
                        }
                        components.append([function["function"], kwargs])
-
                    filter_pipeline = build_filter_ensemble(filter_name, components)
                self._filters.append(filter_pipeline)
        else:
            self._filters = [
-                build_filter_ensemble("take_first", [["take_first", None]])
+                build_filter_ensemble("none", [["take_first", None]])
            ]

        if self._config.use_prompt is not None:
@@ -598,7 +607,7 @@ class ConfigurableTask(Task):
        if self.fewshot_docs() is not None:
            self.sampler = samplers.Sampler(
                list(self.fewshot_docs()), self, rnd=random.Random()
-            )  # TODO: pass the correct docs in here
+            )

    def download(self, dataset_kwargs=None):

@@ -639,15 +648,15 @@ class ConfigurableTask(Task):
            return self.dataset[self._config.test_split]

    def fewshot_docs(self):
-        if (self._config.num_fewshot > 0) and (self._config.fewshot_split is None):
-            eval_logger.warning(
-                "num_fewshot > 0 but fewshot_split is None. "
-                "using preconfigured rule."
-            )
-            return super().fewshot_docs()
-
-        elif self._config.fewshot_split is not None:
+        if self._config.fewshot_split is not None:
            return self.dataset[self._config.fewshot_split]
+        else:
+            if self._config.num_fewshot > 0:
+                eval_logger.warning(
+                    "num_fewshot > 0 but fewshot_split is None. "
+                    "using preconfigured rule."
+                )
+            return super().fewshot_docs()

    def should_decontaminate(self):
        return self._config.should_decontaminate
@@ -818,7 +827,7 @@ class ConfigurableTask(Task):
            )
            if (
                2 * len(choices) == len(lls)
-                and "acc_mutual_info" in self._metric_list.keys()
+                and "acc_mutual_info" in self._metric_fn_list.keys()
            ):
                # then we are doing mutual info.
                # this stores the "dryrun" / unconditional answer loglikelihoods

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -141,15 +141,16 @@ def evaluate(

    results = collections.defaultdict(dict)
    versions = collections.defaultdict(dict)
+    configs = collections.defaultdict(dict)

    requests = collections.defaultdict(list)
-    # requests_origin = collections.defaultdict(list)

    # docs = {}

    # get lists of each type of request
    for task_name, task in task_dict.items():
        versions[task_name] = task.VERSION
+        configs[task_name] = dict(task.dump_config()) # TODO: don't access a private attribute here ; for non-YAML tasks handle this case

        # deterministically shuffle docs and chop off the first `limit` because sometimes docs are in some kind of order
        # task_docs = list(task_doc_func())
@@ -289,7 +290,7 @@ def evaluate(
            if stderr is not None:
                results[task_name][metric + "_stderr" + "," + key] = stderr(items)

-        return {"results": dict(results), "versions": dict(versions)}
+        return {"results": dict(results), "configs": dict(configs), "versions": dict(versions)}

    else:
        return None
--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -19,37 +19,42 @@ def get_task_name_from_config(task_config):
    return "{dataset_path}_{dataset_name}".format(**task_config)


+def include_task_folder(task_dir):
+    """
+    Calling this function
+    """
+    for root, subdirs, file_list in os.walk(task_dir):
+        if (subdirs == []) and (len(file_list) > 0):
+            for f in file_list:
+                if f.endswith(".yaml"):
+                    yaml_path = os.path.join(root, f)
+                    try:
+                        config = utils.load_yaml_config(yaml_path)
+
+                        SubClass = type(
+                            config["task"] + "ConfigurableTask",
+                            (ConfigurableTask,),
+                            {"CONFIG": TaskConfig(**config)},
+                        )
+
+                        if "task" in config:
+                            task_name = "{}".format(config["task"])
+                            register_task(task_name)(SubClass)
+
+                        if "group" in config:
+                            for group in config["group"]:
+                                register_group(group)(SubClass)
+                    except Exception as error:
+                        eval_logger.warning(
+                            "Failed to load config in\n"
+                            f"                                 {yaml_path}\n"
+                            "                                 Config will not be added to registry\n"
+                            f"                                 Error: {error}"
+                        )
+
+
 task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
-for root, subdirs, file_list in os.walk(task_dir):
-    if (subdirs == []) and (len(file_list) > 0):
-        for file in file_list:
-            if "yaml" in file:
-                yaml_path = os.path.join(root, file)
-                try:
-                    config = utils.load_yaml_config(yaml_path)
-
-                    SubClass = type(
-                        config["task"] + "ConfigurableTask",
-                        (ConfigurableTask,),
-                        {"CONFIG": TaskConfig(**config)},
-                    )
-
-                    if "task" in config:
-                        task_name = "{}".format(config["task"])
-                        register_task(task_name)(SubClass)
-
-                    if "group" in config:
-                        for group in config["group"]:
-                            register_group(group)(SubClass)
-                except Exception as error:
-                    eval_logger.warning(
-                        "Failed to load config in\n"
-                        f"                                 {yaml_path}\n"
-                        "                                 Config will not be added to registry"
-                        f"                                 Error: {error}"
-                    )
-
-ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
+include_task_folder(task_dir)


 def get_task(task_name, config):

--- a/lm_eval/tasks/arc.py
+++ b/lm_eval/tasks/arc.py
-"""
-Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
-https://arxiv.org/pdf/1803.05457.pdf
-
-The ARC dataset consists of 7,787 science exam questions drawn from a variety
-of sources, including science questions provided under license by a research
-partner affiliated with AI2. These are text-only, English language exam questions
-that span several grade levels as indicated in the files. Each question has a
-multiple choice structure (typically 4 answer options). The questions are sorted
-into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
-a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.
-
-Homepage: https://allenai.org/data/arc
-"""
-from lm_eval import utils
-from lm_eval.prompts import get_prompt
-from lm_eval.api.task import MultipleChoiceTask
-
-from lm_eval.api.registry import register_task, register_group
-
-_CITATION = """
-@article{Clark2018ThinkYH,
-  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
-  author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
-  journal={ArXiv},
-  year={2018},
-  volume={abs/1803.05457}
-}
-"""
-
-
-@register_group("arc")
-@register_task("arc_easy")
-class ARCEasy(MultipleChoiceTask):
-    VERSION = "2.0"
-    DATASET_PATH = "ai2_arc"
-    DATASET_NAME = "ARC-Easy"
-
-    OUTPUT_TYPE = "loglikelihood"
-
-    def has_training_docs(self):
-        return True
-
-    def has_validation_docs(self):
-        return True
-
-    def has_test_docs(self):
-        return True
-
-    def training_docs(self):
-        if self._training_docs is None:
-            self._training_docs = list(map(self._process_doc, self.dataset["train"]))
-        return self._training_docs
-
-    def validation_docs(self):
-        return map(self._process_doc, self.dataset["validation"])
-
-    def test_docs(self):
-        return map(self._process_doc, self.dataset["test"])
-
-    def _process_doc(self, doc):
-        # NOTE: Some `doc["answerKey"]`s are in numeric string format being one
-        # of {'1', '2', '3', '4', '5'}. We map them back to letters.
-        num_to_letter = {"1": "A", "2": "B", "3": "C", "4": "D", "5": "E"}
-        doc["answerKey"] = num_to_letter.get(doc["answerKey"], doc["answerKey"])
-        out_doc = {
-            "id": doc["id"],
-            "question": doc["question"],
-            "choices": doc["choices"]["text"],
-            "gold": ["A", "B", "C", "D", "E"].index(doc["answerKey"]),
-        }
-        return out_doc
-
-    def doc_to_text(self, doc):
-        doc_to_text = get_prompt("qa-basic:question-newline-answer")
-        return utils.apply_template(doc_to_text, doc)
-
-    def should_decontaminate(self):
-        return True
-
-    def doc_to_decontamination_query(self, doc):
-        return doc["query"]
-
-
-@register_group("arc")
-@register_task("arc_challenge")
-class ARCChallenge(ARCEasy):
-    DATASET_PATH = "ai2_arc"
-    DATASET_NAME = "ARC-Challenge"
--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
 group:
-  - arc_yaml
-task: arc_challenge_yaml
+  - ai2_arc
+  - multiple_choice
+task: arc_challenge
 dataset_path: ai2_arc
 dataset_name: ARC-Challenge
 output_type: multiple_choice

--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
 group:
-  - arc_yaml
-task: arc_easy_yaml
+  - ai2_arc
+  - multiple_choice
+task: arc_easy
 dataset_path: ai2_arc
 dataset_name: ARC-Easy
 output_type: multiple_choice

--- a/lm_eval/tasks/gsm8k/README.md
+++ b/lm_eval/tasks/gsm8k/README.md
@@ -30,3 +30,17 @@ Homepage: https://github.com/openai/grade-school-math
      primaryClass={cs.LG}
 }
 ```
+
+
+### Checklist
+
+- [x] Is in Eval-harness v1.0 ?
+- [ ] Has been checked for regression from v1.0?
+- [ ] Has been checked for equivalence with original paper methodology?
+- [ ] "Main" checked variant clearly denoted?
+
+### Variant Wishlist
+
+- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
+- [ ] Using Verifiers
+- [ ] Majority voting "without CoT"
\ No newline at end of file
--- a/lm_eval/tasks/gsm8k/gsm8k.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
+group:
+  - greedy_until
+  - math_word_problems
 task: gsm8k_yaml
 dataset_path: gsm8k
 dataset_name: main
@@ -25,7 +28,7 @@ generation_kwargs:
    - "Question:"
  do_sample: false
  temperature: 0.0
-repeats: 2
+repeats: 1
 num_fewshot: 5
 # filter_list:
 #   - name: "get-answer"

--- a/lm_eval/tasks/lambada.py
+++ b/lm_eval/tasks/lambada.py
-"""
-The LAMBADA dataset: Word prediction requiring a broad discourse context∗
-https://arxiv.org/pdf/1606.06031.pdf
-
-LAMBADA is a dataset to evaluate the capabilities of computational models for text
-understanding by means of a word prediction task. LAMBADA is a collection of narrative
-passages sharing the characteristic that human subjects are able to guess their last
-word if they are exposed to the whole passage, but not if they only see the last
-sentence preceding the target word. To succeed on LAMBADA, computational models
-cannot simply rely on local context, but must be able to keep track of information
-in the broader discourse.
-
-Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
-"""
-from lm_eval.api.task import Task
-from lm_eval.api.instance import Instance
-from lm_eval.api.metrics import mean, perplexity
-
-from lm_eval.api.registry import register_task, register_group
-
-_CITATION = """
-@misc{
-    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
-    title={The LAMBADA dataset},
-    DOI={10.5281/zenodo.2630551},
-    publisher={Zenodo},
-    year={2016},
-    month={Aug}
-}
-"""
-
-
-class LambadaBase(Task):
-    VERSION = None
-
-    OUTPUT_TYPE = "loglikelihood"
-
-    def training_docs(self):
-        if self.has_training_docs():
-            return self.dataset["train"]
-
-    def validation_docs(self):
-        if self.has_validation_docs():
-            return self.dataset["validation"]
-
-    def test_docs(self):
-        if self.has_test_docs():
-            return self.dataset["test"]
-
-    def doc_to_text(self, doc):
-        return doc["text"].rsplit(" ", 1)[0]
-
-    def should_decontaminate(self):
-        return True
-
-    def doc_to_decontamination_query(self, doc):
-        return doc["text"]
-
-    def doc_to_target(self, doc):
-        return " " + doc["text"].rsplit(" ", 1)[1]
-
-    def construct_requests(self, doc, ctx, **kwargs):
-        return Instance(
-            request_type=self.OUTPUT_TYPE,
-            doc=doc,
-            arguments=(ctx, self.doc_to_target(doc)),
-            **kwargs
-        )
-
-    def process_results(self, doc, results):
-        # TODO: this ^ is a hack. filters should make it so that we only have one response per request that we score
-        results = results[
-            0
-        ]  # TODO: recheck this. currently a list of [(ll, is_greedy)] is passed in
-        ll, is_greedy = results
-
-        return {"ppl": ll, "acc": int(is_greedy)}
-
-    def aggregation(self):
-        return {"ppl": perplexity, "acc": mean}
-
-    def higher_is_better(self):
-        return {"ppl": False, "acc": True}
-
-
-@register_task("lambada_standard")
-class LambadaStandard(LambadaBase):
-    """The LAMBADA task using the standard original LAMBADA dataset."""
-
-    VERSION = "2.0"
-    DATASET_PATH = "lambada"
-
-    def has_training_docs(self):
-        return False
-
-    def has_validation_docs(self):
-        return True
-
-    def has_test_docs(self):
-        return True
-
-
-@register_task("lambada_openai")
-class LambadaOpenAI(LambadaBase):
-    """The LAMBADA task using the LAMBADA OpenAI dataset, a modified version of the
-    original LAMBADA dataset created by OpenAI for evaluating their GPT-2 model.
-
-    Reference: https://github.com/openai/gpt-2/issues/131#issuecomment-497136199
-    """
-
-    VERSION = "2.0"
-    DATASET_PATH = "EleutherAI/lambada_openai"
-
-    def has_training_docs(self):
-        return False
-
-    def has_validation_docs(self):
-        return False
-
-    def has_test_docs(self):
-        return True
--- a/lm_eval/tasks/lambada/lambada_openai.yaml
+++ b/lm_eval/tasks/lambada/lambada_openai.yaml
 group:
  - lambada
-task: lambada_openai_yaml
+  - loglikelihood
+  - perplexity
+task: lambada_openai
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default
 output_type: loglikelihood

--- a/lm_eval/tasks/lambada/lambada_standard.yaml
+++ b/lm_eval/tasks/lambada/lambada_standard.yaml
 group:
  - lambada
-task: lambada_standard_yaml
+  - loglikelihood
+  - perplexity
+task: lambada_standard
 dataset_path: lambada
 dataset_name: null
 output_type: loglikelihood

--- a/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
 group:
  - lambada_cloze
+  - loglikelihood
 task: lambada_openai_cloze_yaml
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default

--- a/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
 group:
  - lambada_cloze
+  - loglikelihood
 task: lambada_standard_cloze_yaml
 dataset_path: lambada
 dataset_name: null

--- a/lm_eval/tasks/pile.py
+++ b/lm_eval/tasks/pile.py
-"""
-The Pile: An 800GB Dataset of Diverse Text for Language Modeling
-https://arxiv.org/pdf/2101.00027.pdf
-
-The Pile is a 825 GiB diverse, open source language modelling data set that consists
-of 22 smaller, high-quality datasets combined together. To score well on Pile
-BPB (bits per byte), a model must be able to understand many disparate domains
-including books, github repositories, webpages, chat logs, and medical, physics,
-math, computer science, and philosophy papers.
-Homepage: https://pile.eleuther.ai/
-"""
-
-from lm_eval.api.task import PerplexityTask
-
-from lm_eval.api.registry import register_task, register_group
-
-_CITATION = """
-@article{pile,
-  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
-  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
-  journal={arXiv preprint arXiv:2101.00027},
-  year={2020}
-}
-"""
-
-
-class PilePerplexityTask(PerplexityTask):
-    VERSION = "2.0"
-    DATASET_PATH = "EleutherAI/the_pile"
-    DATASET_NAME = None
-
-    def has_training_docs(self):
-        return False
-
-    def test_docs(self):
-        for doc in self.dataset["train"].select(range(100)):
-            yield doc
-
-    def has_validation_docs(self):
-        return False
-
-    def has_test_docs(self):
-        return True
-
-    def doc_to_target(self, doc):
-        return doc["text"]
-
-    # def validation_docs(self):
-    #     for doc in self.dataset["validation"]:
-    #         yield doc["text"]
-
-    # def test_docs(self):
-    #     for doc in self.dataset["test"]:
-    #         yield doc["text"]
-
-
-class PileArxiv(PilePerplexityTask):
-    DATASET_NAME = "pile_arxiv"
-
-
-class PileBooks3(PilePerplexityTask):
-    DATASET_NAME = "pile_books3"
-
-
-class PileBookCorpus2(PilePerplexityTask):
-    DATASET_NAME = "pile_bookcorpus2"
-
-
-class PileDmMathematics(PilePerplexityTask):
-    DATASET_NAME = "pile_dm-mathematics"
-
-
-@register_task("pile_enron")
-class PileEnron(PilePerplexityTask):
-    DATASET_NAME = "enron_emails"
-
-
-class PileEuroparl(PilePerplexityTask):
-    DATASET_NAME = "pile_europarl"
-
-
-class PileFreeLaw(PilePerplexityTask):
-    DATASET_NAME = "pile_freelaw"
-
-
-class PileGithub(PilePerplexityTask):
-    DATASET_NAME = "pile_github"
-
-
-class PileGutenberg(PilePerplexityTask):
-    DATASET_NAME = "pile_gutenberg"
-
-
-class PileHackernews(PilePerplexityTask):
-    DATASET_NAME = "pile_hackernews"
-
-
-class PileNIHExporter(PilePerplexityTask):
-    DATASET_NAME = "pile_nih-exporter"
-
-
-class PileOpenSubtitles(PilePerplexityTask):
-    DATASET_NAME = "pile_opensubtitles"
-
-
-class PileOpenWebText2(PilePerplexityTask):
-    DATASET_NAME = "pile_openwebtext2"
-
-
-class PilePhilPapers(PilePerplexityTask):
-    DATASET_NAME = "pile_philpapers"
-
-
-class PilePileCc(PilePerplexityTask):
-    DATASET_NAME = "pile_pile-cc"
-
-
-class PilePubmedAbstracts(PilePerplexityTask):
-    DATASET_NAME = "pile_pubmed-abstracts"
-
-
-class PilePubmedCentral(PilePerplexityTask):
-    DATASET_NAME = "pile_pubmed-central"
-
-
-class PileStackExchange(PilePerplexityTask):
-    DATASET_NAME = "pile_stackexchange"
-
-
-class PileUspto(PilePerplexityTask):
-    DATASET_NAME = "pile_upsto"
-
-
-class PileUbuntuIrc(PilePerplexityTask):
-    DATASET_NAME = "pile_ubuntu-irc"
-
-
-class PileWikipedia(PilePerplexityTask):
-    DATASET_NAME = "pile_wikipedia"
-
-
-class PileYoutubeSubtitles(PilePerplexityTask):
-    DATASET_NAME = "pile_youtubesubtitles"
--- a/lm_eval/tasks/pile/pile_arxiv.yaml
+++ b/lm_eval/tasks/pile/pile_arxiv.yaml
 group:
  - pile
+  - perplexity
+  - loglikelihood_rolling
 task: pile_arxiv
 dataset_path: EleutherAI/the_pile
 dataset_name: pile_arxiv