Merge branch 'big-refactor' into refactor-more-tasks

02362e6a · Lintang Sutawika · GitHub · b1b5239d · 0ba4ae15 · 02362e6a
Unverified Commit 02362e6a authored Jul 06, 2023 by Lintang Sutawika Committed by GitHub Jul 06, 2023
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -5,3 +5,4 @@ lm_cache
 .idea

 *.egg-info/
+.vscode/
--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
@@ -14,32 +14,42 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields

 ### Parameters

+Task naming + registration:
 - **task** (`str`, defaults to None) — name of the task.
 - **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
- **reference** (`str`, *optional*) —
+
+Dataset configuration options:
 - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
 - **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
 - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
 - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
 - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
 - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **template_aliases** (`str`, *optional*) —
- **aliases**: (`Union[str, list]`, *optional*) —
+- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
+
+Prompting / in-context formatting options:
+- **template_aliases** (`str`, *optional*) — A field for inputting additional Jinja2 content. Intended not to render as text after applying a Jinja template, but to instead define variables within Jinja that will be used within the written prompts. (for example, mapping the dataset column `label` to the new name `gold`).
+- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text and doc_to_target and make template_aliases unused.
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
+- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model.
+- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
+- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
+- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
+
+Runtime configuration options:
 - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
 - **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
+
+Scoring details:
 - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
 - **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
+- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
 - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
 - **should_decontaminate** (`bool`, *optional*, defaults to False) -
 - **doc_to_decontamination_query** (`str`, *optional*) —
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
+
+Other:
 - **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.

 ## Filters

--- a/docs/description_guide.md
+++ b/docs/description_guide.md
-# Description Guide
-
-![fewshot-example](./img/fewshot_example_gpt3.png)
-(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
-
-Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
-
- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
-
-```python
-description_dict = {
-    "task_name_1": "description",
-    "task_name_2": "description",
-    ...
-}
-```
-
-Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
-
-```python
-"""
-<description>
-
-<examples>
-
-<prompt>
-"""
-```
-
-## Descriptions in File
-
-One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
-
-```json
-{
-    "cycle_letters": "Please unscramble the letters into a word, and write that word:",
-    "copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
-}
-```
-
-which can then be supplied to the CLI as:
-
-```bash
-python main.py  \
--tasks cycle_letters,copa \
--description_dict_path /your/path/descriptions.json \
-...
-```
--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -157,3 +157,17 @@ def get_aggregation(name):
        raise Warning(
            "{} not a registered aggregation metric!".format(name),
        )
+
+
+def get_default_aggregation(metric_name):
+    try:
+        return DEFAULT_AGGREGATION_REGISTRY[metric_name]
+    except KeyError:
+        raise Warning(f"No default aggregation metric for metric '{metric_name}'!")
+
+
+def is_higher_better(metric_name):
+    try:
+        return HIGHER_IS_BETTER_REGISTRY[metric_name]
+    except KeyError:
+        raise Warning(f"higher_is_better not specified for metric '{metric_name}'!")
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -24,19 +24,18 @@ from lm_eval.logger import eval_logger
 from lm_eval.prompts import get_prompt
 from lm_eval.filters import build_filter_ensemble
 from lm_eval.api.metrics import (
-    # get_metric,
-    # get_aggregation,
    mean,
    weighted_perplexity,
    bits_per_byte,
 )
 from lm_eval.api.registry import (
-    METRIC_REGISTRY,
+    get_metric,
+    get_aggregation,
+    get_default_aggregation,
+    is_higher_better,
    DEFAULT_METRIC_REGISTRY,
    OUTPUT_TYPE_REGISTRY,
    AGGREGATION_REGISTRY,
-    HIGHER_IS_BETTER_REGISTRY,
-    DEFAULT_AGGREGATION_REGISTRY,
 )

 ALL_OUTPUT_TYPES = [
@@ -50,10 +49,12 @@ ALL_OUTPUT_TYPES = [

 @dataclass
 class TaskConfig(dict):
-
+    # task naming/registry
    task: str = None
    group: Union[str, list] = None
-
+    # HF dataset options.
+    # which dataset to use,
+    # and what splits for what purpose
    dataset_path: str = None
    dataset_name: str = None
    dataset_kwargs: dict = None
@@ -61,24 +62,25 @@ class TaskConfig(dict):
    validation_split: str = None
    test_split: str = None
    fewshot_split: str = None  # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
-
+    # formatting / prompting options.
+    # see docs/advanced_task_guide.md for more info
    template_aliases: str = None
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
+    gold_alias: Union[Callable, str] = None
    use_prompt: str = None
    description: str = ""
    target_delimiter: str = " "
    fewshot_delimiter: str = "\n\n"
-
+    # runtime configuration options
    num_fewshot: int = 0
-    batch_size: int = 1
-    repeats: int = 1
-
+    # scoring options
    metric_list: str = None
    gold_alias: Union[Callable, str] = None
    create_choices: Union[Callable, str] = None
    output_type: str = "greedy_until"
    generation_kwargs: dict = None
+    repeats: int = 1
    filter_list: Union[str, list] = None
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None
@@ -480,7 +482,7 @@ class Task(abc.ABC):
            The fewshot context.
        """
        # TODO: this should only return the overrides applied to a non-YAML task's configuration.
-        # (batch size, num_fewshot)
+        # (num_fewshot)
        return self._config.to_dict()


@@ -528,13 +530,11 @@ class ConfigurableTask(Task):
        if self._config.metric_list is None:
            # TODO: handle this in TaskConfig.__post_init__ ?
            for metric_name in _metric_list:
-                self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
-                self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
-                    metric_name
-                ]
-                self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
+                self._metric_fn_list[metric_name] = get_metric(metric_name)
+                self._aggregation_list[metric_name] = get_default_aggregation(
                    metric_name
-                ]
+                )
+                self._higher_is_better[metric_name] = is_higher_better(metric_name)
        else:
            for metric_config in self._config.metric_list:
                assert "metric" in metric_config
@@ -544,30 +544,13 @@ class ConfigurableTask(Task):
                    for key in metric_config
                    if key not in ["metric", "aggregation", "higher_is_better"]
                }
-                try:
-                    self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
-                except Exception:
-                    eval_logger.warning(
-                        f"Metric {metric_name} not found, "
-                        "Searching from https://huggingface.co/evaluate-metric"
-                    )
-                    try:
-                        metric_object = evaluate.load(metric_name)
-                        self._metric_fn_list[metric_name] = metric_object
-                        self._metric_fn_kwargs[metric_name] = kwargs
-
-                    except Exception:
-                        raise Warning(
-                            "{} not found in the evaluate library!".format(metric_name),
-                            "Please check https://huggingface.co/evaluate-metric",
-                        )
+                self._metric_fn_list[metric_name] = get_metric(metric_name)
+                self._metric_fn_kwargs[metric_name] = kwargs

                if "aggregation" in metric_config:
                    agg_name = metric_config["aggregation"]
                    if type(agg_name) == str:
-                        self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[
-                            agg_name
-                        ]
+                        self._aggregation_list[metric_name] = get_aggregation(agg_name)
                    elif callable(agg_name):
                        self._aggregation_list[metric_name] = metric_config[
                            "aggregation"
@@ -575,7 +558,7 @@ class ConfigurableTask(Task):
                else:

                    INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
-                    metric_agg = DEFAULT_AGGREGATION_REGISTRY[metric_name]
+                    metric_agg = get_default_aggregation(metric_name)
                    eval_logger.warning(
                        f"metric {metric_name} is defined, but aggregation is not. "
                        f"using default "
@@ -591,11 +574,9 @@ class ConfigurableTask(Task):
                    eval_logger.warning(
                        f"metric {metric_name} is defined, but higher_is_better is not. "
                        f"using default "
-                        f"higher_is_better={HIGHER_IS_BETTER_REGISTRY[metric_name]}"
+                        f"higher_is_better={is_higher_better(metric_name)}"
                    )
-                    self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
-                        metric_name
-                    ]
+                    self._higher_is_better[metric_name] = is_higher_better(metric_name)

        self.download(self._config.dataset_kwargs)
        self._training_docs = None
@@ -865,7 +846,6 @@ class ConfigurableTask(Task):
            else:
                gold = int(self.doc_to_target(doc))

-            pred = np.argmax(lls)
            # retrieve choices in List[str] form, to compute choice lengths, etc.
            choices = self.create_choices(doc)
            if (
@@ -879,6 +859,8 @@ class ConfigurableTask(Task):
                # and this stores our "regular" conditional loglikelihoods
                lls = lls[::2]

+            pred = np.argmax(lls)
+
            acc = 1.0 if np.argmax(lls) == gold else 0.0
            completion_len = np.array([float(len(i)) for i in choices])
            acc_norm = 1.0 if np.argmax(lls / completion_len) == gold else 0.0
@@ -890,7 +872,6 @@ class ConfigurableTask(Task):
                **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
            }

-            # TODO: set which normalization metrics should be reported, and calculate them
            if "exact_match" in self._metric_fn_list.keys():
                # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
                is_greedy = is_greedy[gold]  # take value for the gold answer
@@ -926,7 +907,7 @@ class ConfigurableTask(Task):
                gold = self.doc_to_target(doc)

            for key, result in zip(self._metric_fn_list.keys(), results):
-                _dict = self._metric_fn_list[key].compute(
+                _dict = self._metric_fn_list[key](
                    references=[gold],
                    predictions=[result],
                    **self._metric_fn_kwargs[key],

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -45,7 +45,6 @@ def simple_evaluate(
    check_integrity=False,
    decontamination_ngrams_path=None,
    write_out=False,
-    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.

@@ -74,8 +73,6 @@ def simple_evaluate(
        Whether to run the relevant part of the test suite for the tasks
    :param write_out: bool
        If True, write details about prompts and logits to json for all tasks
-    :param output_base_path: str, optional
-        Directory to which detailed eval info will be written. Defaults to present working dir.
    :return
        Dictionary of results
    """
@@ -121,7 +118,6 @@ def simple_evaluate(
        bootstrap_iters=bootstrap_iters,
        decontamination_ngrams_path=decontamination_ngrams_path,
        write_out=write_out,
-        output_base_path=output_base_path,
    )

    if lm.rank == 0:
@@ -158,7 +154,6 @@ def evaluate(
    bootstrap_iters=100000,
    decontamination_ngrams_path=None,
    write_out=False,
-    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.

@@ -174,8 +169,6 @@ def evaluate(
        Number of iterations for bootstrap statistics
    :param write_out: bool
        If True, write all prompts, logits and metrics to json for offline analysis
-    :param output_base_path: str, optional
-        Directory to which detailed eval info will be written. Defaults to present working dir
    :return
        Dictionary of results
    """
@@ -188,8 +181,6 @@ def evaluate(
    samples = collections.defaultdict(list)
    requests = collections.defaultdict(list)

-    # docs = {}
-
    # get lists of each type of request
    for task_name, task in task_dict.items():
        versions[task_name] = task.VERSION

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -115,9 +115,10 @@ class HFLM(LM):
                    else torch.device("cpu")
                )
        else:
-            eval_logger.info(
-                f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
-            )
+            if device != "cuda":
+                eval_logger.info(
+                    f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
+                )
            # TODO: include in warning that `load_in_8bit` etc. affect this too
            self._device = device

@@ -204,7 +205,12 @@ class HFLM(LM):
        self.model.tie_weights()
        if gpus <= 1 and not parallelize:
            # place model onto device, if not using HF Accelerate in any form
-            self.model.to(self.device)
+            try:
+                self.model.to(self.device)
+            except ValueError:
+                eval_logger.info(
+                    "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
+                )

        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            pretrained if tokenizer is None else tokenizer,
@@ -246,7 +252,12 @@ class HFLM(LM):
                    if torch.cuda.is_available()
                    else torch.device("cpu")
                )
-                self.model.to(self.device)
+                try:
+                    self.model.to(self.device)
+                except ValueError:
+                    eval_logger.info(
+                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
+                    )
            else:
                self._model = accelerator.prepare(self.model)
                self._device = torch.device(f"cuda:{accelerator.local_process_index}")

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -9,7 +9,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] DROP
 - [x] ~~Lambada~~
 - [x] Lambada (Cloze variants)
- [ ] Lambada (Multilingual)
+- [x] ~~Lambada (Multilingual)~~
 - [x] Wikitext
 - [x] PiQA
 - [ ] PROST (WIP)
@@ -17,7 +17,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] Pubmed QA
 - [x] SciQ
 - [ ] QASPER
- [ ] QA4MRE
+- [ ] QA4MRE (WIP)
 - [ ] TriviaQA
 - [x] AI2 ARC
 - [ ] LogiQA
@@ -31,7 +31,8 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] SWAG (WIP)
 - [x] OpenBookQA
 - [ ] SQuADv2 (WIP)
- [ ] HeadQA
+- [ ] RACE (WIP)
+- [ ] HeadQA (WIP)
 - [ ] MathQA
 - [ ] WebQs
 - [ ] WSC273

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -14,7 +14,10 @@ from lm_eval.api.registry import (


 def get_task_name_from_config(task_config):
-    return "{dataset_path}_{dataset_name}".format(**task_config)
+    if "dataset_name" in task_config:
+        return "{dataset_path}_{dataset_name}".format(**task_config)
+    else:
+        return "{dataset_path}".format(**task_config)


 def include_task_folder(task_dir):

--- a/lm_eval/tasks/arc/arc_challenge.yaml
+++ b/lm_eval/tasks/arc/arc_challenge.yaml
@@ -19,6 +19,6 @@ metric_list:
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
-  - metric: acc_mutual_info
-    aggregation: mean
-    higher_is_better: true
+  # - metric: acc_mutual_info
+  #   aggregation: mean
+  #   higher_is_better: true
--- a/lm_eval/tasks/lambada_multilingual/README.md
+++ b/lm_eval/tasks/lambada_multilingual/README.md
+# LAMBADA
+
+### Paper
+The LAMBADA dataset: Word prediction requiring a broad discourse context
+https://arxiv.org/pdf/1606.06031.pdf
+
+LAMBADA is a dataset to evaluate the capabilities of computational models for text
+understanding by means of a word prediction task. LAMBADA is a collection of narrative
+passages sharing the characteristic that human subjects are able to guess their last
+word if they are exposed to the whole passage, but not if they only see the last
+sentence preceding the target word. To succeed on LAMBADA, computational models
+cannot simply rely on local context, but must be able to keep track of information
+in the broader discourse.
+
+Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
+
+### Citation
+
+@misc{
+    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
+    title={The LAMBADA dataset},
+    DOI={10.5281/zenodo.2630551},
+    publisher={Zenodo},
+    year={2016},
+    month={Aug}
+}
+
+### Subtasks
+
+* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
+
+### Checklist
+
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+(This task is novel to the Evaluation Harness, and has been checked against v0.3.0 of the harness.)
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_de.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_de
+dataset_name: de
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_en
+dataset_path: EleutherAI/lambada_openai
+dataset_name: en
+output_type: loglikelihood
+test_split: test
+template_aliases: ""
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_es.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_es
+dataset_name: es
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_fr.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_fr
+dataset_name: fr
--- a/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+++ b/lm_eval/tasks/lambada_multilingual/lambada_mt_it.yaml
+include: lambada_mt_en.yaml
+group:
+  - lambada_multilingual
+  - loglikelihood
+  - perplexity
+task: lambada_openai_mt_it
+dataset_name: it
--- a/lm_eval/tasks/prost/corypaik_prost.yaml
+++ b/lm_eval/tasks/prost/corypaik_prost.yaml
+group:
+  - multiple_choice
+task: corypaik_prost
+dataset_path: corypaik/prost
+dataset_name: null
+output_type: multiple_choice
+test_split: test
+template_aliases: "{% set answer_choices = [A, B, C, D] %}{% set gold = label %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
+doc_to_text: "{{context}}\nQuestion: {{ex_question}}\nAnswer:"
+doc_to_target: "{{answer_choices[gold]}}"
+gold_alias: "{{gold}}" # this will be cast to an int.
+should_decontaminate: true
+doc_to_decontamination_query: "{{context}}\nQuestion: {{ex_question}}\nAnswer:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/super_glue/boolq/seq2seq.yaml
+++ b/lm_eval/tasks/super_glue/boolq/seq2seq.yaml
@@ -8,7 +8,13 @@ training_split: train
 validation_split: validation
 doc_to_text: "{{passage}}\nQuestion: {{question}}\nAnswer:"
 doc_to_target: "{{answer_choices[label]}}"
-gold_alias: "{{label}}" # this will be cast to an int.
+gold_alias: " {{answer_choices[label]}}" # this will be cast to an int.
+generation_kwargs:
+  until:
+    - "\n\n"
+    - "\n"
+  do_sample: false
+  temperature: 0.0
 template_aliases: "{% set answer_choices = ['no', 'yes'] %}"
 metric_list:
  - metric: exact_match

--- a/lm_eval/tasks/super_glue/copa/default.yaml
+++ b/lm_eval/tasks/super_glue/copa/default.yaml
 group:
-  - super-glue-lm-eval-v1-
+  - super-glue-lm-eval-v1
 task: "copa"
 dataset_path: super_glue
 dataset_name: copa

--- a/lm_eval/tasks/super_glue/rte/promptsource-00.yaml
+++ b/lm_eval/tasks/super_glue/rte/promptsource-00.yaml
 group:
  - super-glue-promptsource
-task: "GPT-3 style"
+task: "rte"
 dataset_path: super_glue
 dataset_name: rte
 training_split: train
 validation_split: validation
 use_prompt: "promptsource:GPT-3 style"
+generation_kwargs:
+    until:
+    - "\n"
+    - "\n\n"
 metric_list:
  - metric: exact_match
    aggregation: mean