Merge branch 'big-refactor' into wmt

adde09b3 · haileyschoelkopf · df6c5dcb · 64c76fc3 · adde09b3 · adde09b3
Commit adde09b3 authored Aug 14, 2023 by haileyschoelkopf
20 changed files
--- a/README.md
+++ b/README.md
@@ -33,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run:
 ```bash
 git clone https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
-git checkout big-refactor
 pip install -e .
 ```

@@ -49,6 +48,13 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
 pip install -e ".[gptq]"
 ```

+
+To install the package with all extras, run
+```bash
+pip install -e ".[all]"
+```
+
+
 ## Support

 The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
@@ -93,6 +99,8 @@ python main.py \
    --batch_size auto:4
 ```

+Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere.
+
 ### Multi-GPU Evaluation with Hugging Face `accelerate`

 To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
@@ -128,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi

 **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**

+To use `accelerate` with the `lm-eval` command, use 
+```
+accelerate launch --no_python lm-eval --model ...
+```
+
 ### Commercial APIs

-Our library also supports language models served via the OpenAI API:
+Our library also supports the evaluation of models served via several commercial APIs, and hope to implement support for common performant local/self-hosted inference servers.
+
+A full accounting of the supported and planned libraries + APIs can be seen below:
+
+| API or Inference Server     | Implemented?                    | `--model <xxx>` name                                                             | Models supported:                    | Request Types:                                           |
+|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
+| OpenAI Completions          | :heavy_check_mark:              | `openai`, `openai-completions`, `gooseai`                                        | up to `code-davinci-002`             | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| OpenAI ChatCompletions      | :x: Not yet - needs help!       | N/A                                                                              | (link here?)                         | `greedy_until` (no logprobs)                             |
+| Anthropic                   | :heavy_check_mark:              | `anthropic`                                                                      | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)         | `greedy_until` (no logprobs)                             |
+| GooseAI                     | :heavy_check_mark: (not separately maintained)  | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) |                                      | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Textsynth                   | Needs testing                   | `textsynth`                                                                      | ???                                  | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Cohere                      | :hourglass: - blocked on Cohere API bug | N/A                                                                              | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| GGML                        | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617)              | N/A                                                                              | ???                                  | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
+| vLLM                        | :x: Not yet - needs help!       | N/A                                                                              | All HF models                        | `greedy_until` (no logprobs)                             |
+| Your inference server here! | ...                             | ...                                                                              | ...                                  | ...                                                      |                                | ...                                                      |
+
+It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
+
+Our library supports language models served via the OpenAI Completions API as follows:

 ```bash
 export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
 python main.py \
-    --model openai \
+    --model openai-completions \
    --model_args engine=davinci \
    --tasks lambada_openai,hellaswag
 ```

 While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`.

-To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
-
-```bash
-python main.py \
-    --model openai \
-    --model_args engine=davinci \
-    --tasks lambada_openai,hellaswag \
-    --check_integrity
-```
-
 ### Other Frameworks

 A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
@@ -172,6 +193,16 @@ python write_out.py \

 This will write out one text file for each task.

+To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
+
+```bash
+python main.py \
+    --model openai \
+    --model_args engine=davinci \
+    --tasks lambada_openai,hellaswag \
+    --check_integrity
+```
+
 ## Advanced Usage

 For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
@@ -201,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu

 As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.

+## How to Contribute or Learn More?
+
+For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
+
+
+You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
+
+
 ## Cite as

 ```

--- a/docs/advanced_task_guide.md
+++ b/docs/advanced_task_guide.md
@@ -236,3 +236,89 @@ Generative tasks:

 Tasks using complex filtering:
 - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
+
+
+## Benchmarks
+
+When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
+
+To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
+
+```yaml
+group: pythia
+task:
+  - lambada_openai
+  - wikitext
+  - piqa
+  - sciq
+  - wsc
+  - winogrande
+  - arc
+  - logiqa
+  - blimp
+  - hendrycksTest*
+```
+
+Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
+
+```yaml
+group: t0_eval
+task:
+  # Coreference Resolution
+  - dataset_path: super_glue
+    dataset_name: wsc.fixed
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Coreference Resolution
+  - dataset_path: winogrande
+    dataset_name: winogrande_xl
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  ...
+```
+
+If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
+
+```YAML
+group: t0_eval
+task:
+  ...
+  - task: anli_r1
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r1
+    validation_split: dev_r1
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r2
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r2
+    validation_split: dev_r2
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+```
+
+Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -761,7 +761,12 @@ class ConfigurableTask(Task):
            return doc_to_text(doc)
        # Used when applying a Promptsource template
        elif hasattr(doc_to_text, "apply"):
-            return doc_to_text.apply(doc)[0]
+            applied_prompt = doc_to_text.apply(doc)
+            if len(applied_prompt) == 2:
+                return applied_prompt[0]
+            else:
+                eval_logger.warning("Applied prompt returns empty string")
+                return self._config.fewshot_delimiter
        else:
            print(type(doc_to_text))
            raise TypeError
@@ -791,7 +796,12 @@ class ConfigurableTask(Task):
            return doc_to_target(doc)
        # Used when applying a Promptsource template
        elif hasattr(doc_to_target, "apply"):
-            return doc_to_target.apply(doc)[1]
+            applied_prompt = doc_to_target.apply(doc)
+            if len(applied_prompt) == 2:
+                return applied_prompt[1]
+            else:
+                eval_logger.warning("Applied prompt returns empty string")
+                return self._config.fewshot_delimiter
        else:
            raise TypeError


--- a/lm_eval/benchmarks/__init__.py
+++ b/lm_eval/benchmarks/__init__.py
+import os
+import yaml
+
+from lm_eval import utils
+from lm_eval.tasks import register_configurable_task, check_prompt_config
+from lm_eval.logger import eval_logger
+from lm_eval.api.registry import (
+    TASK_REGISTRY,
+    GROUP_REGISTRY,
+    ALL_TASKS,
+)
+
+
+def include_benchmarks(task_dir):
+
+    for root, subdirs, file_list in os.walk(task_dir):
+        if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
+            for f in file_list:
+                if f.endswith(".yaml"):
+                    try:
+                        benchmark_path = os.path.join(root, f)
+
+                        with open(benchmark_path, "rb") as file:
+                            yaml_config = yaml.full_load(file)
+
+                        assert "group" in yaml_config
+                        group = yaml_config["group"]
+                        all_task_list = yaml_config["task"]
+                        config_list = [
+                            task for task in all_task_list if type(task) != str
+                        ]
+                        task_list = [
+                            task for task in all_task_list if type(task) == str
+                        ]
+
+                        for task_config in config_list:
+                            var_configs = check_prompt_config(
+                                {
+                                    **task_config,
+                                    **{"group": group},
+                                }
+                            )
+                            for config in var_configs:
+                                register_configurable_task(config)
+
+                        task_names = utils.pattern_match(task_list, ALL_TASKS)
+                        for task in task_names:
+                            if task in TASK_REGISTRY:
+                                if group in GROUP_REGISTRY:
+                                    GROUP_REGISTRY[group].append(task)
+                                else:
+                                    GROUP_REGISTRY[group] = [task]
+                                    ALL_TASKS.add(group)
+                    except Exception as error:
+                        eval_logger.warning(
+                            "Failed to load benchmark in\n"
+                            f"                                 {benchmark_path}\n"
+                            "                                 Benchmark will not be added to registry\n"
+                            f"                                 Error: {error}"
+                        )
+
+
+task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
+include_benchmarks(task_dir)
--- a/lm_eval/tasks/benchmarks/pythia.yaml
+++ b/lm_eval/tasks/benchmarks/pythia.yaml
@@ -6,7 +6,7 @@ task:
  - sciq
  - wsc
  - winogrande
-  - arc_*
-  # - logiqa
-  # - blimp_*
-  # - hendrycksTest*
+  - arc
+  - logiqa
+  - blimp
+  - hendrycksTest*
--- a/lm_eval/benchmarks/t0_eval.yaml
+++ b/lm_eval/benchmarks/t0_eval.yaml
+group: t0_eval
+task:
+  # Coreference Resolution
+  - dataset_path: super_glue
+    dataset_name: wsc.fixed
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Coreference Resolution
+  - dataset_path: winogrande
+    dataset_name: winogrande_xl
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Natural Language Inference
+  - dataset_path: super_glue
+    dataset_name: cb
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    output_type: greedy_until
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - dataset_path: super_glue
+    dataset_name: rte
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r1
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r1
+    validation_split: dev_r1
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r2
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r2
+    validation_split: dev_r2
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  - task: anli_r3
+    dataset_path: anli
+    use_prompt: promptsource:*
+    training_split: train_r3
+    validation_split: dev_r3
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Sentence Completion
+  - dataset_path: super_glue
+    dataset_name: copa
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Natural Language Inference
+  - dataset_path: hellaswag
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
+  # Word Sense Disambiguation
+  - dataset_path: super_glue
+    dataset_name: wic
+    use_prompt: promptsource:*
+    training_split: train
+    validation_split: validation
+    metric_list:
+      - metric: exact_match
+        aggregation: mean
+        higher_is_better: true
+        ignore_case: true
+        ignore_punctuation: true
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -11,6 +11,7 @@ import numpy as np

 import lm_eval.api
 import lm_eval.tasks
+import lm_eval.benchmarks
 import lm_eval.models
 import lm_eval.api.metrics
 import lm_eval.api.registry

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -55,7 +55,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] XStoryCloze (Lintang)
 - [x] XWinograd
 - [ ] PAWS-X (Lintang)
- [ ] XNLI (Lintang)
+- [x] XNLI
 - [ ] MGSM (Lintang)
 - [ ] SCROLLS
 - [x] Babi

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -44,7 +44,7 @@ def check_prompt_config(config):
        prompt_list = prompts.load_prompt_list(
            use_prompt=config["use_prompt"],
            dataset_name=config["dataset_path"],
-            subset_name=config["dataset_name"],
+            subset_name=config["dataset_name"] if "dataset_name" in config else None,
        )
        for idx, prompt_variation in enumerate(prompt_list):
            all_configs.append(
@@ -54,7 +54,9 @@ def check_prompt_config(config):
                    **{
                        "task": "_".join(
                            [
-                                get_task_name_from_config(config),
+                                config["task"]
+                                if "task" in config
+                                else get_task_name_from_config(config),
                                prompt_variation,
                            ]
                        )
@@ -98,58 +100,8 @@ def include_task_folder(task_dir):
                        )


-def include_benchmarks(task_dir, benchmark_dir="benchmarks"):
-
-    for root, subdirs, file_list in os.walk(os.path.join(task_dir, benchmark_dir)):
-        if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
-            for f in file_list:
-                if f.endswith(".yaml"):
-                    try:
-                        benchmark_path = os.path.join(root, f)
-
-                        with open(benchmark_path, "rb") as file:
-                            yaml_config = yaml.full_load(file)
-
-                        assert "group" in yaml_config
-                        group = yaml_config["group"]
-                        all_task_list = yaml_config["task"]
-                        config_list = [
-                            task for task in all_task_list if type(task) != str
-                        ]
-                        task_list = [
-                            task for task in all_task_list if type(task) == str
-                        ]
-
-                        for task_config in config_list:
-                            var_configs = check_prompt_config(
-                                {
-                                    **task_config,
-                                    **{"group": group},
-                                }
-                            )
-                            for config in var_configs:
-                                register_configurable_task(config)
-
-                        task_names = utils.pattern_match(task_list, ALL_TASKS)
-                        for task in task_names:
-                            if task in TASK_REGISTRY:
-                                if group in GROUP_REGISTRY:
-                                    GROUP_REGISTRY[group].append(task)
-                                else:
-                                    GROUP_REGISTRY[group] = [task]
-                                    ALL_TASKS.add(group)
-                    except Exception as error:
-                        eval_logger.warning(
-                            "Failed to load benchmark in\n"
-                            f"                                 {benchmark_path}\n"
-                            "                                 Benchmark will not be added to registry\n"
-                            f"                                 Error: {error}"
-                        )
-
-
 task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
 include_task_folder(task_dir)
-include_benchmarks(task_dir)


 def get_task(task_name, config):

--- a/lm_eval/tasks/benchmarks/t0_eval.yaml
+++ b/lm_eval/tasks/benchmarks/t0_eval.yaml
-group: t0_eval
-task:
-  # # Coreference Resolution
-  # - dataset_path: super_glue
-  #   dataset_name: wsc.fixed
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Coreference Resolution
-  # - dataset_path: winogrande
-  #   dataset_name: winogrande_xl
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # Natural Language Inference
-  - dataset_path: super_glue
-    dataset_name: cb
-    use_prompt: promptsource:*
-    training_split: train
-    validation_split: validation
-    output_type: greedy_until
-    metric_list:
-      - metric: exact_match
-        aggregation: mean
-        higher_is_better: true
-        ignore_case: true
-        ignore_punctuation: true
-  # Natural Language Inference
-  # - dataset_path: super_glue
-  #   dataset_name: rte
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # # - dataset_path: anli
-  # #   use_prompt: promptsource:*
-  # #   training_split: train_r1
-  # #   validation_split: dev_r1
-  # # Sentence Completion
-  # - dataset_path: super_glue
-  #   dataset_name: copa
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Natural Language Inference
-  # - dataset_path: hellaswag
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
-  # # Word Sense Disambiguation
-  # - dataset_path: super_glue
-  #   dataset_name: wic
-  #   use_prompt: promptsource:*
-  #   training_split: train
-  #   validation_split: validation
-  #   metric_list:
-  #     - metric: exact_match
-  #       aggregation: mean
-  #       higher_is_better: true
-  #       ignore_case: true
-  #       ignore_punctuation: true
--- a/lm_eval/tasks/crows_pairs/crows_pairs_english.yaml
+++ b/lm_eval/tasks/crows_pairs/crows_pairs_english.yaml
@@ -16,6 +16,6 @@ metric_list:
  - metric: likelihood_diff
    aggregation: mean
    higher_is_better: false
-  - metric: acc
+  - metric: pct_stereotype
    aggregation: mean
-    higher_is_better: true
+    higher_is_better: false
--- a/lm_eval/tasks/crows_pairs/utils.py
+++ b/lm_eval/tasks/crows_pairs/utils.py
@@ -13,7 +13,7 @@ def process_results(doc, results):
    # then treat this as predicting stereotyped sentence
    acc = 1.0 if likelihood1 > likelihood2 else 0.0

-    return {"likelihood_diff": diff, "acc": acc}
+    return {"likelihood_diff": diff, "pct_stereotype": acc}


 def doc_to_choice(doc):

--- a/lm_eval/tasks/hellaswag/hellaswag.yaml
+++ b/lm_eval/tasks/hellaswag/hellaswag.yaml
@@ -7,9 +7,10 @@ output_type: multiple_choice
 training_split: train
 validation_split: validation
 test_split: null
-doc_to_text: "{% set text = activity_label ~ ': ' ~ ctx_a ~ ' ' ~ ctx_b.capitalize() %}{{text|trim|replace(' [title]', '. ')|regex_replace('\\[.*?\\]', '')|replace('  ', ' ')}}"
+process_docs: !function utils.process_docs
+doc_to_text: "{{query}}"
 doc_to_target: "{{label}}"
-doc_to_choice: "{{endings|map('trim')|map('replace', ' [title]', '. ')|map('regex_replace', '\\[.*?\\]', '')|map('replace', '  ', ' ')|list}}"
+doc_to_choice: "{{choices}}"
 metric_list:
  - metric: acc
    aggregation: mean

--- a/lm_eval/tasks/hellaswag/utils.py
+++ b/lm_eval/tasks/hellaswag/utils.py
+import datasets
+import re
+
+
+def preprocess(text):
+    text = text.strip()
+    # NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+
+
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    def _process_doc(doc):
+        ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
+        out_doc = {
+            "query": preprocess(doc["activity_label"] + ": " + ctx),
+            "choices": [preprocess(ending) for ending in doc["endings"]],
+            "gold": int(doc["label"]),
+        }
+        return out_doc
+
+    return dataset.map(_process_doc)
--- a/lm_eval/tasks/xnli/README.md
+++ b/lm_eval/tasks/xnli/README.md
+# XNLI
+
+### Paper
+
+Title: `XNLI: Evaluating Cross-lingual Sentence Representations`
+
+Abstract: https://arxiv.org/abs/1809.05053
+
+Based on the implementation of @yongzx (see https://github.com/EleutherAI/lm-evaluation-harness/pull/258)
+
+Prompt format (same as XGLM and mGPT):
+
+sentence1 + ", right? " + mask = (Yes|Also|No) + ", " + sentence2
+
+Predicition is the full sequence with the highest likelihood.
+
+Language specific prompts are translated word-by-word with Google Translate
+and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
+
+Homepage: https://github.com/facebookresearch/XNLI
+
+
+### Citation
+
+"""
+@InProceedings{conneau2018xnli,
+  author = "Conneau, Alexis
+        and Rinott, Ruty
+        and Lample, Guillaume
+        and Williams, Adina
+        and Bowman, Samuel R.
+        and Schwenk, Holger
+        and Stoyanov, Veselin",
+  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
+  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
+               in Natural Language Processing",
+  year = "2018",
+  publisher = "Association for Computational Linguistics",
+  location = "Brussels, Belgium",
+}
+"""
+
+### Groups and Tasks
+
+#### Groups
+
+* `xnli`
+
+#### Tasks
+
+* `xnli_ar`: Arabic
+* `xnli_bg`: Bulgarian
+* `xnli_de`: German
+* `xnli_el`: Greek
+* `xnli_en`: English
+* `xnli_es`: Spanish
+* `xnli_fr`: French
+* `xnli_hi`: Hindi
+* `xnli_ru`: Russian
+* `xnli_sw`: Swahili
+* `xnli_th`: Thai
+* `xnli_tr`: Turkish
+* `xnli_ur`: Urdu
+* `xnli_vi`: Vietnamese
+* `xnli_zh`: Chinese
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/xnli/utils.py
+++ b/lm_eval/tasks/xnli/utils.py
+import argparse
+from typing import Dict, List
+
+import yaml
+
+
+# Different languages that are part of xnli.
+# These correspond to dataset names (Subsets) on HuggingFace.
+# A yaml file is generated by this script for each language.
+
+LANGUAGES = {
+    "ar": {  # Arabic
+        "QUESTION_WORD": "صحيح",
+        "ENTAILMENT_LABEL": "نعم",
+        "NEUTRAL_LABEL": "لذا",
+        "CONTRADICTION_LABEL": "رقم",
+    },
+    "bg": {  # Bulgarian
+        "QUESTION_WORD": "правилно",
+        "ENTAILMENT_LABEL": "да",
+        "NEUTRAL_LABEL": "така",
+        "CONTRADICTION_LABEL": "не",
+    },
+    "de": {  # German
+        "QUESTION_WORD": "richtig",
+        "ENTAILMENT_LABEL": "Ja",
+        "NEUTRAL_LABEL": "Auch",
+        "CONTRADICTION_LABEL": "Nein",
+    },
+    "el": {  # Greek
+        "QUESTION_WORD": "σωστός",
+        "ENTAILMENT_LABEL": "Ναί",
+        "NEUTRAL_LABEL": "Έτσι",
+        "CONTRADICTION_LABEL": "όχι",
+    },
+    "en": {  # English
+        "QUESTION_WORD": "right",
+        "ENTAILMENT_LABEL": "Yes",
+        "NEUTRAL_LABEL": "Also",
+        "CONTRADICTION_LABEL": "No",
+    },
+    "es": {  # Spanish
+        "QUESTION_WORD": "correcto",
+        "ENTAILMENT_LABEL": "Sí",
+        "NEUTRAL_LABEL": "Asi que",
+        "CONTRADICTION_LABEL": "No",
+    },
+    "fr": {  # French
+        "QUESTION_WORD": "correct",
+        "ENTAILMENT_LABEL": "Oui",
+        "NEUTRAL_LABEL": "Aussi",
+        "CONTRADICTION_LABEL": "Non",
+    },
+    "hi": {  # Hindi
+        "QUESTION_WORD": "सही",
+        "ENTAILMENT_LABEL": "हाँ",
+        "NEUTRAL_LABEL": "इसलिए",
+        "CONTRADICTION_LABEL": "नहीं",
+    },
+    "ru": {  # Russian
+        "QUESTION_WORD": "правильно",
+        "ENTAILMENT_LABEL": "Да",
+        "NEUTRAL_LABEL": "Так",
+        "CONTRADICTION_LABEL": "Нет",
+    },
+    "sw": {  # Swahili
+        "QUESTION_WORD": "sahihi",
+        "ENTAILMENT_LABEL": "Ndiyo",
+        "NEUTRAL_LABEL": "Hivyo",
+        "CONTRADICTION_LABEL": "Hapana",
+    },
+    "th": {  # Thai
+        "QUESTION_WORD": "ถูกต้อง",
+        "ENTAILMENT_LABEL": "ใช่",
+        "NEUTRAL_LABEL": "ดังนั้น",
+        "CONTRADICTION_LABEL": "ไม่",
+    },
+    "tr": {  # Turkish
+        "QUESTION_WORD": "doğru",
+        "ENTAILMENT_LABEL": "Evet",
+        "NEUTRAL_LABEL": "Böylece",
+        "CONTRADICTION_LABEL": "Hayır",
+    },
+    "ur": {  # Urdu
+        "QUESTION_WORD": "صحیح",
+        "ENTAILMENT_LABEL": "جی ہاں",
+        "NEUTRAL_LABEL": "اس لئے",
+        "CONTRADICTION_LABEL": "نہیں",
+    },
+    "vi": {  # Vietnamese
+        "QUESTION_WORD": "đúng",
+        "ENTAILMENT_LABEL": "Vâng",
+        "NEUTRAL_LABEL": "Vì vậy",
+        "CONTRADICTION_LABEL": "Không",
+    },
+    "zh": {  # Chinese
+        "QUESTION_WORD": "正确",
+        "ENTAILMENT_LABEL": "是的",
+        "NEUTRAL_LABEL": "所以",
+        "CONTRADICTION_LABEL": "不是的",
+    },
+}
+
+
+def gen_lang_yamls(output_dir: str, overwrite: bool) -> None:
+    """
+    Generate a yaml file for each language.
+
+    :param output_dir: The directory to output the files to.
+    :param overwrite: Whether to overwrite files if they already exist.
+    """
+    err = []
+    for lang in LANGUAGES.keys():
+        file_name = f"xnli_{lang}.yaml"
+        try:
+            QUESTION_WORD = LANGUAGES[lang]["QUESTION_WORD"]
+            ENTAILMENT_LABEL = LANGUAGES[lang]["ENTAILMENT_LABEL"]
+            NEUTRAL_LABEL = LANGUAGES[lang]["NEUTRAL_LABEL"]
+            CONTRADICTION_LABEL = LANGUAGES[lang]["CONTRADICTION_LABEL"]
+            with open(
+                f"{output_dir}/{file_name}", "w" if overwrite else "x", encoding="utf8"
+            ) as f:
+                f.write("# Generated by utils.py\n")
+                yaml.dump(
+                    {
+                        "include": "xnli_common_yaml",
+                        "dataset_name": lang,
+                        "task": f"xnli_{lang}",
+                        "doc_to_text": "",
+                        "doc_to_choice": f"{{{{["
+                        f"""premise+\", {QUESTION_WORD}? {ENTAILMENT_LABEL}, \"+hypothesis,"""
+                        f"""premise+\", {QUESTION_WORD}? {NEUTRAL_LABEL}, \"+hypothesis,"""
+                        f"""premise+\", {QUESTION_WORD}? {CONTRADICTION_LABEL}, \"+hypothesis"""
+                        f"]}}}}",
+                    },
+                    f,
+                    allow_unicode=True,
+                )
+        except FileExistsError:
+            err.append(file_name)
+
+    if len(err) > 0:
+        raise FileExistsError(
+            "Files were not created because they already exist (use --overwrite flag):"
+            f" {', '.join(err)}"
+        )
+
+
+def main() -> None:
+    """Parse CLI args and generate language-specific yaml files."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--overwrite",
+        default=False,
+        action="store_true",
+        help="Overwrite files if they already exist",
+    )
+    parser.add_argument(
+        "--output-dir", default=".", help="Directory to write yaml files to"
+    )
+    args = parser.parse_args()
+
+    gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite)
+
+
+if __name__ == "__main__":
+    main()
--- a/lm_eval/tasks/xnli/xnli_ar.yaml
+++ b/lm_eval/tasks/xnli/xnli_ar.yaml
+# Generated by utils.py
+dataset_name: ar
+doc_to_choice: '{{[premise+", صحيح? نعم, "+hypothesis,premise+", صحيح? لذا, "+hypothesis,premise+",
+  صحيح? رقم, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_ar
--- a/lm_eval/tasks/xnli/xnli_bg.yaml
+++ b/lm_eval/tasks/xnli/xnli_bg.yaml
+# Generated by utils.py
+dataset_name: bg
+doc_to_choice: '{{[premise+", правилно? да, "+hypothesis,premise+", правилно? така,
+  "+hypothesis,premise+", правилно? не, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_bg
--- a/lm_eval/tasks/xnli/xnli_common_yaml
+++ b/lm_eval/tasks/xnli/xnli_common_yaml
+# This file will be included in the generated language-specific task configs.
+# It doesn't have a yaml file extension as it is not meant to be imported directly
+# by the harness.
+group: xnli
+task: null
+dataset_path: xnli
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+doc_to_text: null
+doc_to_target: label
+doc_to_choice: null
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/xnli/xnli_de.yaml
+++ b/lm_eval/tasks/xnli/xnli_de.yaml
+# Generated by utils.py
+dataset_name: de
+doc_to_choice: '{{[premise+", richtig? Ja, "+hypothesis,premise+", richtig? Auch,
+  "+hypothesis,premise+", richtig? Nein, "+hypothesis]}}'
+doc_to_text: ''
+include: xnli_common_yaml
+task: xnli_de