resolved merge conflict

6f4f9e1c · lintangsutawika · 0d5748b7 · aed90773 · 6f4f9e1c · 6f4f9e1c
Commit 6f4f9e1c authored Dec 13, 2023 by lintangsutawika
20 changed files
--- a/.github/workflows/new_tasks.yml
+++ b/.github/workflows/new_tasks.yml
@@ -3,10 +3,10 @@ name: Tasks Modified
 on:
  push:
    branches:
-      - 'big-refactor*'
+      - 'main'
  pull_request:
    branches:
-      - 'big-refactor*'
+      - 'main'
  workflow_dispatch:
 # comment/edit out the above to stop/change the triggers
 jobs:

--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -6,10 +6,10 @@ name: Unit Tests
 on:
  push:
    branches:
-      - 'big-refactor*'
+      - 'main'
  pull_request:
    branches:
-      - 'big-refactor*'
+      - 'main'
  workflow_dispatch:
 # Jobs run concurrently and steps run sequentially within a job.
 # jobs: linter and cpu_tests. Add more jobs/steps as required.

--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,6 +19,7 @@ repos:
      - id: no-commit-to-branch
      - id: requirements-txt-fixer
      - id: trailing-whitespace
+        args: [--markdown-linebreak-ext=md]
      - id: fix-byte-order-marker
        exclude: docs/CNAME
      - id: fix-encoding-pragma

--- a/README.md
+++ b/README.md
 # Language Model Evaluation Harness
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)
+## Announcement
+**A new v0.4.0 release of lm-evaluation-harness is available** !
+New updates and features include:
+- Internal refactoring
+- Config-based task creation and configuration
+- Easier import and sharing of externally-defined task config YAMLs
+- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource
+- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more
+- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more
+- Logging and usability changes
+- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more
+Please see our updated documentation pages in `docs/` for more details.
+Development will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](discord.gg/eleutherai)!
 ## Overview
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
@@ -25,7 +45,7 @@ cd lm-evaluation-harness
 pip install -e .
 ```
-We also provide a number of optional dependencies for . Extras can be installed via `pip install -e ".[NAME]"`
+We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
 | Name          | Use                                   |
 | ------------- | ------------------------------------- |
@@ -106,18 +126,21 @@ To use `accelerate` with the `lm-eval` command, use
 accelerate launch --no_python lm-eval --model ...
 ```
-### Tensor Parallel + Optimized Inference with vLLM
-We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html).
+### Tensor + Data Parallel and Optimized Inference with `vLLM`
+We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
 ```bash
 lm_eval --model vllm \
-    --model_args pretrained={model_name},tensor_parallel_size={number of GPUs to use},dtype=auto,gpu_memory_utilization=0.8 \
+    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
    --tasks lambada_openai \
    --batch_size auto
 ```
 For a full list of supported vLLM configurations, please reference our vLLM integration and the vLLM documentation.
+vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a script at [./scripts/model_comparator.py](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking validity of vllm results against HF.
 ### Model APIs and Inference Servers
 Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
@@ -158,7 +181,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
 > [!Note]
 > You can inspect what the LM inputs look like by running the following command:
->
 > ```bash
 > python write_out.py \
 >     --tasks all_tasks \
@@ -166,7 +188,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
 >     --num_examples 10 \
 >     --output_base_path /path/to/output/folder
 > ```
->
 > This will write out one text file for each task.
 To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
@@ -202,19 +223,17 @@ To save evaluation results provide an `--output_path`. We also support logging m
 Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
-For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md) guide in our documentation!
+For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
 ## How to Contribute or Learn More?
 For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
-You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
 ### Implementing new tasks
 To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
-In general, we following the following priority list for addressing concerns about prompting and other eval details:
+In general, we follow this priority list for addressing concerns about prompting and other eval details:
 1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
 2. If there is a clear and unambiguous official implementation, use that procedure.
 3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
@@ -222,11 +241,11 @@ In general, we following the following priority list for addressing concerns abo
 These are guidelines and not rules, and can be overruled in special circumstances.
-We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
+We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165) as our original goal was specifically to compare results with that paper.
 ### Support
-The best way to get support is to open an issue on this repo or join the EleutherAI discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
+The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
 ## Cite as
@@ -234,11 +253,11 @@ The best way to get support is to open an issue on this repo or join the Eleuthe
 @misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
-  month        = sep,
+  month        = 12,
-  year         = 2021,
+  year         = 2023,
  publisher    = {Zenodo},
-  version      = {v0.0.1},
+  version      = {v0.4.0},
-  doi          = {10.5281/zenodo.5371628},
+  doi          = {10.5281/zenodo.10256836},
-  url          = {https://doi.org/10.5281/zenodo.5371628}
+  url          = {https://zenodo.org/records/10256836}
 }
 ```
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -10,7 +10,7 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
 This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
-* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
+* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
 * `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
@@ -51,7 +51,7 @@ We also support using the library's external API for use within model training l
 `lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
-`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs/model_guide.md), and wrapping your custom model in that class as follows:
+`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
 ```python
 import lm_eval

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -12,7 +12,6 @@ To get started contributing, go ahead and fork the main repo, clone it, create a
 # After forking...
 git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
 cd lm-evaluation-harness
-git checkout big-refactor
 git checkout -b <model-type>
 pip install -e ".[dev]"
 ```
@@ -46,7 +45,7 @@ class MyCustomLM(LM):
        #...
    #...
 ```
-Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
+Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
 We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -2,9 +2,9 @@
 `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
-This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.4.0 in the future.)
+This documentation page provides a walkthrough to get started creating your own task, in `lm-eval` versions v0.4.0 and later.
-A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/examples/lm-eval-overview.ipynb).
+A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb).
 ## Setup
@@ -14,12 +14,11 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
 # After forking...
 git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
 cd lm-evaluation-harness
-git checkout big-refactor
 git checkout -b <task-name>
 pip install -e ".[dev]"
 ```
-In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/sciq/sciq.yaml).
+In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml).
 ## Creating a YAML file

--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
@@ -112,22 +112,3 @@ def get_sampler(name):
        raise ValueError(
            f"Attempted to use contextsampler '{name}', but no sampling strategy for this name found! Supported model names: {', '.join(SAMPLER_REGISTRY.keys())}"
        )
-# TODO: how should we do design here? might be better to have a single sampler and pass more kwargs at init.
-# Depends what's easier for new user to add own functionality on top of
-# types of sampler:
-# - class-balanced, randomly shuffled
-# - class-balanced, one particular set of fewshot examples for all evaled instances
-# - hand-specify number of fewshot examples per class?
-# - random, varies per example (check that this is curr. default in old repo)
-# - random, unified per example
-# - enforce a specific fixed fewshot string! (or should we not use this, in favor of including it in prompt template directly)
-# - user-specified doc indices to restrict fewshot doc options to
-# - user specifies split to use for drawing fewshot instances (TODO: manually prevent this from being same split you eval!)
-# - user specifies a prepended "description"/string to add in front of the (prompted) input
-# - user specifies a location to draw fewshot samples from? DO THIS IN TASK CLASS
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -234,8 +234,7 @@ def evaluate(
    padding_requests = collections.defaultdict(int)
    # store the hierarchy to do proper ordering
    task_hierarchy = collections.defaultdict(list)
-    # store the ordering of tasks and groups
+    # store task aliases
-    task_order = collections.defaultdict(int)
    task_group_alias = collections.defaultdict(dict)
    # store num-fewshot value per task
    num_fewshot = collections.defaultdict(int)
@@ -440,32 +439,6 @@ def evaluate(
        vals = vals_torch
    if lm.rank == 0:
-        ### Get task ordering for correct sample-wide aggregation
-        group_to_task = {}
-        for group in task_hierarchy.keys():
-            if group not in task_order:
-                task_order[group] = 0
-            if len(task_hierarchy[group]) > 0:
-                group_to_task[group] = task_hierarchy[group].copy()
-            for task in task_hierarchy[group]:
-                if task in task_order:
-                    task_order[task] += 1
-                else:
-                    task_order[task] = 1 + task_order[group]
-                if task in task_hierarchy:
-                    group_to_task[group].remove(task)
-                    group_to_task[group].extend(task_hierarchy[task])
-        task_to_group = {}
-        for group in group_to_task:
-            for task in group_to_task[group]:
-                if task in task_to_group:
-                    task_to_group[task].append(group)
-                else:
-                    task_to_group[task] = [group]
        ### Aggregate results over all datapoints ###
        # aggregate results ; run bootstrap CIs
@@ -492,10 +465,10 @@ def evaluate(
                    else bootstrap_iters,
                )
-                if stderr is not None:
+                if stderr is not None and len(items) > 1:
                    results[task_name][metric + "_stderr" + "," + key] = stderr(items)
                else:
-                    results[task_name][metric + "_stderr" + "," + key] = 0
+                    results[task_name][metric + "_stderr" + "," + key] = "N/A"
        if bool(results):
            for group, task_list in reversed(task_hierarchy.items()):
@@ -553,37 +526,36 @@ def evaluate(
                results[group]["samples"] = total_size
-        def print_tasks(task_hierarchy, task_order, task_version, task_group_alias):
+        def print_tasks(task_hierarchy, tab=0):
            results_agg = collections.defaultdict(dict)
            groups_agg = collections.defaultdict(dict)
-            for group_name, task_list in task_hierarchy.items():
-                order = task_order[group_name]
-                results_agg[group_name] = results[group_name].copy()
-                results_agg[group_name]["tab"] = order
-                if (order < max(task_order.values())) and (len(task_list) > 0):
+            (group_name, task_list), *_ = task_hierarchy.items()
-                    groups_agg[group_name] = results[group_name].copy()
+            task_list = sorted(task_list)
-                    groups_agg[group_name]["tab"] = order
-                if task_list != []:
+            results_agg[group_name] = results[group_name].copy()
-                    for task in sorted(task_list):
+            results_agg[group_name]["tab"] = tab
-                        if task in task_hierarchy:
-                            _task_hierarchy = {task: task_hierarchy[task]}
-                        else:
-                            _task_hierarchy = {task: []}
-                        _results_agg, _groups_agg, task_version = print_tasks(
+            if len(task_list) > 0:
-                            _task_hierarchy, task_order, task_version, task_group_alias
+                groups_agg[group_name] = results[group_name].copy()
-                        )
+                groups_agg[group_name]["tab"] = tab
-                        results_agg = {**results_agg, **_results_agg}
+                for task_name in task_list:
-                        groups_agg = {**groups_agg, **_groups_agg}
+                    if task_name in task_hierarchy:
+                        _task_hierarchy = {
+                            **{task_name: task_hierarchy[task_name]},
+                            **task_hierarchy,
+                        }
+                    else:
+                        _task_hierarchy = {task_name: []}
-            return results_agg, groups_agg, task_version
+                    _results_agg, _groups_agg = print_tasks(_task_hierarchy, tab + 1)
+                    results_agg = {**results_agg, **_results_agg}
+                    groups_agg = {**groups_agg, **_groups_agg}
-        results_agg, groups_agg, versions = print_tasks(
+            return results_agg, groups_agg
-            task_hierarchy, task_order, versions, task_group_alias
-        )
+        results_agg, groups_agg = print_tasks(task_hierarchy)
        for task in results_agg:
            task_results = results_agg[task]

--- a/lm_eval/models/vllm_causallms.py
+++ b/lm_eval/models/vllm_causallms.py
 from collections import defaultdict
-from typing import List, Tuple, Optional, Literal, Union
+from typing import List, Tuple, Optional, Literal, Union, Any
+from transformers import AutoTokenizer
 from lm_eval.api.instance import Instance
 from lm_eval.api.model import LM
 import copy
@@ -10,13 +10,22 @@ from lm_eval import utils
 try:
    from vllm import LLM, SamplingParams
+    from ray.util.multiprocessing import Pool
+    from vllm.transformers_utils.tokenizer import get_tokenizer
 except ModuleNotFoundError:
    pass
 eval_logger = utils.eval_logger
+# adapted from https://github.com/vllm-project/vllm/issues/367#issuecomment-1788341727
+def run_inference_one_model(model_args: dict, sampling_params, requests: List[int]):
+    # gpu_id = [x for x in gpu_id]
+    # os.environ["CUDA_VISIBLE_DEVICES"]= str(gpu_id)
+    llm = LLM(**model_args)
+    return llm.generate(prompt_token_ids=requests, sampling_params=sampling_params)
 @register_model("vllm")
 class VLLM(LM):
    _DEFAULT_MAX_LENGTH = 2048
@@ -27,7 +36,9 @@ class VLLM(LM):
        dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
        revision: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
+        tokenizer: Optional[str] = None,
        tokenizer_mode: Literal["auto", "slow"] = "auto",
+        tokenizer_revision: Optional[str] = None,
        tensor_parallel_size: int = 1,
        quantization: Optional[Literal["awq"]] = None,
        max_gen_toks: int = 256,
@@ -38,6 +49,7 @@ class VLLM(LM):
        seed: int = 1234,
        gpu_memory_utilization: float = 0.9,
        device: str = "cuda",
+        data_parallel_size: int = 1,
    ):
        super().__init__()
@@ -50,19 +62,32 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
            )
        assert "cuda" in device or device is None, "vLLM only supports CUDA"
-        self.model = LLM(
+        self.tensor_parallel_size = int(tensor_parallel_size)
-            model=pretrained,
+        self.data_parallel_size = int(data_parallel_size)
-            gpu_memory_utilization=float(gpu_memory_utilization),
+        self.model_args = {
-            revision=revision,
+            "model": pretrained,
-            dtype=dtype,
+            "gpu_memory_utilization": float(gpu_memory_utilization),
+            "revision": revision,
+            "dtype": dtype,
+            "tokenizer": tokenizer,
+            "tokenizer_mode": tokenizer_mode,
+            "tokenizer_revision": tokenizer_revision,
+            "trust_remote_code": trust_remote_code,
+            "tensor_parallel_size": int(tensor_parallel_size),
+            "swap_space": int(swap_space),
+            "quantization": quantization,
+            "seed": int(seed),
+        }
+        if self.data_parallel_size <= 1:
+            self.model = LLM(**self.model_args)
+        else:
+            self.model_args["worker_use_ray"] = True
+        self.tokenizer = get_tokenizer(
+            tokenizer if tokenizer else pretrained,
            tokenizer_mode=tokenizer_mode,
            trust_remote_code=trust_remote_code,
-            tensor_parallel_size=int(tensor_parallel_size),
+            tokenizer_revision=tokenizer_revision,
-            swap_space=int(swap_space),
-            quantization=quantization,
-            seed=int(seed),
        )
-        self.tokenizer = self.model.get_tokenizer()
        self.batch_size = batch_size
        self._max_length = max_length
        self._max_gen_toks = max_gen_toks
@@ -76,8 +101,8 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
    def max_length(self):
        if self._max_length:  # if max length manually set, return it
            return self._max_length
-        if hasattr(self.model.llm_engine.model_config, "max_model_len"):
+        if hasattr(self.tokenizer, "model_max_length"):
-            return self.model.llm_engine.model_config.max_model_len
+            return self.tokenizer.model_max_length
        return self._DEFAULT_MAX_LENGTH
    @property
@@ -104,7 +129,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
    def _model_generate(
        self,
-        requests: List[int] = None,
+        requests: List[List[int]] = None,
        generate: bool = False,
        max_tokens: int = None,
        stop: Optional[List[str]] = None,
@@ -114,25 +139,50 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
        if "do_sample" in kwargs.keys():
            kwargs.pop("do_sample")
        if generate:
-            generate_sampling_params = SamplingParams(
+            # hf defaults
-                max_tokens=max_tokens, stop=stop, **kwargs
+            kwargs["skip_special_tokens"] = kwargs.get("skip_special_tokens", False)
-            )
+            kwargs["spaces_between_special_tokens"] = kwargs.get(
-            outputs = self.model.generate(
+                "spaces_between_special_tokens", False
-                prompt_token_ids=requests,
-                sampling_params=generate_sampling_params,
-                use_tqdm=use_tqdm,
            )
+            sampling_params = SamplingParams(max_tokens=max_tokens, stop=stop, **kwargs)
        else:
-            logliklihood_sampling_params = SamplingParams(
+            sampling_params = SamplingParams(
                temperature=0, prompt_logprobs=2, max_tokens=1
            )
-            outputs = self.model.generate(
+        if self.data_parallel_size > 1:
-                prompt_token_ids=requests,
+            requests = [
-                sampling_params=logliklihood_sampling_params,
+                list(x) for x in utils.divide(requests, self.data_parallel_size)
-                use_tqdm=use_tqdm,
+            ]
-            )
+            inputs = [(self.model_args, sampling_params, req) for req in requests]
+            with Pool(self.data_parallel_size) as pool:
+                results = pool.starmap(run_inference_one_model, inputs)
+            # flatten results
+            return [item for sublist in results for item in sublist]
+        outputs = self.model.generate(
+            prompt_token_ids=requests,
+            sampling_params=sampling_params,
+            use_tqdm=use_tqdm,
+        )
        return outputs
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
+        context_enc = self.tok_encode(context, add_special_tokens=False)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
        new_reqs = []
        for context, continuation in [req.args for req in requests]:
@@ -142,12 +192,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
                    continuation
                )
            else:
-                context_enc, continuation_enc = self.tokenizer(
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
-                    [context, continuation],
-                    truncation="do_not_truncate",
-                    add_special_tokens=False,
-                    return_attention_mask=False,
-                ).input_ids
            new_reqs.append(((context, continuation), context_enc, continuation_enc))
@@ -188,7 +233,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
        # batch tokenize contexts
        context, all_gen_kwargs = zip(*(req.args for req in requests))
-        context_encoding = self.tokenizer(context).input_ids
+        context_encoding = self.tokenizer(context, add_special_tokens=False).input_ids
        requests = [
            ((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
        ]

--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/a.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/a.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
 task: arc_easy_alt_ov_01a
 doc_to_text: !function ../styles.template_01
 doc_to_choice: !function ../styles.choice_01a
 doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/b.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/b.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
 task: arc_easy_alt_ov_01b
 doc_to_text: !function ../styles.template_01
 doc_to_choice: !function ../styles.choice_01b
 doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/c.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_01/c.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
 task: arc_easy_alt_ov_01c
 doc_to_text: !function ../styles.template_01
 doc_to_choice: !function ../styles.choice_01c
 doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/a.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/a.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
 task: arc_easy_alt_ov_02a
 doc_to_text: !function ../styles.template_02
 doc_to_choice: !function ../styles.choice_02a
 doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/b.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/b.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
 task: arc_easy_alt_ov_02b
 doc_to_text: !function ../styles.template_02
 doc_to_choice: !function ../styles.choice_02b
 doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/c.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_02/c.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
 task: arc_easy_alt_ov_02c
 doc_to_text: !function ../styles.template_02
 doc_to_choice: !function ../styles.choice_02c
 doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/a.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/a.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
 task: arc_easy_alt_ov_03a
 doc_to_text: !function ../styles.template_03
 doc_to_choice: !function ../styles.choice_03a
 doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/b.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/b.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
 task: arc_easy_alt_ov_03b
 doc_to_text: !function ../styles.template_03
 doc_to_choice: !function ../styles.choice_03b
 doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/c.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_03/c.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
 task: arc_easy_alt_ov_03c
 doc_to_text: !function ../styles.template_03
 doc_to_choice: !function ../styles.choice_03c
 doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/output_variation/style_04/a.yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/output_variation/style_04/a.yaml
@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_04
 task: arc_easy_alt_ov_04a
 doc_to_text: !function ../styles.template_04
 doc_to_choice: !function ../styles.choice_04a
 doc_to_decontamination_query: !function ../styles.template_04
\ No newline at end of file