Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

33f2f9bf · lintangsutawika · e1fdf2a8 · 7634a6ec · 33f2f9bf · 33f2f9bf
Commit 33f2f9bf authored Aug 10, 2023 by lintangsutawika
20 changed files
--- a/README.md
+++ b/README.md
@@ -14,14 +14,13 @@ If you choose to port a task not yet completed according to [our checklist](http
 Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
 ## Overview
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
 Features:
- Many tasks implemented, 200+ tasks implemented in the old framework which require porting to the new setup as described in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md.
+- Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md).
 - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
 - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
 - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
@@ -50,6 +49,10 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
 pip install -e ".[gptq]"
 ```
+## Support
+The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
 ## Basic Usage
 ### Hugging Face `transformers`
@@ -79,6 +82,17 @@ python main.py \
 Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via  Support for this model type is currently pending.
+Batch size selection can be automated by setting the  ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:
+```bash
+python main.py \
+    --model hf \
+    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
+    --tasks lambada_openai,hellaswag \
+    --device cuda:0 \
+    --batch_size auto:4
+```
 ### Multi-GPU Evaluation with Hugging Face `accelerate`
 To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -36,15 +36,19 @@ The LM class enforces a common interface via which we can extract responses from
 ```python
 class MyCustomLM(LM):
    #...
-    def loglikelihood(self, requests):
+    def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
+        #...
-    def loglikelihood_rolling(self, requests):
+    def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
+        #...
-    def greedy_until(self, requests):
+    def greedy_until(self, requests: list[Instance]) -> list[str]:
+        #...
    #...
 ```
+Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation).
 We support

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
 import abc
 import os
-from typing import Union
+from typing import Union, List, Tuple
 from sqlitedict import SqliteDict
 import json
 import hashlib
@@ -25,31 +25,32 @@ class LM(abc.ABC):
        self.cache_hook = CacheHook(None)
    @abc.abstractmethod
-    def loglikelihood(self, requests):
+    def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
        """Compute log-likelihood of generating a continuation from a context.
        Downstream tasks should attempt to use loglikelihood instead of other
        LM calls whenever possible.
-        :param requests: list
+        :param requests: list[Instance]
-            A list of pairs (context, continuation)
+            A list of Instance objects, with property `args` which returns a tuple (context, continuation).
-            context: str
+            `context: str`
                Context string. Implementations of LM must be able to handle an
                empty context string.
-            continuation: str
+            `continuation: str`
                The continuation over which log likelihood will be calculated. If
                there is a word boundary, the space should be in the continuation.
                For example, context="hello" continuation=" world" is correct.
-        :return: list
+        :return: list[tuple[float, bool]]
            A list of pairs (logprob, isgreedy)
-            logprob: float
+            `logprob: float`
-                The log probability of `continuation`
+                The log probability of `continuation`.
-            isgreedy:
+            `isgreedy`:
-                Whether `continuation` would be generated by greedy sampling from `context`
+                Whether `continuation` would be generated by greedy sampling from `context`.
        """
        pass
    @abc.abstractmethod
-    def loglikelihood_rolling(self, requests):
+    def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
        """Compute full log-likelihood of a string, with no truncation, for perplexity computation
        - We will use the full max context length of the model.
        - For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
@@ -77,11 +78,11 @@ class LM(abc.ABC):
            1. Each token is predicted exactly once
            2. For the last pair, we provide the full context, but only score the last two tokens
-        :param requests: list
+        :param requests: list[Instance]
-            A list of strings
+            A list of Instance objects with property `args` which returns a tuple (context, continuation).
            string: str
                String for which we are computing per-token loglikelihood
-        :return: list
+        :return: list[tuple[float, bool]]
            A list of pairs (logprob, isgreedy)
            logprob: float
                The log probability of `continuation`
@@ -92,17 +93,17 @@ class LM(abc.ABC):
    # TODO: Add an optional max length
    @abc.abstractmethod
-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
        """Generate greedily until a stopping sequence
-        :param requests: list
+        :param requests: list[Instance]
-            A list of pairs (context, until)
+            A list of Instance objects with property `args` which returns a tuple (context, until).
            context: str
                Context string
            until: [str]
                The string sequences to generate until. These string sequences
                may each span across multiple tokens, or may be part of one token.
-        :return: list
+        :return: list[str]
            A list of strings continuation
            continuation: str
                The generated continuation.

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -13,7 +13,7 @@ from tqdm import tqdm
 import datasets
 import numpy as np
-from typing import Union
+from typing import Union, List, Any, Tuple, Literal
 from collections.abc import Callable
 from lm_eval import utils
@@ -477,7 +477,7 @@ class Task(abc.ABC):
            eval_logger.warning("No filter defined, passing through instances")
            return self._instances
-    def dump_config(self):
+    def dump_config(self) -> dict:
        """Returns a dictionary representing the task's config.
        :returns: str
@@ -489,14 +489,13 @@ class Task(abc.ABC):
 class ConfigurableTask(Task):
    VERSION = "Yaml"
    OUTPUT_TYPE = None
    CONFIG = None
    def __init__(
        self, data_dir=None, cache_dir=None, download_mode=None, config: dict = None
-    ):
+    ):  # TODO no super() call here
        # Get pre-configured attributes
        self._config = self.CONFIG
@@ -662,25 +661,25 @@ class ConfigurableTask(Task):
            **dataset_kwargs if dataset_kwargs is not None else {},
        )
-    def has_training_docs(self):
+    def has_training_docs(self) -> bool:
        if self._config.training_split is not None:
            return True
        else:
            return False
-    def has_validation_docs(self):
+    def has_validation_docs(self) -> bool:
        if self._config.validation_split is not None:
            return True
        else:
            return False
-    def has_test_docs(self):
+    def has_test_docs(self) -> bool:
        if self._config.test_split is not None:
            return True
        else:
            return False
-    def training_docs(self):
+    def training_docs(self) -> datasets.Dataset:
        if self.has_training_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(
@@ -688,7 +687,7 @@ class ConfigurableTask(Task):
                )
            return self.dataset[self._config.training_split]
-    def validation_docs(self):
+    def validation_docs(self) -> datasets.Dataset:
        if self.has_validation_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(
@@ -696,7 +695,7 @@ class ConfigurableTask(Task):
                )
            return self.dataset[self._config.validation_split]
-    def test_docs(self):
+    def test_docs(self) -> datasets.Dataset:
        if self.has_test_docs():
            if self._config.process_docs is not None:
                return self._config.process_docs(self.dataset[self._config.test_split])
@@ -767,7 +766,7 @@ class ConfigurableTask(Task):
            print(type(doc_to_text))
            raise TypeError
-    def doc_to_target(self, doc):
+    def doc_to_target(self, doc: dict) -> Union[int, str]:
        if self.prompt is not None:
            doc_to_target = self.prompt
@@ -796,7 +795,7 @@ class ConfigurableTask(Task):
        else:
            raise TypeError
-    def doc_to_choice(self, doc):
+    def doc_to_choice(self, doc: Any) -> List[str]:
        if self.prompt is not None:
            doc_to_choice = self.prompt
@@ -838,7 +837,9 @@ class ConfigurableTask(Task):
        else:
            raise TypeError
-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(
+        self, doc: dict, ctx: str, **kwargs
+    ) -> Union[List[Instance], Instance]:
        if self.OUTPUT_TYPE == "loglikelihood":
            arguments = (ctx, self.doc_to_target(doc))
@@ -847,13 +848,14 @@ class ConfigurableTask(Task):
        elif self.OUTPUT_TYPE == "multiple_choice":
            choices = self.doc_to_choice(doc)
+            target_delimiter = self._config.target_delimiter
            if self.multiple_input:
                # If there are multiple inputs, choices are placed in the ctx
                cont = self.doc_to_target(doc)
-                arguments = [(ctx, " {}".format(cont)) for ctx in choices]
+                arguments = [(ctx, f"{target_delimiter}{cont}") for ctx in choices]
            else:
                # Otherwise they are placed in the continuation
-                arguments = [(ctx, " {}".format(cont)) for cont in choices]
+                arguments = [(ctx, f"{target_delimiter}{cont}") for cont in choices]
            request_list = [
                Instance(
@@ -1037,13 +1039,12 @@ class ConfigurableTask(Task):
 class MultipleChoiceTask(Task):
    OUTPUT_TYPE: str = "loglikelihood"
-    def doc_to_target(self, doc):
+    def doc_to_target(self, doc: dict) -> str:
        return " " + doc["choices"][doc["gold"]]
-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(self, doc: dict, ctx: str, **kwargs) -> List[Instance]:
        # TODO: add mutual info here?
        return [
            Instance(
@@ -1056,7 +1057,7 @@ class MultipleChoiceTask(Task):
            for i, choice in enumerate(doc["choices"])
        ]
-    def process_results(self, doc, results):
+    def process_results(self, doc: dict, results: List[Tuple[float, bool]]) -> dict:
        results = [
            res[0] for res in results
        ]  # only retain loglikelihoods, discard is_greedy TODO: do we need is_greedy anywhere?
@@ -1071,13 +1072,13 @@ class MultipleChoiceTask(Task):
            "acc_norm": acc_norm,
        }
-    def higher_is_better(self):
+    def higher_is_better(self) -> dict:
        return {
            "acc": True,
            "acc_norm": True,
        }
-    def aggregation(self):
+    def aggregation(self) -> dict:
        return {
            "acc": mean,
            "acc_norm": mean,
@@ -1085,24 +1086,23 @@ class MultipleChoiceTask(Task):
 class PerplexityTask(Task):
    OUTPUT_TYPE = "loglikelihood_rolling"
-    def has_training_docs(self):
+    def has_training_docs(self) -> bool:
        return False
-    def fewshot_examples(self, k, rnd):
+    def fewshot_examples(self, k: int, rnd) -> List:
        assert k == 0
        return []
-    def fewshot_context(self, doc, num_fewshot):
+    def fewshot_context(self, doc: dict, num_fewshot: int) -> Literal[""]:
        assert (
            num_fewshot == 0
        ), "The number of fewshot examples must be 0 for perplexity tasks."
        return ""
-    def higher_is_better(self):
+    def higher_is_better(self) -> dict:
        return {
            "word_perplexity": False,
            "byte_perplexity": False,
@@ -1118,7 +1118,7 @@ class PerplexityTask(Task):
    def doc_to_target(self, doc):
        return doc
-    def construct_requests(self, doc, ctx, **kwargs):
+    def construct_requests(self, doc: dict, ctx: Union[str, None], **kwargs):
        assert not ctx
        return Instance(
@@ -1129,7 +1129,7 @@ class PerplexityTask(Task):
            **kwargs,
        )
-    def process_results(self, doc, results):
+    def process_results(self, doc: dict, results: float) -> dict:
        (loglikelihood,) = results
        words = self.count_words(self.doc_to_target(doc))
        bytes_ = self.count_bytes(self.doc_to_target(doc))
@@ -1139,7 +1139,7 @@ class PerplexityTask(Task):
            "bits_per_byte": (loglikelihood, bytes_),
        }
-    def aggregation(self):
+    def aggregation(self) -> dict:
        return {
            "word_perplexity": weighted_perplexity,
            "byte_perplexity": weighted_perplexity,
@@ -1147,10 +1147,10 @@ class PerplexityTask(Task):
        }
    @classmethod
-    def count_bytes(cls, doc):
+    def count_bytes(cls, doc) -> int:
        return len(doc.encode("utf-8"))
    @classmethod
-    def count_words(cls, doc):
+    def count_words(cls, doc) -> int:
        """Downstream tasks with custom word boundaries should override this!"""
        return len(re.split(r"\s+", doc))
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -85,7 +85,9 @@ def simple_evaluate(
        1234
    )  # TODO: this may affect training runs that are run with evaluation mid-run.
-    assert tasks != [], "No tasks specified"
+    assert (
+        tasks != []
+    ), "No tasks specified, or no tasks found. Please verify the task names."
    if isinstance(model, str):
        if model_args is None:
@@ -251,7 +253,7 @@ def evaluate(
                    eval_logger.info(
                        f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\n{inst.args[0]}\n(end of prompt on previous line)"
                    )
-                    eval_logger.info("Request:", inst)
+                    eval_logger.info(f"Request: {str(inst)}")
        # aggregate Instances by LM method requested to get output.
        reqtype = (

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
 from . import huggingface
 from . import openai_completions
-from . import anthropic_llms
 from . import textsynth
 from . import dummy
+from . import anthropic_llms
 # TODO: implement __all__
--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
-import os
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
 from tqdm import tqdm
 import time
-import anthropic
 from lm_eval.logger import eval_logger
-from typing import List, Literal, Any
+from typing import List, Any, Tuple
 def anthropic_completion(
-    client: anthropic.Anthropic,
+    client,  #: anthropic.Anthropic,
    model: str,
    prompt: str,
    max_tokens_to_sample: int,
    temperature: float,
    stop: List[str],
    **kwargs: Any,
-):
+) -> str:
-    """Query Anthropic API for completion.
+    """Wrapper function around the Anthropic completion API client with exponential back-off
+    in case of RateLimitError.
-    Retry with back-off until they respond
+    params:
+        client: anthropic.Anthropic
+            Anthropic API client
+        model: str
+            Anthropic model e.g. 'claude-instant-v1', 'claude-2'
+        prompt: str
+            Prompt to feed to the model
+        max_tokens_to_sample: int
+            Maximum number of tokens to sample from the model
+        temperature: float
+            Sampling temperature
+        stop: List[str]
+            List of stop sequences
+        kwargs: Any
+            Additional model_args to pass to the API client
    """
-    backoff_time = 3
+    try:
+        import anthropic
+    except ModuleNotFoundError:
+        raise Exception(
+            "attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
+please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
+        )
+    backoff_time: float = 3
    while True:
        try:
            response = client.completions.create(
@@ -68,6 +90,14 @@ class AnthropicLM(LM):
        """
        super().__init__()
+        try:
+            import anthropic
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
+please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
+            )
        self.model = model
        # defaults to os.environ.get("ANTHROPIC_API_KEY")
        self.client = anthropic.Anthropic()
@@ -78,15 +108,15 @@ class AnthropicLM(LM):
    @property
    def eot_token_id(self):
-        # Not sure but anthropic.AI_PROMPT -> [203, 203, 50803, 30]
+        # Not sure but anthropic.HUMAN_PROMPT ?
        raise NotImplementedError("No idea about anthropic tokenization.")
    @property
-    def max_length(self):
+    def max_length(self) -> int:
        return 2048
    @property
-    def max_gen_toks(self):
+    def max_gen_toks(self) -> int:
        return self.max_tokens_to_sample
    @property
@@ -108,14 +138,15 @@ class AnthropicLM(LM):
    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
        raise NotImplementedError("No support for logits.")
-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
        if not requests:
            return []
-        requests = [req.args for req in requests]
+        _requests: List[Tuple[str, dict]] = [req.args for req in requests]
        res = []
-        for request in tqdm(requests):
+        for request in tqdm(_requests):
            try:
                inp = request[0]
                request_args = request[1]
@@ -129,16 +160,16 @@ class AnthropicLM(LM):
                    prompt=inp,
                    max_tokens_to_sample=max_gen_toks,
                    temperature=temperature,  # TODO: implement non-greedy sampling for Anthropic
-                    stop=until,
+                    stop=until,  # type: ignore
                    **self.kwargs,
                )
                res.append(response)
                self.cache_hook.add_partial("greedy_until", request, response)
-            except anthropic.APIConnectionError as e:
+            except anthropic.APIConnectionError as e:  # type: ignore # noqa: F821
                eval_logger.critical(f"Server unreachable: {e.__cause__}")
                break
-            except anthropic.APIStatusError as e:
+            except anthropic.APIStatusError as e:  # type: ignore # noqa: F821
                eval_logger.critical(f"API error {e.status_code}: {e.message}")
                break

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -22,7 +22,7 @@ from lm_eval.api.registry import register_model
 from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria
-from accelerate import Accelerator
+from accelerate import Accelerator, find_executable_batch_size
 from typing import List, Optional, Union
@@ -72,7 +72,8 @@ class HFLM(LM):
        max_length: Optional[int] = None,
        device: Optional[str] = "cuda",
        dtype: Optional[Union[str, torch.dtype]] = "auto",
-        batch_size: Optional[int] = 1,
+        batch_size: Optional[Union[int, str]] = 1,
+        max_batch_size: Optional[int] = 64,
        low_cpu_mem_usage: Optional[bool] = True,
        trust_remote_code: Optional[bool] = False,
        use_fast_tokenizer: Optional[bool] = True,
@@ -97,7 +98,7 @@ class HFLM(LM):
        assert isinstance(device, str)
        assert isinstance(pretrained, str)
-        assert isinstance(batch_size, int)
+        assert isinstance(batch_size, (int, str))
        gpus = torch.cuda.device_count()
        accelerator = Accelerator()
@@ -247,8 +248,16 @@ class HFLM(LM):
        self._max_length = max_length
-        # multithreading and batching
+        self.batch_schedule = 1
-        self.batch_size_per_gpu = batch_size
+        self.batch_sizes = {}
+        self.max_batch_size = max_batch_size
+        if str(batch_size).startswith("auto"):
+            batch_size = batch_size.split(":")
+            self.batch_size_per_gpu = batch_size[0]
+            self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
+        else:
+            self.batch_size_per_gpu = int(batch_size)
        # multigpu data-parallel support when launched with accelerate
        if gpus > 1:
@@ -345,6 +354,56 @@ class HFLM(LM):
    def world_size(self):
        return self._world_size
+    def _detect_batch_size(self, requests=None, pos=0):
+        if requests:
+            _, context_enc, continuation_enc = requests[pos]
+            max_length = len(
+                (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]
+            )
+            max_context_enc = len(context_enc[-(self.max_length + 1) :])
+            max_cont_enc = len(continuation_enc[-(self.max_length + 1) :])
+        else:
+            max_length = self.max_length
+        # if OOM, then halves batch_size and tries again
+        @find_executable_batch_size(starting_batch_size=self.max_batch_size)
+        def forward_batch(batch_size):
+            if self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+                length = max(max_context_enc, max_cont_enc)
+                batched_conts = torch.ones(
+                    (batch_size, length), device=self.device
+                ).long()
+                test_batch = torch.ones((batch_size, length), device=self.device).long()
+                call_kwargs = {
+                    "attn_mask": test_batch,
+                    "labels": batched_conts,
+                }
+            else:
+                call_kwargs = {}
+                test_batch = torch.ones(
+                    (batch_size, max_length), device=self.device
+                ).long()
+            for _ in range(5):
+                out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1)
+                out = out  # Identity process so that it passes pre-commit
+            return batch_size
+        batch_size = forward_batch()
+        if self.world_size > 1:
+            # if multi-GPU, always take minimum over all selected batch sizes
+            max_rnk_bs = torch.tensor([batch_size], device=self.device)
+            gathered = (
+                self.accelerator.gather(max_rnk_bs).cpu().detach().numpy().tolist()
+            )
+            batch_size = min(gathered)
+            utils.clear_torch_cache()
+            return batch_size
+        utils.clear_torch_cache()
+        return batch_size
    def tok_encode(self, string: str, left_truncate_len=None):
        """ """
        if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
@@ -483,6 +542,15 @@ class HFLM(LM):
    def loglikelihood_rolling(self, requests):
        loglikelihoods = []
+        adaptive_batch_size = None
+        if self.batch_size == "auto":
+            # using rolling window with maximum context
+            print("Passed argument batch_size = auto. Detecting largest batch size")
+            batch_size = self._detect_batch_size()
+            print(f"Determined Largest batch size: {batch_size}")
+            adaptive_batch_size = batch_size
        for (string,) in tqdm([req.args for req in requests], disable=(self.rank != 0)):
            rolling_token_windows = list(
                map(
@@ -512,7 +580,9 @@ class HFLM(LM):
                    rolling_token_windows += pad_amnt * [rolling_token_windows[0]]
            string_nll = self._loglikelihood_tokens(
-                rolling_token_windows, disable_tqdm=True
+                rolling_token_windows,
+                disable_tqdm=True,
+                override_bs=adaptive_batch_size,
            )
            if (self.world_size > 1) and (pad_amnt > 0):
@@ -526,7 +596,7 @@ class HFLM(LM):
        return loglikelihoods
-    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
        res = []
@@ -541,11 +611,43 @@ class HFLM(LM):
            toks = x[1] + x[2]
            return -len(toks), tuple(toks)
-        # TODO: automatic (variable) batch size detection for vectorization
        re_ord = utils.Reorderer(requests, _collate)
+        n_reordered_requests = len(re_ord.get_reordered())
+        # automatic (variable) batch size detection for vectorization
+        # pull longest context sample from request
+        def _batch_scheduler(pos):
+            sched = pos // int(n_reordered_requests / self.batch_schedule)
+            if sched in self.batch_sizes:
+                return self.batch_sizes[sched]
+            if (len(self.batch_sizes) > 1) and (
+                self.batch_sizes[sched - 1] == self.max_batch_size
+            ):
+                # if previous batch size is already maximal, skip recomputation
+                self.batch_sizes[sched] = self.max_batch_size
+                return self.batch_sizes[sched]
+            print(
+                f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
+            )
+            self.batch_sizes[sched] = self._detect_batch_size(
+                re_ord.get_reordered(), pos
+            )
+            print(f"Determined largest batch size: {self.batch_sizes[sched]}")
+            return self.batch_sizes[sched]
        for chunk in utils.chunks(
            tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))),
-            self.batch_size,
+            n=self.batch_size
+            if self.batch_size != "auto"
+            else override_bs
+            if override_bs is not None
+            else 0,
+            fn=_batch_scheduler
+            if self.batch_size == "auto"
+            and n_reordered_requests > 0
+            and not override_bs
+            else None,
        ):
            inps = []
            cont_toks_list = []

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
 import os
 import time
-import transformers
+from typing import List, Tuple
-import numpy as np
 from tqdm import tqdm
 from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
-def get_result(response, ctxlen):
+def get_result(response: dict, ctxlen: int) -> Tuple[float, bool]:
    """Process results from OpenAI API response.
    :param response: dict
@@ -43,7 +40,13 @@ def oa_completion(**kwargs):
    Retry with back-off until they respond
    """
-    import openai
+    try:
+        import openai, tiktoken  # noqa: E401
+    except ModuleNotFoundError:
+        raise Exception(
+            "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+        )
    backoff_time = 3
    while True:
@@ -61,7 +64,12 @@ def oa_completion(**kwargs):
 class OpenaiCompletionsLM(LM):
    REQ_CHUNK_SIZE = 20
-    def __init__(self, engine, truncate=False):
+    def __init__(
+        self,
+        engine: str = "text-davinci-003",
+        truncate: bool = False,
+        batch_size: int = 1,
+    ):
        """
        :param engine: str
@@ -70,28 +78,25 @@ class OpenaiCompletionsLM(LM):
            Truncate input if too long (if False and input is too long, throw error)
        """
        super().__init__()
+        try:
-        import openai
+            import openai, tiktoken  # noqa: E401
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+    please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+            )
        self.engine = engine
-        self.tokenizer = transformers.GPT2TokenizerFast.from_pretrained("gpt2")
+        self.tokenizer = tiktoken.encoding_for_model(self.engine)
+        self.vocab_size = self.tokenizer.n_vocab
-        self.vocab_size = self.tokenizer.vocab_size
-        # to make the annoying "Using pad_token, but it is not set yet." error go away
-        self.tokenizer.pad_token = "<|endoftext|>"
-        assert self.tokenizer.encode("hello\n\nhello") == [31373, 198, 198, 31373]
        self.truncate = truncate
-        self.end_of_text_token_id = self.tokenizer.convert_tokens_to_ids(
+        self.end_of_text_token_id = self.tokenizer.eot_token
-            ["<|endoftext|>"]
-        )[0]
        # Read from environment variable OPENAI_API_SECRET_KEY
        openai.api_key = os.environ["OPENAI_API_SECRET_KEY"]
    @property
    def eot_token_id(self):
-        return self.tokenizer.eos_token_id
+        return self.end_of_text_token_id
    @property
    def max_length(self):
@@ -112,19 +117,49 @@ class OpenaiCompletionsLM(LM):
        # Isn't used because we override _loglikelihood_tokens
        raise NotImplementedError()
-    def tok_encode(self, string: str):
+    def tok_encode(self, string: str) -> List[int]:
-        return self.tokenizer.encode(string, add_special_tokens=False)
+        return self.tokenizer.encode(string)
-    def tok_decode(self, tokens):
+    def tok_decode(self, tokens: List[int]) -> str:
        return self.tokenizer.decode(tokens)
-    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+    def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
+        new_reqs = []
+        for context, continuation in [req.args for req in requests]:
+            if context == "":
+                # end of text as context
+                context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
+                    continuation
+                )
+            else:
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
+            new_reqs.append(((context, continuation), context_enc, continuation_enc))
+        return self._loglikelihood_tokens(new_reqs)
+    def _loglikelihood_tokens(
+        self, requests, disable_tqdm=False
+    ) -> List[Tuple[float, bool]]:
        res = []
        def _collate(x):
            # this doesn't efficiently handle last-token differences yet, but those are kinda annoying because
            # it's not guaranteed that the 100 or so logprobs we get to see actually contain all the continuations
-            # we care about and so we need some kind of backup for when it isn't
+            # we care about, and so we need some kind of backup for when it isn't
            toks = x[1] + x[2]
            return -len(toks), tuple(toks)
@@ -166,13 +201,13 @@ class OpenaiCompletionsLM(LM):
                # partial caching
                if cache_key is not None:
                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
        return re_ord.get_original(res)
-    def greedy_until(self, requests):
+    def greedy_until(self, requests) -> List[str]:
        if not requests:
            return []
        res = []
+        requests = [req.args for req in requests]
        def _collate(x):
            toks = self.tok_encode(x[0])
@@ -203,12 +238,7 @@ class OpenaiCompletionsLM(LM):
                inp = context_enc[-(self.max_length - self.max_gen_toks) :]
                inps.append(inp)
-            try:
+            until = request_args.get("until", ["<|endoftext|>"])
-                until = request_args["until"][
-                    0
-                ]  # TODO: does this handle a list of stop seqs correctly?
-            except KeyError:
-                until = "<|endoftext|>"
            response = oa_completion(
                engine=self.engine,
@@ -222,7 +252,7 @@ class OpenaiCompletionsLM(LM):
            for resp, (context, args_) in zip(response.choices, chunk):
                s = resp["text"]
-                until_ = args_.get(["until"], [])
+                until_ = args_.get("until", ["<|endoftext|>"])
                for term in until_:
                    if len(term) > 0:
@@ -234,7 +264,6 @@ class OpenaiCompletionsLM(LM):
                )
                res.append(s)
        return re_ord.get_original(res)
    def _model_call(self, inps):
@@ -244,3 +273,34 @@ class OpenaiCompletionsLM(LM):
    def _model_generate(self, context, max_length, eos_token_id):
        # Isn't used because we override greedy_until
        raise NotImplementedError()
+    def loglikelihood_rolling(self, requests) -> List[float]:
+        loglikelihoods = []
+        for (string,) in tqdm([req.args for req in requests]):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+            # TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+                disable_tqdm=True,
+            )
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -5,39 +5,39 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [ ] Glue (Lintang)
 - [x] SuperGlue
- [ ] CoQA
+- [ ] CoQA (Lintang)
- [ ] DROP
+- [ ] DROP (Lintang)
 - [x] ~~Lambada~~
 - [x] Lambada (Cloze variants)
 - [x] ~~Lambada (Multilingual)~~
 - [x] Wikitext
 - [x] PiQA
 - [x] PROST
- [ ] MCTACO
+- [ ] MCTACO (Lintang)
 - [x] Pubmed QA
 - [x] SciQ
 - [ ] QASPER
 - [x] QA4MRE
- [ ] TriviaQA
+- [ ] TriviaQA (Lintang)
 - [x] AI2 ARC
- [ ] LogiQA [(WIP)](https://github.com/EleutherAI/lm-evaluation-harness/pull/711)
+- [x] LogiQA
 - [x] HellaSwag
 - [x] SWAG
 - [x] OpenBookQA
- [ ] SQuADv2
+- [ ] SQuADv2 (Lintang)
 - [x] RACE
 - [x] HeadQA
 - [x] MathQA
- [ ] WebQs
+- [x] WebQs
- [ ] WSC273
+- [ ] WSC273 (Lintang)
 - [x] Winogrande
 - [x] ANLI
 - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
- [x] TruthfulQA (mc1)
+- [x] TruthfulQA (mc1) (Lintang)
- [ ] TruthfulQA (mc2)
+- [ ] TruthfulQA (mc2) (Lintang)
- [ ] TruthfulQA (gen)
+- [ ] TruthfulQA (gen) (Lintang)
 - [ ] MuTual
- [ ] Hendrycks Math
+- [ ] Hendrycks Math (Hailey)
 - [ ] Asdiv
 - [ ] GSM8k
 - [x] Arithmetic
@@ -47,18 +47,18 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
 - [x] ~~Pile (perplexity)~~
 - [ ] BLiMP (Lintang)
 - [x] ToxiGen
- [ ] StoryCloze
+- [ ] StoryCloze (Lintang)
- [ ] NaturalQs
+- [ ] NaturalQs (Hailey)
- [ ] CrowS-Pairs
+- [x] CrowS-Pairs
- [ ] XCopa
+- [x] XCopa
- [ ] BIG-Bench
+- [ ] BIG-Bench (Hailey)
- [ ] XStoryCloze
+- [ ] XStoryCloze (Lintang)
 - [x] XWinograd
- [ ] PAWS-X
+- [ ] PAWS-X (Lintang)
- [ ] XNLI
+- [ ] XNLI (Lintang)
- [ ] MGSM
+- [ ] MGSM (Lintang)
 - [ ] SCROLLS
- [ ] Babi
+- [x] Babi
 # Novel Tasks
 Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.

--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
+group:
+  - greedy_until
+task: babi
+dataset_path: Muennighoff/babi
+dataset_name: null
+output_type: greedy_until
+training_split: train
+validation_split: valid
+test_split: test
+doc_to_text: "Passage: {{passage}}Question: {{question}}\nAnswer:"
+doc_to_target: " {{answer}}"
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "\n"
+    - "Passage:"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/blimp/README.md
+++ b/lm_eval/tasks/blimp/README.md
+# Task-name
+### Paper
+Title: `BLiMP: A Benchmark of Linguistic Minimal Pairs for English`
+Abstract: `https://arxiv.org/abs/1912.00582`
+BLiMP is a challenge set for evaluating what language models (LMs) know about
+major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
+containing 1000 minimal pairs isolating specific contrasts in syntax, morphology,
+or semantics. The data is automatically generated according to expert-crafted
+grammars.
+Homepage: https://github.com/alexwarstadt/blimp
+### Citation
+```
+@article{warstadt2019blimp,
+    author = {Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R.},
+    title = {BLiMP: The Benchmark of Linguistic Minimal Pairs for English},
+    journal = {Transactions of the Association for Computational Linguistics},
+    volume = {8},
+    number = {},
+    pages = {377-392},
+    year = {2020},
+    doi = {10.1162/tacl\_a\_00321},
+    URL = {https://doi.org/10.1162/tacl_a_00321},
+    eprint = {https://doi.org/10.1162/tacl_a_00321},
+    abstract = { We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4\%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands. }
+}
+```
+### Subtasks
+List or describe tasks defined in this folder, and their names here:
+* `task_name`: `1-sentence description of what this particular task does`
+* `task_name2`: .....
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/blimp/adjunct_island.yaml
+++ b/lm_eval/tasks/blimp/adjunct_island.yaml
+# Generated by utils.py
+dataset_name: adjunct_island
+include: template_yaml
+task: blimp_adjunct_island
--- a/lm_eval/tasks/blimp/anaphor_gender_agreement.yaml
+++ b/lm_eval/tasks/blimp/anaphor_gender_agreement.yaml
+# Generated by utils.py
+dataset_name: anaphor_gender_agreement
+include: template_yaml
+task: blimp_anaphor_gender_agreement
--- a/lm_eval/tasks/blimp/anaphor_number_agreement.yaml
+++ b/lm_eval/tasks/blimp/anaphor_number_agreement.yaml
+# Generated by utils.py
+dataset_name: anaphor_number_agreement
+include: template_yaml
+task: blimp_anaphor_number_agreement
--- a/lm_eval/tasks/blimp/animate_subject_passive.yaml
+++ b/lm_eval/tasks/blimp/animate_subject_passive.yaml
+# Generated by utils.py
+dataset_name: animate_subject_passive
+include: template_yaml
+task: blimp_animate_subject_passive
--- a/lm_eval/tasks/blimp/animate_subject_trans.yaml
+++ b/lm_eval/tasks/blimp/animate_subject_trans.yaml
+# Generated by utils.py
+dataset_name: animate_subject_trans
+include: template_yaml
+task: blimp_animate_subject_trans
--- a/lm_eval/tasks/blimp/causative.yaml
+++ b/lm_eval/tasks/blimp/causative.yaml
+# Generated by utils.py
+dataset_name: causative
+include: template_yaml
+task: blimp_causative
--- a/lm_eval/tasks/blimp/complex_NP_island.yaml
+++ b/lm_eval/tasks/blimp/complex_NP_island.yaml
+# Generated by utils.py
+dataset_name: complex_NP_island
+include: template_yaml
+task: blimp_complex_NP_island
--- a/lm_eval/tasks/blimp/coordinate_structure_constraint_complex_left_branch.yaml
+++ b/lm_eval/tasks/blimp/coordinate_structure_constraint_complex_left_branch.yaml
+# Generated by utils.py
+dataset_name: coordinate_structure_constraint_complex_left_branch
+include: template_yaml
+task: blimp_coordinate_structure_constraint_complex_left_branch