Merge branch 'EleutherAI:master' into fix_ptun

318bd988 · Wang, Yi · GitHub · 35f1b5a7 · 25dfd3f6 · 318bd988
Unverified Commit 318bd988 authored Jul 04, 2023 by Wang, Yi Committed by GitHub Jul 04, 2023
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,5 @@ env
 data/
 lm_cache
 .idea
+
+*.egg-info/
--- a/CODEOWNERS
+++ b/CODEOWNERS
-* @jon-tow @StellaAthena
+* @haileyschoelkopf @lintangsutawika
--- a/Dockerfile
+++ b/Dockerfile
+FROM nvidia/cuda:11.2.0-cudnn8-runtime-ubuntu20.04
+
+
+### Install python 3.10 and set it as default python interpreter 
+RUN  apt update &&  apt install software-properties-common -y && \
+add-apt-repository ppa:deadsnakes/ppa -y &&  apt update && \
+apt install curl -y && \
+apt install python3.10 -y && \
+update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 && \
+update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
+apt install python3.10-venv python3.10-dev -y && \
+curl -Ss https://bootstrap.pypa.io/get-pip.py | python3.10 && \
+apt-get clean && rm -rf /var/lib/apt/lists/
+
+
+### Copy files 
+COPY . /lm-evaluation-harness/
+
+### Set working directory
+
+WORKDIR /lm-evaluation-harness
+
+
+### Install requirements
+RUN pip install --no-cache-dir -e . 
+### Run bash
+CMD ["/bin/bash"]
+
+
+
--- a/README.md
+++ b/README.md
 # Language Model Evaluation Harness

+## We're Refactoring LM-Eval!
+(as of 6/15/23)
+We have a revamp of the Evaluation Harness library internals staged on the [big-refactor](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) branch! It is far along in progress, but before we start to move the `master` branch of the repository over to this new design with a new version release, we'd like to ensure that it's been tested by outside users and there are no glaring bugs.
+
+We’d like your help to test it out! you can help by:
+1. Trying out your current workloads on the big-refactor branch, and seeing if anything breaks or is counterintuitive,
+2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.
+
+If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with: 
+- A shell command to run the task in the `master` branch, and what the score is
+- A shell command to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
+
+Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
+
+
 ## Overview

 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
@@ -142,7 +157,7 @@ When reporting eval harness results, please also report the version of each task

 ## Test Set Decontamination

-To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
+To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).

 For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx):
 #### What's a `Request`? What's a `doc`?
 To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
 A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
-The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
+The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.


 ```python

--- a/docs/task_table.md
+++ b/docs/task_table.md
@@ -286,6 +286,13 @@
 |reversed_words                                           |     |✓  |    |        10000|acc                                                                                                                                                                              |
 |rte                                                      |✓    |✓  |    |          277|acc                                                                                                                                                                              |
 |sciq                                                     |✓    |✓  |✓   |         1000|acc, acc_norm                                                                                                                                                                    |
+|scrolls_contractnli                                      |✓    |✓  |    |         1037|em, acc, acc_norm                                                                                                                                                                |
+|scrolls_govreport                                        |✓    |✓  |    |          972|rouge1, rouge2, rougeL                                                                                                                                                           |
+|scrolls_narrativeqa                                      |✓    |✓  |    |         3425|f1                                                                                                                                                                               |
+|scrolls_qasper                                           |✓    |✓  |    |          984|f1                                                                                                                                                                               |
+|scrolls_qmsum                                            |✓    |✓  |    |          272|rouge1, rouge2, rougeL                                                                                                                                                           |
+|scrolls_quality                                          |✓    |✓  |    |         2086|em, acc, acc_norm                                                                                                                                                                |
+|scrolls_summscreenfd                                     |✓    |✓  |    |          338|rouge1, rouge2, rougeL                                                                                                                                                           |
 |squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1                                                                                                   |
 |sst                                                      |✓    |✓  |    |          872|acc                                                                                                                                                                              |
 |swag                                                     |✓    |✓  |    |        20006|acc, acc_norm                                                                                                                                                                    |

--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -119,6 +119,12 @@ class LM(abc.ABC):


 class BaseLM(LM):
+    def __init__(self):
+        super().__init__()
+        self.batch_schedule = 1
+        self.batch_sizes = {}
+        self.max_batch_size = 512
+
    @property
    @abstractmethod
    def eot_token_id(self):
@@ -167,19 +173,52 @@ class BaseLM(LM):
        """
        pass

+    def _detect_batch_size(self, requests=None, pos=0):
+        if requests:
+            _, context_enc, continuation_enc = requests[pos]
+            max_length = len(
+                (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]
+            )
+        else:
+            max_length = self.max_length
+
+        # if OOM, then halves batch_size and tries again
+        @find_executable_batch_size(starting_batch_size=self.max_batch_size)
+        def forward_batch(batch_size):
+            test_batch = torch.ones((batch_size, max_length), device=self.device).long()
+            for _ in range(5):
+                _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
+            return batch_size
+
+        batch_size = forward_batch()
+        utils.clear_torch_cache()
+
+        return batch_size
+
    # subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length.
    # TODO: enforce this somehow

+    def _encode_pair(self, context, continuation):
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
    def loglikelihood(self, requests):
        new_reqs = []
        for context, continuation in requests:
            if context == "":
                # end of text as context
-                context_enc = [self.eot_token_id]
+                context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
+                    continuation
+                )
            else:
-                context_enc = self.tok_encode(context)
-
-            continuation_enc = self.tok_encode(continuation)
+                context_enc, continuation_enc = self._encode_pair(context, continuation)

            new_reqs.append(((context, continuation), context_enc, continuation_enc))

@@ -193,19 +232,7 @@ class BaseLM(LM):
        if self.batch_size == "auto":
            # using rolling window with maximum context
            print("Passed argument batch_size = auto. Detecting largest batch size")
-
-            @find_executable_batch_size(
-                starting_batch_size=512
-            )  # if OOM, then halves batch_size and tries again
-            def forward_batch(batch_size):
-                test_batch = torch.ones(
-                    (batch_size, self.max_length), device=self.device
-                ).long()
-                for _ in range(5):
-                    _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
-                return batch_size
-
-            batch_size = forward_batch()
+            batch_size = self._detect_batch_size()
            print(f"Determined Largest batch size: {batch_size}")
            adaptive_batch_size = batch_size

@@ -258,34 +285,32 @@ class BaseLM(LM):

        re_ord = utils.Reorderer(requests, _collate)

+        reordered_requests = re_ord.get_reordered()
+        n_reordered_requests = len(reordered_requests)
+
        # automatic (variable) batch size detection for vectorization
        # pull longest context sample from request
-        if len(re_ord.get_reordered()) > 0:
-            _, context_enc, continuation_enc = re_ord.get_reordered()[0]
-            max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
-            if (self.batch_size == 'auto'):
-                
-                if override_bs is None:
-                    print('Passed argument batch_size = auto. Detecting largest batch size')
-                    @find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
-                    def forward_batch(batch_size):
-                        test_batch = torch.ones((batch_size, max_context), device=self.device).long()
-                        for _ in range(5):
-                            out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
-                        return batch_size
-
-                    batch_size = forward_batch()
-                    print(f"Determined largest batch size: {batch_size}")
-                    adaptive_batch_size = batch_size
-
-                else:
-                    adaptive_batch_size = override_bs
-        else:
-            adaptive_batch_size = 0 if override_bs is None else override_bs
+        def _batch_scheduler(pos):
+            sched = pos // int(n_reordered_requests / self.batch_schedule)
+            if sched in self.batch_sizes:
+                return self.batch_sizes[sched]
+            print(
+                f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
+            )
+            self.batch_sizes[sched] = self._detect_batch_size(reordered_requests, pos)
+            print(f"Determined largest batch size: {self.batch_sizes[sched]}")
+            return self.batch_sizes[sched]

        for chunk in utils.chunks(
-            tqdm(re_ord.get_reordered(), disable=disable_tqdm),
-            self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
+            tqdm(reordered_requests, disable=disable_tqdm),
+            n=self.batch_size
+            if self.batch_size != "auto"
+            else override_bs
+            if override_bs is not None
+            else 0,
+            fn=_batch_scheduler
+            if self.batch_size == "auto" and n_reordered_requests > 0 and not override_bs
+            else None,
        ):
            inps = []
            cont_toks_list = []
@@ -387,18 +412,34 @@ class BaseLM(LM):
        res = []

        def _collate(x):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+
            toks = self.tok_encode(x[0])
-            return len(toks), x[0]
+            return -len(toks), x[0]

        re_ord = utils.Reorderer(requests, _collate)

+        warn_stop_seq = False
        for context, request_args in tqdm(re_ord.get_reordered()):
            until = request_args["until"]
            if isinstance(until, str):
                until = [until]

            if until:
-                (primary_until,) = self.tok_encode(until[0])
+                try:
+                    (primary_until,) = self.tok_encode(until[0])
+                except ValueError:
+                    if not warn_stop_seq:
+                        print(
+                            "Warning: a primary stop sequence is multi-token! Will default to EOS token for this tokenizer. Consider using `hf-causal-experimental` for multi-token stop sequence support for the time being."
+                        )
+                        warn_stop_seq = True
+                    primary_until = self.eot_token_id
            else:
                primary_until = None

@@ -856,6 +897,10 @@ class CachingLM:
        lm.set_cache_hook(self.get_cache_hook())

    def __getattr__(self, attr):
+        lm_attr = getattr(self.lm, attr)
+        if not callable(lm_attr):
+            return lm_attr
+
        def fn(requests):
            res = []
            remaining_reqs = []

--- a/lm_eval/datasets/triviaqa/__init__.py
+++ b/lm_eval/datasets/triviaqa/__init__.py
--- a/lm_eval/datasets/triviaqa/README.md
+++ b/lm_eval/datasets/triviaqa/README.md
---
-dataset_info:
-  features:
-  - name: question_id
-    dtype: string
-  - name: question_source
-    dtype: string
-  - name: question
-    dtype: string
-  - name: answer
-    struct:
-    - name: aliases
-      sequence: string
-    - name: value
-      dtype: string
-  - name: search_results
-    sequence:
-    - name: description
-      dtype: string
-    - name: filename
-      dtype: string
-    - name: rank
-      dtype: int32
-    - name: title
-      dtype: string
-    - name: url
-      dtype: string
-    - name: search_context
-      dtype: string
-  config_name: triviaqa
-  splits:
-  - name: train
-    num_bytes: 1270894387
-    num_examples: 87622
-  - name: validation
-    num_bytes: 163755044
-    num_examples: 11313
-  download_size: 632549060
-  dataset_size: 1434649431
---
--- a/lm_eval/datasets/triviaqa/dataset_infos.json
+++ b/lm_eval/datasets/triviaqa/dataset_infos.json
-{"triviaqa": {"description": "TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence\ntriples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts\nand independently gathered evidence documents, six per question on average, that provide\nhigh quality distant supervision for answering the questions.\n", "citation": "@InProceedings{JoshiTriviaQA2017,\n    author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},\n    title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},\n    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},\n    month = {July},\n    year = {2017},\n    address = {Vancouver, Canada},\n    publisher = {Association for Computational Linguistics},\n}\n", "homepage": "https://nlp.cs.washington.edu/triviaqa/", "license": "Apache License 2.0", "features": {"question_id": {"dtype": "string", "id": null, "_type": "Value"}, "question_source": {"dtype": "string", "id": null, "_type": "Value"}, "question": {"dtype": "string", "id": null, "_type": "Value"}, "answer": {"aliases": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "value": {"dtype": "string", "id": null, "_type": "Value"}}, "search_results": {"feature": {"description": {"dtype": "string", "id": null, "_type": "Value"}, "filename": {"dtype": "string", "id": null, "_type": "Value"}, "rank": {"dtype": "int32", "id": null, "_type": "Value"}, "title": {"dtype": "string", "id": null, "_type": "Value"}, "url": {"dtype": "string", "id": null, "_type": "Value"}, "search_context": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "triviaqa", "config_name": "triviaqa", "version": {"version_str": "0.0.1", "description": null, "major": 0, "minor": 0, "patch": 1}, "splits": {"train": {"name": "train", "num_bytes": 1271393601, "num_examples": 87622, "dataset_name": "triviaqa"}, "validation": {"name": "validation", "num_bytes": 163819509, "num_examples": 11313, "dataset_name": "triviaqa"}}, "download_checksums": {"http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz": {"num_bytes": 546481381, "checksum": "adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e"}}, "download_size": 546481381, "post_processing_size": null, "dataset_size": 1435213110, "size_in_bytes": 1981694491}}
--- a/lm_eval/datasets/triviaqa/triviaqa.py
+++ b/lm_eval/datasets/triviaqa/triviaqa.py
-# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-# Custom TriviaQA because HF version sanitizes the dataset differently.
-# https://github.com/huggingface/datasets/blob/9977ade72191ff0b6907ec63935448c6269a91a1/datasets/trivia_qa/trivia_qa.py#L285
-"""TriviaQA (Unfiltered Raw) dataset."""
-
-
-import json
-import os
-
-import datasets
-
-
-_CITATION = """\
-@InProceedings{JoshiTriviaQA2017,
-    author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},
-    title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
-    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
-    month = {July},
-    year = {2017},
-    address = {Vancouver, Canada},
-    publisher = {Association for Computational Linguistics},
-}
-"""
-
-_DESCRIPTION = """\
-TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence
-triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts
-and independently gathered evidence documents, six per question on average, that provide
-high quality distant supervision for answering the questions.
-"""
-
-_HOMEPAGE = "https://nlp.cs.washington.edu/triviaqa/"
-
-_LICENSE = "Apache License 2.0"
-
-_URLS = "https://nlp.cs.washington.edu/triviaqa/data/triviaqa-unfiltered.tar.gz"
-
-
-class Triviaqa(datasets.GeneratorBasedBuilder):
-    """TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples"""
-
-    VERSION = datasets.Version("0.0.2")
-
-    BUILDER_CONFIGS = [
-        datasets.BuilderConfig(
-            name="triviaqa", version=VERSION, description="The TriviaQA dataset"
-        ),
-    ]
-
-    def _info(self):
-        features = datasets.Features(
-            {
-                "question_id": datasets.Value("string"),
-                "question_source": datasets.Value("string"),
-                "question": datasets.Value("string"),
-                "answer": {
-                    "aliases": datasets.features.Sequence(
-                        datasets.Value("string"),
-                    ),
-                    "value": datasets.Value("string"),
-                },
-                "search_results": datasets.features.Sequence(
-                    {
-                        "description": datasets.Value("string"),
-                        "filename": datasets.Value("string"),
-                        "rank": datasets.Value("int32"),
-                        "title": datasets.Value("string"),
-                        "url": datasets.Value("string"),
-                        "search_context": datasets.Value("string"),
-                    }
-                ),
-            }
-        )
-        return datasets.DatasetInfo(
-            description=_DESCRIPTION,
-            features=features,
-            homepage=_HOMEPAGE,
-            license=_LICENSE,
-            citation=_CITATION,
-        )
-
-    def _split_generators(self, dl_manager):
-        urls = _URLS
-        data_dir = dl_manager.download_and_extract(urls)
-        return [
-            datasets.SplitGenerator(
-                name=datasets.Split.TRAIN,
-                # These kwargs will be passed to _generate_examples
-                gen_kwargs={
-                    "filepath": os.path.join(
-                        data_dir, "triviaqa-unfiltered", "unfiltered-web-train.json"
-                    ),
-                },
-            ),
-            datasets.SplitGenerator(
-                name=datasets.Split.VALIDATION,
-                # These kwargs will be passed to _generate_examples
-                gen_kwargs={
-                    "filepath": os.path.join(
-                        data_dir, "triviaqa-unfiltered", "unfiltered-web-dev.json"
-                    ),
-                },
-            ),
-        ]
-
-    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
-    def _generate_examples(self, filepath):
-        with open(filepath, encoding="utf-8") as f:
-            json_data = json.load(f)["Data"]
-            for key, data in enumerate(json_data):
-                search_results = []
-                for search_result in data["SearchResults"]:
-                    search_results.append(
-                        {
-                            "description": search_result["Description"]
-                            if "Description" in search_result
-                            else "",
-                            "filename": search_result["Filename"]
-                            if "Filename" in search_result
-                            else "",
-                            "rank": search_result["Rank"]
-                            if "Rank" in search_result
-                            else -1,
-                            "title": search_result["Title"]
-                            if "Title" in search_result
-                            else "",
-                            "url": search_result["Url"]
-                            if "Url" in search_result
-                            else "",
-                            "search_context": search_result["SearchContext"]
-                            if "SearchContext" in search_result
-                            else "",
-                        }
-                    )
-                yield key, {
-                    "question_id": data["QuestionId"],
-                    "question_source": data["QuestionSource"],
-                    "question": data["Question"],
-                    "answer": {
-                        "aliases": data["Answer"]["Aliases"],
-                        "value": data["Answer"]["Value"],
-                    },
-                    "search_results": search_results,
-                }
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
 import collections
 import itertools
-import numpy as np
 import random
+
 import lm_eval.metrics
 import lm_eval.models
 import lm_eval.tasks
 import lm_eval.base
 from lm_eval.utils import positional_deprecated, run_task_tests
+from lm_eval.models.gpt2 import HFLM
+
+import numpy as np
+import transformers


 @positional_deprecated
@@ -16,6 +20,7 @@ def simple_evaluate(
    tasks=[],
    num_fewshot=0,
    batch_size=None,
+    max_batch_size=None,
    device=None,
    no_cache=False,
    limit=None,
@@ -29,7 +34,7 @@ def simple_evaluate(
    """Instantiate and evaluate a model on a list of tasks.

    :param model: Union[str, LM]
-        Name of model or LM object, see lm_eval.models.get_model
+        Name of model, transformers.PreTrainedModel object, or LM object, see lm_eval.models.get_model
    :param model_args: Optional[str]
        String arguments for each model class, see LM.create_from_arg_string.
        Ignored if `model` argument is a LM object.
@@ -37,8 +42,10 @@ def simple_evaluate(
        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
-    :param batch_size: int, optional
+    :param batch_size: int or str, optional
        Batch size for model
+    :param max_batch_size: int, optional
+        Maximal batch size to try with automatic batch size detection
    :param device: str, optional
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
    :param no_cache: bool
@@ -67,8 +74,15 @@ def simple_evaluate(
        if model_args is None:
            model_args = ""
        lm = lm_eval.models.get_model(model).create_from_arg_string(
-            model_args, {"batch_size": batch_size, "device": device}
+            model_args, {"batch_size": batch_size, "max_batch_size": max_batch_size, "device": device}
        )
+    elif isinstance(model, transformers.PreTrainedModel):
+        lm = lm_eval.models.get_model("hf-causal")(
+                pretrained=model,
+                batch_size=batch_size,
+                max_batch_size=max_batch_size,
+                )
+        no_cache = True
    else:
        assert isinstance(model, lm_eval.base.LM)
        lm = model
@@ -77,7 +91,7 @@ def simple_evaluate(
        lm = lm_eval.base.CachingLM(
            lm,
            "lm_cache/"
-            + model
+            + (model if isinstance(model, str) else model.model.config._name_or_path)
            + "_"
            + model_args.replace("=", "-").replace(",", "_").replace("/", "-")
            + ".db",
@@ -101,11 +115,17 @@ def simple_evaluate(
    )

    # add info about the model and few shot config
+    model_name = None
+    if isinstance(model, str):
+        model_name = model
+    elif isinstance(model, transformers.PreTrainedModel):
+        model_name = "pretrained=" + model.config._name_or_path
    results["config"] = {
-        "model": model,
+        "model": model_name,
        "model_args": model_args,
        "num_fewshot": num_fewshot,
        "batch_size": batch_size,
+        "batch_sizes": list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else [],
        "device": device,
        "no_cache": no_cache,
        "limit": limit,

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
 from . import gpt2
 from . import gpt3
+from . import anthropic_llms
 from . import huggingface
 from . import textsynth
 from . import dummy
@@ -11,6 +12,7 @@ MODEL_REGISTRY = {
    "hf-seq2seq": huggingface.AutoSeq2SeqLM,
    "gpt2": gpt2.GPT2LM,
    "gpt3": gpt3.GPT3LM,
+    "anthropic": anthropic_llms.AnthropicLM,
    "textsynth": textsynth.TextSynthLM,
    "dummy": dummy.DummyLM,
 }

--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
+import os
+from lm_eval.base import BaseLM
+from tqdm import tqdm
+import time
+
+
+def anthropic_completion(client, model, prompt, max_tokens_to_sample, temperature, stop):
+    """Query Anthropic API for completion.
+
+    Retry with back-off until they respond
+    """
+    import anthropic
+
+    backoff_time = 3
+    while True:
+        try:
+            response = client.completion(
+                prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
+                model=model,
+                # NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
+                #       (e.g. gsm8k's ":") may truncate a lot of the input.
+                stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
+                max_tokens_to_sample=max_tokens_to_sample,
+                temperature=temperature,
+            )
+            print(response)
+            return response["completion"]
+        except RuntimeError:
+            # TODO: I don't actually know what error Anthropic raises when it times out
+            #       So err update this error when we find out.
+            import traceback
+
+            traceback.print_exc()
+            time.sleep(backoff_time)
+            backoff_time *= 1.5
+
+
+class AnthropicLM(BaseLM):
+    REQ_CHUNK_SIZE = 20
+
+    def __init__(self, model):
+        """
+
+        :param model: str
+            Anthropic model e.g. claude-instant-v1
+        """
+        super().__init__()
+        import anthropic
+        self.model = model
+        self.client = anthropic.Client(os.environ['ANTHROPIC_API_KEY'])
+
+    @property
+    def eot_token_id(self):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    @property
+    def max_length(self):
+        return 2048
+
+    @property
+    def max_gen_toks(self):
+        return 256
+
+    @property
+    def batch_size(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    @property
+    def device(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    def tok_encode(self, string: str):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    def tok_decode(self, tokens):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+        raise NotImplementedError("No support for logits.")
+
+    def greedy_until(self, requests):
+        if not requests:
+            return []
+
+        res = []
+        for request in tqdm(requests):
+            inp = request[0]
+            request_args = request[1]
+            until = request_args["until"]
+            response = anthropic_completion(
+                client=self.client,
+                model=self.model,
+                prompt=inp,
+                max_tokens_to_sample=self.max_gen_toks,
+                temperature=0.0,
+                stop=until,
+            )
+            res.append(response)
+        return res
+
+    def _model_call(self, inps):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    def _model_generate(self, context, max_length, eos_token_id):
+        # Isn't used because we override greedy_until
+        raise NotImplementedError()
--- a/lm_eval/models/gpt2.py
+++ b/lm_eval/models/gpt2.py
@@ -17,6 +17,9 @@ def _get_dtype(


 class HFLM(BaseLM):
+
+    _DEFAULT_MAX_LENGTH = 2048
+
    def __init__(
        self,
        device="cuda",
@@ -26,67 +29,93 @@ class HFLM(BaseLM):
        subfolder=None,
        tokenizer=None,
        batch_size=1,
+        max_batch_size=512,
+        max_length=None,
        load_in_8bit: Optional[bool] = False,
        trust_remote_code: Optional[bool] = False,
        dtype: Optional[Union[str, torch.dtype]]="auto",
    ):
        super().__init__()

-        assert isinstance(device, str)
-        assert isinstance(pretrained, str)
-        assert isinstance(batch_size, (int, str))

-        device_list = set(
-            ["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
-        )
-        if device and device in device_list:
-            self._device = torch.device(device)
-            print(f"Using device '{device}'")
-        else:
-            print("Device not specified")
-            print(f"Cuda Available? {torch.cuda.is_available()}")
-            self._device = (
-                torch.device("cuda")
-                if torch.cuda.is_available()
-                else torch.device("cpu")
+        # Initialize model
+        if isinstance(pretrained, transformers.PreTrainedModel):
+            self.model = pretrained
+            self._device = self.model.device
+
+            if tokenizer:
+                assert isinstance(
+                        tokenizer,
+                        transformers.PreTrainedTokenizer
+                        ) or isinstance(
+                        tokenizer,
+                        transformers.PreTrainedTokenizerFast
+                        )
+                self.tokenizer = tokenizer
+            else:
+                # Get tokenizer
+                model_name = self.model.name_or_path
+                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+                        model_name,
+                        revision=revision,
+                        trust_remote_code=trust_remote_code,
+                        )
+
+        elif isinstance(pretrained, str):
+
+            # Initialize device
+            assert isinstance(device, str)
+            device_list = set(
+                ["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
            )
+            if device and device in device_list:
+                self._device = torch.device(device)
+                print(f"Using device '{device}'")
+            else:
+                print("Device not specified")
+                print(f"Cuda Available? {torch.cuda.is_available()}")
+                self._device = (
+                    torch.device("cuda")
+                    if torch.cuda.is_available()
+                    else torch.device("cpu")
+                )
+            revision = revision + ("/" + subfolder if subfolder is not None else "")
+
+            # Initialize new model and tokenizer instances
+            self.model = transformers.AutoModelForCausalLM.from_pretrained(
+                    pretrained,
+                    load_in_8bit=load_in_8bit,
+                    low_cpu_mem_usage=low_cpu_mem_usage,
+                    revision=revision,
+                    torch_dtype=_get_dtype(dtype),
+                    trust_remote_code=trust_remote_code,
+                    ).to(self.device)
+            self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+                    tokenizer if tokenizer else pretrained,
+                    revision=revision,
+                    trust_remote_code=trust_remote_code,
+                    )

-        # TODO: update this to be less of a hack once subfolder is fixed in HF
-        revision = revision + ("/" + subfolder if subfolder is not None else "")
-
-        self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained(
-            pretrained,
-            load_in_8bit=load_in_8bit,
-            low_cpu_mem_usage=low_cpu_mem_usage,
-            revision=revision,
-            torch_dtype=_get_dtype(dtype),
-            trust_remote_code=trust_remote_code,
-        ).to(self.device)
-        self.gpt2.eval()
-
-        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
-            pretrained if tokenizer is None else tokenizer,
-            revision=revision,
-            trust_remote_code=trust_remote_code,
-        )
+        else:
+            raise TypeError('Parameter pretrained should be of type str or transformers.PreTrainedModel')
+
+        self.model.eval()

        self.vocab_size = self.tokenizer.vocab_size

-        if isinstance(
-            self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)
-        ):
-            assert self.tokenizer.encode("hello\n\nhello") == [
-                31373,
-                198,
-                198,
-                31373,
-            ], self.tokenizer.encode("hello\n\nhello")
+        # Validate batch_size
+        assert isinstance(batch_size, (int, str))

        # setup for automatic batch size detection
-        if batch_size == "auto":
-            self.batch_size_per_gpu = batch_size
+        if str(batch_size).startswith("auto"):
+            batch_size = batch_size.split(":")
+            self.batch_size_per_gpu = batch_size[0]
+            self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
        else:
            self.batch_size_per_gpu = int(batch_size)
+        self.max_batch_size = max_batch_size
+
+        self._max_length = max_length

    @property
    def eot_token_id(self):
@@ -95,11 +124,18 @@ class HFLM(BaseLM):

    @property
    def max_length(self):
-        try:
-            return self.gpt2.config.n_ctx
-        except AttributeError:
-            # gptneoconfig doesn't have n_ctx apparently
-            return self.gpt2.config.max_position_embeddings
+        if self._max_length: # if max length manually set, return it
+            return self._max_length
+        seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
+        for attr in seqlen_config_attrs:
+            if hasattr(self.model.config, attr):
+                return getattr(self.model.config, attr)
+        if hasattr(self.tokenizer, "model_max_length"):
+            if self.tokenizer.model_max_length == 1000000000000000019884624838656:
+                return self._DEFAULT_MAX_LENGTH
+            return self.tokenizer.model_max_length
+        return self._DEFAULT_MAX_LENGTH
+

    @property
    def max_gen_toks(self):
@@ -130,14 +166,14 @@ class HFLM(BaseLM):
        logits returned from the model
        """
        with torch.no_grad():
-            return self.gpt2(inps)[0]
+            return self.model(inps)[0]

    def _model_generate(self, context, max_length, eos_token_id):
        generation_kwargs = {"do_sample": False, "max_length": max_length}
        if eos_token_id is not None:
            generation_kwargs['eos_token_id'] = eos_token_id
            generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
-        return self.gpt2.generate(context, **generation_kwargs)
+        return self.model.generate(context, **generation_kwargs)


 # for backwards compatibility

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -3,12 +3,12 @@ import torch
 import torch.nn.functional as F
 import transformers
 import peft
+from peft import __version__ as PEFT_VERSION
 from pathlib import Path
 from typing import List, Mapping, NewType, Optional, Tuple, Union
 from tqdm import tqdm

 from transformers import BatchEncoding
-from accelerate import find_executable_batch_size

 from lm_eval import utils
 from lm_eval.base import BaseLM
@@ -75,6 +75,7 @@ class HuggingFaceAutoLM(BaseLM):
        subfolder: Optional[str] = None,
        revision: Optional[str] = "main",
        batch_size: Optional[Union[int, str]] = 1,
+        max_batch_size: Optional[int] = 512,
        max_gen_toks: Optional[int] = 256,
        max_length: Optional[int] = None,
        add_special_tokens: Optional[bool] = None,
@@ -87,8 +88,11 @@ class HuggingFaceAutoLM(BaseLM):
        device: Optional[Union[int, str]] = "cuda",
        peft: str = None,
        load_in_8bit: Optional[bool] = False,
+        load_in_4bit: Optional[bool] = False,
        trust_remote_code: Optional[bool] = False,
        gptq_use_triton: Optional[bool] = False,
+        bnb_4bit_quant_type: Optional[str] = None,
+        bnb_4bit_compute_dtype: Optional[Union[str, torch.dtype]] = None,
    ):
        """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
        Args:
@@ -142,11 +146,21 @@ class HuggingFaceAutoLM(BaseLM):
                `adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft)
            load_in_8bit (bool, optional, defaults to False):
                If True, will convert the loaded model into mixed-8bit quantized model. See:
-                https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.load_in_8bit
+                https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-8bit
+            load_in_4bit (bool, optional, defaults to False):
+                If True, will convert the loaded model into mixed-4bit quantized model. See:
+                https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-4bit
            trust_remote_code (bool, optional, defaults to False):
                If True, will trust the remote code when loading the model.
            gptq_use_triton (bool, optional, defaults to False):
                Use Triton for GPTQ inference.
+            bnb_4bit_quant_type (str, optional, defaults to None):
+                The quantization type to use for BnB 4bit quantization. See:
+                https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L77
+            bnb_4bit_compute_dtype (Union[str, torch.dtype], optional, defaults to None):
+                The compute dtype to use for BnB 4bit quantization. See:
+                https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py#L74
+
        """
        super().__init__()

@@ -167,10 +181,13 @@ class HuggingFaceAutoLM(BaseLM):
            ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."

        # setup for automatic batch size detection
-        if batch_size == "auto":
-            self._batch_size = batch_size
+        if str(batch_size).startswith("auto"):
+            batch_size = batch_size.split(":")
+            self._batch_size = batch_size[0]
+            self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
        else:
            self._batch_size = int(batch_size)
+        self.max_batch_size = max_batch_size

        self._max_gen_toks = max_gen_toks
        self._max_length = max_length
@@ -186,6 +203,7 @@ class HuggingFaceAutoLM(BaseLM):
            revision=revision,
            subfolder=subfolder,
            tokenizer=tokenizer,
+            trust_remote_code=trust_remote_code,
        )
        self.tokenizer.model_max_length = self.max_length

@@ -197,7 +215,6 @@ class HuggingFaceAutoLM(BaseLM):
                max_cpu_memory,
                offload_folder,
            )
-        model_kwargs["load_in_8bit"] = load_in_8bit
        self.model = self._create_auto_model(
            pretrained=pretrained,
            quantized=quantized,
@@ -206,6 +223,10 @@ class HuggingFaceAutoLM(BaseLM):
            subfolder=subfolder,
            torch_dtype=_get_dtype(dtype, self._config),
            gptq_use_triton=gptq_use_triton,
+            load_in_8bit=load_in_8bit,
+            load_in_4bit=load_in_4bit,
+            bnb_4bit_quant_type=bnb_4bit_quant_type,
+            bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,
            **model_kwargs,
        )
        # note: peft_path can be different than pretrained model path
@@ -215,8 +236,7 @@ class HuggingFaceAutoLM(BaseLM):
                peft=peft,
                revision=revision,
                subfolder=subfolder,
-                torch_dtype=_get_dtype(dtype, self._config),
-                **model_kwargs,
+                load_in_4bit=load_in_4bit,
            )
        self.model.eval()
        torch.set_grad_enabled(False)
@@ -227,8 +247,11 @@ class HuggingFaceAutoLM(BaseLM):
            # the user specified one so we force `self._device` to be the same as
            # `lm_head`'s.
            self._device = self.model.hf_device_map["lm_head"]
-        if not use_accelerate:
-            self.model.to(self._device)
+        if not use_accelerate and not (load_in_4bit or load_in_8bit):
+            try:
+                self.model.to(self._device)
+            except:
+                print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")

    def _create_auto_model(
        self,
@@ -241,12 +264,25 @@ class HuggingFaceAutoLM(BaseLM):
        max_memory: Optional[dict] = None,
        offload_folder: Optional[str] = None,
        load_in_8bit: Optional[bool] = False,
+        load_in_4bit: Optional[bool] = False,
        trust_remote_code: Optional[bool] = False,
        torch_dtype: Optional[Union[str, torch.dtype]] = None,
        gptq_use_triton: Optional[bool] = False,
+        bnb_4bit_quant_type: Optional[str] = None,
+        bnb_4bit_compute_dtype: Optional[Union[str, torch.dtype]] = None,
    ) -> transformers.AutoModel:
        """Returns a pre-trained pytorch model from a pre-trained model configuration."""
        if not quantized:
+            if load_in_4bit:
+                assert transformers.__version__ >= "4.30.0", "load_in_4bit requires transformers >= 4.30.0"
+            model_kwargs = {}
+            if transformers.__version__ >= "4.30.0":
+                model_kwargs["load_in_4bit"] = load_in_4bit
+                if load_in_4bit:
+                    if bnb_4bit_quant_type:
+                        model_kwargs["bnb_4bit_quant_type"] = bnb_4bit_quant_type
+                    if bnb_4bit_compute_dtype:
+                        model_kwargs["bnb_4bit_compute_dtype"] = _get_dtype(bnb_4bit_compute_dtype)
            model = self.AUTO_MODEL_CLASS.from_pretrained(
                pretrained,
                revision=revision + ("/" + subfolder if subfolder is not None else ""),
@@ -256,6 +292,7 @@ class HuggingFaceAutoLM(BaseLM):
                load_in_8bit=load_in_8bit,
                trust_remote_code=trust_remote_code,
                torch_dtype=torch_dtype,
+                **model_kwargs,
            )
        else:
            from auto_gptq import AutoGPTQForCausalLM
@@ -278,23 +315,14 @@ class HuggingFaceAutoLM(BaseLM):
        peft: str,
        revision: str,
        subfolder: str,
-        device_map: Optional[Union[str, _DeviceMapping]] = None,
-        max_memory: Optional[dict] = None,
-        offload_folder: Optional[str] = None,
-        load_in_8bit: Optional[bool] = False,
-        trust_remote_code: Optional[bool] = False,
-        torch_dtype: Optional[Union[str, torch.dtype]] = None,
+        load_in_4bit: Optional[bool] = False,
    ):
+        if load_in_4bit:
+            assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
        model = self.AUTO_PEFT_CLASS.from_pretrained(
            model,
            peft,
            revision=revision + ("/" + subfolder if subfolder is not None else ""),
-            device_map=device_map,
-            max_memory=max_memory,
-            offload_folder=offload_folder,
-            load_in_8bit=load_in_8bit,
-            trust_remote_code=trust_remote_code,
-            torch_dtype=torch_dtype,
        )
        return model

@@ -305,11 +333,13 @@ class HuggingFaceAutoLM(BaseLM):
        revision: str,
        subfolder: str,
        tokenizer: Optional[str] = None,
+        trust_remote_code: Optional[bool] = False,
    ) -> transformers.PreTrainedTokenizer:
        """Returns a pre-trained tokenizer from a pre-trained tokenizer configuration."""
        tokenizer = self.AUTO_TOKENIZER_CLASS.from_pretrained(
            pretrained if tokenizer is None else tokenizer,
            revision=revision + ("/" + subfolder if subfolder is not None else ""),
+            trust_remote_code=trust_remote_code,
        )
        tokenizer.pad_token = tokenizer.eos_token
        return tokenizer
@@ -351,7 +381,7 @@ class HuggingFaceAutoLM(BaseLM):
        """Return the maximum sequence length of the model.
        NOTE: Different model configurations have different max sequence length
        attribute names.
-            - n_positions: (CTRLConfig)
+            - n_positions: (CTRLConfig, T5Config)
            - max_position_embeddings: (BartConfig, RoFormerConfig)
            - n_ctx: (GPT2Config)
        NOTE: For relative position encoded models you should specify the max
@@ -408,19 +438,7 @@ class HuggingFaceAutoLM(BaseLM):
        if self.batch_size == "auto":
            # using rolling window with maximum context
            print("Passed argument batch_size = auto. Detecting largest batch size")
-
-            @find_executable_batch_size(
-                starting_batch_size=512
-            )  # if OOM, then halves batch_size and tries again
-            def forward_batch(batch_size):
-                test_batch = torch.ones(
-                    (batch_size, self.max_length), device=self.device
-                ).long()
-                for _ in range(5):
-                    _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
-                return batch_size
-
-            batch_size = forward_batch()
+            batch_size = self._detect_batch_size()
            print(f"Determined Largest batch size: {batch_size}")
            adaptive_batch_size = batch_size

@@ -485,12 +503,14 @@ class AutoCausalLM(HuggingFaceAutoLM):
        revision: str,
        subfolder: str,
        tokenizer: Optional[str] = None,
+        trust_remote_code: Optional[bool] = False,
    ) -> transformers.PreTrainedTokenizer:
        tokenizer = super()._create_auto_tokenizer(
            pretrained=pretrained,
            revision=revision,
            subfolder=subfolder,
            tokenizer=tokenizer,
+            trust_remote_code=trust_remote_code,
        )
        tokenizer.padding_side = "left"
        return tokenizer
@@ -543,15 +563,6 @@ class AutoSeq2SeqLM(HuggingFaceAutoLM):
    AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
    AUTO_PEFT_CLASS = peft.PeftModel

-    @property
-    def max_length(self) -> int:
-        """Return the maximum sequence length of the model.
-        TODO: Currently only works for relative position encoded Seq2Seq models.
-        """
-        if self._max_length is not None:
-            return self._max_length
-        return self._DEFAULT_MAX_LENGTH
-
    def loglikelihood(
        self, requests: List[Tuple[str, str]]
    ) -> List[Tuple[float, bool]]:

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -4,6 +4,7 @@ from typing import List, Union
 import sacrebleu
 import lm_eval.base

+from . import babi
 from . import superglue
 from . import glue
 from . import arc
@@ -60,6 +61,7 @@ from . import xwinograd
 from . import pawsx
 from . import xnli
 from . import mgsm
+from . import scrolls

 ########################################
 # Translation tasks
@@ -92,6 +94,7 @@ all_translation_benchmarks = {


 TASK_REGISTRY = {
+    "babi": babi.Babi,
    # GLUE
    "cola": glue.CoLA,
    "mnli": glue.MNLI,
@@ -325,6 +328,7 @@ TASK_REGISTRY = {
    **pawsx.construct_tasks(),
    **xnli.construct_tasks(),
    **mgsm.construct_tasks(),
+    **scrolls.construct_tasks()
 }



--- a/lm_eval/tasks/babi.py
+++ b/lm_eval/tasks/babi.py
+"""
+Inspired by https://github.com/stanford-crfm/helm/blob/0eaaa62a2263ddb94e9850ee629423b010f57e4a/src/helm/benchmark/scenarios/babi_qa_scenario.py
+"""
+import numpy as np
+from collections import defaultdict
+from lm_eval.base import rf, Task
+from lm_eval.metrics import mean
+
+
+_CITATION = """
+@article{weston2015towards,
+  title={Towards ai-complete question answering: A set of prerequisite toy tasks},
+  author={Weston, Jason and Bordes, Antoine and Chopra, Sumit and Rush, Alexander M and Van Merri{\"e}nboer, Bart and Joulin, Armand and Mikolov, Tomas},
+  journal={arXiv preprint arXiv:1502.05698},
+  year={2015}
+}
+"""
+
+class Babi(Task):
+    VERSION = 0
+    DATASET_PATH = "Muennighoff/babi"
+    DATASET_NAME = None
+
+    def has_training_docs(self):
+        return True
+
+    def has_validation_docs(self):
+        return True
+
+    def has_test_docs(self):
+        return True
+
+    def training_docs(self):
+        if self.has_training_docs():
+            return self.dataset["train"]
+
+    def validation_docs(self):
+        if self.has_validation_docs():
+            return self.dataset["valid"]
+
+    def test_docs(self):
+        if self.has_test_docs():
+            return self.dataset["test"]
+
+    def doc_to_text(self, doc):
+        return (
+            doc['passage'] + doc['question']
+        )
+
+    def should_decontaminate(self):
+        return False # TODO Necessary?
+
+    def doc_to_decontamination_query(self, doc):
+        return f"Passage: {doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
+
+    def doc_to_target(self, doc):
+        return " " + doc['answer']
+
+    def construct_requests(self, doc, ctx):
+        """Uses RequestFactory to construct Requests and returns an iterable of
+        Requests which will be sent to the LM.
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param ctx: str
+            The context string, generated by fewshot_context. This includes the natural
+            language description, as well as the few shot examples, and the question
+            part of the document for `doc`.
+        """
+        return rf.greedy_until(ctx, ["\n"])
+
+    def process_results(self, doc, results):
+        """Take a single document and the LM results and evaluates, returning a
+        dict where keys are the names of submetrics and values are the values of
+        the metric for that one document
+
+        :param doc:
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param results:
+            The results of the requests created in construct_requests.
+        """
+        gold = doc["answer"]
+        pred = gold.strip() == results[0].strip()
+        return {"em": pred}
+
+    def aggregation(self):
+        return {
+            "em": mean,
+        }
+
+    def higher_is_better(self):
+        return {
+            "em": True,
+        }
--- a/lm_eval/tasks/hendrycks_test.py
+++ b/lm_eval/tasks/hendrycks_test.py
@@ -14,7 +14,6 @@ Homepage: https://github.com/hendrycks/test
 """
 from lm_eval.base import MultipleChoiceTask

-
 _CITATION = """
 @article{hendryckstest2021,
    title={Measuring Massive Multitask Language Understanding},
@@ -103,8 +102,8 @@ def create_task(subject):


 class GeneralHendrycksTest(MultipleChoiceTask):
-    VERSION = 0
-    DATASET_PATH = "hendrycks_test"
+    VERSION = 1
+    DATASET_PATH = "cais/mmlu"
    DATASET_NAME = None

    def __init__(self, subject):
@@ -112,7 +111,7 @@ class GeneralHendrycksTest(MultipleChoiceTask):
        super().__init__()

    def has_training_docs(self):
-        return False
+        return True

    def has_validation_docs(self):
        return True
@@ -126,41 +125,50 @@ class GeneralHendrycksTest(MultipleChoiceTask):
    def test_docs(self):
        return map(self._process_doc, self.dataset["test"])

+    def _format_subject(self, subject):
+        words = subject.split("_")
+        return " ".join(words)
+
+    def fewshot_context(self, doc, num_fewshot, **kwargs):
+        subject = self.DATASET_NAME
+        description = f"The following are multiple choice questions (with answers) about {self._format_subject(subject)}."
+        kwargs["description"] = description
+        return super().fewshot_context(doc=doc, num_fewshot=num_fewshot, **kwargs)
+
    def _process_doc(self, doc):
        def format_example(doc, keys):
            """
-            Question: <prompt>
-            Choices:
+            <prompt>
            A. <choice1>
            B. <choice2>
            C. <choice3>
            D. <choice4>
            Answer:
            """
-            prompt = "Question: " + doc["question"] + "\nChoices:\n"
-            prompt += "".join(
+
+            question = doc["question"].strip()
+            choices = "".join(
                [f"{key}. {choice}\n" for key, choice in zip(keys, doc["choices"])]
            )
-            prompt += "Answer:"
+            prompt = f"{question}\n{choices}Answer:"
            return prompt

        keys = ["A", "B", "C", "D"]
        return {
            "query": format_example(doc, keys),
-            "choices": doc["choices"],
-            "gold": keys.index(doc["answer"])
-            if isinstance(doc["answer"], str)
-            else doc["answer"],
+            "choices": keys,
+            "gold": doc["answer"],
        }

    def fewshot_examples(self, k, rnd):
        # fewshot_examples is not just sampling from train_docs because dev is
        # in the same distribution as val/test but auxiliary_train isn't
-
        if self._fewshot_docs is None:
            self._fewshot_docs = list(map(self._process_doc, self.dataset["dev"]))

-        return rnd.sample(list(self._fewshot_docs), k)
+        # use the unchanged order of the dev set without sampling,
+        # just as in the original code https://github.com/hendrycks/test/blob/master/evaluate.py#L28
+        return self._fewshot_docs[:k]

    def doc_to_text(self, doc):
        return doc["query"]

--- a/lm_eval/tasks/scrolls.py
+++ b/lm_eval/tasks/scrolls.py
+"""
+SCROLLS: Standardized CompaRison Over Long Language Sequences
+https://arxiv.org/abs/2201.03533
+
+SCROLLS is a suite of datasets that require synthesizing information over long texts.
+The benchmark includes seven natural language tasks across multiple domains,
+including summarization, question answering, and natural language inference.
+
+Homepage: https://www.scrolls-benchmark.com/
+
+Since SCROLLS tasks are generally longer than the maximum sequence length of many models,
+it is possible to create "subset" tasks that contain only those samples whose tokenized length
+is less than some pre-defined limit. For example, to create a subset of "Qasper" that would
+be suitable for a model using the GPTNeoX tokenizer and a 4K maximium sequence length:
+
+```
+class QasperGPTNeoX4K(Qasper):
+    PRUNE_TOKENIZERS = ["EleutherAI/pythia-410m-deduped"]
+    PRUNE_MAX_TOKENS = 4096
+    PRUNE_NUM_PROC = _num_cpu_cores() # optional, to speed up pruning of large datasets like NarrativeQA
+```
+
+`PRUNE_TOKENIZERS` can contain more than one tokenizer; this will include only samples that are
+less than `PRUNE_MAX_TOKENS` for ALL of the tokenizers. This can be useful to comparing models
+that use different tokenizers but the same maximum sequence length.
+
+Once the subset task class has been defined in this file, it can be used by adding the class
+to `lm_eval/tasks/__init__.py`.
+
+NOTE: GovReport may need `max_gen_toks` set larger for causal models.
+"""
+from abc import abstractmethod
+from datasets import load_metric
+from transformers import AutoTokenizer
+from lm_eval.base import rf, Task
+from lm_eval.metrics import mean
+from functools import reduce
+import transformers.data.metrics.squad_metrics as squad_metrics
+import numpy as np
+import re
+
+_CITATION = """
+@inproceedings{shaham-etal-2022-scrolls,
+    title = "{SCROLLS}: Standardized {C}ompa{R}ison Over Long Language Sequences",
+    author = "Shaham, Uri  and 
+      Segal, Elad  and
+      Ivgi, Maor  and
+      Efrat, Avia  and
+      Yoran, Ori  and
+      Haviv, Adi  and
+      Gupta, Ankit  and
+      Xiong, Wenhan  and
+      Geva, Mor  and
+      Berant, Jonathan  and
+      Levy, Omer",
+    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.emnlp-main.823",
+    pages = "12007--12021"
+}
+"""
+
+# SCROLLS is formualted as a sequence-to-sequence task.
+# To allow for evaluation of causal models, we'll
+# reformualte these with appropriate prompts
+
+
+def _download_metric():
+    import os
+    import shutil
+    from huggingface_hub import hf_hub_download
+    scrolls_metric_path = hf_hub_download(repo_id="tau/scrolls", repo_type="dataset", filename="metrics/scrolls.py")
+    updated_scrolls_metric_path = (
+        os.path.dirname(scrolls_metric_path) + os.path.basename(scrolls_metric_path).replace(".", "_") + ".py"
+    )
+    shutil.copy(scrolls_metric_path, updated_scrolls_metric_path)
+    return updated_scrolls_metric_path
+
+
+def _process_doc_prepended_question(doc):
+    # "When a query is given in addition to the raw text (as
+    # in QMSum, Qasper, NarrativeQA, QuALITY, and ContractNLI),
+    # we prepend it to the text, using two newlines as a natural separator"
+    input = doc["input"]
+    split = input.find("\n\n")
+    return {
+        "id": doc["id"],
+        "pid": doc["pid"],
+        "input": input,
+        "outputs": doc["outputs"],
+        "question": input[0:split],
+        "text": input[split + 2:]
+    }
+
+
+def _drop_duplicates_in_input(untokenized_dataset):
+    # from scrolls/evaluator/dataset_evaluator.py
+
+    indices_to_keep = []
+    id_to_idx = {}
+    outputs = []
+    for i, (id_, output) in enumerate(zip(untokenized_dataset["id"], untokenized_dataset["output"])):
+        if id_ in id_to_idx:
+            outputs[id_to_idx[id_]].append(output)
+            continue
+        indices_to_keep.append(i)
+        id_to_idx[id_] = len(outputs)
+        outputs.append([output])
+    untokenized_dataset = untokenized_dataset.select(indices_to_keep).flatten_indices()
+    untokenized_dataset = untokenized_dataset.remove_columns("output")
+    untokenized_dataset = untokenized_dataset.add_column("outputs", outputs)
+    return untokenized_dataset
+
+
+def _num_cpu_cores():
+    # https://stackoverflow.com/questions/1006289/how-to-find-out-the-number-of-cpus-using-python/55423170#55423170
+    try:
+        import psutil
+        return psutil.cpu_count(logical=False)
+    except ImportError:
+        import os
+        return len(os.sched_getaffinity(0))
+
+
+class _SCROLLSTask(Task):
+    VERSION = 0
+    DATASET_PATH = "tau/scrolls"
+    DATASET_NAME = None
+    PRUNE_TOKENIZERS = None
+    PRUNE_MAX_TOKENS = None
+    PRUNE_NUM_PROC = None
+
+    def __init__(self, no_metric=False):
+        super().__init__()
+        self.metric = load_metric(_download_metric(), config_name=self.DATASET_NAME) if not no_metric else None
+
+    def has_training_docs(self):
+        return True
+
+    def has_validation_docs(self):
+        return True
+
+    def has_test_docs(self):
+        return False
+
+    def training_docs(self):
+        for doc in self.dataset["train"]:
+            yield from self._process_doc(doc)
+
+    def validation_docs(self):
+        for doc in self.dataset["validation"]:
+            yield from self._process_doc(doc)
+
+    def should_decontaminate(self):
+        return True
+
+    def doc_to_decontamination_query(self, doc):
+        return doc["input"]
+
+    def download(self, *args, **kwargs):
+        super().download(*args, **kwargs)
+        del self.dataset["test"]
+        for split in self.dataset:
+            self.dataset[split] = _drop_duplicates_in_input(self.dataset[split])
+        if self.PRUNE_TOKENIZERS is not None and self.PRUNE_TOKENIZERS is not None:
+            self.prune()
+
+    def _get_prune_text(self, sample):
+        return self.doc_to_text(self._process_doc(sample)[0])
+
+    def prune(self):
+        """Create a pruned version of a SCROLLS task dataset containing only inputs
+        that are less than `max_tokens` when tokenized by each tokenizer
+        """
+
+        tokenizers = [AutoTokenizer.from_pretrained(tokenizer) for tokenizer in self.PRUNE_TOKENIZERS]
+        cache = {}
+
+        def _filter(sample):
+            text = self._get_prune_text(sample)
+            cached = cache.get(text, None)
+            if cached is None:
+                for tokenizer in tokenizers:
+                    if len(tokenizer(text).input_ids) > self.PRUNE_MAX_TOKENS:
+                        cache[text] = False
+                        return False
+                cache[text] = True
+                return True
+            else:
+                return cached
+
+        self.dataset = self.dataset.filter(_filter, num_proc=self.PRUNE_NUM_PROC)
+
+    def doc_to_target(self, doc):
+        return " " + ", ".join(doc["outputs"])
+
+    def doc_to_text(self, doc):
+        return f"{doc['text']}\n\nQuestion: {doc['question']}\nAnswer:"
+
+    def higher_is_better(self):
+        return {x: True for x in self._scrolls_metrics().keys()}
+
+    @abstractmethod
+    def _scrolls_metrics(self):
+        pass
+
+    def _make_compute_metrics(self, value):
+        def compute_metrics(samples):
+            predictions, references = zip(*samples)  # unzip, if you will
+            computed = self.metric.compute(predictions=predictions, references=references)
+            return computed[value]
+        return compute_metrics
+
+    def aggregation(self):
+        return {
+            key: self._make_compute_metrics(value) for key, value in self._scrolls_metrics().items()
+        }
+
+
+class _SCROLLSMultipleChoiceTask(_SCROLLSTask):
+
+    def __init__(self):
+        super().__init__(no_metric=True)
+
+    def _scrolls_metrics(self):
+        return None
+
+    def aggregation(self):
+        return {
+            "em": mean,
+            "acc": mean,
+            "acc_norm": mean
+        }
+
+    def higher_is_better(self):
+        return {
+            "em": True,
+            "acc": True,
+            "acc_norm": True
+        }
+
+    def process_results(self, doc, results):
+        gold = doc["gold"]
+
+        acc = 1.0 if np.argmax(results) == gold else 0.0
+        completion_len = np.array([float(len(i)) for i in doc["choices"]])
+        acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0
+
+        return {
+            "acc": acc,
+            "acc_norm": acc_norm,
+            "em": acc_norm * 100.0,
+        }
+
+    def construct_requests(self, doc, ctx):
+        lls = [
+            rf.loglikelihood(ctx, " {}".format(choice))[0] for choice in doc["choices"]
+        ]
+
+        return lls
+
+
+class _SCROLLSSummaryTask(_SCROLLSTask):
+
+    def _process_doc(self, doc):
+        return [doc]
+
+    def _scrolls_metrics(self):
+        return {"rouge1": "rouge/rouge1", "rouge2": "rouge/rouge2", "rougeL": "rouge/rougeL"}
+
+    def process_results(self, doc, results):
+        return {
+            "rouge1": (results[0], doc["outputs"]),
+            "rouge2": (results[0], doc["outputs"]),
+            "rougeL": (results[0], doc["outputs"])
+        }
+
+    def construct_requests(self, doc, ctx):
+        return [rf.greedy_until(ctx, {'until': ["\n"]})]
+
+    def doc_to_text(self, doc):
+        return f"{doc['input']}\n\nQuestion: What is a summary of the preceding text?\nAnswer:"
+
+
+class Qasper(_SCROLLSTask):
+    """A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
+    https://arxiv.org/abs/2105.03011
+    """
+
+    DATASET_NAME = "qasper"
+
+    def _process_doc(self, doc):
+        doc = _process_doc_prepended_question(doc)
+        doc["is_yes_no"] = reduce(lambda prev, cur: prev and squad_metrics.normalize_answer(cur)
+                                  in ["yes", "no"], doc["outputs"], True)
+        return [doc]
+
+    def _scrolls_metrics(self):
+        return {"f1": "f1"}
+
+    def process_results(self, doc, results):
+        if doc["is_yes_no"]:
+            prediction = " yes" if results[0] > results[1] else " no"
+        elif len(results[0].strip()) == 0:
+            prediction = "Unanswerable"
+        else:
+            prediction = results[0]
+        return {
+            "f1": (prediction, doc["outputs"])
+        }
+
+    def construct_requests(self, doc, ctx):
+        if doc["is_yes_no"]:
+            ll_yes, _ = rf.loglikelihood(ctx, " yes")
+            ll_no, _ = rf.loglikelihood(ctx, " no")
+            return [ll_yes, ll_no]
+        else:
+            return [rf.greedy_until(ctx, {'until': ["\n"]})]
+
+
+class QuALITY(_SCROLLSMultipleChoiceTask):
+    """QuALITY: Question Answering with Long Input Texts, Yes!
+    https://arxiv.org/abs/2112.08608
+    """
+
+    DATASET_NAME = "quality"
+    _multiple_choice_pattern = re.compile(r" *\([A-D]\) *")
+
+    @staticmethod
+    def _normalize_answer(text):
+        return " ".join(text.split()).strip()
+
+    def _process_doc(self, doc):
+        doc = _process_doc_prepended_question(doc)
+
+        split = doc["text"].find("\n\n", doc["text"].find("(D)"))
+        choices_text = doc["text"][:split]
+
+        doc["text"] = doc["text"][split:].strip()
+        doc["choices"] = [QuALITY._normalize_answer(choice) for choice in re.split(
+            QuALITY._multiple_choice_pattern, choices_text)[1:]]
+        doc["gold"] = doc["choices"].index(QuALITY._normalize_answer(doc["outputs"][0]))
+
+        return [doc]
+
+
+class NarrativeQA(_SCROLLSTask):
+    """The NarrativeQA Reading Comprehension Challenge
+    https://arxiv.org/abs/1712.07040
+    """
+
+    DATASET_NAME = "narrative_qa"
+
+    def _process_doc(self, doc):
+        return [_process_doc_prepended_question(doc)]
+
+    def _scrolls_metrics(self):
+        return {"f1": "f1"}
+
+    def _get_prune_text(self, doc):
+        # pruning narrativeqa takes forever -- let's cheat a bit
+        # and just cache on the text, not the question, since
+        # the dataset is different questions about the same large
+        # documents
+        return self._process_doc(doc)[0]["text"]
+
+    def process_results(self, doc, results):
+        return {
+            "f1": (results[0], doc["outputs"])
+        }
+
+    def construct_requests(self, doc, ctx):
+        return [rf.greedy_until(ctx, {'until': ["\n"]})]
+
+
+class ContractNLI(_SCROLLSMultipleChoiceTask):
+    """ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
+    https://arxiv.org/abs/1712.07040
+    """
+
+    DATASET_NAME = "contract_nli"
+    CHOICES = ["Not mentioned", "Entailment", "Contradiction"]
+
+    def _process_doc(self, doc):
+        doc = _process_doc_prepended_question(doc)
+        doc["choices"] = ContractNLI.CHOICES
+        doc["gold"] = ContractNLI.CHOICES.index(doc["outputs"][0])
+        return [doc]
+
+    def doc_to_text(self, doc):
+        return f"{doc['text']}\n\nHypothesis: {doc['question']}\nConclusion:"
+
+
+class GovReport(_SCROLLSSummaryTask):
+    """Efficient Attentions for Long Document Summarization
+    https://arxiv.org/abs/2104.02112
+
+    Note: The average length of the reference summaries is ~3,000
+    characters, or ~600 tokens as tokenized by GPT-NeoX. For causal models,
+    it is recommended to set `max_gen_toks` sufficently large (e.g. 1024)
+    to allow a full summary to be generated.
+    """
+
+    DATASET_NAME = "gov_report"
+
+
+class SummScreenFD(_SCROLLSSummaryTask):
+    """SummScreen: A Dataset for Abstractive Screenplay Summarization
+    https://arxiv.org/abs/2104.07091
+    """
+
+    DATASET_NAME = "summ_screen_fd"
+
+
+class QMSum(_SCROLLSSummaryTask):
+    """QMSum: A New Benchmark for Query-based Multi-domain
+    Meeting Summarization
+
+    https://arxiv.org/abs/2104.05938
+    """
+
+    DATASET_NAME = "qmsum"
+
+    def _process_doc(self, doc):
+        return [_process_doc_prepended_question(doc)]
+
+    def doc_to_text(self, doc):
+        return f"{doc['text']}\n\nQuestion: {doc['question']}\nAnswer:"
+
+
+def construct_tasks():
+    return {
+        "scrolls_qasper": Qasper,
+        "scrolls_quality": QuALITY,
+        "scrolls_narrativeqa": NarrativeQA,
+        "scrolls_contractnli": ContractNLI,
+        "scrolls_govreport": GovReport,
+        "scrolls_summscreenfd": SummScreenFD,
+        "scrolls_qmsum": QMSum
+    }