merged latest and added altworld files

835cc40e · lintangsutawika · 8da401e0 · c9bbec6e · 835cc40e · 835cc40e
Commit 835cc40e authored Dec 06, 2023 by lintangsutawika
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -39,7 +39,7 @@ repos:
      - id: codespell
        exclude: >
          (?x)^(
-              .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml
+              .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
          )$
        args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
  - repo: https://github.com/pre-commit/mirrors-mypy

--- a/README.md
+++ b/README.md
--- a/docs/README.md
+++ b/docs/README.md
@@ -7,18 +7,4 @@ Welcome to the docs for the LM Evaluation Harness!
 * To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/user_guide.md)
 * To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
 * For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
-* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Advanced Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/advanced_task_guide.md).
-
-## Progress on Revamp
-
-Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
-
-### Desired Pages
-
-* [ ] YAML explainer
-  * [ ] Explainer on filters + advanced features
-  * [ ] Walkthrough start-to-finish of adding a new task to codebase
-* [ ] Explaining registries + decorators
-* [ ] model_guide.md for adding new model API
-  * [ ] guide to writing an adapter to new advanced codebase (e.g. NeoX)
-* [ ] Parallelism guide (?)
+* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -18,6 +18,8 @@ This mode supports a number of command-line arguments, the details of which can

 * `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.

+* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
+
 * `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.

 * `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -102,6 +102,8 @@ class MyCustomLM(LM):

 Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library!

+**Tip: be sure to import your model in `lm_eval/models/__init__.py!`**
+
 ## Testing

 We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -2,7 +2,9 @@

 `lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).

-This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.5.0 in the future.)
+This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.4.0 in the future.)
+
+A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/examples/lm-eval-overview.ipynb).

 ## Setup


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -50,7 +50,7 @@ Scoring details:
 - **doc_to_decontamination_query** (`str`, *optional*) —

 Other:
- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.
+- **metadata** (`Union[str, list]`, *optional*) — An optional field where arbitrary metadata can be passed. A good example would be `version` that is used to denote the version of the yaml config.

 ## Filters


--- a/examples/lm-eval-overview.ipynb
+++ b/examples/lm-eval-overview.ipynb
--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -105,6 +105,14 @@ def parse_eval_args() -> argparse.Namespace:
        default=None,
        help="Additional path to include if there are external tasks to include.",
    )
+    parser.add_argument(
+        "--gen_kwargs",
+        default="",
+        help=(
+            "String arguments for model generation on greedy_until tasks,"
+            " e.g. `temperature=0,top_k=0,top_p=0`"
+        ),
+    )
    parser.add_argument(
        "--verbosity",
        type=str,
@@ -210,6 +218,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        check_integrity=args.check_integrity,
        write_out=args.write_out,
        log_samples=args.log_samples,
+        gen_kwargs=args.gen_kwargs,
    )

    if results is not None:
@@ -236,7 +245,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
                    filename.open("w").write(samples_dumped)

        print(
-            f"{args.model} ({args.model_args}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
+            f"{args.model} ({args.model_args}), gen_kwargs: ({args.gen_kwargs}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
            f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
        )
        print(evaluator.make_table(results))

--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -112,8 +112,10 @@ def ter(items):
 @register_aggregation("brier_score")
 def brier_score(items):  # This is a passthrough function
    gold, predictions = list(zip(*items))
-    gold = np.array(gold)
+    print(type(predictions))
    predictions = np.array(predictions)
+    print(predictions.shape)
+    gold = np.array(gold)
    gold_one_hot = np.eye(len(predictions[0]))[gold]
    return np.mean(np.sum((predictions - gold_one_hot) ** 2, axis=1))


--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
@@ -133,13 +133,6 @@ class LM(abc.ABC):
        additional_config = {} if additional_config is None else additional_config
        args = utils.simple_parse_args_string(arg_string)
        args2 = {k: v for k, v in additional_config.items() if v is not None}
-        # TODO: delete once float16 MPS is fixed in torch stable
-        if (
-            args2.get("device") in ("mps", "mps:0")
-            or args.get("device") in ("mps", "mps:0")
-            and "dev" not in torch.__version__
-        ):
-            args["dtype"] = "float32"
        return cls(**args, **args2)

    @property

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -81,7 +81,7 @@ class TaskConfig(dict):
    fewshot_delimiter: str = "\n\n"
    fewshot_config: dict = None
    # runtime configuration options
-    num_fewshot: int = 0
+    num_fewshot: int = None
    # scoring options
    metric_list: list = None
    output_type: str = "generate_until"
@@ -91,7 +91,9 @@ class TaskConfig(dict):
    should_decontaminate: bool = False
    doc_to_decontamination_query: str = None

-    metadata: str = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks
+    metadata: Union[
+        str, list
+    ] = None  # by default, not used in the code. allows for users to pass arbitrary info to tasks

    def __post_init__(self) -> None:
        if self.dataset_path and ("." in self.dataset_path):
@@ -359,7 +361,7 @@ class Task(abc.ABC):
            # sample fewshot context #TODO: need to offset doc_id by rank now!
            fewshot_ctx = self.fewshot_context(
                doc,
-                self.config.num_fewshot,
+                0 if self.config.num_fewshot is None else self.config.num_fewshot,
            )

            # TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute
@@ -775,7 +777,7 @@ class ConfigurableTask(Task):
        if self.config.fewshot_split is not None:
            return self.dataset[self.config.fewshot_split]
        else:
-            if self.config.num_fewshot > 0:
+            if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0):
                eval_logger.warning(
                    f"Task '{self.config.task}': "
                    "num_fewshot > 0 but fewshot_split is None. "

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -20,6 +20,7 @@ from lm_eval.utils import (
    make_table,
    create_iterator,
    get_git_commit_hash,
+    simple_parse_args_string,
    eval_logger,
 )

@@ -40,6 +41,7 @@ def simple_evaluate(
    decontamination_ngrams_path=None,
    write_out: bool = False,
    log_samples: bool = True,
+    gen_kwargs: str = None,
 ):
    """Instantiate and evaluate a model on a list of tasks.

@@ -70,6 +72,9 @@ def simple_evaluate(
        If True, write out an example document and model input for checking task integrity
    :param log_samples: bool
        If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis
+    :param gen_kwargs: str
+        String arguments for model generation
+        Ignored for all tasks with loglikelihood output_type
    :return
        Dictionary of results
    """
@@ -83,6 +88,14 @@ def simple_evaluate(
        tasks != []
    ), "No tasks specified, or no tasks found. Please verify the task names."

+    if gen_kwargs is not None:
+        gen_kwargs = simple_parse_args_string(gen_kwargs)
+        eval_logger.warning(
+            f"generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks."
+        )
+        if gen_kwargs == "":
+            gen_kwargs = None
+
    if isinstance(model, str):
        if model_args is None:
            model_args = ""
@@ -117,14 +130,21 @@ def simple_evaluate(
                continue

        config = task_obj._config
+        if config["output_type"] == "generate_until" and gen_kwargs is not None:
+            config["generation_kwargs"].update(gen_kwargs)
+
        if num_fewshot is not None:
-            if config["num_fewshot"] > 0:
+            if config["num_fewshot"] == 0:
+                eval_logger.info(
+                    f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored."
+                )
+            else:
                default_num_fewshot = config["num_fewshot"]
                eval_logger.warning(
                    f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}"
                )

-            task_obj._config["num_fewshot"] = num_fewshot
+                task_obj._config["num_fewshot"] = num_fewshot

    if check_integrity:
        run_task_tests(task_list=tasks)
@@ -154,6 +174,7 @@ def simple_evaluate(
            "use_cache": use_cache,
            "limit": limit,
            "bootstrap_iters": bootstrap_iters,
+            "gen_kwargs": gen_kwargs,
        }
        results["git_hash"] = get_git_commit_hash()
        return results
@@ -216,6 +237,8 @@ def evaluate(
    # store the ordering of tasks and groups
    task_order = collections.defaultdict(int)
    task_group_alias = collections.defaultdict(dict)
+    # store num-fewshot value per task
+    num_fewshot = collections.defaultdict(int)

    # get lists of each type of request
    for task_name, task in task_dict.items():
@@ -234,6 +257,12 @@ def evaluate(
        versions[task_name] = task.VERSION
        configs[task_name] = dict(task.dump_config())

+        if "num_fewshot" in configs[task_name]:
+            n_shot = configs[task_name]["num_fewshot"]
+        else:
+            n_shot = 0
+        num_fewshot[task_name] = n_shot
+
        if "task_alias" in configs[task_name]:
            task_group_alias[task_name] = configs[task_name]["task_alias"]

@@ -411,7 +440,6 @@ def evaluate(
        vals = vals_torch

    if lm.rank == 0:
-
        ### Get task ordering for correct sample-wide aggregation
        group_to_task = {}
        for group in task_hierarchy.keys():
@@ -422,7 +450,6 @@ def evaluate(
                group_to_task[group] = task_hierarchy[group].copy()

            for task in task_hierarchy[group]:
-
                if task in task_order:
                    task_order[task] += 1
                else:
@@ -471,9 +498,7 @@ def evaluate(
                    results[task_name][metric + "_stderr" + "," + key] = 0

        if bool(results):
-
            for group, task_list in reversed(task_hierarchy.items()):
-
                if task_list == []:
                    total_size = results[group]["samples"]
                else:
@@ -493,7 +518,6 @@ def evaluate(
                        for metric in [
                            key for key in metrics.keys() if "_stderr" not in key
                        ]:
-
                            stderr = "_stderr,".join(metric.split(","))
                            stderr_score = results[task][stderr]
                            var_score = stderr_score**2
@@ -530,11 +554,9 @@ def evaluate(
                results[group]["samples"] = total_size

        def print_tasks(task_hierarchy, task_order, task_version, task_group_alias):
-
            results_agg = collections.defaultdict(dict)
            groups_agg = collections.defaultdict(dict)
            for group_name, task_list in task_hierarchy.items():
-
                order = task_order[group_name]
                results_agg[group_name] = results[group_name].copy()
                results_agg[group_name]["tab"] = order
@@ -597,11 +619,16 @@ def evaluate(
            else:
                groups_agg[group]["alias"] = tab_string + group

+        for group_name, task_list in task_hierarchy.items():
+            if task_list != []:
+                num_fewshot[group_name] = num_fewshot[task_list[0]]
+
        results_dict = {
            "results": dict(results_agg.items()),
            **({"groups": dict(groups_agg.items())} if bool(groups_agg) else {}),
            "configs": dict(sorted(configs.items())),
            "versions": dict(sorted(versions.items())),
+            "n-shot": dict(sorted(num_fewshot.items())),
        }
        if log_samples:
            results_dict["samples"] = dict(samples)

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -4,6 +4,6 @@ from . import textsynth
 from . import dummy
 from . import anthropic_llms
 from . import gguf
-
+from . import vllm_causallms

 # TODO: implement __all__
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
 import os
-
+from packaging import version
 import torch
 import transformers
 from transformers.models.auto.modeling_auto import (
@@ -16,13 +16,14 @@ from pathlib import Path
 import torch.nn.functional as F

 from lm_eval import utils
+from lm_eval.api.instance import Instance
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model

 from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria

 from accelerate import Accelerator, find_executable_batch_size, DistributedType
-from typing import List, Optional, Union
+from typing import List, Optional, Union, Tuple

 eval_logger = utils.eval_logger

@@ -117,11 +118,11 @@ class HFLM(LM):
                    device = int(device)
                self._device = torch.device(device)
                eval_logger.info(f"Using device '{device}'")
-                if device in ("mps", "mps:0") and "dev" not in torch.__version__:
-                    eval_logger.info(
-                        "MPS: Setting dtype to float32. To use float16 with MPS, please install a nightly build of "
-                        "PyTorch: pip3 install --pre torch torchvision torchaudio --index-url "
-                        "https://download.pytorch.org/whl/nightly/cpu"
+                if device in ("mps", "mps:0") and version.parse(
+                    torch.__version__
+                ) < version.parse("2.1"):
+                    raise RuntimeError(
+                        f"mps requires torch >= 2.1. You have {torch.__version__}"
                    )
            else:
                eval_logger.info("Device not specified")
@@ -157,12 +158,17 @@ class HFLM(LM):
            trust_remote_code=trust_remote_code,
        )

-        if getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
-            self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
-        elif (
-            not getattr(self._config, "model_type")
+        if (
+            getattr(self._config, "model_type")
            in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
        ):
+            # first check if model type is listed under seq2seq models, since some
+            # models like MBart are listed in both seq2seq and causal mistakenly in HF transformers.
+            # these special cases should be treated as seq2seq models.
+            self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
+        elif getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
+            self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
+        else:
            if not trust_remote_code:
                eval_logger.warning(
                    "HF model type is neither marked as CausalLM or Seq2SeqLM. \
@@ -171,8 +177,6 @@ class HFLM(LM):
            # if model type is neither in HF transformers causal or seq2seq model registries
            # then we default to AutoModelForCausalLM
            self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
-        else:
-            self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM

        assert self.AUTO_MODEL_CLASS in [
            transformers.AutoModelForCausalLM,
@@ -420,7 +424,9 @@ class HFLM(LM):
        utils.clear_torch_cache()
        return batch_size

-    def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None):
+    def tok_encode(
+        self, string: str, left_truncate_len=None, add_special_tokens=None
+    ) -> List[int]:
        """ """
        if add_special_tokens is None:
            if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
@@ -442,7 +448,7 @@ class HFLM(LM):
        padding_side: str = "left",
        left_truncate_len: int = None,
        truncation: bool = False,
-    ):
+    ) -> Tuple[List[int], List[int]]:
        # encode a batch of strings. converts to tensors and pads automatically, unlike tok_encode.
        old_padding_side = self.tokenizer.padding_side
        self.tokenizer.padding_side = padding_side
@@ -536,7 +542,9 @@ class HFLM(LM):

        return logits

-    def _encode_pair(self, context, continuation):
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
        n_spaces = len(context) - len(context.rstrip())
        if n_spaces > 0:
            continuation = context[-n_spaces:] + continuation
@@ -551,7 +559,7 @@ class HFLM(LM):
        continuation_enc = whole_enc[context_enc_len:]
        return context_enc, continuation_enc

-    def loglikelihood(self, requests):
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
        new_reqs = []
        for context, continuation in [req.args for req in requests]:
            if context == "":
@@ -566,7 +574,7 @@ class HFLM(LM):

        return self._loglikelihood_tokens(new_reqs)

-    def loglikelihood_rolling(self, requests):
+    def loglikelihood_rolling(self, requests: List[Instance]) -> List[float]:
        loglikelihoods = []

        adaptive_batch_size = None
@@ -640,8 +648,11 @@ class HFLM(LM):
        return self.batch_sizes[sched]

    def _loglikelihood_tokens(
-        self, requests, disable_tqdm: bool = False, override_bs=None
-    ):
+        self,
+        requests: List[Tuple[Tuple[str, str], List[int], List[int]]],
+        disable_tqdm: bool = False,
+        override_bs: int = None,
+    ) -> List[Tuple[float, bool]]:
        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
        res = []

@@ -820,7 +831,7 @@ class HFLM(LM):

        return re_ord.get_original(res)

-    def generate_until(self, requests):
+    def generate_until(self, requests: List[Instance]) -> List[str]:
        res = defaultdict(list)
        re_ords = {}


--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
 import os
 import time
 from typing import List, Tuple
+
+import copy
+from collections import defaultdict
 from tqdm import tqdm
+
 from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
@@ -51,7 +55,7 @@ please install these via `pip install lm-eval[openai]` or `pip install -e .[open
    backoff_time = 3
    while True:
        try:
-            return openai.Completion.create(**kwargs)
+            return openai.Completions.create(**kwargs)
        except openai.error.OpenAIError:
            import traceback

@@ -60,7 +64,7 @@ please install these via `pip install lm-eval[openai]` or `pip install -e .[open
            backoff_time *= 1.5


-@register_model("openai", "openai-completions", "gooseai")
+@register_model("gooseai")
 class OpenaiCompletionsLM(LM):
    REQ_CHUNK_SIZE = 20

@@ -304,3 +308,211 @@ class OpenaiCompletionsLM(LM):
            string_nll = sum(string_nll)
            loglikelihoods.append(string_nll)
        return loglikelihoods
+
+
+def oa_chat_completion(client, **kwargs):
+    """Query OpenAI API for chat completion.
+
+    Retry with back-off until they respond
+    """
+    try:
+        import openai, tiktoken  # noqa: E401
+    except ModuleNotFoundError:
+        raise Exception(
+            "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+        )
+
+    async def _get_completions(**kwargs):
+        chat_completions = await client.chat.completions.create(**kwargs)
+        return chat_completions
+
+    backoff_time = 3
+    while True:
+        try:
+            return client.chat.completions.create(**kwargs)
+        except openai.OpenAIError:
+            import traceback
+
+            traceback.print_exc()
+            time.sleep(backoff_time)
+            backoff_time *= 1.5
+
+
+@register_model("openai-chat-completions")
+class OpenaiChatCompletionsLM(LM):
+    def __init__(
+        self, model: str = "gpt-3.5-turbo", truncate: bool = False, batch_size: int = 1
+    ) -> None:
+        """
+
+        :param model: str
+            OpenAI API model (e.g. gpt-3.5-turbo)
+        :param truncate: bool
+            Truncate input if too long (if False and input is too long, throw error)
+        """
+        super().__init__()
+        try:
+            import openai, tiktoken  # noqa: E401
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
+    please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
+            )
+        self.model = model
+        self.frequency_penalty = 0
+        self.logit_bias = None
+        self.n = 1
+        self.presence_penalty = 0
+        self.temperature = 1
+        self.top_p = 1
+        self.tokenizer = tiktoken.encoding_for_model(self.model)
+        self.vocab_size = self.tokenizer.n_vocab
+        self.truncate = truncate
+        self.end_of_text_token_id = self.tokenizer.eot_token
+
+        # Read from environment variable OPENAI_API_KEY
+        self.client = openai.OpenAI()  # openai.AsyncOpenAI()
+
+    @property
+    def eot_token_id(self):
+        return self.end_of_text_token_id
+
+    @property
+    def max_length(self) -> int:
+        # Note: the OpenAI API supports up to 2049 tokens, with the first token being the first input token
+        return 2048
+
+    @property
+    def max_gen_toks(self) -> int:
+        return 256
+
+    @property
+    def batch_size(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    @property
+    def device(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    def tok_encode(self, string: str) -> List[int]:
+        return self.tokenizer.encode(string)
+
+    def tok_decode(self, tokens: List[int]) -> str:
+        return self.tokenizer.decode(tokens)
+
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
+    def generate_until(self, requests) -> List[str]:
+        res = defaultdict(list)
+        re_ords = {}
+
+        def _collate(x):
+            toks = self.tok_encode(x[0])
+            return -len(toks), x[0]
+
+        # we group requests by their generation_kwargs,
+        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+        # in the same batch.
+        grouper = utils.Grouper(requests, lambda x: str(x.args[1]))
+        for key, reqs in grouper.get_grouped().items():
+            # within each set of reqs for given kwargs, we reorder by token length, descending.
+            re_ords[key] = utils.Reorderer([req.args for req in reqs], _collate)
+
+        def sameuntil_chunks(xs, size):
+            ret = []
+            lastuntil = xs[0][1]
+            for x in xs:
+                if len(ret) >= size or x[1] != lastuntil:
+                    yield ret, lastuntil
+                    ret = []
+                    lastuntil = x[1]
+                ret.append(x)
+
+            if ret:
+                yield ret, lastuntil
+
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0))
+        for key, re_ord in re_ords.items():
+            # n needs to be 1 because messages in
+            # chat completion are not batch but
+            # is regarded as a single conversation.
+            chunks = utils.chunks(re_ord.get_reordered(), n=1)
+            for chunk in chunks:
+                contexts, all_gen_kwargs = zip(*chunk)
+                inps = [{"role": "user", "content": context} for context in contexts]
+
+                gen_kwargs = all_gen_kwargs[0]
+                until = None
+                if isinstance(gen_kwargs, dict):
+                    kwargs = copy.deepcopy(gen_kwargs)  # edge case for repeats > 1
+                    if "until" in kwargs.keys():
+                        until = kwargs.pop("until")
+                        if isinstance(until, str):
+                            until = [kwargs]
+                        elif not isinstance(until, list):
+                            raise ValueError(
+                                f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
+                            )
+                else:
+                    raise ValueError(
+                        f"Expected `kwargs` to be of type `dict` but got {kwargs}"
+                    )
+
+                if "max_gen_toks" in kwargs.keys():
+                    max_gen_toks = kwargs.pop("max_gen_toks")
+                else:
+                    max_gen_toks = self.max_gen_toks
+
+                response = oa_chat_completion(
+                    client=self.client,
+                    messages=inps,
+                    model=self.model,
+                    frequency_penalty=self.frequency_penalty,
+                    # logit_bias=self.logit_bias,
+                    max_tokens=max_gen_toks,
+                    n=self.n,
+                    presence_penalty=self.presence_penalty,
+                    temperature=self.temperature,
+                    top_p=self.top_p,
+                )
+
+                for resp, (context, args_) in zip(response.choices, chunk):
+                    s = resp.message.content
+
+                    if until is not None:
+                        for term in until:
+                            if len(term) > 0:
+                                s = s.split(term)[0]
+
+                    res[key].append(s)
+
+                    self.cache_hook.add_partial(
+                        "generate_until", (context, {"until": until}), s
+                    )
+                    pbar.update(1)
+            # reorder this group of results back to original unsorted form
+            res[key] = re_ord.get_original(res[key])
+
+        pbar.close()
+
+        return grouper.get_original(res)
+
+    def loglikelihood(self, requests):
+        raise NotImplementedError("No support for logits.")
+
+    def loglikelihood_rolling(self, requests):
+        raise NotImplementedError("No support for logits.")
--- a/lm_eval/models/vllm_causallms.py
+++ b/lm_eval/models/vllm_causallms.py
+from collections import defaultdict
+from typing import List, Tuple, Optional, Literal, Union
+
+from lm_eval.api.instance import Instance
+from lm_eval.api.model import LM
+import copy
+from tqdm import tqdm
+from lm_eval.api.registry import register_model
+from lm_eval import utils
+
+try:
+    from vllm import LLM, SamplingParams
+except ModuleNotFoundError:
+    pass
+
+
+eval_logger = utils.eval_logger
+
+
+@register_model("vllm")
+class VLLM(LM):
+    _DEFAULT_MAX_LENGTH = 2048
+
+    def __init__(
+        self,
+        pretrained="gpt2",
+        dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
+        revision: Optional[str] = None,
+        trust_remote_code: Optional[bool] = False,
+        tokenizer_mode: Literal["auto", "slow"] = "auto",
+        tensor_parallel_size: int = 1,
+        quantization: Optional[Literal["awq"]] = None,
+        max_gen_toks: int = 256,
+        swap_space: int = 4,
+        batch_size: Union[str, int] = 1,
+        max_batch_size=None,
+        max_length: int = None,
+        seed: int = 1234,
+        gpu_memory_utilization: float = 0.9,
+        device: str = "cuda",
+    ):
+        super().__init__()
+
+        try:
+            import vllm
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'vllm' LM type, but package `vllm` is not installed. \
+please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`",
+            )
+
+        assert "cuda" in device or device is None, "vLLM only supports CUDA"
+        self.model = LLM(
+            model=pretrained,
+            gpu_memory_utilization=float(gpu_memory_utilization),
+            revision=revision,
+            dtype=dtype,
+            tokenizer_mode=tokenizer_mode,
+            trust_remote_code=trust_remote_code,
+            tensor_parallel_size=int(tensor_parallel_size),
+            swap_space=int(swap_space),
+            quantization=quantization,
+            seed=int(seed),
+        )
+        self.tokenizer = self.model.get_tokenizer()
+        self.batch_size = batch_size
+        self._max_length = max_length
+        self._max_gen_toks = max_gen_toks
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        if self._max_length:  # if max length manually set, return it
+            return self._max_length
+        if hasattr(self.model.llm_engine.model_config, "max_model_len"):
+            return self.model.llm_engine.model_config.max_model_len
+        return self._DEFAULT_MAX_LENGTH
+
+    @property
+    def max_gen_toks(self):
+        return self._max_gen_toks
+
+    def tok_encode(
+        self,
+        string: str,
+        left_truncate_len=None,
+        add_special_tokens=False,
+        truncation=False,
+    ):
+        """ """
+        encoding = self.tokenizer.encode(
+            string, add_special_tokens=add_special_tokens, truncation=truncation
+        )
+
+        # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+        if left_truncate_len:
+            encoding = encoding[-left_truncate_len:]
+
+        return encoding
+
+    def _model_generate(
+        self,
+        requests: List[int] = None,
+        generate: bool = False,
+        max_tokens: int = None,
+        stop: Optional[List[str]] = None,
+        use_tqdm=True,
+        **kwargs,
+    ):
+        if "do_sample" in kwargs.keys():
+            kwargs.pop("do_sample")
+        if generate:
+            generate_sampling_params = SamplingParams(
+                max_tokens=max_tokens, stop=stop, **kwargs
+            )
+            outputs = self.model.generate(
+                prompt_token_ids=requests,
+                sampling_params=generate_sampling_params,
+                use_tqdm=use_tqdm,
+            )
+        else:
+            logliklihood_sampling_params = SamplingParams(
+                temperature=0, prompt_logprobs=2, max_tokens=1
+            )
+            outputs = self.model.generate(
+                prompt_token_ids=requests,
+                sampling_params=logliklihood_sampling_params,
+                use_tqdm=use_tqdm,
+            )
+        return outputs
+
+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        new_reqs = []
+        for context, continuation in [req.args for req in requests]:
+            if context == "":
+                # end of text as context
+                context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
+                    continuation
+                )
+            else:
+                context_enc, continuation_enc = self.tokenizer(
+                    [context, continuation],
+                    truncation="do_not_truncate",
+                    add_special_tokens=False,
+                    return_attention_mask=False,
+                ).input_ids
+
+            new_reqs.append(((context, continuation), context_enc, continuation_enc))
+
+        return self._loglikelihood_tokens(new_reqs)
+
+    def loglikelihood_rolling(self, requests: List[Instance]) -> List[float]:
+        loglikelihoods = []
+
+        for (string,) in tqdm([req.args for req in requests]):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length - 1,
+                        context_len=1,
+                    ),
+                )
+            )
+
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+            )
+
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
+
+    def generate_until(self, requests: List[Instance]) -> List[str]:
+        res = defaultdict(list)
+        re_ords = {}
+
+        # batch tokenize contexts
+        context, all_gen_kwargs = zip(*(req.args for req in requests))
+        context_encoding = self.tokenizer(context).input_ids
+        requests = [
+            ((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
+        ]
+
+        def _collate_gen(_requests):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            return -len(_requests[0][1]), tuple(_requests[0][1])
+
+        # we group requests by their generation_kwargs,
+        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+        # in the same batch.
+        grouper = utils.Grouper(requests, lambda x: str(x[1]))
+        for key, reqs in grouper.get_grouped().items():
+            # within each set of reqs for given kwargs, we reorder by token length, descending.
+            re_ords[key] = utils.Reorderer(requests, _collate_gen)
+
+        pbar = tqdm(total=len(requests), disable=(self.rank != 0))
+        # for each different set of kwargs, we execute all requests, by batch.
+        for key, re_ord in re_ords.items():
+            chunks = utils.chunks(
+                re_ord.get_reordered(),
+                n=self.batch_size if self.batch_size != "auto" else 0,
+                fn=None,
+            )
+            for chunk in chunks:
+                context_and_encoding, all_gen_kwargs = zip(*chunk)
+                context, context_encoding = zip(*context_and_encoding)
+                # we assume all gen kwargs in the batch are the same
+                # this is safe to assume because the `grouper` object ensures it.
+                gen_kwargs = all_gen_kwargs[0]
+                # unpack our keyword arguments.
+                until = None
+                if isinstance(gen_kwargs, dict):
+                    kwargs = copy.deepcopy(gen_kwargs)  # edge case for repeats > 1
+                    if "until" in kwargs.keys():
+                        until = kwargs.pop("until")
+                        if isinstance(until, str):
+                            until = [until]
+                        elif not isinstance(until, list):
+                            raise ValueError(
+                                f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
+                            )
+                else:
+                    raise ValueError(
+                        f"Expected `kwargs` to be of type `dict` but got {gen_kwargs}"
+                    )
+                if not until:
+                    until = [self.tokenizer.decode(self.eot_token_id)]
+                if "max_gen_toks" in kwargs.keys():
+                    max_gen_toks = kwargs.pop("max_gen_toks")
+                else:
+                    max_gen_toks = self.max_gen_toks
+
+                # set the max length in tokens of inputs ("context_enc")
+                # max len for inputs = max length, minus room to generate the max new tokens
+                max_ctx_len = self.max_length - max_gen_toks
+                context_encoding = [x[-max_ctx_len:] for x in context_encoding]
+
+                # TODO: max_length in kwargs
+
+                # perform batched generation
+                cont = self._model_generate(
+                    requests=context_encoding,
+                    generate=True,
+                    max_tokens=max_gen_toks,
+                    stop=until,
+                    **kwargs,
+                )
+
+                # cache generations
+                for output, context in zip(cont, context):
+                    generated_text = output.outputs[0].text
+                    res[key].append(generated_text)
+                    self.cache_hook.add_partial(
+                        "generate_until", (context, gen_kwargs), generated_text
+                    )
+                    pbar.update(1)
+
+            # reorder this group of results back to original unsorted form
+            res[key] = re_ord.get_original(res[key])
+
+        pbar.close()
+
+        return grouper.get_original(res)
+
+    def _loglikelihood_tokens(
+        self,
+        requests: List[Tuple[Tuple[str, str], List[int], List[int]]],
+        disable_tqdm: bool = False,
+    ) -> List[Tuple[float, bool]]:
+        res = []
+
+        def _collate(x):
+            toks = x[1] + x[2]
+            return -len(toks), tuple(toks)
+
+        re_ord = utils.Reorderer(requests, _collate)
+
+        chunks = utils.chunks(
+            re_ord.get_reordered(),
+            n=self.batch_size if self.batch_size != "auto" else 0,
+            fn=None,
+        )
+        pbar = tqdm(total=len(requests), disable=disable_tqdm)
+        for chunk in chunks:
+            inps = []
+            ctxlens = []
+            for cache_key, context_enc, continuation_enc in chunk:
+                inp = (context_enc + continuation_enc)[-(self.max_length) :]
+                ctxlen = len(context_enc) - max(
+                    0, len(context_enc) + len(continuation_enc) - (self.max_length)
+                )
+
+                inps.append(inp)
+                ctxlens.append(ctxlen)
+
+            outputs = self._model_generate(requests=inps, generate=False)
+
+            for output, ctxlen, (cache_key, context_enc, continuation_enc) in zip(
+                outputs, ctxlens, chunk
+            ):
+                answer = self._parse_logprobs(
+                    (context_enc + continuation_enc),
+                    output,
+                    ctxlen,
+                )
+
+                res.append(answer)
+
+                # partial caching
+                if cache_key is not None:
+                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
+                    pbar.update(1)
+        pbar.close()
+        return re_ord.get_original(res)
+
+    @staticmethod
+    def _parse_logprobs(tokens: List, outputs, ctxlen: int) -> Tuple[float, bool]:
+        """Process logprobs and tokens.
+
+        :param tokens: list
+            Tokens from context+continuations
+        :param outputs: RequestOutput
+            Contains prompt
+        :param ctxlen: int
+            Length of context (so we can slice them away and only keep the predictions)
+        :return:
+            continuation_logprobs: float
+                Log probabilities of continuation tokens
+            is_greedy: bool
+                Whether argmax matches given continuation exactly
+        """
+
+        # prompt_logprobs = [None, {}*len(context-1)]
+        continuation_logprobs_dicts = outputs.prompt_logprobs
+
+        # Calculate continuation_logprobs
+        # assume ctxlen always > 1
+        continuation_logprobs = sum(
+            logprob_dict.get(token)
+            for token, logprob_dict in zip(
+                tokens[ctxlen:], continuation_logprobs_dicts[ctxlen:]
+            )
+        )
+
+        # Determine if is_greedy
+        is_greedy = True
+        for token, logprob_dict in zip(
+            tokens[ctxlen:], continuation_logprobs_dicts[ctxlen:]
+        ):
+            # Get the token with the maximum log probability from the logprob_dict
+            if logprob_dict:  # Ensure the logprob_dict is not None
+                top_token = max(logprob_dict, key=logprob_dict.get)
+                if top_token != token:
+                    is_greedy = False
+                    break
+
+        return continuation_logprobs, is_greedy
--- a/lm_eval/tasks/anli/anli_r1.yaml
+++ b/lm_eval/tasks/anli/anli_r1.yaml
@@ -22,3 +22,5 @@ metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
+metadata:
+  - version: 1.0
--- a/lm_eval/tasks/arc/alternative_worlds/README.md
+++ b/lm_eval/tasks/arc/alternative_worlds/README.md
+
+
+Investigate affect of letter options
+- (A)
+- A)
+- A.
+- A\t
+- (a)
+- a)
+- a.
+- a\t
+
+Answer types:
+- letters only
+    - original option
+    - just letter
+- letters + continuation
+    - original option 
+    - just letter
+- continuation
\ No newline at end of file
--- a/lm_eval/tasks/arc/alternative_worlds/_arc_easy_alt_yaml
+++ b/lm_eval/tasks/arc/alternative_worlds/_arc_easy_alt_yaml
+group:
+  - ai2_arc
+task: arc_easy
+dataset_path: ai2_arc
+dataset_name: ARC-Easy
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: test
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{choices.label.index(answerKey)}}"
+doc_to_choice: "{{choices.text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "Question: {{question}}\nAnswer:"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+  - metric: brier_score
+    higher_is_better: false