resolved merge conflict

e58b8182 · lintangsutawika · d213a533 · 0571eeb1 · e58b8182 · e58b8182
Commit e58b8182 authored Aug 08, 2024 by lintangsutawika
20 changed files
--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -56,7 +56,7 @@ jobs:
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
-        pip install -e '.[dev,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
+        pip install -e '.[dev,sentencepiece,api]' --extra-index-url https://download.pytorch.org/whl/cpu
 #         Install optional git dependencies
 #                pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
 #        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
@@ -84,7 +84,7 @@ jobs:
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
-        pip install -e '.[dev,optimum,deepsparse,sparseml]' --extra-index-url https://download.pytorch.org/whl/cpu
+        pip install -e '.[dev,optimum,deepsparse,sparseml,api]' --extra-index-url https://download.pytorch.org/whl/cpu
    - name: Test with pytest
      run: python -m pytest tests/models --showlocals -s -vv
    - name: Archive artifacts

--- a/.gitignore
+++ b/.gitignore
@@ -13,6 +13,7 @@ temp
 __pycache__
 .ipynb_checkpoints
 temp
+test_logs/
 # IPython
 profile_default/
 ipython_config.py

--- a/README.md
+++ b/README.md
@@ -2,6 +2,15 @@
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)
+---
+*Latest News 📣*
+- [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**
+- [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
+---
 ## Announcement
 **A new v0.4.0 release of lm-evaluation-harness is available** !
@@ -21,6 +30,8 @@ Please see our updated documentation pages in `docs/` for more details.
 Development will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](https://discord.gg/eleutherai)!
+---
 ## Overview
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
@@ -95,7 +106,7 @@ lm_eval --model hf \
 #### Multi-GPU Evaluation with Hugging Face `accelerate`
-We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
+We support three main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
 To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
@@ -112,7 +123,7 @@ For cases where your model can fit on a single GPU, this allows you to evaluate
 The second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*
-In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
+In this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
 ```
 lm_eval --model hf \
@@ -129,7 +140,19 @@ For more advanced users or even larger models, we allow for the following argume
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
 - `offload_folder`: a folder where model weights will be offloaded to disk if needed.
-These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
+The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.
+```
+accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
+    -m lm_eval --model hf \
+    --tasks lambada_openai,arc_easy \
+    --model_args parallelize=True \
+    --batch_size 16
+```
+To learn more about model parallelism and how to use it with the `accelerate` library, see the [accelerate documentation](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism)
+**Warning: We do not natively support multi-node evaluation using the `hf` model type! Please reference [our GPT-NeoX library integration](https://github.com/EleutherAI/gpt-neox/blob/main/eval.py) for an example of code in which a custom multi-machine evaluation script is written.**
 **Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**
@@ -217,26 +240,26 @@ lm_eval --model openai-completions \
 We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.
 ```bash
-lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1
+lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16
 ```
-Note that for externally hosted models, configs such as `--device` and `--batch_size` should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
+Note that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
-| API or Inference Server                                                                                                   | Implemented?                    | `--model <xxx>` name                                                | Models supported:                                                                             | Request Types:                                             |
+| API or Inference Server                                                                                                   | Implemented?                    | `--model <xxx>` name                                | Models supported:                                                                                                                                                                                                                                                                                                                                          | Request Types:                                             |
-|---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
+|---------------------------------------------------------------------------------------------------------------------------|---------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
-| OpenAI Completions                                                                                                        | :heavy_check_mark:              | `openai-completions`, `local-completions` | All OpenAI Completions API models                                            | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
+| OpenAI Completions                                                                                                        | :heavy_check_mark:              | `openai-completions`, `local-completions`           | All OpenAI Completions API models                                                                                                                                                                                                                                                                                                                          | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
-| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:        | `openai-chat-completions`, `local-chat-completions`                                                               | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                 | `generate_until` (no logprobs)                             |
+| OpenAI ChatCompletions                                                                                                    | :heavy_check_mark:        | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt)                                                                                                                                                                                                                                                                              | `generate_until` (no logprobs)                             |
-| Anthropic                                                                                                                 | :heavy_check_mark:              | `anthropic`                                                         | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)  | `generate_until` (no logprobs)                             |
+| Anthropic                                                                                                                 | :heavy_check_mark:              | `anthropic`                                         | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model)                                                                                                                                                                                                                                                               | `generate_until` (no logprobs)                             |
-| Anthropic Chat                                                                                                                | :heavy_check_mark:              | `anthropic-chat`, `anthropic-chat-completions`                                                         | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)  | `generate_until` (no logprobs)                             |
+| Anthropic Chat                                                                                                                | :heavy_check_mark:              | `anthropic-chat`, `anthropic-chat-completions`      | [Supported Anthropic Engines](https://docs.anthropic.com/claude/docs/models-overview)                                                                                                                                                                                                                                                                      | `generate_until` (no logprobs)                             |
-| Textsynth                                                                                                                 | :heavy_check_mark:                   | `textsynth`                                                         | [All supported engines](https://textsynth.com/documentation.html#engines)                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Textsynth                                                                                                                 | :heavy_check_mark:                   | `textsynth`                                         | [All supported engines](https://textsynth.com/documentation.html#engines)                                                                                                                                                                                                                                                                                  | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
-| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                                 | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                        | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
+| Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                 | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                                                                                                                                                                                                                                                                                     | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
-| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:              | `gguf`, `ggml`                                                      | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                   | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
+| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:              | `gguf`, `ggml`                                      | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                                                                                                                                                                                                                                                                                | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
-| vLLM                                                                                                                      | :heavy_check_mark:       | `vllm`                                                              | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
+| vLLM                                                                                                                      | :heavy_check_mark:       | `vllm`                                              | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html)                                                                                                                                                                                                                                                              | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
-| Mamba                       | :heavy_check_mark:       | `mamba_ssm`                                                                      | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                             |
+| Mamba                       | :heavy_check_mark:       | `mamba_ssm`                                         | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces)                                                                                                                                                                                                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                             |
-| Huggingface Optimum (Causal LMs)    | ✔️         | `openvino`                                 |     Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                           |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
+| Huggingface Optimum (Causal LMs)    | ✔️         | `openvino`                                          | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format                                                                                                                                                                                                                            |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
-| Neuron via AWS Inf2 (Causal LMs)    | ✔️         | `neuronx`                                 |     Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2)                         |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
+| Neuron via AWS Inf2 (Causal LMs)    | ✔️         | `neuronx`                                           | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2)                                                                                                                                                                                            |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
-| [Neural Magic DeepSparse](https://github.com/neuralmagic/deepsparse)    | ✔️         | `deepsparse`                                 |     Any LM from [SparseZoo](https://sparsezoo.neuralmagic.com/) or on [HF Hub with the "deepsparse" tag](https://huggingface.co/models?other=deepsparse)                       |  `generate_until`, `loglikelihood`                         | ...                                                      |
+| [Neural Magic DeepSparse](https://github.com/neuralmagic/deepsparse)    | ✔️         | `deepsparse`                                        | Any LM from [SparseZoo](https://sparsezoo.neuralmagic.com/) or on [HF Hub with the "deepsparse" tag](https://huggingface.co/models?other=deepsparse)                                                                                                                                                                                                       |  `generate_until`, `loglikelihood`                         | ...                                                      |
-| [Neural Magic SparseML](https://github.com/neuralmagic/sparseml)    | ✔️         | `sparseml`                                 |     Any decoder-only AutoModelForCausalLM from [SparseZoo](https://sparsezoo.neuralmagic.com/) or on [HF Hub](https://huggingface.co/neuralmagic). Especially useful for models with quantization like [`zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized`](https://sparsezoo.neuralmagic.com/models/llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized)                         |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
+| [Neural Magic SparseML](https://github.com/neuralmagic/sparseml)    | ✔️         | `sparseml`                                          | Any decoder-only AutoModelForCausalLM from [SparseZoo](https://sparsezoo.neuralmagic.com/) or on [HF Hub](https://huggingface.co/neuralmagic). Especially useful for models with quantization like [`zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized`](https://sparsezoo.neuralmagic.com/models/llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized) |  `generate_until`, `loglikelihood`, `loglikelihood_rolling`                         | ...                                                      |
-| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions or ChatCompletions interface                                  | `generate_until`                                           |                                | ...                |
+| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-completions` or `local-chat-completions`     | Support for OpenAI API-compatible servers, with easy customization for other APIs.                                                                                                                                                                                                                                                                         | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                                          |                                | ...                |
 Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
@@ -436,28 +459,27 @@ The best way to get support is to open an issue on this repo or join the [Eleuth
 ## Optional Extras
 Extras dependencies can be installed via `pip install -e ".[NAME]"`
-| Name          | Use                                   |
+| Name            | Use                                          |
-|---------------|---------------------------------------|
+|-----------------|----------------------------------------------|
-| anthropic     | For using Anthropic's models          |
+| api             | For using api models (Anthropic, OpenAI API) |
-| deepsparse     | For running NM's DeepSparse models    |
+| deepsparse      | For running NM's DeepSparse models           |
-| dev           | For linting PRs and contributions     |
+| dev             | For linting PRs and contributions            |
-| gptq          | For loading models with GPTQ          |
+| gptq            | For loading models with GPTQ                 |
-| hf_transfer   | For speeding up HF Hub file downloads |
+| hf_transfer     | For speeding up HF Hub file downloads        |
-| ifeval        | For running the IFEval task           |
+| ifeval          | For running the IFEval task                  |
-| neuronx       | For running on AWS inf2 instances     |
+| neuronx         | For running on AWS inf2 instances            |
-| mamba         | For loading Mamba SSM models          |
+| mamba           | For loading Mamba SSM models                 |
-| math          | For running math task answer checking |
+| math            | For running math task answer checking        |
-| multilingual  | For multilingual tokenizers           |
+| multilingual    | For multilingual tokenizers                  |
-| openai        | For using OpenAI's models             |
+| optimum         | For running Intel OpenVINO models            |
-| optimum       | For running Intel OpenVINO models     |
+| promptsource    | For using PromptSource prompts               |
-| promptsource  | For using PromptSource prompts        |
+| sentencepiece   | For using the sentencepiece tokenizer        |
-| sentencepiece | For using the sentencepiece tokenizer |
+| sparseml        | For using NM's SparseML models               |
-| sparseml      | For using NM's SparseML models        |
+| testing         | For running library test suite               |
-| testing       | For running library test suite        |
+| vllm            | For loading models with vLLM                 |
-| vllm          | For loading models with vLLM          |
+| zeno            | For visualizing results with Zeno            |
-| zeno          | For visualizing results with Zeno     |
+| --------------- | ---------------------------------------      |
-|---------------|---------------------------------------|
+| all             | Loads all extras (not recommended)           |
-| all           | Loads all extras (not recommended)    |
 ## Cite as

--- a/docs/API_guide.md
+++ b/docs/API_guide.md
+# TemplateAPI Usage Guide
+The `TemplateAPI` class is a versatile superclass designed to facilitate the integration of various API-based language models into the lm-evaluation-harness framework. This guide will explain how to use and extend the `TemplateAPI` class to implement your own API models. If your API implements the OpenAI API you can use the `local-completions` or the `local-chat-completions` (defined [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/openai_completions.py)) model types, which can also serve as examples of how to effectively subclass this template.
+## Overview
+The `TemplateAPI` class provides a template for creating API-based model implementations. It handles common functionalities such as:
+- Tokenization (optional)
+- Batch processing
+- Caching
+- Retrying failed requests
+- Parsing API responses
+To use this class, you typically need to subclass it and implement specific methods for your API.
+## Key Methods to Implement
+When subclassing `TemplateAPI`, you need to implement the following methods:
+1. `_create_payload`: Creates the JSON payload for API requests.
+2. `parse_logprobs`: Parses log probabilities from API responses.
+3. `parse_generations`: Parses generated text from API responses.
+4. `headers`: Returns the headers for the API request.
+You may also need to override other methods or properties depending on your API's specific requirements.
+> [!NOTE]
+> Currently loglikelihood and MCQ based tasks (such as MMLU) are only supported for completion endpoints. Not for chat-completion — those that expect a list of dicts — endpoints! Completion APIs which support instruct tuned models can be evaluated with the `--apply_chat_template` option in order to simultaneously evaluate models using a chat template format while still being able to access the model logits needed for loglikelihood-based tasks.
+# TemplateAPI Usage Guide
+## TemplateAPI Arguments
+When initializing a `TemplateAPI` instance or a subclass, you can provide several arguments to customize its behavior. Here's a detailed explanation of some important arguments:
+- `model` or `pretrained` (str):
+   - The name or identifier of the model to use.
+   - `model` takes precedence over `pretrained` when both are provided.
+- `base_url` (str):
+   - The base URL for the API endpoint.
+- `tokenizer` (str, optional):
+  - The name or path of the tokenizer to use.
+  - If not provided, it defaults to using the same tokenizer name as the model.
+- `num_concurrent` (int):
+   - Number of concurrent requests to make to the API.
+   - Useful for APIs that support parallel processing.
+   - Default is 1 (sequential processing).
+- `tokenized_requests` (bool):
+  - Determines whether the input is pre-tokenized. Defaults to `True`.
+  - Requests can be sent in either tokenized form (`list[list[int]]`) or as text (`list[str]`, or `str` for batch_size=1).
+  - For loglikelihood-based tasks, prompts require tokenization to calculate the context length. If `False` prompts are decoded back to text before being sent to the API.
+  - Not as important for `generate_until` tasks.
+  - Ignored for chat formatted inputs (list[dict...]) or if tokenizer_backend is None.
+- `tokenizer_backend` (str, optional):
+  - Required for loglikelihood-based or MCQ tasks.
+  - Specifies the tokenizer library to use. Options are "tiktoken", "huggingface", or None.
+  - Default is "huggingface".
+- `max_length` (int, optional):
+  - Maximum length of input + output.
+  - Default is 2048.
+- `max_retries` (int, optional):
+   - Maximum number of retries for failed API requests.
+   - Default is 3.
+- `max_gen_toks` (int, optional):
+  - Maximum number of tokens to generate in completion tasks.
+  - Default is 256 or set in task yaml.
+- `batch_size` (int or str, optional):
+  - Number of requests to batch together (if the API supports batching).
+  - Can be an integer or "auto" (which defaults to 1 for API models).
+  - Default is 1.
+- `seed` (int, optional):
+  - Random seed for reproducibility.
+  - Default is 1234.
+- `add_bos_token` (bool, optional):
+  - Whether to add the beginning-of-sequence token to inputs (when tokenizing).
+  - Default is False.
+- `custom_prefix_token_id` (int, optional):
+  - Custom token ID to use as a prefix for inputs.
+  - If not provided, uses the model's default BOS or EOS token (if `add_bos_token` is True).
+Example usage:
+```python
+class MyAPIModel(TemplateAPI):
+    def __init__(self, **kwargs):
+        super().__init__(
+            model="my-model",
+            base_url="https://api.mymodel.com/v1/completions",
+            tokenizer_backend="huggingface",
+            num_concurrent=5,
+            max_retries=5,
+            batch_size=10,
+            **kwargs
+        )
+    # Implement other required methods...
+```
+When subclassing `TemplateAPI`, you can override these arguments in your `__init__` method to set default values specific to your API. You can also add additional (potentially user-specified) arguments as needed for your specific implementation.
+## Example Implementation: OpenAI API
+The `OpenAICompletionsAPI` and `OpenAIChatCompletion` ([here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/openai_completions.py) classes demonstrate how to implement API models using the `TemplateAPI` class. Here's a breakdown of the key components:
+### 1. Subclassing and Initialization
+```python
+@register_model("openai-completions")
+class OpenAICompletionsAPI(LocalCompletionsAPI):
+    def __init__(
+        self,
+        base_url="https://api.openai.com/v1/completions",
+        tokenizer_backend="tiktoken",
+        **kwargs,
+    ):
+        super().__init__(
+            base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
+        )
+```
+### 2. Implementing API Key Retrieval
+```python
+@cached_property
+def api_key(self):
+    key = os.environ.get("OPENAI_API_KEY", None)
+    if key is None:
+        raise ValueError(
+            "API key not found. Please set the OPENAI_API_KEY environment variable."
+        )
+    return key
+```
+### 3. Creating the Payload
+```python
+def _create_payload(
+    self,
+    messages: Union[List[List[int]], List[dict], List[str], str],
+    generate=False,
+    gen_kwargs: Optional[dict] = None,
+    **kwargs,
+) -> dict:
+    if generate:
+        # ... (implementation for generation)
+    else:
+        # ... (implementation for log likelihood)
+```
+### 4. Parsing API Responses
+```python
+@staticmethod
+def parse_logprobs(
+    outputs: Union[Dict, List[Dict]],
+    tokens: List[List[int]] = None,
+    ctxlens: List[int] = None,
+    **kwargs,
+) -> List[Tuple[float, bool]]:
+    # ... (implementation)
+@staticmethod
+def parse_generations(outputs: Union[Dict, List[Dict]], **kwargs) -> List[str]:
+    # ... (implementation)
+```
+The requests are initiated in the `model_call` or the `amodel_call` methods.
+## Implementing Your Own API Model
+To implement your own API model:
+1. Subclass `TemplateAPI` or one of its subclasses (e.g., `LocalCompletionsAPI`).
+2. Override the `__init__` method if you need to set specific parameters.
+3. Implement the `_create_payload` and `header` methods to create the appropriate payload for your API.
+4. Implement the `parse_logprobs` and `parse_generations` methods to parse your API's responses.
+5. Override the `api_key` property if your API requires authentication.
+6. Override any other methods as necessary to match your API's behavior.
+## Best Practices
+1. Use the `@register_model` decorator to register your model with the framework (and import it in `lm_eval/models/__init__.py`!).
+3. Use environment variables for sensitive information like API keys.
+4. Properly handle batching and concurrent requests if supported by your API.
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -2,8 +2,6 @@
 Welcome and thank you for your interest in the LM Evaluation Harness! We welcome contributions and feedback and appreciate your time spent with our library, and hope you find it useful!
-We intend LM Evaluation Harness to be a broadly useful and
 ## Important Resources
 There are several places information about LM Evaluation Harness is located:
@@ -11,7 +9,7 @@ There are several places information about LM Evaluation Harness is located:
 - Our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
 - We occasionally use [GitHub Milestones](https://github.com/EleutherAI/lm-evaluation-harness/milestones) to track progress toward specific near-term version releases.
 - We maintain a [Project Board](https://github.com/orgs/EleutherAI/projects/25) for tracking current work items and PRs, and for future roadmap items or feature requests.
- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](discord.gg/eleutherai).
+- Further discussion and support conversations are located in the #lm-thunderdome channel of the [EleutherAI discord](https://discord.gg/eleutherai).
 ## Code Style
@@ -32,7 +30,7 @@ in order to ensure linters and other checks will be run upon committing.
 We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
 ```
-python -m pytest --ignore=tests/tests_master --ignore=tests/extra
+python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
 ```
 ## Contributor License Agreement

--- a/docs/README.md
+++ b/docs/README.md
@@ -4,7 +4,8 @@ Welcome to the docs for the LM Evaluation Harness!
 ## Table of Contents
-* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
+* To learn about the public interface of the library, as well as how to evaluate via the command line or as integrated into an external library, see the [Interface](./interface.md).
 * To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
+  * For an extended description of how to extend the library to new model classes served over an API, see the [API Guide](./API_guide.md).
 * For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
 * To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
--- a/examples/lm-eval-overview.ipynb
+++ b/examples/lm-eval-overview.ipynb
@@ -210,7 +210,7 @@
      ],
      "source": [
        "# Install LM-Eval\n",
-        "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor"
+        "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
      ]
    },
    {

--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -8,7 +8,6 @@ from typing import List
 import numpy as np
 import sacrebleu
-import sklearn.metrics
 from lm_eval.api.registry import register_aggregation, register_metric
@@ -51,21 +50,24 @@ def bits_per_byte(items):
 @register_aggregation("f1")
 def f1_score(items):
+    from sklearn.metrics import f1_score
    unzipped_list = list(zip(*items))
    golds = unzipped_list[0]
    preds = unzipped_list[1]
-    fscore = sklearn.metrics.f1_score(golds, preds)
+    fscore = f1_score(golds, preds)
    return np.max(fscore)
 @register_aggregation("matthews_corrcoef")
 def matthews_corrcoef(items):
+    from sklearn.metrics import matthews_corrcoef
    unzipped_list = list(zip(*items))
    golds = unzipped_list[0]
    preds = unzipped_list[1]
-    # print(preds)
+    return matthews_corrcoef(golds, preds)
-    return sklearn.metrics.matthews_corrcoef(golds, preds)
 @register_aggregation("bleu")

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
@@ -55,7 +55,7 @@ class LM(abc.ABC):
        pass
    @abc.abstractmethod
-    def loglikelihood_rolling(self, requests) -> List[Tuple[float]]:
+    def loglikelihood_rolling(self, requests) -> List[float]:
        """Compute full log-likelihood of a string, with no truncation, for perplexity computation
        - We will use the full max context length of the model.
        - For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
@@ -101,14 +101,13 @@ class LM(abc.ABC):
        """Generate greedily until a stopping sequence
        :param requests: list[Instance]
-            A list of Instance objects with property `args` which returns a tuple (context, until).
+            A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
            context: str
                Context string
-            until: [str]
+            gen_kwargs: dict
-                The string sequences to generate until. These string sequences
+                A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
-                may each span across multiple tokens, or may be part of one token.
        :return: list[str]
-            A list of strings continuation
+            A list of model generated continuations.
            continuation: str
                The generated continuation.
        """
@@ -325,14 +324,19 @@ class TemplateLM(LM):
        return self.eot_token_id
    @abc.abstractmethod
-    def tok_encode(self, string: str, **kwargs):
+    def tok_encode(self, string: str, **kwargs) -> List[int]:
+        """
+        Tokenize a string using the model's tokenizer and return a list of token IDs.
+        """
        pass
    @abc.abstractmethod
-    def _loglikelihood_tokens(self, requests, **kwargs):
+    def _loglikelihood_tokens(self, requests, **kwargs) -> List[Tuple[float, bool]]:
        pass
-    def _encode_pair(self, context, continuation):
+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
        n_spaces = len(context) - len(context.rstrip())
        if n_spaces > 0:
            continuation = context[-n_spaces:] + continuation
@@ -373,7 +377,7 @@ class TemplateLM(LM):
    @abc.abstractmethod
    def loglikelihood_rolling(
        self, requests, disable_tqdm: bool = False
-    ) -> List[Tuple[float, bool]]:
+    ) -> List[float]:
        pass
    @abc.abstractmethod

--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
+from functools import partial
 import datasets
@@ -15,9 +17,38 @@ class ContextSampler:
        self.target_delimiter = self.config.target_delimiter
        self.fewshot_delimiter = self.config.fewshot_delimiter
-        self.doc_to_text = self.task.doc_to_text
+        if (
-        self.doc_to_target = self.task.doc_to_target
+            self.config.fewshot_config is not None
-        self.doc_to_choice = self.task.doc_to_choice
+            and self.config.fewshot_config.get("doc_to_text", None) is not None
+        ):
+            self.doc_to_text = partial(
+                self.task.doc_to_text,
+                doc_to_text=self.config.fewshot_config.get("doc_to_text", None),
+            )
+        else:
+            self.doc_to_text = self.task.doc_to_text
+        if (
+            self.config.fewshot_config is not None
+            and self.config.fewshot_config.get("doc_to_target", None) is not None
+        ):
+            self.doc_to_target = partial(
+                self.task.doc_to_target,
+                doc_to_target=self.config.fewshot_config.get("doc_to_target", None),
+            )
+        else:
+            self.doc_to_target = self.task.doc_to_target
+        if (
+            self.config.fewshot_config is not None
+            and self.config.fewshot_config.get("doc_to_choice", None) is not None
+        ):
+            self.doc_to_choice = partial(
+                self.task.doc_to_choice,
+                doc_to_choice=self.config.fewshot_config.get("doc_to_choice", None),
+            )
+        else:
+            self.doc_to_choice = self.task.doc_to_choice
        self.docs = docs  # HF dataset split, provided by task._fewshot_docs()
        if fewshot_indices:  # subset few-shot docs from
@@ -52,14 +83,15 @@ class ContextSampler:
                else self.doc_to_choice(doc)[doc_content]
            )
            labeled_examples += self.target_delimiter
-            labeled_examples += (
+            if doc_target != "":
-                str(doc_target[0])
+                labeled_examples += (
-                if isinstance(doc_target, list)
+                    str(doc_target[0])
-                else str(doc_target)
+                    if isinstance(doc_target, list)
-                if self.config.doc_to_choice is None or isinstance(doc_target, str)
+                    else doc_target
-                else str(self.doc_to_choice(doc)[doc_target])
+                    if self.config.doc_to_choice is None or isinstance(doc_target, str)
-            )
+                    else str(self.doc_to_choice(doc)[doc_target])
-            labeled_examples += self.fewshot_delimiter
+                )
+                labeled_examples += self.fewshot_delimiter
        return labeled_examples

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -1171,9 +1171,11 @@ class ConfigurableTask(Task):
        """
        return doc
-    def doc_to_text(self, doc):
+    def doc_to_text(self, doc, doc_to_text=None):
        if self.prompt is not None:
            doc_to_text = self.prompt
+        elif doc_to_text is not None:
+            doc_to_text = doc_to_text
        else:
            doc_to_text = self.config.doc_to_text
@@ -1205,9 +1207,11 @@ class ConfigurableTask(Task):
            print(type(doc_to_text))
            raise TypeError
-    def doc_to_target(self, doc: Mapping) -> Union[int, str, list]:
+    def doc_to_target(self, doc: Mapping, doc_to_target=None) -> Union[int, str, list]:
        if self.prompt is not None:
            doc_to_target = self.prompt
+        elif doc_to_target is not None:
+            doc_to_target = doc_to_target
        else:
            doc_to_target = self.config.doc_to_target
@@ -1249,9 +1253,11 @@ class ConfigurableTask(Task):
        else:
            raise TypeError
-    def doc_to_choice(self, doc: Any) -> List[str]:
+    def doc_to_choice(self, doc: Any, doc_to_choice=None) -> List[str]:
        if self.prompt is not None:
            doc_to_choice = self.prompt
+        elif doc_to_choice is not None:
+            doc_to_choice = doc_to_choice
        elif self.config.doc_to_choice is None:
            eval_logger.error("doc_to_choice was called but not set in config")
        else:

--- a/lm_eval/caching/__init__.py
+++ b/lm_eval/caching/__init__.py
--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
 from . import (
    anthropic_llms,
+    api_models,
    dummy,
    gguf,
    huggingface,

--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
-from typing import Any, List, Tuple
+import os
+from functools import cached_property
+from typing import Any, Dict, List, Tuple, Union
 from tqdm import tqdm
 from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
+from lm_eval.models.openai_completions import LocalCompletionsAPI
 from lm_eval.models.utils import retry_on_specific_exceptions
@@ -138,7 +141,7 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install
    return messages()
-@register_model("anthropic")
+@register_model("anthropic-completions")
 class AnthropicLM(LM):
    REQ_CHUNK_SIZE = 20  # TODO: not used
@@ -271,90 +274,89 @@ please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install
 @register_model("anthropic-chat", "anthropic-chat-completions")
-class AnthropicChatLM(AnthropicLM):
+class AnthropicChat(LocalCompletionsAPI):
-    REQ_CHUNK_SIZE = 20  # TODO: not used
    def __init__(
        self,
-        model: str,
+        base_url="https://api.anthropic.com/v1/messages",
-        batch_size: int = 1,
+        tokenizer_backend=None,
-        max_tokens: int = 256,
+        **kwargs,
-        temperature: float = 0,  # defaults to 1
+    ):
-        **kwargs,  # top_p, top_k, etc.
+        super().__init__(
-    ) -> None:
+            base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
-        """Anthropic API wrapper.
+        )
+        eval_logger.warning(
-        :param model: str
+            "Chat completions does not support batching. Defaulting to batch size 1."
-            Anthropic model e.g. 'claude-3-opus-20240229', 'claude-3-sonnet-20240229'
+        )
-        :param max_tokens: int
+        self._batch_size = 1
-            Maximum number of tokens to sample from the model
+        self.anthropic_version = "2023-06-01"
-        :param temperature: float
+        eval_logger.warning(
-            Sampling temperature
+            f"Using Anthropic Version: {self.anthropic_version}. Confirm the current version here: https://docs.anthropic.com/en/api/versioning"
-        :param kwargs: Any
+        )
-            Additional model_args to pass to the API client
-        """
-        super().__init__()
-        try:
+    @cached_property
-            import anthropic
+    def api_key(self):
-        except ModuleNotFoundError:
+        """Override this property to return the API key for the API request."""
-            raise Exception(
+        key = os.environ.get("ANTHROPIC_API_KEY", None)
-                "attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
+        if key is None:
-please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
+            raise ValueError(
+                "API key not found. Please set the ANTHROPIC_API_KEY environment variable."
            )
+        return key
-        self.model = model
-        # defaults to os.environ.get("ANTHROPIC_API_KEY")
+    @cached_property
-        self.client = anthropic.Anthropic()
+    def header(self):
-        self.temperature = temperature
+        return {
-        self.max_tokens = max_tokens
+            "x-api-key": f"{self.api_key}",
-        self.tokenizer = self.client.get_tokenizer()
+            "anthropic-version": self.anthropic_version,
-        self.kwargs = kwargs
+        }
-    @property
+    def _create_payload(
-    def max_gen_toks(self) -> int:
+        self, messages: List[Dict], generate=True, gen_kwargs: dict = None, **kwargs
-        return self.max_tokens
+    ) -> dict:
+        system = (
-    def generate_until(self, requests) -> List[str]:
+            messages[0].get("content") if messages[0].get("role") == "system" else None
-        try:
+        )
-            import anthropic
+        if system:
-        except ModuleNotFoundError:
+            messages = messages[1:]
-            raise Exception(
+        gen_kwargs.pop("do_sample", False)
-                "attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
+        max_tokens = gen_kwargs.pop("max_gen_toks", self._max_gen_toks)
-please install anthropic via `pip install 'lm-eval[anthropic]'` or `pip install -e '.[anthropic]'`",
+        temperature = gen_kwargs.pop("temperature", 0)
-            )
+        stop = gen_kwargs.pop("until", ["\n\nHuman:"])
+        if not isinstance(stop, list):
-        if not requests:
+            stop = [stop]
-            return []
+        out = {
+            "messages": messages,
-        _requests: List[Tuple[str, dict]] = [req.args for req in requests]
+            "model": self.model,
+            "max_tokens": max_tokens,
+            "temperature": temperature,
+            "stop_sequences": stop,
+            **gen_kwargs,
+        }
+        if system:
+            out["system"] = system
+        return out
+    def parse_generations(
+        self, outputs: Union[Dict, List[Dict]], **kwargs
+    ) -> List[str]:
        res = []
-        for request in tqdm(_requests):
+        if not isinstance(outputs, list):
-            try:
+            outputs = [outputs]
-                inp = request[0]
+        for out in outputs:
-                request_args = request[1]
+            for choices in out["content"]:
-                # generation_kwargs
+                res.append(choices["text"])
-                until = request_args.get("until")
-                max_tokens = request_args.get("max_gen_toks", self.max_length)
-                temperature = request_args.get("temperature", self.temperature)
-                response = anthropic_chat(
-                    client=self.client,
-                    model=self.model,
-                    prompt=inp,
-                    max_tokens=max_tokens,
-                    temperature=temperature,  # TODO: implement non-greedy sampling for Anthropic
-                    stop=until,  # type: ignore
-                    **self.kwargs,
-                )
-                res.append(response)
-                self.cache_hook.add_partial("generate_until", request, response)
-            except anthropic.APIConnectionError as e:  # type: ignore # noqa: F821
-                eval_logger.critical(f"Server unreachable: {e.__cause__}")
-                break
-            except anthropic.APIStatusError as e:  # type: ignore # noqa: F821
-                eval_logger.critical(f"API error {e.status_code}: {e.message}")
-                break
        return res
+    def tok_encode(
+        self,
+        string: str,
+        left_truncate_len=None,
+        add_special_tokens=None,
+        **kwargs,
+    ) -> List[str]:
+        return [string]
+    def loglikelihood(self, requests, **kwargs):
+        raise NotImplementedError(
+            "Anthropic Chat Completions API does not support the return of loglikelihood"
+        )
--- a/lm_eval/models/api_models.py
+++ b/lm_eval/models/api_models.py
+import abc
+import asyncio
+import copy
+import itertools
+import json
+from functools import cached_property
+from typing import (
+    Any,
+    Awaitable,
+    Callable,
+    Dict,
+    Iterable,
+    List,
+    Literal,
+    NamedTuple,
+    Optional,
+    Tuple,
+    Union,
+)
+try:
+    import requests
+    from aiohttp import ClientSession, TCPConnector
+    from tenacity import RetryError, retry, stop_after_attempt, wait_exponential
+    from tqdm import tqdm
+    from tqdm.asyncio import tqdm_asyncio
+except ModuleNotFoundError:
+    pass
+from importlib.util import find_spec
+from lm_eval import utils
+from lm_eval.api.instance import Instance
+from lm_eval.api.model import TemplateLM
+from lm_eval.models.utils import Collator, chunks, configure_pad_token
+LogLikelihoodInputs = Tuple[Tuple[str, str], List[int], List[int]]
+# utility class to keep track of json encoded chats
+class JsonChatStr(NamedTuple):
+    prompt: str
+    def encode(self, encoding):
+        return self.prompt.encode(encoding)
+eval_logger = utils.eval_logger
+class TemplateAPI(TemplateLM):
+    def __init__(
+        self,
+        model: str = None,
+        pretrained: str = None,  # `model` takes precedence over `pretrained` when passed.
+        base_url: str = None,
+        tokenizer: Optional[str] = None,
+        # Logliklehood tasks require a tokenizer to calculate context lengths,
+        # however the requests can be sent as a string if the API doesn't support token inputs.
+        # use tokenized_requests=False
+        tokenizer_backend: Optional[
+            Literal["tiktoken", "huggingface", None]
+        ] = "huggingface",
+        truncate: bool = False,
+        # number of concurrent requests. More useful if not batching
+        num_concurrent: int = 1,
+        max_retries: int = 3,
+        max_gen_toks: int = 256,
+        batch_size: Union[str, int] = 1,
+        seed: int = 1234,
+        max_length: Optional[int] = 2048,
+        add_bos_token: bool = False,
+        custom_prefix_token_id=None,
+        # send the requests as tokens or strings
+        tokenized_requests=True,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        missing_packages = [
+            pkg
+            for pkg in ["aiohttp", "tqdm", "tenacity", "requests"]
+            if find_spec(pkg) is None
+        ]
+        if missing_packages:
+            raise ModuleNotFoundError(
+                f"Attempted to use an API model, but the required packages {missing_packages} are not installed. "
+                'Please install these via `pip install lm-eval[api]` or `pip install -e ."[api]"`'
+            )
+        self.model = model or pretrained
+        self.base_url = base_url
+        self.tokenizer = tokenizer
+        if not isinstance(batch_size, int) and "auto" in batch_size:
+            eval_logger.warning(
+                "Automatic batch size is not supported for API models. Defaulting to batch size 1."
+            )
+        elif int(batch_size) > 1:
+            eval_logger.warning(
+                "Batch size > 1 detected. Ensure your API supports batched requests with varying total sequence lengths."
+            )
+        self._batch_size = int(batch_size) if batch_size != "auto" else 1
+        self._truncate = truncate
+        self._max_gen_toks = int(max_gen_toks)
+        self._seed = int(seed)
+        self.max_length = max_length
+        if int(num_concurrent) <= 1:
+            eval_logger.info(
+                "Concurrent requests are disabled. To enable concurrent requests, set `num_concurrent` > 1."
+            )
+        self._concurrent = int(num_concurrent)
+        self.tokenizer_backend = tokenizer_backend
+        self.add_bos_token = add_bos_token
+        self.custom_prefix_token_id = custom_prefix_token_id
+        self.tokenized_requests = tokenized_requests
+        self.max_retries = int(max_retries)
+        eval_logger.info(f"Using tokenizer {self.tokenizer_backend}")
+        if self.tokenizer_backend is None:
+            self.tokenizer = None
+            self.tokenized_requests = False
+        else:
+            if self.tokenizer is None:
+                if self.tokenizer_backend == "huggingface":
+                    import transformers
+                    self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+                        self.tokenizer if self.tokenizer else self.model
+                    )
+                    # Not used as the API will handle padding but to mirror the behavior of the HFLM
+                    self.tokenizer = configure_pad_token(self.tokenizer)
+                elif self.tokenizer_backend == "tiktoken":
+                    try:
+                        import tiktoken
+                        self.tokenizer = tiktoken.encoding_for_model(self.model)
+                    except ModuleNotFoundError as e:
+                        raise Exception(
+                            "Attempted to use 'openai' LM type, but the package `tiktoken` is not installed. "
+                            "Please install it via `pip install lm-eval[api]` or `pip install -e .[api]`."
+                        ) from e
+                    if "openai" not in self.base_url:
+                        eval_logger.warning(
+                            f"Passed `base_url={self.base_url}` but using (OpenAI) Tiktoken tokenizer backend. "
+                            "Pass `tokenizer_backend=huggingface` and provide the HF tokenizer name if your model does not use Tiktoken."
+                        )
+            else:
+                import transformers
+                assert isinstance(tokenizer, str), "tokenizer must be a string"
+                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+                    tokenizer,
+                )
+    @abc.abstractmethod
+    def _create_payload(
+        self,
+        messages: Union[List[List[int]], List[dict], List[str], str],
+        *,
+        generate: bool = True,
+        gen_kwargs: Optional[dict] = None,
+        seed: int = 1234,
+        **kwargs,
+    ) -> dict:
+        """This method is responsible for creating the json payload that will be sent to the API."""
+        raise NotImplementedError
+    def create_message(
+        self,
+        messages: Union[List[List[int]], List[str], List[JsonChatStr]],
+        generate=False,
+    ) -> Union[List[List[int]], List[dict], List[str], str]:
+        """Helper method to transform the prompt into the expected API input format. messages consist of batched requests"""
+        if isinstance(messages[0], JsonChatStr):
+            # for chat completions we need to decode the json string to list[dict,...]
+            assert (
+                self._batch_size == 1
+            ), "non-tokenized chat requests are only supported with batch_size=1"
+            # list[dict["role":..., "content":...],...]
+            return json.loads(messages[0].prompt)
+        if not self.tokenized_requests:
+            # if messages are tokenized:
+            if isinstance(messages[0][0], int):
+                # assuming decoding is lossless. However, this is only for logliklehood requests
+                # as we need to compute the context length. For generations, we don't need to tokenize.
+                messages = self.decode_batch(messages)
+            if self._batch_size <= 1:
+                # if batch is 1 return str
+                return messages[0]
+            else:
+                # list[str,...]
+                return messages
+        # list[list[int], ...]
+        return messages
+    @staticmethod
+    @abc.abstractmethod
+    def parse_logprobs(
+        outputs: Union[Any, List[Any]],
+        tokens: List[List[int]] = None,
+        ctxlen: List[int] = None,
+        **kwargs,
+    ) -> List[Tuple[float, bool]]:
+        """Method used to parse the logprobs from the (batched) API response. This method should return a list of tuples"""
+        raise NotImplementedError
+    @staticmethod
+    @abc.abstractmethod
+    def parse_generations(outputs: Union[Any, List[Any]], **kwargs) -> List[str]:
+        """Method used to parse the generations from the (batched) API response. This method should return a list of str"""
+        raise NotImplementedError
+    @cached_property
+    def api_key(self) -> str:
+        """Override this property to return the API key for the API request."""
+        return ""
+    @cached_property
+    def header(self) -> dict:
+        """Override this property to return the headers for the API request."""
+        return {"Authorization": f"Bearer {self.api_key}"}
+    @property
+    def chat_template(self) -> str:
+        """Must be defined for LM subclasses that implement Chat Templating.
+        Should return the structure of the chat template applied to user/assistant messages.
+        Only used for logging and reproducibility.
+        """
+        return ""
+    @property
+    def tokenizer_name(self) -> str:
+        """Must be defined for LM subclasses which implement Chat Templating.
+        Should return the name of the tokenizer or chat template used.
+        Used only to properly fingerprint caches when requests are being cached with `--cache_requests`, otherwise not used.
+        """
+        return ""
+    def apply_chat_template(
+        self, chat_history: List[Dict[str, str]]
+    ) -> Union[str, JsonChatStr]:
+        """Applies a chat template to a list of chat history between user and model."""
+        if self.tokenizer_backend == "huggingface" and self.tokenized_requests:
+            return self.tokenizer.apply_chat_template(
+                chat_history, tokenize=False, add_generation_prompt=True
+            )
+        else:
+            # bit of a hack. We'll load back before sending to the API
+            return JsonChatStr(json.dumps(chat_history))
+    @cached_property
+    def eot_token_id(self) -> Optional[int]:
+        if self.tokenizer is None:
+            return None
+        else:
+            if self.tokenizer_backend == "huggingface":
+                return self.tokenizer.eos_token_id
+            elif self.tokenizer_backend == "tiktoken":
+                return self.tokenizer.eot_token
+    @cached_property
+    def prefix_token_id(self) -> Optional[int]:
+        if self.tokenizer is None:
+            return None
+        else:
+            if self.custom_prefix_token_id is not None:
+                return self.custom_prefix_token_id
+            if self.tokenizer_backend == "huggingface":
+                if self.tokenizer.bos_token_id is not None:
+                    return self.tokenizer.bos_token_id
+                return self.tokenizer.eos_token_id
+            else:
+                return self.tokenizer.eot_token
+    def tok_encode(
+        self,
+        string: str,
+        left_truncate_len: int = None,
+        add_special_tokens: bool = False,
+        truncation: bool = False,
+        **kwargs,
+    ) -> Union[List[List[int]], List[int], List[str]]:
+        if self.tokenizer_backend is None:
+            return [string]
+        elif self.tokenizer_backend == "huggingface":
+            # by default for CausalLM - false or self.add_bos_token is set
+            if not add_special_tokens:
+                add_special_tokens = False or self.add_bos_token
+            encoding: Union[List[List[int]], List[int]] = self.tokenizer(
+                string,
+                add_special_tokens=add_special_tokens,
+                truncation=truncation,
+                return_attention_mask=False,
+            ).input_ids
+            # left-truncate the encoded context to be at most `left_truncate_len` tokens long
+            if left_truncate_len:
+                if not isinstance(string, str):
+                    encoding = [enc[-left_truncate_len:] for enc in encoding]
+                else:
+                    encoding = encoding[-left_truncate_len:]
+            return encoding
+        else:
+            try:
+                encoding = self.tokenizer.encode(string)
+            except Exception:
+                encoding = self.tokenizer.encode_batch(string)
+            return encoding
+    def decode_batch(self, tokens: List[List[int]]) -> List[str]:
+        if self.tokenizer_backend == "huggingface":
+            return self.tokenizer.batch_decode(tokens)
+        elif self.tokenizer_backend == "tiktoken":
+            return self.tokenizer.decode_batch(tokens)
+    def model_call(
+        self,
+        messages: Union[List[List[int]], List[str], List[JsonChatStr]],
+        *,
+        generate: bool = True,
+        gen_kwargs: Optional[Dict] = None,
+        **kwargs,
+    ) -> Optional[dict]:
+        # !!! Copy: shared dict for each request, need new object !!!
+        gen_kwargs = copy.deepcopy(gen_kwargs)
+        try:
+            response = requests.post(
+                self.base_url,
+                json=self._create_payload(
+                    self.create_message(messages),
+                    generate=generate,
+                    gen_kwargs=gen_kwargs,
+                    seed=self._seed,
+                    **kwargs,
+                ),
+                headers=self.header,
+            )
+            if not response.ok:
+                eval_logger.warning(
+                    f"API request failed with error message: {response.text}. Retrying..."
+                )
+            response.raise_for_status()
+            return response.json()
+        except RetryError:
+            eval_logger.error(
+                "API request failed after multiple retries. Please check the API status."
+            )
+            return None
+    async def amodel_call(
+        self,
+        session: ClientSession,
+        messages: Union[List[List[int]], List[str], List[JsonChatStr]],
+        *,
+        generate: bool = True,
+        cache_keys: list = None,
+        ctxlens: Optional[List[int]] = None,
+        gen_kwargs: Optional[Dict] = None,
+        **kwargs,
+    ) -> Union[List[str], List[Tuple[float, bool]], None]:
+        # !!! Copy: shared dict for each request, need new object !!!
+        gen_kwargs = copy.deepcopy(gen_kwargs)
+        payload = self._create_payload(
+            self.create_message(messages),
+            generate=generate,
+            gen_kwargs=gen_kwargs,
+            seed=self._seed,
+            **kwargs,
+        )
+        cache_method = "generate_until" if generate else "loglikelihood"
+        try:
+            async with session.post(
+                self.base_url,
+                json=payload,
+                headers=self.header,
+            ) as response:
+                if not response.ok:
+                    error_text = await response.text()
+                    eval_logger.warning(
+                        f"API request failed with error message: {error_text}. Retrying..."
+                    )
+                # raising exception will retry the request
+                response.raise_for_status()
+                outputs = await response.json()
+            answers = (
+                self.parse_generations(
+                    outputs=outputs,
+                )
+                if generate
+                else self.parse_logprobs(
+                    outputs=outputs,
+                    tokens=messages,
+                    ctxlens=ctxlens,
+                )
+            )
+            if cache_keys:
+                for res, cache in zip(answers, cache_keys):
+                    self.cache_hook.add_partial(cache_method, cache, res)
+            return answers
+        # If the retries also fail
+        except RetryError:
+            eval_logger.error(
+                "API request failed after multiple retries. Please check the API status."
+            )
+            return None
+    def batch_logliklehood_requests(
+        self, chunks: Iterable[List[LogLikelihoodInputs]]
+    ) -> Tuple[List[List[int]], List[int], List[Tuple[str, str]]]:
+        inputs = []
+        ctxlens = []
+        cache_keys = []
+        for chunk in chunks:
+            for cache_key, context_enc, continuation_enc in chunk:
+                inp = (context_enc + continuation_enc)[-(self.max_length) :]
+                ctxlen = len(context_enc) - max(
+                    0, len(context_enc) + len(continuation_enc) - (self.max_length)
+                )
+                inputs.append(inp)
+                ctxlens.append(ctxlen)
+                cache_keys.append(cache_key)
+        return inputs, ctxlens, cache_keys
+    async def get_batched_requests(
+        self,
+        requests: list,
+        cache_keys: list,
+        *,
+        generate: bool = True,
+        ctxlens: List[int] = None,
+        **kwargs,
+    ) -> Union[List[List[str]], List[List[Tuple[float, bool]]]]:
+        ctxlens = ctxlens if ctxlens else [None] * len(requests)
+        conn = TCPConnector(limit=self._concurrent)
+        async with ClientSession(connector=conn) as session:
+            retry_: Callable[..., Awaitable[Any]] = retry(
+                stop=stop_after_attempt(self.max_retries),
+                wait=wait_exponential(multiplier=0.5, min=1, max=10),
+                reraise=True,
+            )(self.amodel_call)
+            # Create tasks for each batch of request
+            tasks = [
+                asyncio.create_task(
+                    retry_(
+                        session=session,
+                        messages=message,
+                        cache_keys=cache_key,
+                        generate=generate,
+                        ctxlens=ctxlen,
+                        **kwargs,
+                    )
+                )
+                for message, cache_key, ctxlen in zip(
+                    chunks(requests, n=self._batch_size),
+                    chunks(cache_keys, n=self._batch_size),
+                    chunks(ctxlens, n=self._batch_size),
+                )
+            ]
+            return await tqdm_asyncio.gather(*tasks, desc="Requesting API")
+    def _loglikelihood_tokens(self, requests, **kwargs) -> List[Tuple[float, bool]]:
+        assert (
+            self.tokenizer is not None
+        ), "Tokenizer is required for loglikelihood tasks to compute context lengths."
+        res = []
+        def _collate(req: LogLikelihoodInputs):
+            """Defines the key for the sorted method"""
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+            toks = req[1] + req[2]
+            return -len(toks), tuple(toks)
+        re_ord = Collator(
+            requests,
+            sort_fn=_collate,
+            group_by=None,
+        )
+        # if concurrent then we'll batch in the async context
+        chunked = re_ord.get_batched(n=self._batch_size if self._concurrent <= 1 else 0)
+        if self._concurrent <= 1:
+            pbar = tqdm(desc="Requesting API", total=len(requests))
+            for chunk in chunked:
+                inputs, ctxlens, cache_keys = self.batch_logliklehood_requests([chunk])
+                outputs = retry(
+                    stop=stop_after_attempt(self.max_retries),
+                    wait=wait_exponential(multiplier=0.5, min=1, max=10),
+                    reraise=True,
+                )(self.model_call)(messages=inputs, generate=False)
+                if isinstance(outputs, dict):
+                    outputs = [outputs]
+                for answer_, cache_key in zip(
+                    self.parse_logprobs(
+                        outputs=outputs, tokens=inputs, ctxlens=ctxlens
+                    ),
+                    cache_keys,
+                ):
+                    if answer_ is not None:
+                        res.append(answer_)
+                        # partial caching
+                        if cache_key is not None:
+                            self.cache_hook.add_partial(
+                                "loglikelihood", cache_key, answer_
+                            )
+                        pbar.update(1)
+        else:
+            inputs, ctxlens, cache_keys = self.batch_logliklehood_requests(chunked)
+            res = itertools.chain.from_iterable(
+                asyncio.run(
+                    self.get_batched_requests(
+                        inputs, cache_keys, generate=False, ctxlens=ctxlens
+                    )
+                )
+            )
+        return re_ord.get_original(res)
+    def generate_until(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[str]:
+        res = []
+        def _collate_gen(_requests):
+            # sort by the length of the non-tokenized contexts
+            return -len(_requests[0])
+        # Let the API deal with tokenization
+        requests, all_gen_kwargs = zip(*(req.args for req in requests))
+        if self.tokenized_requests:
+            encodings_list = self.tok_encode(
+                requests, add_special_tokens=self.add_bos_token
+            )
+        else:
+            encodings_list = [None] * len(requests)
+        requests = [
+            (a, b, c) for a, b, c in zip(requests, all_gen_kwargs, encodings_list)
+        ]
+        re_ord = Collator(
+            requests,
+            sort_fn=_collate_gen,
+            group_by="gen_kwargs",
+        )
+        chunked = re_ord.get_batched(
+            n=self._batch_size if self._concurrent <= 1 else 0, batch_fn=None
+        )
+        if self._concurrent <= 1:
+            pbar = tqdm(desc="Requesting API", total=len(requests))
+            for chunk in chunked:
+                contexts, all_gen_kwargs, encodings_list = zip(*chunk)
+                req = encodings_list if self.tokenized_requests else contexts
+                outputs = retry(
+                    stop=stop_after_attempt(self.max_retries),
+                    wait=wait_exponential(multiplier=0.5, min=1, max=10),
+                    reraise=True,
+                )(self.model_call)(
+                    messages=req,
+                    generate=True,
+                    gen_kwargs=copy.deepcopy(all_gen_kwargs[0]),
+                )
+                for generated_text, context in zip(
+                    self.parse_generations(
+                        outputs=outputs,
+                        contexts=contexts,
+                    ),
+                    contexts,
+                ):
+                    if generated_text is not None:
+                        res.append(generated_text)
+                        # partial caching
+                        if context is not None:
+                            self.cache_hook.add_partial(
+                                "generate_until",
+                                (context, all_gen_kwargs[0]),
+                                generated_text,
+                            )
+                            pbar.update(1)
+        else:
+            for chunk in chunked:
+                contexts, all_gen_kwargs, encodings_list = zip(*chunk)
+                req = encodings_list if self.tokenized_requests else contexts
+                results = itertools.chain.from_iterable(
+                    asyncio.run(
+                        self.get_batched_requests(
+                            req,
+                            cache_keys=[(ctx, all_gen_kwargs[0]) for ctx in contexts],
+                            generate=True,
+                            gen_kwargs=copy.deepcopy(all_gen_kwargs[0]),
+                        )
+                    )
+                )
+                res.extend(results)
+        return re_ord.get_original(res)
+    def loglikelihood_rolling(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[float]:
+        loglikelihoods = []
+        for (string,) in tqdm([req.args for req in requests], disable=disable_tqdm):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.prefix_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+            # TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+                disable_tqdm=True,
+            )
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -9,10 +9,10 @@ import torch.nn.functional as F
 import transformers
 from accelerate import (
    Accelerator,
-    DistributedType,
    InitProcessGroupKwargs,
    find_executable_batch_size,
 )
+from accelerate.utils import get_max_memory
 from huggingface_hub import HfApi
 from packaging import version
 from peft import PeftModel
@@ -40,31 +40,6 @@ from lm_eval.models.utils import (
 eval_logger = utils.eval_logger
-def _get_accelerate_args(
-    device_map_option: Optional[str] = "auto",
-    max_memory_per_gpu: Optional[Union[int, str]] = None,
-    max_cpu_memory: Optional[Union[int, str]] = None,
-    offload_folder: Optional[str] = "./offload",
-    gpus: Optional[int] = None,
-) -> dict:
-    """Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
-    max_memory = {}
-    if max_memory_per_gpu is not None:
-        max_memory_per_gpu_map = {
-            device_idx: max_memory_per_gpu for device_idx in range(gpus)
-        }
-        max_memory.update(max_memory_per_gpu_map)
-    if max_cpu_memory is not None:
-        max_memory["cpu"] = max_cpu_memory
-    args = {}
-    if max_memory:
-        args["max_memory"] = max_memory
-    args["device_map"] = device_map_option
-    args["offload_folder"] = offload_folder
-    return args
 @register_model("hf-auto", "hf", "huggingface")
 class HFLM(TemplateLM):
    """
@@ -105,7 +80,6 @@ class HFLM(TemplateLM):
        # arguments used for splitting a model across GPUs naively.
        # only used if `parallelize=True`.
        parallelize: Optional[bool] = False,
-        device_map_option: Optional[str] = "auto",
        max_memory_per_gpu: Optional[Union[int, str]] = None,
        max_cpu_memory: Optional[Union[int, str]] = None,
        offload_folder: Optional[Union[str, os.PathLike]] = "./offload",
@@ -128,21 +102,6 @@ class HFLM(TemplateLM):
            self._config = self._model.config
            gpus = 0
-            if tokenizer:
-                assert isinstance(
-                    tokenizer, transformers.PreTrainedTokenizer
-                ) or isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
-                self.tokenizer = tokenizer
-            else:
-                # Get tokenizer
-                model_name = self._model.name_or_path
-                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
-                    model_name,
-                    revision=revision,
-                    trust_remote_code=trust_remote_code,
-                    use_fast=use_fast_tokenizer,
-                )
        else:
            assert isinstance(device, str)
            assert isinstance(pretrained, str)
@@ -157,6 +116,7 @@ class HFLM(TemplateLM):
            if "npu" in accelerator.device.type:
                gpus = torch.npu.device_count()
+            # using one process with no model parallelism
            if not (parallelize or accelerator.num_processes > 1):
                # use user-passed device
                device_list = set(
@@ -182,14 +142,19 @@ class HFLM(TemplateLM):
                        if torch.cuda.is_available()
                        else torch.device("cpu")
                    )
-            else:
+            else:  # Parallelism managed by accelerate
                if device != "cuda":
                    eval_logger.info(
                        f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
                    )
                # TODO: include in warning that `load_in_8bit` etc. affect this too
-                self._device = torch.device(device)
+                self._device = (
+                    self.accelerator.device
+                    if hasattr(self, "accelerator")
+                    else torch.device(device)
+                )
+            revision = str(revision)  # cast to string if not already one
            # TODO: update this to be less of a hack once subfolder is fixed in HF
            revision = revision + ("/" + subfolder if subfolder is not None else "")
@@ -222,7 +187,6 @@ class HFLM(TemplateLM):
                trust_remote_code=trust_remote_code,
                parallelize=parallelize,
                gpus=gpus,
-                device_map_option=device_map_option,
                max_memory_per_gpu=max_memory_per_gpu,
                max_cpu_memory=max_cpu_memory,
                offload_folder=offload_folder,
@@ -237,19 +201,6 @@ class HFLM(TemplateLM):
            self.model.eval()
            self.model.tie_weights()
-        if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
-            # TODO: can remove this whole snippet except in the mps case, perhaps?
-            if not (parallelize or autogptq or hasattr(self, "accelerator")):
-                # place model onto device requested manually,
-                # if not using HF Accelerate or device_map
-                # or any other option that preloads model onto device
-                try:
-                    self.model.to(self.device)
-                except ValueError:
-                    eval_logger.debug(
-                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
-                    )
        self.truncation = truncation
        self.logits_cache = logits_cache
        self.vocab_size = self.tokenizer.vocab_size
@@ -257,10 +208,10 @@ class HFLM(TemplateLM):
        self.tokenizer = configure_pad_token(self.tokenizer, model_config=self.config)
        self.add_bos_token = add_bos_token
-        if getattr(self.config, "model_type", None) in ["gemma", "gemma2"]:
+        if "gemma" in getattr(self.config, "model_type", ""):
            self.add_bos_token = True
            eval_logger.info(
-                f"Model type is '{self.config.model_type}', a BOS token will be used as Gemma underperforms without it."
+                f"Model type is '{self.config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
            )
        self._max_length = max_length
@@ -280,49 +231,46 @@ class HFLM(TemplateLM):
            self.batch_size_per_gpu = int(batch_size)
        if isinstance(pretrained, str):
+            if gpus >= 1 or str(self.device) == "mps":
+                # TODO: can remove this whole snippet except in the mps case, perhaps?
+                if not (parallelize or autogptq or hasattr(self, "accelerator")):
+                    # place model onto device requested manually,
+                    # if not using HF Accelerate or device_map
+                    # or any other option that preloads model onto device
+                    try:
+                        self.model.to(self.device)
+                    except ValueError:
+                        eval_logger.debug(
+                            "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
+                        )
            # multigpu data-parallel support when launched with accelerate
            if gpus > 1:
-                if parallelize:
+                if accelerator.num_processes > 1:
-                    if accelerator.num_processes > 1:
+                    if parallelize:
-                        raise RuntimeError(
+                        eval_logger.warning(
-                            "Attempted to use both a HF Accelerate `device_map` and to launch via `accelerate launch`. If this is the case, please either remove `parallelize=True` from --model_args or launch outside of the Accelerate launcher."
+                            "You are both using a HF Accelerate `device_map` (`--model_args parallelize=True`) and launching via `accelerate launch`. This will attempt to do model and data parallelism depending on the resources available."
                        )
-                    else:
+                    elif gpus > accelerator.num_processes:
-                        pass
-                elif accelerator.num_processes == 1:
-                    # if we aren't launching via accelerate, ditch
-                    self._rank = 0
-                    self._world_size = 1
-                else:
-                    if gpus > accelerator.num_processes:
                        eval_logger.warning(
                            "WARNING: The number of total system GPUs does not match the number of spawned processes. "
                            "If you would like to use data parallelism, please launch the script "
                            "with 'accelerate launch *script*'. "
                            f"Current run will proceed with {accelerator.num_processes} devices."
                        )
-                    assert (
+                        if self.accelerator.is_local_main_process:
-                        accelerator.distributed_type
+                            eval_logger.info(
-                        in [
+                                f"Using {gpus} devices with data parallelism"
-                            DistributedType.FSDP,
+                            )
-                            DistributedType.MULTI_GPU,
-                            DistributedType.MULTI_NPU,
-                        ]
-                    ), "Unsupported distributed type provided. Only DDP and FSDP are supported."
-                    if accelerator.distributed_type == DistributedType.FSDP:
-                        self._model = accelerator.prepare(self.model)
-                    else:
-                        self._model = accelerator.prepare_model(
-                            self.model, evaluation_mode=True
-                        )
                    self._device = torch.device(f"{accelerator.device}")
                    self.accelerator = accelerator
-                    if self.accelerator.is_local_main_process:
-                        eval_logger.info(f"Using {gpus} devices with data parallelism")
                    self._rank = self.accelerator.local_process_index
                    self._world_size = self.accelerator.num_processes
+                else:
+                    # if we aren't launching via accelerate, ditch
+                    self._rank = 0
+                    self._world_size = 1
        else:
            # if a PreTrainedModel was passed into HFLM, we forgo distributed setup.
            eval_logger.warning(
@@ -337,6 +285,94 @@ class HFLM(TemplateLM):
                f"Loglikelihood prefix token id used in evaluation: {self.prefix_token_id}"
            )
+    def _get_accelerate_args(
+        self,
+        parallelize: bool = None,
+        device_map: Optional[str] = "auto",
+        max_memory_per_gpu: Optional[Union[int, str]] = None,
+        max_cpu_memory: Optional[Union[int, str]] = None,
+        offload_folder: Optional[str] = "./offload",
+        gpus: Optional[int] = None,
+    ) -> dict:
+        """Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
+        num_local_processes = int(os.environ.get("LOCAL_WORLD_SIZE", 1))
+        num_machines = int(os.environ.get("WORLD_SIZE", 0)) // num_local_processes
+        if (
+            num_machines == 0
+            and hasattr(self, "accelerator")
+            and self.accelerator is not None
+        ):
+            eval_logger.info(
+                "We are not in a distributed setting for accelerate. Setting model_parallel to False."
+            )
+            parallelize = False
+        if parallelize is None:
+            # If parallelism is unset by the user, we automatically assign model parallelism
+            # if enough extra GPUs are available
+            max_memory_all_gpus = get_max_memory()
+            # We just want gpu, not cpu, max memory
+            if "cpu" in max_memory_all_gpus:
+                del max_memory_all_gpus["cpu"]
+            parallelize = bool(num_local_processes < len(max_memory_all_gpus))
+            eval_logger.info(
+                f"Setting model parallel to {parallelize} since "
+                f"the number of local processes is {num_local_processes} "
+                f"and the number of GPUs is {len(max_memory_all_gpus)}"
+            )
+        args = {}
+        if parallelize:  # Model parallelism will be used
+            max_memory = {}
+            if max_memory_per_gpu is not None:  # Using the provided memory requirements
+                max_memory_per_gpu_map = {
+                    device_idx: max_memory_per_gpu for device_idx in range(gpus)
+                }
+            else:  # Estimating the possible memory requirements
+                max_memory_all_gpus = get_max_memory()
+                if "cpu" in max_memory_all_gpus:
+                    del max_memory_all_gpus["cpu"]
+                if not hasattr(self, "accelerator"):
+                    max_memory_per_gpu_map = {
+                        k: v for k, v in max_memory_all_gpus.items()
+                    }
+                else:
+                    # use only 1 / num_processes of the GPUs if we are running under accelerate launch
+                    max_memory_per_gpu_map = {
+                        k: v
+                        for k, v in max_memory_all_gpus.items()
+                        if k % num_local_processes
+                        == (self.accelerator.process_index % num_local_processes)
+                    }
+            args["max_memory"] = max_memory_per_gpu_map
+            args["device_map"] = "auto"
+            eval_logger.info(
+                f"Model parallel was set to True, setting max memory per GPU to {max_memory_per_gpu_map} and device map to 'auto'"
+            )
+            if max_cpu_memory is not None:
+                max_memory["cpu"] = max_cpu_memory
+            args["offload_folder"] = offload_folder
+        elif (
+            device_map is None
+        ):  # No model parallelism, we use the default provided device for our model
+            if hasattr(self, "accelerator"):
+                device_map = {"": f"{self.accelerator.device}"}
+            else:
+                device_map = {"": str(self.device)}
+            args["max_memory"] = None
+            args["device_map"] = device_map
+            eval_logger.info(
+                f"Model parallel was set to False, max memory was not set, and device map was set to {device_map}"
+            )
+        else:
+            args["max_memory"] = None
+            args["device_map"] = None
+            eval_logger.info("Model parallel was set to False.")
+        return args
    @property
    def config(self):
        # return the associated transformers.AutoConfig for the given pretrained model.
@@ -483,7 +519,6 @@ class HFLM(TemplateLM):
        # (accelerate naive PP (device_map) options)
        parallelize: Optional[bool] = False,
        gpus: Optional[int] = None,
-        device_map_option: Optional[str] = "auto",
        max_memory_per_gpu: Optional[Union[int, str]] = None,
        max_cpu_memory: Optional[Union[int, str]] = None,
        offload_folder: Optional[str] = "./offload",
@@ -507,25 +542,16 @@ class HFLM(TemplateLM):
        model_kwargs = kwargs if kwargs else {}
-        if parallelize:
+        model_kwargs.update(
-            model_kwargs.update(
+            self._get_accelerate_args(
-                _get_accelerate_args(
+                parallelize=parallelize,
-                    device_map_option,  # TODO: phase out device_map_option?
+                device_map=kwargs.get("device_map", None),
-                    max_memory_per_gpu,
+                max_memory_per_gpu=max_memory_per_gpu,
-                    max_cpu_memory,
+                max_cpu_memory=max_cpu_memory,
-                    offload_folder,
+                offload_folder=offload_folder,
-                    gpus,
+                gpus=gpus,
-                )
            )
-        elif "device_map" not in model_kwargs:
+        )
-            # set a device_map to initialize model on the right GPU.
-            # this is needed because it seems that the default behavior
-            # for quantized models now seems to be device_map="auto"
-            # which breaks data-parallel mode.
-            if hasattr(self, "accelerator"):
-                model_kwargs.update({"device_map": {"": f"{self.accelerator.device}"}})
-            else:
-                model_kwargs.update({"device_map": {"": str(self.device)}})
        if not autogptq:
            if model_kwargs.get("load_in_4bit", None):
@@ -538,6 +564,7 @@ class HFLM(TemplateLM):
                        model_kwargs["bnb_4bit_compute_dtype"] = get_dtype(
                            model_kwargs["bnb_4bit_compute_dtype"]
                        )
            self._model = self.AUTO_MODEL_CLASS.from_pretrained(
                pretrained,
                revision=revision,

--- a/lm_eval/models/neuron_optimum.py
+++ b/lm_eval/models/neuron_optimum.py
@@ -231,6 +231,7 @@ class NEURON_HF(TemplateLM):
            " For inf2.48xlarge, set it to `24`."
        )
+        revision = str(revision)  # cast to string if not already one
        # TODO: update this to be less of a hack once subfolder is fixed in HF
        revision = revision + ("/" + subfolder if subfolder is not None else "")

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
-import copy
 import os
-from collections import defaultdict
+from functools import cached_property
-from importlib.util import find_spec
+from typing import Any, Dict, List, Optional, Tuple, Union
-from typing import List, Literal, Optional, Tuple
-from tqdm import tqdm
-import lm_eval.models.utils
-from lm_eval import utils
-from lm_eval.api.model import LM, TemplateLM
 from lm_eval.api.registry import register_model
-from lm_eval.models.utils import retry_on_specific_exceptions
+from lm_eval.models.api_models import TemplateAPI
 from lm_eval.utils import eval_logger
-def get_result(response) -> Tuple[float, bool]:
+@register_model("local-completions")
-    """Process results from OpenAI API response.
+class LocalCompletionsAPI(TemplateAPI):
+    def __init__(
-    :param response: dict
+        self,
-        OpenAI API Response
+        base_url=None,
-    :return:
+        tokenizer_backend="huggingface",
-        continuation_logprobs: np.array
+        **kwargs,
-            Log probabilities of continuation tokens
+    ):
-        is_greedy: bool
+        super().__init__(
-            whether argmax matches given continuation exactly
+            base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
-    """
-    is_greedy = True
-    logprobs = response.logprobs.token_logprobs
-    continuation_logprobs = sum(logprobs)
-    for i in range(len(response.logprobs.token_logprobs)):
-        token = response.logprobs.token_logprobs[i]
-        top_tokens = response.logprobs.top_logprobs[i]
-        top_token = max(top_tokens.keys(), key=lambda x: top_tokens[x])
-        if top_token != token:
-            is_greedy = False
-            break
-    return continuation_logprobs, is_greedy
-def oa_completion(client, chat: bool = False, **kwargs):
-    """Query OpenAI API for completion.
-    Retry with back-off until they respond
-    """
-    if not find_spec("openai") or not find_spec("tiktoken"):
-        raise Exception(
-            "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. "
-            "Please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`"
        )
-    else:
-        import openai
-    def _exception_callback(e: Exception, sleep_time: float) -> None:
-        import traceback
-        traceback.print_exc()
-    @retry_on_specific_exceptions(
-        on_exceptions=[openai.OpenAIError],
-        max_retries=None,  # retry forever, consider changing
-        on_exception_callback=_exception_callback,
-    )
-    def completion():
-        if chat:
-            return client.chat.completions.create(**kwargs)
-        else:
-            return client.completions.create(**kwargs)
-    return completion()
+    def _create_payload(
-@register_model("openai-completions", "local-completions")
-class OpenaiCompletionsLM(TemplateLM):
-    _DEFAULT_MAX_LENGTH = 2048
-    def __init__(
        self,
-        model: str,
+        messages: Union[List[List[int]], List[dict], List[str], str],
-        base_url: str = None,
+        generate=False,
-        tokenizer: Optional[str] = None,
+        gen_kwargs: Optional[dict] = None,
-        tokenizer_backend: Literal["tiktoken", "huggingface"] = "tiktoken",
-        truncate: bool = False,
-        max_gen_toks: int = 256,
-        batch_size: int = 1,
        seed: int = 1234,
-        max_length: Optional[int] = None,
+        **kwargs,
-    ) -> None:
+    ) -> dict:
-        """
+        if generate:
+            gen_kwargs.pop("do_sample", False)
-        :param engine: str
+            max_tokens = gen_kwargs.pop("max_gen_toks", self._max_gen_toks)
-            OpenAI API engine (e.g. gpt-3.5-turbo-instruct)
+            temperature = gen_kwargs.pop("temperature", 0)
-        :param truncate: bool
+            stop = gen_kwargs.pop("until", ["<|endoftext|>"])
-            Truncate input if too long (if False and input is too long, throw error)
+            return {
-        """
+                "prompt": messages,
-        super().__init__()
+                "model": self.model,
-        self.seed = seed
+                "max_tokens": max_tokens,
-        try:
+                "temperature": temperature,
-            import openai  # noqa: E401
+                "stop": stop,
-            import tiktoken
+                "seed": seed,
-        except ModuleNotFoundError:
+                **gen_kwargs,
-            raise Exception(
+            }
-                "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
-    please install these via `pip install lm-eval[openai]` or `pip install -e .\"[openai]\"`",
-            )
-        self.model = model
-        self.base_url = base_url
-        self.tokenizer_backend = tokenizer_backend
-        self.truncate = truncate
-        self._batch_size = int(batch_size)
-        self._max_gen_toks = max_gen_toks
-        self._max_length = max_length
-        # if we have a local model, use HF tokenizer over tiktoken
-        if self.tokenizer_backend == "huggingface":
-            import transformers  # noqa: E401
-            self.tokenizer = transformers.AutoTokenizer.from_pretrained(
-                tokenizer if tokenizer else self.model
-            )
-            self.vocab_size = self.tokenizer.vocab
-            self.end_of_text_token_id = self.tokenizer.eos_token
-        elif self.tokenizer_backend == "tiktoken":
-            if self.base_url:
-                eval_logger.warning(
-                    f"Passed `base_url={self.base_url}` but using Tiktoken tokenizer backend. "
-                    "Pass `tokenizer_backend=huggingface` and provide the HF tokenizer name if your model does not use Tiktoken."
-                )
-            self.tokenizer = tiktoken.encoding_for_model(self.model)
-            self.vocab_size = self.tokenizer.n_vocab
-            self.end_of_text_token_id = self.tokenizer.eot_token
-        else:
-            raise ValueError(
-                f"Expected tokenizer_backend to be one of ['tiktoken', 'huggingface'] but got {self.tokenizer_backend}"
-            )
-        # Read from environment variable OPENAI_API_KEY
-        # Set to EMPTY for local
-        openai.api_key = os.environ["OPENAI_API_KEY"]
-        if self.base_url:
-            self.client = openai.OpenAI(base_url=self.base_url)
-        else:
-            self.client = openai.OpenAI()
-    @property
-    def eot_token_id(self):
-        return self.end_of_text_token_id
-    @property
-    def max_length(self) -> int:
-        if self._max_length:
-            return self._max_length
        else:
-            return self._DEFAULT_MAX_LENGTH
+            return {
+                "model": self.model,
-    @property
+                "prompt": messages,
-    def max_gen_toks(self) -> int:
+                "temperature": 0,
-        return self._max_gen_toks
+                "max_tokens": 1,
+                "logprobs": 1,
-    @property
+                "seed": seed,
-    def batch_size(self) -> int:
+                "echo": True,
-        return self._batch_size
+            }
-    @property
+    @staticmethod
-    def device(self):
+    def parse_logprobs(
-        # Isn't used because we override _loglikelihood_tokens
+        outputs: Union[Dict, List[Dict]],
-        raise NotImplementedError()
+        tokens: List[List[int]] = None,
+        ctxlens: List[int] = None,
-    def tok_encode(self, string: str, **kwargs) -> List[int]:
+        **kwargs,
-        return self.tokenizer.encode(string)
-    def tok_decode(self, tokens: List[int]) -> str:
-        return self.tokenizer.decode(tokens)
-    def _loglikelihood_tokens(
-        self, requests, disable_tqdm: bool = False
    ) -> List[Tuple[float, bool]]:
        res = []
+        if not isinstance(outputs, list):
-        def _collate(x):
+            outputs = [outputs]
-            # this doesn't efficiently handle last-token differences yet, but those are kinda annoying because
+        for out in outputs:
-            # it's not guaranteed that the 100 or so logprobs we get to see actually contain all the continuations
+            for choice, ctxlen in zip(out["choices"], ctxlens):
-            # we care about, and so we need some kind of backup for when it isn't
+                assert ctxlen > 0, "Context length must be greater than 0"
-            toks = x[1] + x[2]
+                logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
-            return -len(toks), tuple(toks)
+                tokens = choice["logprobs"]["token_logprobs"][ctxlen:-1]
+                top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
-        re_ord = utils.Reorderer(requests, _collate)
+                is_greedy = True
+                for tok, top in zip(tokens, top_logprobs):
-        for chunk in tqdm(
+                    if tok != max(top, key=top.get):
-            list(lm_eval.models.utils.chunks(re_ord.get_reordered(), self.batch_size)),
+                        is_greedy = False
-            disable=disable_tqdm,
+                        break
-        ):
+                res.append((logprobs, is_greedy))
-            inps = []
+        return res
-            ctxlens = []
-            for cache_key, context_enc, continuation_enc in chunk:
+    @staticmethod
-                # max_length+1 because the API takes up to 2049 tokens, including the first context token
+    def parse_generations(outputs: Union[Dict, List[Dict]], **kwargs) -> List[str]:
-                inp = (context_enc + continuation_enc)[-(self.max_length + 1) :]
-                # TODO: the logic is much simpler if we just look at the length of continuation tokens
-                ctxlen = len(context_enc) - max(
-                    0, len(context_enc) + len(continuation_enc) - (self.max_length + 1)
-                )
-                inps.append(inp)
-                ctxlens.append(ctxlen)
-            response = oa_completion(
-                client=self.client,
-                model=self.model,
-                prompt=inps,
-                max_tokens=0,
-                temperature=0.0,
-                logprobs=10,
-                seed=self.seed,
-            )
-            for resp, ctxlen, (cache_key, context_enc, continuation_enc) in zip(
-                response.choices, ctxlens, chunk
-            ):
-                answer = get_result(resp)
-                res.append(answer)
-                # partial caching
-                if cache_key is not None:
-                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
-        return re_ord.get_original(res)
-    def generate_until(self, requests, disable_tqdm: bool = False) -> List[str]:
-        if not requests:
-            return []
        res = []
-        requests = [req.args for req in requests]
+        if not isinstance(outputs, list):
+            outputs = [outputs]
+        for out in outputs:
+            for choices in out["choices"]:
+                res.append(choices["text"])
+        return res
-        def _collate(x):
+    @property
-            toks = self.tok_encode(x[0])
+    def api_key(self):
-            return len(toks), x[0]
+        return os.environ.get("OPENAI_API_KEY", "")
-        re_ord = utils.Reorderer(requests, _collate)
-        def sameuntil_chunks(xs, size):
-            ret = []
-            lastuntil = xs[0][1]
-            for x in xs:
-                if len(ret) >= size or x[1] != lastuntil:
-                    yield ret, lastuntil
-                    ret = []
-                    lastuntil = x[1]
-                ret.append(x)
-            if ret:
-                yield ret, lastuntil
-        # todo: more intelligent batching for heterogeneous `until`
-        for chunk, request_args in tqdm(
-            list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)),
-            disable=disable_tqdm,
-        ):
-            inps = []
-            self._max_gen_toks = request_args.get("max_gen_toks", self.max_gen_toks)
-            for context, _ in chunk:
-                context_enc = self.tok_encode(context)
-                inp = context_enc[-(self.max_length - self.max_gen_toks) :]
-                inps.append(inp)
-            until = request_args.get("until", ["<|endoftext|>"])
-            request_args["temperature"] = request_args.get("temperature", 0)
-            response = oa_completion(
-                client=self.client,
-                model=self.model,
-                prompt=inps,
-                max_tokens=self.max_gen_toks,
-                stop=until,
-                seed=self.seed,
-                **{
-                    k: v
-                    for k, v in request_args.items()
-                    if k not in {"do_sample", "max_gen_toks", "until"}
-                },
-            )
-            for resp, (context, args_) in zip(response.choices, chunk):
-                s = getattr(resp, "text")
-                until_ = until
-                for term in until_:
-                    if len(term) > 0:
-                        s = s.split(term)[0]
-                # partial caching
-                self.cache_hook.add_partial(
-                    "generate_until", (context, {"until": until_}), s
-                )
-                res.append(s)
-        return re_ord.get_original(res)
-    def _model_call(self, inps):
-        # Isn't used because we override _loglikelihood_tokens
-        raise NotImplementedError()
-    def _model_generate(self, context, max_length, eos_token_id):
-        # Isn't used because we override generate_until
-        raise NotImplementedError()
-    def loglikelihood_rolling(
-        self, requests, disable_tqdm: bool = False
-    ) -> List[float]:
-        loglikelihoods = []
-        for (string,) in tqdm([req.args for req in requests], disable=disable_tqdm):
-            rolling_token_windows = list(
-                map(
-                    utils.make_disjoint_window,
-                    utils.get_rolling_token_windows(
-                        token_list=self.tok_encode(string),
-                        prefix_token=self.eot_token_id,
-                        max_seq_len=self.max_length,
-                        context_len=1,
-                    ),
-                )
-            )
-            # TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
-            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
-            string_nll = self._loglikelihood_tokens(
+@register_model("local-chat-completions")
-                rolling_token_windows,
+class LocalChatCompletion(LocalCompletionsAPI):
-                disable_tqdm=True,
+    def __init__(
+        self,
+        base_url=None,
+        tokenizer_backend=None,
+        tokenized_requests=False,
+        **kwargs,
+    ):
+        eval_logger.warning(
+            "chat-completions endpoint requires the `--apply_chat_template` flag."
+        )
+        super().__init__(
+            base_url=base_url,
+            tokenizer_backend=tokenizer_backend,
+            tokenized_requests=tokenized_requests,
+            **kwargs,
+        )
+        if self._batch_size > 1:
+            eval_logger.warning(
+                "Chat completions does not support batching. Defaulting to batch size 1."
            )
+            self._batch_size = 1
-            # discard is_greedy
+    def _create_payload(
-            string_nll = [x[0] for x in string_nll]
+        self,
+        messages: List[Dict],
+        generate=False,
+        gen_kwargs: dict = None,
+        seed=1234,
+        **kwargs,
+    ) -> dict:
+        gen_kwargs.pop("do_sample", False)
+        max_tokens = gen_kwargs.pop("max_gen_toks", self._max_gen_toks)
+        temperature = gen_kwargs.pop("temperature", 0)
+        stop = gen_kwargs.pop("until", ["<|endoftext|>"])
+        if not isinstance(stop, (list, tuple)):
+            stop = [stop]
+        return {
+            "messages": messages,
+            "model": self.model,
+            "max_tokens": max_tokens,
+            "temperature": temperature,
+            "stop": stop[:4],
+            "seed": seed,
+            **gen_kwargs,
+        }
+    @staticmethod
+    def parse_generations(outputs: Union[Dict, List[Dict]], **kwargs) -> List[str]:
+        res = []
+        if not isinstance(outputs, list):
+            outputs = [outputs]
+        for out in outputs:
+            for choices in out["choices"]:
+                res.append(choices["message"]["content"])
+        return res
+    def tok_encode(
+        self,
+        string: Union[str, Any],
+        left_truncate_len=None,
+        add_special_tokens=None,
+        **kwargs,
+    ) -> Union[List[str], List[int], Any]:
+        return string
-            string_nll = sum(string_nll)
+    def loglikelihood(self, requests, **kwargs):
-            loglikelihoods.append(string_nll)
+        raise NotImplementedError(
-        return loglikelihoods
+            "Loglikelihood is not supported for chat completions. Consider using the completions API instead."
+        )
-@register_model("openai-chat-completions", "local-chat-completions")
+@register_model(
-class OpenaiChatCompletionsLM(LM):
+    "openai-completions",
+)
+class OpenAICompletionsAPI(LocalCompletionsAPI):
    def __init__(
        self,
-        model: str = "gpt-3.5-turbo",  # GPT model or Local model using HuggingFace model paths
+        base_url="https://api.openai.com/v1/completions",
-        base_url: str = None,
+        tokenizer_backend="tiktoken",
-        truncate: bool = False,
        **kwargs,
-    ) -> None:
+    ):
-        """
+        super().__init__(
+            base_url=base_url, tokenizer_backend=tokenizer_backend, **kwargs
+        )
-        :param model: str
+    @cached_property
-            Implements an OpenAI-style chat completion API for
+    def api_key(self):
-            accessing both OpenAI OR locally-hosted models using
+        """Override this property to return the API key for the API request."""
-            HuggingFace Tokenizer
+        key = os.environ.get("OPENAI_API_KEY", None)
-            OpenAI API model (e.g. gpt-3.5-turbo)
+        if key is None:
-            using the **gen_kwargs passed on init
+            raise ValueError(
-        :param truncate: bool
+                "API key not found. Please set the OPENAI_API_KEY environment variable."
-            Truncate input if too long (if False and input is too long, throw error)
-        """
-        super().__init__()
-        try:
-            import openai  # noqa: E401
-        except ModuleNotFoundError:
-            raise Exception(
-                "attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
-    please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
            )
-        self.model = model
+        return key
-        self.base_url = base_url
-        self.truncate = truncate
-        # Read from environment variable OPENAI_API_KEY
-        # Set to EMPTY for local
-        if self.base_url:
-            self.client = openai.OpenAI(base_url=self.base_url)
-        else:
-            self.client = openai.OpenAI()  # openai.AsyncOpenAI()
-    @property
-    def max_length(self) -> int:
-        # Note: the OpenAI API supports up to 2049 tokens, with the first token being the first input token
-        return 2048
-    @property
+    def loglikelihood(self, requests, **kwargs):
-    def max_gen_toks(self) -> int:
+        assert (
-        return 256
+            self.model != "gpt-3.5-turbo"
+        ), "Loglikelihood is not supported for gpt-3.5-turbo"
-    @property
+        return super().loglikelihood(requests, **kwargs)
-    def batch_size(self):
-        # Isn't used because we override _loglikelihood_tokens
-        raise NotImplementedError()
-    @property
-    def device(self):
-        # Isn't used because we override _loglikelihood_tokens
-        raise NotImplementedError()
-    def generate_until(self, requests, disable_tqdm: bool = False) -> List[str]:
+@register_model("openai-chat-completions")
-        res = defaultdict(list)
+class OpenAIChatCompletion(LocalChatCompletion):
-        re_ords = {}
+    def __init__(
+        self,
+        base_url="https://api.openai.com/v1/chat/completions",
+        tokenizer_backend=None,
+        tokenized_requests=False,
+        **kwargs,
+    ):
+        super().__init__(
+            base_url=base_url,
+            tokenizer_backend=tokenizer_backend,
+            tokenized_requests=tokenized_requests,
+            **kwargs,
+        )
-        # we group requests by their generation_kwargs,
+    @cached_property
-        # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
+    def api_key(self):
-        # in the same batch.
+        """Override this property to return the API key for the API request."""
-        grouper = lm_eval.models.utils.Grouper(requests, lambda x: str(x.args[1]))
+        key = os.environ.get("OPENAI_API_KEY", None)
-        for key, reqs in grouper.get_grouped().items():
+        if key is None:
-            # within each set of reqs for given kwargs, we reorder by token length, descending.
+            raise ValueError(
-            re_ords[key] = utils.Reorderer(
+                "API key not found. Please set the OPENAI_API_KEY environment variable."
-                [req.args for req in reqs], lambda x: (-len(x[0]), x[0])
            )
+        return key
-        pbar = tqdm(total=len(requests), disable=(disable_tqdm or (self.rank != 0)))
-        for key, re_ord in re_ords.items():
-            # n needs to be 1 because messages in
-            # chat completion are not batch but
-            # is regarded as a single conversation.
-            chunks = lm_eval.models.utils.chunks(re_ord.get_reordered(), n=1)
-            for chunk in chunks:
-                contexts, all_gen_kwargs = zip(*chunk)
-                inps = [{"role": "user", "content": context} for context in contexts]
-                gen_kwargs = all_gen_kwargs[0]
-                until = None
-                if isinstance(kwargs := copy.deepcopy(gen_kwargs), dict):
-                    if "do_sample" in kwargs.keys():
-                        kwargs.pop("do_sample")
-                    if "until" in kwargs.keys():
-                        until = kwargs.pop("until")
-                        if isinstance(until, str):
-                            until = [until]
-                        elif not isinstance(until, list):
-                            raise ValueError(
-                                f"Expected repr(kwargs['until']) to be of type Union[str, list] but got {until}"
-                            )
-                        kwargs["stop"] = until
-                    kwargs["max_tokens"] = kwargs.pop("max_gen_toks", self.max_gen_toks)
-                else:
-                    raise ValueError(
-                        f"Expected repr(kwargs) to be of type repr(dict) but got {kwargs}"
-                    )
-                response = oa_completion(
-                    client=self.client,
-                    chat=True,
-                    messages=inps,
-                    model=self.model,
-                    **kwargs,
-                )
-                for resp, (context, args_) in zip(response.choices, chunk):
-                    s = resp.message.content
-                    if until is not None:
-                        for term in until:
-                            if len(term) > 0:
-                                s = s.split(term)[0]
-                    res[key].append(s)
-                    self.cache_hook.add_partial(
-                        "generate_until", (context, {"until": until}), s
-                    )
-                    pbar.update(1)
-            # reorder this group of results back to original unsorted form
-            res[key] = re_ord.get_original(res[key])
-        pbar.close()
-        return grouper.get_original(res)
-    def loglikelihood(self, requests, disable_tqdm: bool = False):
-        raise NotImplementedError("No support for logits.")
-    def loglikelihood_rolling(self, requests, disable_tqdm: bool = False):
-        raise NotImplementedError("No support for logits.")
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -29,7 +29,7 @@
 | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
 | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
 | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
-| [commonsense_qa](commmonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
+| [commonsense_qa](commonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
 | [copal_id](copal_id/README.md) | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian |
 | [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English |
 | [crows_pairs](crows_pairs/README.md) | Tasks designed to test model biases in various sociodemographic groups. | English, French |
@@ -65,6 +65,7 @@
 | [logiqa2](logiqa2/README.md) | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
 | [mathqa](mathqa/README.md) | Question answering tasks involving mathematical reasoning and problem-solving. | English |
 | [mc_taco](mc_taco/README.md) | Question-answer pairs that require temporal commonsense comprehension. | English |
+| [med_concepts_qa](med_concepts_qa/README.md) | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English |
 | medmcqa | Medical multiple choice questions assessing detailed medical knowledge. | English |
 | medqa | Multiple choice question answering based on the United States Medical License Exams. | |
 | [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
@@ -75,9 +76,9 @@
 | [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
 | [nq_open](nq_open/README.md) | Open domain question answering tasks based on the Natural Questions dataset. | English |
 | [okapi/arc_multilingual](okapi/arc_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.** |
-| [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) |
+| [okapi/hellaswag_multilingual](okapi/hellaswag_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) **Machine Translated.** |
-| okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) |
+| okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) **Machine Translated.** |
-| [okapi/truthfulqa_multilingual](okapi/truthfulqa_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) |
+| [okapi/truthfulqa_multilingual](okapi/truthfulqa_multilingual/README.md) | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) **Machine Translated.** |
 | [openbookqa](openbookqa/README.md) | Open-book question answering tasks that require external knowledge and reasoning. | English |
 | [paloma](paloma/README.md) | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English |
 | [paws-x](paws-x/README.md) | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean |
@@ -115,7 +116,7 @@
 | [wmt2016](wmt2016/README.md) | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish |
 | [wsc273](wsc273/README.md) | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English |
 | [xcopa](xcopa/README.md) | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese |
-| [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greekm English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
+| [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
 | [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
 | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
 | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
--- a/lm_eval/tasks/afrimmlu/direct/utils.py
+++ b/lm_eval/tasks/afrimmlu/direct/utils.py
-from sklearn.metrics import f1_score
+from lm_eval.utils import weighted_f1_score
 def doc_to_choice(doc):
@@ -30,11 +30,3 @@ def doc_to_text(doc):
        choice4=choices[3],
    )
    return text
-def weighted_f1_score(items):
-    unzipped_list = list(zip(*items))
-    golds = unzipped_list[0]
-    preds = unzipped_list[1]
-    fscore = f1_score(golds, preds, average="weighted")
-    return fscore