author={Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title={A framework for few-shot language model evaluation},
title={A framework for few-shot language model evaluation},
@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on
...
@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
- Easy support for custom prompts and evaluation metrics.
- Easy support for custom prompts and evaluation metrics.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,17476825572045927382,18443729326628441434,12854182577605049984) is used internally by dozens of companies including NVIDIA, Cohere, Nous Research, Booz Allen Hamilton, and Mosaic ML.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
## Install
## Install
...
@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a
...
@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a
#### Multi-GPU Evaluation with Hugging Face `accelerate`
#### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we leverage the [accelerate 🚀](https://github.com/huggingface/accelerate) library as follows:
We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
```
```
accelerate launch -m lm_eval --model hf \
accelerate launch -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--tasks lambada_openai,arc_easy \
--batch_size 16
--batch_size 16
```
```
(or via `accelerate launch --no-python lm_eval`).
For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.
**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.
The second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*
To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well:
In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
```
lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--model_args parallelize=True \
--batch_size 16
```
This means that your model's weights will be split across all available GPUs.
For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
-`device_map_option`: How to split model weights across available GPUs. defaults to "auto".
-`device_map_option`: How to split model weights across available GPUs. defaults to "auto".
-`max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
-`max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
-`max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
-`max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
-`offload_folder`: a folder where model weights will be offloaded to disk if needed.
-`offload_folder`: a folder where model weights will be offloaded to disk if needed.
To use `accelerate` with the `lm-eval` command, use
These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
```
accelerate launch --no_python lm-eval --model ...
```
### Tensor + Data Parallel and Optimized Inference with `vLLM`
### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
We also support using your own local inference server with an implemented version of the OpenAI ChatCompletions endpoint and passing trained HuggingFace artifacts and tokenizers.
We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your local inference server! | :heavy_check_mark: | `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
| Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
| `local-completions` (using `openai-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface | `generate_until` | | ... |
Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
### Other Frameworks
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
### Additional Features
### Additional Features
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
...
@@ -219,11 +236,11 @@ lm_eval --model hf \
...
@@ -219,11 +236,11 @@ lm_eval --model hf \
--device cuda:0
--device cuda:0
```
```
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,autogptq=NAME` (or `,autogptq=True` for default names) in the `model_args` argument:
@@ -239,7 +256,7 @@ For a full list of supported arguments, check out the [interface](https://github
...
@@ -239,7 +256,7 @@ For a full list of supported arguments, check out the [interface](https://github
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](hub.zenoml.com) to create an account and get an API key [on your account page](hub.zenoml.com/account).
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
```
dataset_path: hellaswag
dataset_kwargs:
data_dir: hellaswag_local/
```
You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
## Writing a Prompt Template
## Writing a Prompt Template
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
@@ -4,7 +4,7 @@ The `lm-evaluation-harness` is meant to be an extensible and flexible framework
...
@@ -4,7 +4,7 @@ The `lm-evaluation-harness` is meant to be an extensible and flexible framework
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
...
@@ -301,6 +301,23 @@ task:
...
@@ -301,6 +301,23 @@ task:
-hendrycksTest*
-hendrycksTest*
```
```
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
```yaml
group:multimedqa
task:
-pubmedqa
-medmcqa
-medqa_4options
-task:mmlu_anatomy
task_alias:"anatomy(mmlu)"
group_alias:null
-task:mmlu_clinical_knowledge
task_alias:"clinical_knowledge(mmlu)"
group_alias:null
...
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
```yaml
...
@@ -363,4 +380,4 @@ task:
...
@@ -363,4 +380,4 @@ task:
ignore_punctuation: true
ignore_punctuation: true
```
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/tasks/benchmarks/`
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
# or any other option that preloads model onto device
# or any other option that preloads model onto device
try:
try:
self.model.to(self.device)
self.model.to(self.device)
exceptValueError:
exceptValueError:
eval_logger.info(
eval_logger.debug(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
)
)
self._create_tokenizer(
self._create_tokenizer(
...
@@ -456,12 +459,24 @@ class HFLM(LM):
...
@@ -456,12 +459,24 @@ class HFLM(LM):
ifparallelize:
ifparallelize:
model_kwargs.update(
model_kwargs.update(
_get_accelerate_args(
_get_accelerate_args(
device_map_option,
device_map_option,# TODO: phase out device_map_option?
max_memory_per_gpu,
max_memory_per_gpu,
max_cpu_memory,
max_cpu_memory,
offload_folder,
offload_folder,
)
)
)
)
elif"device_map"notinmodel_kwargs:
# set a device_map to initialize model on the right GPU.
# this is needed because it seems that the default behavior
# for quantized models now seems to be device_map="auto"