Unverified Commit 7158f4f4 authored by Kiersten Stokes's avatar Kiersten Stokes Committed by GitHub
Browse files

Add Markdown linter (#2818)

* Add markdown linter to pre-commit hooks

* Reformat existing markdown (excluding lm_eval/tasks/*.md)
parent c73b43f4
...@@ -46,6 +46,12 @@ repos: ...@@ -46,6 +46,12 @@ repos:
.*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
)$ )$
args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt] args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
- repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.29
hooks:
- id: pymarkdown
exclude: ^lm_eval/tasks/
args: [fix, -r]
# - repo: https://github.com/pre-commit/mirrors-mypy # - repo: https://github.com/pre-commit/mirrors-mypy
# rev: v1.5.1 # rev: v1.5.1
# hooks: # hooks:
......
...@@ -4,7 +4,8 @@ ...@@ -4,7 +4,8 @@
--- ---
*Latest News 📣* ## Latest News 📣
- [2025/03] Added support for steering HF models! - [2025/03] Added support for steering HF models!
- [2025/02] Added [SGLang](https://docs.sglang.ai/) support! - [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
- [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features. - [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
...@@ -14,6 +15,7 @@ ...@@ -14,6 +15,7 @@
--- ---
## Announcement ## Announcement
**A new v0.4.0 release of lm-evaluation-harness is available** ! **A new v0.4.0 release of lm-evaluation-harness is available** !
New updates and features include: New updates and features include:
...@@ -39,6 +41,7 @@ Development will be continuing on the `main` branch, and we encourage you to giv ...@@ -39,6 +41,7 @@ Development will be continuing on the `main` branch, and we encourage you to giv
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
**Features:** **Features:**
- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented. - Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm). - Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).
...@@ -63,6 +66,7 @@ pip install -e . ...@@ -63,6 +66,7 @@ pip install -e .
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document. We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
## Basic Usage ## Basic Usage
### User Guide ### User Guide
A user guide detailing the full list of supported arguments is provided [here](./docs/interface.md), and on the terminal by calling `lm_eval -h`. Alternatively, you can use `lm-eval` instead of `lm_eval`. A user guide detailing the full list of supported arguments is provided [here](./docs/interface.md), and on the terminal by calling `lm_eval -h`. Alternatively, you can use `lm-eval` instead of `lm_eval`.
...@@ -112,11 +116,12 @@ We support three main ways of using Hugging Face's [accelerate 🚀](https://git ...@@ -112,11 +116,12 @@ We support three main ways of using Hugging Face's [accelerate 🚀](https://git
To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows: To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
``` ```bash
accelerate launch -m lm_eval --model hf \ accelerate launch -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
--batch_size 16 --batch_size 16
``` ```
(or via `accelerate launch --no-python lm_eval`). (or via `accelerate launch --no-python lm_eval`).
For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one. For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
...@@ -127,7 +132,7 @@ The second way of using `accelerate` for multi-GPU evaluation is when your model ...@@ -127,7 +132,7 @@ The second way of using `accelerate` for multi-GPU evaluation is when your model
In this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows: In this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
``` ```bash
lm_eval --model hf \ lm_eval --model hf \
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
--model_args parallelize=True \ --model_args parallelize=True \
...@@ -137,6 +142,7 @@ lm_eval --model hf \ ...@@ -137,6 +142,7 @@ lm_eval --model hf \
This means that your model's weights will be split across all available GPUs. This means that your model's weights will be split across all available GPUs.
For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well: For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
- `device_map_option`: How to split model weights across available GPUs. defaults to "auto". - `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model. - `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM. - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
...@@ -144,7 +150,7 @@ For more advanced users or even larger models, we allow for the following argume ...@@ -144,7 +150,7 @@ For more advanced users or even larger models, we allow for the following argume
The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU. The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.
``` ```bash
accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \ accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
-m lm_eval --model hf \ -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \ --tasks lambada_openai,arc_easy \
...@@ -194,6 +200,7 @@ pd.DataFrame({ ...@@ -194,6 +200,7 @@ pd.DataFrame({
``` ```
Run the evaluation harness with steering vectors applied: Run the evaluation harness with steering vectors applied:
```bash ```bash
lm_eval --model steered \ lm_eval --model steered \
--model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \ --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \
...@@ -211,6 +218,7 @@ To evaluate a `nemo` model, start by installing NeMo following [the documentatio ...@@ -211,6 +218,7 @@ To evaluate a `nemo` model, start by installing NeMo following [the documentatio
NeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`. NeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.
Run a `nemo` model on one GPU: Run a `nemo` model on one GPU:
```bash ```bash
lm_eval --model nemo_lm \ lm_eval --model nemo_lm \
--model_args path=<path_to_nemo_model> \ --model_args path=<path_to_nemo_model> \
...@@ -220,7 +228,7 @@ lm_eval --model nemo_lm \ ...@@ -220,7 +228,7 @@ lm_eval --model nemo_lm \
It is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run: It is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:
``` ```bash
mkdir MY_MODEL mkdir MY_MODEL
tar -xvf MY_MODEL.nemo -c MY_MODEL tar -xvf MY_MODEL.nemo -c MY_MODEL
``` ```
...@@ -230,6 +238,7 @@ tar -xvf MY_MODEL.nemo -c MY_MODEL ...@@ -230,6 +238,7 @@ tar -xvf MY_MODEL.nemo -c MY_MODEL
By default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node. By default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.
1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is: 1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:
```bash ```bash
torchrun --nproc-per-node=8 --no-python lm_eval \ torchrun --nproc-per-node=8 --no-python lm_eval \
--model nemo_lm \ --model nemo_lm \
...@@ -238,7 +247,8 @@ torchrun --nproc-per-node=8 --no-python lm_eval \ ...@@ -238,7 +247,8 @@ torchrun --nproc-per-node=8 --no-python lm_eval \
--batch_size 32 --batch_size 32
``` ```
2) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is: 1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:
```bash ```bash
torchrun --nproc-per-node=4 --no-python lm_eval \ torchrun --nproc-per-node=4 --no-python lm_eval \
--model nemo_lm \ --model nemo_lm \
...@@ -246,6 +256,7 @@ torchrun --nproc-per-node=4 --no-python lm_eval \ ...@@ -246,6 +256,7 @@ torchrun --nproc-per-node=4 --no-python lm_eval \
--tasks hellaswag \ --tasks hellaswag \
--batch_size 32 --batch_size 32
``` ```
Note that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=<number of devices> --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs. Note that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=<number of devices> --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.
Not supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism. Not supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.
...@@ -256,7 +267,7 @@ Pipeline parallelism during evaluation is supported with OpenVINO models ...@@ -256,7 +267,7 @@ Pipeline parallelism during evaluation is supported with OpenVINO models
To enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:<GPU index1>,<GPU index2>` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is: To enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:<GPU index1>,<GPU index2>` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:
``` ```bash
lm_eval --model openvino \ lm_eval --model openvino \
--tasks wikitext \ --tasks wikitext \
--model_args pretrained=<path_to_ov_model>,pipeline_parallel=True \ --model_args pretrained=<path_to_ov_model>,pipeline_parallel=True \
...@@ -273,6 +284,7 @@ lm_eval --model vllm \ ...@@ -273,6 +284,7 @@ lm_eval --model vllm \
--tasks lambada_openai \ --tasks lambada_openai \
--batch_size auto --batch_size auto
``` ```
To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation. To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF. vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
...@@ -293,14 +305,17 @@ To use SGLang as the evaluation backend, please **install it in advance** via SG ...@@ -293,14 +305,17 @@ To use SGLang as the evaluation backend, please **install it in advance** via SG
> Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version. > Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.
SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here: SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here:
```bash ```bash
lm_eval --model sglang \ lm_eval --model sglang \
--model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \ --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \
--tasks gsm8k_cot \ --tasks gsm8k_cot \
--batch_size auto --batch_size auto
``` ```
> [!Tip] > [!Tip]
> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions: > When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
>
> 1. Use a manual `batch_size`, rather than `auto`. > 1. Use a manual `batch_size`, rather than `auto`.
> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`. > 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs). > 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
...@@ -323,6 +338,7 @@ We also support using your own local inference server with servers that mirror t ...@@ -323,6 +338,7 @@ We also support using your own local inference server with servers that mirror t
```bash ```bash
lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16 lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16
``` ```
Note that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support. Note that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: | | API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
...@@ -352,7 +368,6 @@ For more information on the different task `output_types` and model request type ...@@ -352,7 +368,6 @@ For more information on the different task `output_types` and model request type
> [!Note] > [!Note]
> For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system="<some system prompt here>"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful. > For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system="<some system prompt here>"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.
### Other Frameworks ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
...@@ -360,6 +375,7 @@ A number of other libraries contain scripts for calling the eval harness through ...@@ -360,6 +375,7 @@ A number of other libraries contain scripts for calling the eval harness through
To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage). To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
### Additional Features ### Additional Features
> [!Note] > [!Note]
> For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation. > For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.
...@@ -367,6 +383,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b ...@@ -367,6 +383,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> [!Note] > [!Note]
> You can inspect what the LM inputs look like by running the following command: > You can inspect what the LM inputs look like by running the following command:
>
> ```bash > ```bash
> python write_out.py \ > python write_out.py \
> --tasks <task1,task2,...> \ > --tasks <task1,task2,...> \
...@@ -374,6 +391,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b ...@@ -374,6 +391,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> --num_examples 10 \ > --num_examples 10 \
> --output_base_path /path/to/output/folder > --output_base_path /path/to/output/folder
> ``` > ```
>
> This will write out one text file for each task. > This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag: To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
...@@ -388,6 +406,7 @@ lm_eval --model openai \ ...@@ -388,6 +406,7 @@ lm_eval --model openai \
## Advanced Usage Tips ## Advanced Usage Tips
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument: For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
```bash ```bash
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \ --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \
...@@ -396,6 +415,7 @@ lm_eval --model hf \ ...@@ -396,6 +415,7 @@ lm_eval --model hf \
``` ```
Models provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied: Models provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:
```bash ```bash
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \ --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \
...@@ -405,6 +425,7 @@ lm_eval --model hf \ ...@@ -405,6 +425,7 @@ lm_eval --model hf \
GPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) GPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)
GPTQModel: add `,gptqmodel=True` to `model_args` GPTQModel: add `,gptqmodel=True` to `model_args`
```bash ```bash
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=model-name-or-path,gptqmodel=True \ --model_args pretrained=model-name-or-path,gptqmodel=True \
...@@ -412,6 +433,7 @@ lm_eval --model hf \ ...@@ -412,6 +433,7 @@ lm_eval --model hf \
``` ```
AutoGPTQ: add `,autogptq=True` to `model_args`: AutoGPTQ: add `,autogptq=True` to `model_args`:
```bash ```bash
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \ --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
...@@ -439,6 +461,7 @@ lm_eval --model hf \ ...@@ -439,6 +461,7 @@ lm_eval --model hf \
``` ```
This allows you to easily download the results and samples from the Hub, using: This allows you to easily download the results and samples from the Hub, using:
```python ```python
from datasets import load_dataset from datasets import load_dataset
...@@ -536,6 +559,7 @@ For more information on the library and how everything fits together, check out ...@@ -536,6 +559,7 @@ For more information on the library and how everything fits together, check out
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
In general, we follow this priority list for addressing concerns about prompting and other eval details: In general, we follow this priority list for addressing concerns about prompting and other eval details:
1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure. 1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
2. If there is a clear and unambiguous official implementation, use that procedure. 2. If there is a clear and unambiguous official implementation, use that procedure.
3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure. 3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
...@@ -550,6 +574,7 @@ We try to prioritize agreement with the procedures used by other groups to decre ...@@ -550,6 +574,7 @@ We try to prioritize agreement with the procedures used by other groups to decre
The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you! The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Optional Extras ## Optional Extras
Extras dependencies can be installed via `pip install -e ".[NAME]"` Extras dependencies can be installed via `pip install -e ".[NAME]"`
| Name | Use | | Name | Use |
...@@ -586,7 +611,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"` ...@@ -586,7 +611,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
## Cite as ## Cite as
``` ```text
@misc{eval-harness, @misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy}, author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation}, title = {A framework for few-shot language model evaluation},
......
...@@ -28,31 +28,29 @@ You may also need to override other methods or properties depending on your API' ...@@ -28,31 +28,29 @@ You may also need to override other methods or properties depending on your API'
> [!NOTE] > [!NOTE]
> Currently loglikelihood and MCQ based tasks (such as MMLU) are only supported for completion endpoints. Not for chat-completion — those that expect a list of dicts — endpoints! Completion APIs which support instruct tuned models can be evaluated with the `--apply_chat_template` option in order to simultaneously evaluate models using a chat template format while still being able to access the model logits needed for loglikelihood-based tasks. > Currently loglikelihood and MCQ based tasks (such as MMLU) are only supported for completion endpoints. Not for chat-completion — those that expect a list of dicts — endpoints! Completion APIs which support instruct tuned models can be evaluated with the `--apply_chat_template` option in order to simultaneously evaluate models using a chat template format while still being able to access the model logits needed for loglikelihood-based tasks.
# TemplateAPI Usage Guide
## TemplateAPI Arguments ## TemplateAPI Arguments
When initializing a `TemplateAPI` instance or a subclass, you can provide several arguments to customize its behavior. Here's a detailed explanation of some important arguments: When initializing a `TemplateAPI` instance or a subclass, you can provide several arguments to customize its behavior. Here's a detailed explanation of some important arguments:
- `model` or `pretrained` (str): - `model` or `pretrained` (str):
- The name or identifier of the model to use. - The name or identifier of the model to use.
- `model` takes precedence over `pretrained` when both are provided. - `model` takes precedence over `pretrained` when both are provided.
- `base_url` (str): - `base_url` (str):
- The base URL for the API endpoint. - The base URL for the API endpoint.
- `tokenizer` (str, optional): - `tokenizer` (str, optional):
- The name or path of the tokenizer to use. - The name or path of the tokenizer to use.
- If not provided, it defaults to using the same tokenizer name as the model. - If not provided, it defaults to using the same tokenizer name as the model.
- `num_concurrent` (int): - `num_concurrent` (int):
- Number of concurrent requests to make to the API. - Number of concurrent requests to make to the API.
- Useful for APIs that support parallel processing. - Useful for APIs that support parallel processing.
- Default is 1 (sequential processing). - Default is 1 (sequential processing).
- `timeout` (int, optional): - `timeout` (int, optional):
- Timeout for API requests in seconds. - Timeout for API requests in seconds.
- Default is 30. - Default is 30.
- `tokenized_requests` (bool): - `tokenized_requests` (bool):
- Determines whether the input is pre-tokenized. Defaults to `True`. - Determines whether the input is pre-tokenized. Defaults to `True`.
...@@ -71,8 +69,8 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa ...@@ -71,8 +69,8 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa
- Default is 2048. - Default is 2048.
- `max_retries` (int, optional): - `max_retries` (int, optional):
- Maximum number of retries for failed API requests. - Maximum number of retries for failed API requests.
- Default is 3. - Default is 3.
- `max_gen_toks` (int, optional): - `max_gen_toks` (int, optional):
- Maximum number of tokens to generate in completion tasks. - Maximum number of tokens to generate in completion tasks.
...@@ -99,7 +97,6 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa ...@@ -99,7 +97,6 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa
- Whether to validate the certificate of the API endpoint (if HTTPS). - Whether to validate the certificate of the API endpoint (if HTTPS).
- Default is True. - Default is True.
Example usage: Example usage:
```python ```python
...@@ -202,5 +199,5 @@ To implement your own API model: ...@@ -202,5 +199,5 @@ To implement your own API model:
## Best Practices ## Best Practices
1. Use the `@register_model` decorator to register your model with the framework (and import it in `lm_eval/models/__init__.py`!). 1. Use the `@register_model` decorator to register your model with the framework (and import it in `lm_eval/models/__init__.py`!).
3. Use environment variables for sensitive information like API keys. 2. Use environment variables for sensitive information like API keys.
4. Properly handle batching and concurrent requests if supported by your API. 3. Properly handle batching and concurrent requests if supported by your API.
...@@ -29,7 +29,7 @@ in order to ensure linters and other checks will be run upon committing. ...@@ -29,7 +29,7 @@ in order to ensure linters and other checks will be run upon committing.
We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via: We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
``` ```bash
python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
``` ```
...@@ -38,27 +38,30 @@ python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralma ...@@ -38,27 +38,30 @@ python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralma
We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library. We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library.
First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR. First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR.
## Contribution Best Practices ## Contribution Best Practices
We recommend a few best practices to make your contributions or reported errors easier to assist with. We recommend a few best practices to make your contributions or reported errors easier to assist with.
**For Pull Requests:** **For Pull Requests:**
- PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution. - PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution.
- New features should have appropriate documentation added alongside them. - New features should have appropriate documentation added alongside them.
- Aim for code maintainability, and minimize code copying. - Aim for code maintainability, and minimize code copying.
- If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them. - If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them.
**For Feature Requests:** **For Feature Requests:**
- Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported? - Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported?
**For Bug Reports**: **For Bug Reports**:
- Provide a short description of the bug. - Provide a short description of the bug.
- Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it? - Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it?
- Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context. - Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context.
- Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant. - Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant.
**For Requesting New Tasks**: **For Requesting New Tasks**:
- Provide a 1-2 sentence description of what the task is and what it evaluates. - Provide a 1-2 sentence description of what the task is and what it evaluates.
- Provide a link to the paper introducing the task. - Provide a link to the paper introducing the task.
- Provide a link to where the dataset can be found. - Provide a link to where the dataset can be found.
...@@ -70,6 +73,7 @@ We recommend a few best practices to make your contributions or reported errors ...@@ -70,6 +73,7 @@ We recommend a few best practices to make your contributions or reported errors
To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin. To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin.
There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include: There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include:
- **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation. - **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation.
- **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation. - **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation.
- **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows. - **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows.
......
# Chat Template Delimiter Handling Update # Chat Template Delimiter Handling Update
## Overview ## Overview
This change modifies how delimiters are handled when applying chat templates in the request construction process for likelihood and multiple-choice based tasks. When `apply_chat_template` is set to `True`, the target delimiter is now set to an empty string instead of using the configured delimiter. This change modifies how delimiters are handled when applying chat templates in the request construction process for likelihood and multiple-choice based tasks. When `apply_chat_template` is set to `True`, the target delimiter is now set to an empty string instead of using the configured delimiter.
## Background ## Background
By default, the system uses a target delimiter (typically a whitespace " ") between the context and target text when constructing prompts. The full string is constructed as: By default, the system uses a target delimiter (typically a whitespace " ") between the context and target text when constructing prompts. The full string is constructed as:
```
```text
doc_to_text(doc) + target_delimiter + doc_to_target(doc) doc_to_text(doc) + target_delimiter + doc_to_target(doc)
``` ```
While this worked well for base models where we wanted the model to predict a single whitespace followed by the answer, chat models have their own formatting conventions that handle spacing differently. While this worked well for base models where we wanted the model to predict a single whitespace followed by the answer, chat models have their own formatting conventions that handle spacing differently.
## The Change ## The Change
- When `apply_chat_template=True`, the target delimiter is now empty ("") instead of the default whitespace - When `apply_chat_template=True`, the target delimiter is now empty ("") instead of the default whitespace
- This prevents interference between chat template formatting and the default delimiter system - This prevents interference between chat template formatting and the default delimiter system
- Particularly important for multiple choice tasks where the template itself handles spacing - Particularly important for multiple choice tasks where the template itself handles spacing
## Example ## Example
```
```text
# Before (with default delimiter " ") # Before (with default delimiter " ")
<user>Question: What color is the sky?\nAnswer:<assistant> blue <user>Question: What color is the sky?\nAnswer:<assistant> blue
......
...@@ -13,6 +13,7 @@ python -m lm_eval \ ...@@ -13,6 +13,7 @@ python -m lm_eval \
``` ```
## Background ## Background
Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set, referred to as leakage or contamination. Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set, referred to as leakage or contamination.
Filtering your training set against the test set is a good first step, however this isn't always possible, as in the case of a new benchmark or one that wasn't considered prior to model training. When training set filtering isn't possible, it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark. Filtering your training set against the test set is a good first step, however this isn't always possible, as in the case of a new benchmark or one that wasn't considered prior to model training. When training set filtering isn't possible, it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
...@@ -20,9 +21,11 @@ Filtering your training set against the test set is a good first step, however t ...@@ -20,9 +21,11 @@ Filtering your training set against the test set is a good first step, however t
The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity. The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
## Implementation ## Implementation
Contamination detection can be found in `lm_eval/decontaminate.py` with supporting code in `lm_eval/decontamination/`. Contamination detection can be found in `lm_eval/decontaminate.py` with supporting code in `lm_eval/decontamination/`.
decontaminate.py does the following: decontaminate.py does the following:
1. Build dictionaries of all ngrams and their corresponding evaluation/document ids. 1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
2. Scan through sorted files containing training set n-grams. 2. Scan through sorted files containing training set n-grams.
3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated. 3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
...@@ -32,6 +35,7 @@ decontaminate.py does the following: ...@@ -32,6 +35,7 @@ decontaminate.py does the following:
This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md). This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
## Pile Ngram Generation ## Pile Ngram Generation
The relevant scripts can be found in `scripts/clean_training_data`, which also import from The relevant scripts can be found in `scripts/clean_training_data`, which also import from
`lm_eval/decontamination/` `lm_eval/decontamination/`
...@@ -52,6 +56,7 @@ python -m scripts/clean_training_data/generate_13_grams \ ...@@ -52,6 +56,7 @@ python -m scripts/clean_training_data/generate_13_grams \
Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing in case you need to stop and start. Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing in case you need to stop and start.
6. Sort the generated 13-grams. 6. Sort the generated 13-grams.
```bash ```bash
python -m scripts/clean_training_data/sort_13_gram_buckets \ python -m scripts/clean_training_data/sort_13_gram_buckets \
-dir path/to/working/directory/output -dir path/to/working/directory/output
......
...@@ -47,8 +47,8 @@ This mode supports a number of command-line arguments, the details of which can ...@@ -47,8 +47,8 @@ This mode supports a number of command-line arguments, the details of which can
- `--system_instruction`: Specifies a system instruction string to prepend to the prompt. - `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
- `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways: - `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
- `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied. - `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
- `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt. - `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer. For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
...@@ -56,23 +56,23 @@ This mode supports a number of command-line arguments, the details of which can ...@@ -56,23 +56,23 @@ This mode supports a number of command-line arguments, the details of which can
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results. - `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42. - `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`. - `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments: - `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
* `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token, - `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
* `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`, - `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
* `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`, - `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
* `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`, - `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
* `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`, - `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
* `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set, - `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
* `public_repo` - whether the repository is public, can be `True` or `False`, - `public_repo` - whether the repository is public, can be `True` or `False`,
* `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`. - `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
* `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`. - `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
* `gated` - whether to gate the details dataset, can be `True` or `False`. - `gated` - whether to gate the details dataset, can be `True` or `False`.
* `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`. - `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
## External Library Usage ## External Library Usage
......
...@@ -45,6 +45,7 @@ class MyCustomLM(LM): ...@@ -45,6 +45,7 @@ class MyCustomLM(LM):
#... #...
#... #...
``` ```
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below. Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
We support three types of requests, consisting of different interactions / measurements with an autoregressive LM. We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
...@@ -65,15 +66,13 @@ All three request types take as input `requests` of type `list[Instance]` that h ...@@ -65,15 +66,13 @@ All three request types take as input `requests` of type `list[Instance]` that h
- This is used to evaluate *perplexity* on a data distribution. - This is used to evaluate *perplexity* on a data distribution.
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input. - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods! To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!
**Tip: be careful of indexing in loglikelihood!** **Tip: be careful of indexing in loglikelihood!**
LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`: LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`:
``` ```text
# how this all works (illustrated on a causal decoder-only setup): # how this all works (illustrated on a causal decoder-only setup):
# CTX CONT # CTX CONT
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1] # inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
...@@ -162,7 +161,8 @@ class MyCustomLM(LM): ...@@ -162,7 +161,8 @@ class MyCustomLM(LM):
- `apply_chat_template` - `apply_chat_template`
- This method performs the bulk of the work required for chat-formatting. - This method performs the bulk of the work required for chat-formatting.
- As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to - As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
```
```text
[ [
{"system": <user-provided system message such as "You are a helpful math-focused chatbot">}, {"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
{"user": <task example - a few-shot example 'input'>} {"user": <task example - a few-shot example 'input'>}
...@@ -170,8 +170,9 @@ class MyCustomLM(LM): ...@@ -170,8 +170,9 @@ class MyCustomLM(LM):
# ... more few-shot examples, potentially # ... more few-shot examples, potentially
{"user": <test set query--response on which we will evaluate>}, {"user": <test set query--response on which we will evaluate>},
] ]
``` ```
which can then be converted into a string input.
which can then be converted into a string input.
- The output is a string representing this conversation that can be fed into the model. - The output is a string representing this conversation that can be fed into the model.
- For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference. - For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
- `tokenizer_name` - `tokenizer_name`
......
...@@ -27,10 +27,13 @@ To implement a new standard task, we'll need to write a YAML file which configur ...@@ -27,10 +27,13 @@ To implement a new standard task, we'll need to write a YAML file which configur
```sh ```sh
touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
``` ```
Or, copy the template subfolder we provide from `templates/new_yaml_task`: Or, copy the template subfolder we provide from `templates/new_yaml_task`:
```sh ```sh
cp -r templates/new_yaml_task lm_eval/tasks/ cp -r templates/new_yaml_task lm_eval/tasks/
``` ```
and rename the folders and YAML file(s) as desired. and rename the folders and YAML file(s) as desired.
### Selecting and configuring a dataset ### Selecting and configuring a dataset
...@@ -54,13 +57,17 @@ training_split: <split name of training set, or `null`> ...@@ -54,13 +57,17 @@ training_split: <split name of training set, or `null`>
validation_split: <split name of val. set, or `null`> validation_split: <split name of val. set, or `null`>
test_split: <split name of test set, or `null`> test_split: <split name of test set, or `null`>
``` ```
Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`. Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`.
We can also specify from which split the task should retrieve few-shot examples via: We can also specify from which split the task should retrieve few-shot examples via:
```yaml ```yaml
fewshot_split: <split name to draw fewshot examples from, or `null`> fewshot_split: <split name to draw fewshot examples from, or `null`>
``` ```
or by hardcoding them, either using the following in the yaml file: or by hardcoding them, either using the following in the yaml file:
```yaml ```yaml
fewshot_config: fewshot_config:
sampler: first_n sampler: first_n
...@@ -69,24 +76,28 @@ fewshot_config: ...@@ -69,24 +76,28 @@ fewshot_config:
{<sample 2>}, {<sample 2>},
] ]
``` ```
or by adding the function `list_fewshot_samples` in the associated utils.py file: or by adding the function `list_fewshot_samples` in the associated utils.py file:
```python ```python
def list_fewshot_samples() -> list[dict]: def list_fewshot_samples() -> list[dict]:
return [{<sample 1>}, {<sample 2>}] return [{<sample 1>}, {<sample 2>}]
``` ```
See `lm_eval/tasks/minerva_math/minerva_math_algebra.yaml` for an example of the latter, and `lm_eval/tasks/gsm8k/gsm8k-cot.yaml` for an example of the former. See `lm_eval/tasks/minerva_math/minerva_math_algebra.yaml` for an example of the latter, and `lm_eval/tasks/gsm8k/gsm8k-cot.yaml` for an example of the former.
In this case, each sample must contain the same fields as the samples in the above sets--for example, if `doc_to_text` expects an `input` field when rendering input prompts, these provided samples must include an `input` key. In this case, each sample must contain the same fields as the samples in the above sets--for example, if `doc_to_text` expects an `input` field when rendering input prompts, these provided samples must include an `input` key.
If neither above options are not set, we will default to train/validation/test sets, in that order. If neither above options are not set, we will default to train/validation/test sets, in that order.
Finally, our dataset may not be already in the exact format we want. Maybe we have to strip whitespace and special characters via a regex from our dataset's "question" field! Or maybe we just want to rename its columns to match a convention we'll be using for our prompts. Finally, our dataset may not be already in the exact format we want. Maybe we have to strip whitespace and special characters via a regex from our dataset's "question" field! Or maybe we just want to rename its columns to match a convention we'll be using for our prompts.
Let's create a python file in the directory where we're writing our YAML file: Let's create a python file in the directory where we're writing our YAML file:
```bash ```bash
touch lm_eval/tasks/<dataset_name>/utils.py touch lm_eval/tasks/<dataset_name>/utils.py
``` ```
Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](../lm_eval/tasks/hellaswag/utils.py)): Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](../lm_eval/tasks/hellaswag/utils.py)):
```python ```python
...@@ -104,6 +115,7 @@ def process_docs(dataset: datasets.Dataset) -> datasets.Dataset: ...@@ -104,6 +115,7 @@ def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
``` ```
Now, in our YAML config file we'll use the `!function` constructor, and tell the config where our imported Python function will come from. At runtime, before doing anything else we will preprocess our dataset according to this function! Now, in our YAML config file we'll use the `!function` constructor, and tell the config where our imported Python function will come from. At runtime, before doing anything else we will preprocess our dataset according to this function!
```yaml ```yaml
process_docs: !function utils.process_docs process_docs: !function utils.process_docs
``` ```
...@@ -112,15 +124,16 @@ process_docs: !function utils.process_docs ...@@ -112,15 +124,16 @@ process_docs: !function utils.process_docs
To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files: To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
``` ```yaml
dataset_path: json dataset_path: json
dataset_name: null dataset_name: null
dataset_kwargs: dataset_kwargs:
data_files: /path/to/my/json data_files: /path/to/my/json
``` ```
Or with files already split into separate directories: Or with files already split into separate directories:
``` ```yaml
dataset_path: arrow dataset_path: arrow
dataset_kwargs: dataset_kwargs:
data_files: data_files:
...@@ -130,7 +143,7 @@ dataset_kwargs: ...@@ -130,7 +143,7 @@ dataset_kwargs:
Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is. Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
``` ```yaml
dataset_path: hellaswag dataset_path: hellaswag
dataset_kwargs: dataset_kwargs:
data_dir: hellaswag_local/ data_dir: hellaswag_local/
...@@ -149,52 +162,59 @@ To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_ch ...@@ -149,52 +162,59 @@ To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_ch
### Basic prompts ### Basic prompts
If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each. If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each.
```yaml ```yaml
doc_to_text: startphrase doc_to_text: startphrase
doc_to_target: label doc_to_target: label
``` ```
Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11). Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11).
```yaml ```yaml
doc_to_target: 3 doc_to_target: 3
``` ```
`doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11)) `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11))
```yaml ```yaml
doc_to_choice: ['No', 'Yes'] doc_to_choice: ['No', 'Yes']
``` ```
if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13)) if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13))
```
```yaml
doc_to_choice: choices doc_to_choice: choices
``` ```
### Writing a prompt with Jinja 2 ### Writing a prompt with Jinja 2
We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format. We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a sample line `doc`, the model sees something in the format of: Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a sample line `doc`, the model sees something in the format of:
```
```text
doc["passage"] doc["passage"]
Question: doc["question"]? Question: doc["question"]?
Answer: Answer:
``` ```
We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61) We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61)
```yaml ```yaml
doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:" doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
``` ```
Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template. Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template.
Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via: Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
```yaml ```yaml
doc_to_target: "{{answer}}" doc_to_target: "{{answer}}"
``` ```
> [!WARNING] > [!WARNING]
> We add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_text(doc) + target_delimiter + doc_to_target(doc)`. `doc_to_text` and `doc_to_target` should not contain trailing right or left whitespace, respectively. For multiple choice the target will be each choice index concatenated with the delimiter. > We add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_text(doc) + target_delimiter + doc_to_target(doc)`. `doc_to_text` and `doc_to_target` should not contain trailing right or left whitespace, respectively. For multiple choice the target will be each choice index concatenated with the delimiter.
#### Multiple choice format #### Multiple choice format
For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format. For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
...@@ -206,6 +226,7 @@ doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is t ...@@ -206,6 +226,7 @@ doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is t
doc_to_target: 3 # this contains the index into the answer choice list of the correct answer. doc_to_target: 3 # this contains the index into the answer choice list of the correct answer.
doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}" doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
``` ```
Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use. Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in the form of a list `["no", "yes"]` that will correspond to the label index. The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in the form of a list `["no", "yes"]` that will correspond to the label index.
...@@ -221,7 +242,8 @@ doc_to_choice: ["no", "yes"] ...@@ -221,7 +242,8 @@ doc_to_choice: ["no", "yes"]
There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml. There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
A good example is WikiText that requires a lot of regex rules to clean the samples. A good example is WikiText that requires a lot of regex rules to clean the samples.
```
```python
def wikitext_detokenizer(doc): def wikitext_detokenizer(doc):
string = doc["page"] string = doc["page"]
# contractions # contractions
...@@ -234,7 +256,8 @@ def wikitext_detokenizer(doc): ...@@ -234,7 +256,8 @@ def wikitext_detokenizer(doc):
``` ```
We can load this function in `doc_to_target` by using a `!function` operator after `doc_to_target` and followed by `<file name>.<function name>`. In the file [wikitext.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wikitext/wikitext.yaml) we write: We can load this function in `doc_to_target` by using a `!function` operator after `doc_to_target` and followed by `<file name>.<function name>`. In the file [wikitext.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wikitext/wikitext.yaml) we write:
```
```yaml
doc_to_target: !function preprocess_wikitext.wikitext_detokenizer doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
``` ```
...@@ -243,22 +266,24 @@ doc_to_target: !function preprocess_wikitext.wikitext_detokenizer ...@@ -243,22 +266,24 @@ doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
[Promptsource](https://github.com/bigscience-workshop/promptsource/tree/main/promptsource) is a great repository for crowdsourced prompts for many datasets. We can load these prompts easily by using the `use_prompt` argument and filling it with the format `"promptsource:<name of prompt template>"`. To use this, `doc_to_text` and `doc_to_target` should be left undefined. This will fetch the template of the dataset defined in the YAML file. [Promptsource](https://github.com/bigscience-workshop/promptsource/tree/main/promptsource) is a great repository for crowdsourced prompts for many datasets. We can load these prompts easily by using the `use_prompt` argument and filling it with the format `"promptsource:<name of prompt template>"`. To use this, `doc_to_text` and `doc_to_target` should be left undefined. This will fetch the template of the dataset defined in the YAML file.
For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3 Style` we can add this to the YAML file. For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3 Style` we can add this to the YAML file.
```
```yaml
use_prompt: "promptsource:GPT-3 Style" use_prompt: "promptsource:GPT-3 Style"
``` ```
If you would like to run evaluation on all prompt templates, you can simply call it this way. If you would like to run evaluation on all prompt templates, you can simply call it this way.
```
```yaml
use_prompt: "promptsource:*" use_prompt: "promptsource:*"
``` ```
### Setting metrics ### Setting metrics
You're almost done! Now we need to choose how to score our task. You're almost done! Now we need to choose how to score our task.
- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice? - *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*? - *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format: If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
```yaml ```yaml
...@@ -270,6 +295,7 @@ metric_list: ...@@ -270,6 +295,7 @@ metric_list:
aggregation: ... aggregation: ...
higher_is_better: ... higher_is_better: ...
``` ```
`aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function). `aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`. For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
...@@ -279,11 +305,12 @@ For a full list of natively supported metrics and aggregation functions see [`do ...@@ -279,11 +305,12 @@ For a full list of natively supported metrics and aggregation functions see [`do
Some tasks may require more advanced processing logic than is described in this guide. Some tasks may require more advanced processing logic than is described in this guide.
As a heuristic check: As a heuristic check:
* Does your task require generating multiple free-form outputs per input document?
* Does your task require complex, multi-step post-processing of generated model outputs? - Does your task require generating multiple free-form outputs per input document?
* Does your task require subsetting documents on the fly based on their content? - Does your task require complex, multi-step post-processing of generated model outputs?
* Do you expect to compute metrics after applying multiple such processing steps on your model outputs? - Does your task require subsetting documents on the fly based on their content?
* Does your task rely on metrics that need a custom implementation? - Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
- Does your task rely on metrics that need a custom implementation?
For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). If none of the above sounds like they apply to your task, it's time to continue onto checking your task performance! For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). If none of the above sounds like they apply to your task, it's time to continue onto checking your task performance!
...@@ -296,6 +323,7 @@ If you're writing your YAML file inside the `lm_eval/tasks` folder, you just nee ...@@ -296,6 +323,7 @@ If you're writing your YAML file inside the `lm_eval/tasks` folder, you just nee
```yaml ```yaml
task: <name of the task> task: <name of the task>
``` ```
Including a task name is mandatory. Including a task name is mandatory.
It is often also convenient to label your task with several `tag` values, though this field is optional: It is often also convenient to label your task with several `tag` values, though this field is optional:
...@@ -305,8 +333,8 @@ tag: ...@@ -305,8 +333,8 @@ tag:
- tag1 - tag1
- tag2 - tag2
``` ```
This will add your task to the `tag1` and `tag2` tags, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
This will add your task to the `tag1` and `tag2` tags, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files. If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
...@@ -318,7 +346,6 @@ task_manager = TaskManager(args.verbosity, include_path=args.include_path) ...@@ -318,7 +346,6 @@ task_manager = TaskManager(args.verbosity, include_path=args.include_path)
Passing `--tasks /path/to/yaml/file` is also accepted. Passing `--tasks /path/to/yaml/file` is also accepted.
### Advanced Group Configs ### Advanced Group Configs
While `tag` values are helpful when you want to be able to quickly and conveniently run a set of related tasks via `--tasks my_tag_name`, often, we wish to implement more complex logic. For example, the MMLU benchmark contains 57 *subtasks* that must all be *averaged* together in order to report a final 'MMLU score'. While `tag` values are helpful when you want to be able to quickly and conveniently run a set of related tasks via `--tasks my_tag_name`, often, we wish to implement more complex logic. For example, the MMLU benchmark contains 57 *subtasks* that must all be *averaged* together in order to report a final 'MMLU score'.
...@@ -341,7 +368,6 @@ metadata: ...@@ -341,7 +368,6 @@ metadata:
This will behave almost identically to a `tag` that includes these 3 tasks, but with one key distinction: we'll print the `nli_tasks` group as a row (with no associated metrics) in our table of outputs, and visually show that these 3 tasks appear under its subheader. This will behave almost identically to a `tag` that includes these 3 tasks, but with one key distinction: we'll print the `nli_tasks` group as a row (with no associated metrics) in our table of outputs, and visually show that these 3 tasks appear under its subheader.
Now, let's assume we actually want to report an aggregate score for `nli_tasks`. We would instead use a YAML config like the following: Now, let's assume we actually want to report an aggregate score for `nli_tasks`. We would instead use a YAML config like the following:
```yaml ```yaml
...@@ -420,7 +446,7 @@ In this example, `recipe` is the custom argument for the `Unitxt` class. ...@@ -420,7 +446,7 @@ In this example, `recipe` is the custom argument for the `Unitxt` class.
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `task_alias` to `abstract_algebra`. In group configs, a `group_alias` for a group can also be set. To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `task_alias` to `abstract_algebra`. In group configs, a `group_alias` for a group can also be set.
``` ```yaml
"dataset_name": "abstract_algebra" "dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\ "description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n" \ algebra.\n\n"
...@@ -451,7 +477,7 @@ One key feature in LM Evaluation Harness is the ability to version tasks and gro ...@@ -451,7 +477,7 @@ One key feature in LM Evaluation Harness is the ability to version tasks and gro
This version info can be provided by adding the following to your new task or group config file: This version info can be provided by adding the following to your new task or group config file:
``` ```yaml
metadata: metadata:
version: 0 version: 0
``` ```
...@@ -462,7 +488,7 @@ If you are incrementing a task's version, please also consider adding a changelo ...@@ -462,7 +488,7 @@ If you are incrementing a task's version, please also consider adding a changelo
for example, for example,
* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance. - \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
## Checking performance + equivalence ## Checking performance + equivalence
...@@ -475,15 +501,16 @@ To enable this, we provide a checklist that should be completed when contributin ...@@ -475,15 +501,16 @@ To enable this, we provide a checklist that should be completed when contributin
The checklist is the following: The checklist is the following:
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
- [ ] Is the task an existing benchmark in the literature?
- [ ] Have you referenced the original paper that introduced the task?
- [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported: If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? - [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? - [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`. It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
......
...@@ -15,11 +15,13 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields ...@@ -15,11 +15,13 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
### Parameters ### Parameters
Task naming + registration: Task naming + registration:
- **task** (`str`, defaults to None) — name of the task. - **task** (`str`, defaults to None) — name of the task.
- **task_alias** (`str`, defaults to None) - Alias of the task name that will be printed in the final table results. - **task_alias** (`str`, defaults to None) - Alias of the task name that will be printed in the final table results.
- **tag** (`str`, *optional*) — name of the task tags(s) a task belongs to. Enables one to run all tasks with a specified tag name at once. - **tag** (`str`, *optional*) — name of the task tags(s) a task belongs to. Enables one to run all tasks with a specified tag name at once.
Dataset configuration options: Dataset configuration options:
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.) - **dataset_name** (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv. - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
...@@ -31,6 +33,7 @@ Dataset configuration options: ...@@ -31,6 +33,7 @@ Dataset configuration options:
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template. - **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Prompting / in-context formatting options: Prompting / in-context formatting options:
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice. - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example. - **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model. - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model.
...@@ -41,10 +44,12 @@ Prompting / in-context formatting options: ...@@ -41,10 +44,12 @@ Prompting / in-context formatting options:
- **gen_prefix** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the gen_prefix could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt. - **gen_prefix** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the gen_prefix could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.
Runtime configuration options: Runtime configuration options:
- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input. - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
- **batch_size** (`int`, *optional*, defaults to 1) — Batch size. - **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
Scoring details: Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format. - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes. - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
...@@ -54,6 +59,7 @@ Scoring details: ...@@ -54,6 +59,7 @@ Scoring details:
- **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`. - **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
Other: Other:
- **metadata** (`dict`, *optional*) — An optional field where arbitrary metadata can be passed. Most tasks should include a `version` key in this field that is used to denote the version of the yaml config. Other special metadata keys are: `num_fewshot`, to override the printed `n-shot` table column for a task. Will also be passed to the `custom_dataset` function if defined. - **metadata** (`dict`, *optional*) — An optional field where arbitrary metadata can be passed. Most tasks should include a `version` key in this field that is used to denote the version of the yaml config. Other special metadata keys are: `num_fewshot`, to override the printed `n-shot` table column for a task. Will also be passed to the `custom_dataset` function if defined.
## Filters ## Filters
...@@ -70,10 +76,8 @@ We do such post-processing by operating on *responses*, which are stored after r ...@@ -70,10 +76,8 @@ We do such post-processing by operating on *responses*, which are stored after r
`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`. `resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.
Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc. Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
**End Aside** **End Aside**
A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome! A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!
### Multiple Filter Pipelines ### Multiple Filter Pipelines
...@@ -112,6 +116,7 @@ filter_list: ...@@ -112,6 +116,7 @@ filter_list:
We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence. We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence.
Our first filter pipeline implements Our first filter pipeline implements
- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)") - applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
- selecting only the first out of the 64 model answers - selecting only the first out of the 64 model answers
...@@ -126,6 +131,7 @@ Then scoring this single answer. ...@@ -126,6 +131,7 @@ Then scoring this single answer.
``` ```
Our second filter pipeline, "maj@64", does majority voting across all 64 answers via: Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
- applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem - applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
- applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each - applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
- taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document. - taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.
...@@ -140,8 +146,10 @@ Our second filter pipeline, "maj@64", does majority voting across all 64 answers ...@@ -140,8 +146,10 @@ Our second filter pipeline, "maj@64", does majority voting across all 64 answers
``` ```
Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via: Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
- subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document - subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
- performing the same sequence of filters on these new sets of 8 responses, for each document. - performing the same sequence of filters on these new sets of 8 responses, for each document.
```yaml ```yaml
- name: "maj@8" - name: "maj@8"
filter: filter:
...@@ -155,7 +163,6 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t ...@@ -155,7 +163,6 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t
Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines. Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.
### Adding a custom filter ### Adding a custom filter
Just like adding a custom model with `register_model` decorator one is able to do the same with filters, for example Just like adding a custom model with `register_model` decorator one is able to do the same with filters, for example
...@@ -169,11 +176,10 @@ class NewFilter(Filter) ...@@ -169,11 +176,10 @@ class NewFilter(Filter)
... ...
``` ```
## Embedded Python Code ## Embedded Python Code
Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments: Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
1. `doc_to_text` 1. `doc_to_text`
2. `doc_to_target` 2. `doc_to_target`
3. `doc_to_choice` 3. `doc_to_choice`
...@@ -183,22 +189,22 @@ Use can use python functions for certain arguments by using the `!function` oper ...@@ -183,22 +189,22 @@ Use can use python functions for certain arguments by using the `!function` oper
The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`. The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`.
## Including a Base YAML ## Including a Base YAML
You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base template is in the same directory. Otherwise, You will need to define the full path. You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base template is in the same directory. Otherwise, You will need to define the full path.
```
```yaml
include: <YAML filename or with full path> include: <YAML filename or with full path>
... ...
``` ```
You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
## Passing Arguments to Metrics ## Passing Arguments to Metrics
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient. Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
``` ```yaml
metric_list: metric_list:
- metric: acc - metric: acc
- metric: exact_match - metric: exact_match
...@@ -216,25 +222,27 @@ metric_list: ...@@ -216,25 +222,27 @@ metric_list:
Here we list all metrics currently supported natively in `lm-eval`: Here we list all metrics currently supported natively in `lm-eval`:
Metrics: Metrics:
* `acc` (accuracy)
* `acc_norm` (length-normalized accuracy) - `acc` (accuracy)
* `acc_mutual_info` (baseline loglikelihood - normalized accuracy) - `acc_norm` (length-normalized accuracy)
* `perplexity` - `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
* `word_perplexity` (perplexity per word) - `perplexity`
* `byte_perplexity` (perplexity per byte) - `word_perplexity` (perplexity per word)
* `bits_per_byte` - `byte_perplexity` (perplexity per byte)
* `matthews_corrcoef` (Matthews correlation coefficient) - `bits_per_byte`
* `f1` (F1 score) - `matthews_corrcoef` (Matthews correlation coefficient)
* `bleu` - `f1` (F1 score)
* `chrf` - `bleu`
* `ter` - `chrf`
- `ter`
Aggregation functions: Aggregation functions:
* `mean`
* `median` - `mean`
* `perplexity` - `median`
* `weighted_perplexity` - `perplexity`
* `bits_per_byte` - `weighted_perplexity`
- `bits_per_byte`
### Adding a Multiple Choice Metric ### Adding a Multiple Choice Metric
...@@ -246,37 +254,41 @@ Adding a multiple choice metric has a few steps. To get it working you need to: ...@@ -246,37 +254,41 @@ Adding a multiple choice metric has a few steps. To get it working you need to:
The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this: The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
```python
@register_metric( @register_metric(
metric="mcc", metric="mcc",
higher_is_better=True, higher_is_better=True,
output_type="multiple_choice", output_type="multiple_choice",
aggregation="matthews_corrcoef", aggregation="matthews_corrcoef",
) )
def mcc_fn(items): # This is a passthrough function def mcc_fn(items): # This is a passthrough function
return items return items
```
Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called. Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
Aggregation functions are defined towards the top of the file, here's an example: Aggregation functions are defined towards the top of the file, here's an example:
@register_aggregation("matthews_corrcoef") ```python
def matthews_corrcoef(items): @register_aggregation("matthews_corrcoef")
unzipped_list = list(zip(*items)) def matthews_corrcoef(items):
golds = unzipped_list[0] unzipped_list = list(zip(*items))
preds = unzipped_list[1] golds = unzipped_list[0]
return sklearn.metrics.matthews_corrcoef(golds, preds) preds = unzipped_list[1]
return sklearn.metrics.matthews_corrcoef(golds, preds)
```
This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this: This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
```python
result_dict = { result_dict = {
**({"acc": acc} if "acc" in use_metric else {}), **({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}), **({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}), **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}), **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}), **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
} }
```
The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference. The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
...@@ -285,15 +297,19 @@ The value here determines the input to the aggregation function, though the name ...@@ -285,15 +297,19 @@ The value here determines the input to the aggregation function, though the name
Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include: Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
Multiple choice tasks: Multiple choice tasks:
- SciQ (`lm_eval/tasks/sciq/sciq.yaml`) - SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
Corpus perplexity evaluations: Corpus perplexity evaluations:
- Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`) - Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
Generative tasks: Generative tasks:
- GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`) - GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
Tasks using complex filtering: Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`) - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
# Group Configuration # Group Configuration
......
...@@ -114,6 +114,14 @@ all = [ ...@@ -114,6 +114,14 @@ all = [
"lm_eval[zeno]", "lm_eval[zeno]",
] ]
[tool.pymarkdown]
plugins.md013.enabled = false # line-length
plugins.md024.allow_different_nesting = true # no-duplicate-headers
plugins.md025.enabled = false # single-header
plugins.md028.enabled = false # no-blanks-blockquote
plugins.md029.allow_extended_start_values = true # ol-prefix
plugins.md034.enabled = false # no-bare-urls
[tool.ruff.lint] [tool.ruff.lint]
extend-select = ["I"] extend-select = ["I"]
......
# Clean Training Data
janitor.py contains a script to remove benchmark data contamination from training data sets. janitor.py contains a script to remove benchmark data contamination from training data sets.
It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165). It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165).
## Algorithm ## Algorithm
1) Collects all contamination text files that are to be removed from training data 1) Collects all contamination text files that are to be removed from training data
2) Filters training data by finding `N`gram matches between the training data 2) Filters training data by finding `N`gram matches between the training data
and any contamination and any contamination
...@@ -13,7 +16,8 @@ It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.1 ...@@ -13,7 +16,8 @@ It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.1
completely contaminated and removed completely contaminated and removed
OpenAI used: OpenAI used:
```
```text
ngram_n = 13 ngram_n = 13
window_to_remove = 200 window_to_remove = 200
minimum_slice_length = 200 minimum_slice_length = 200
...@@ -25,12 +29,13 @@ too_dirty_cutoff = 10 ...@@ -25,12 +29,13 @@ too_dirty_cutoff = 10
Janitor can be used as a pure python program, but it is much faster if the ngram Janitor can be used as a pure python program, but it is much faster if the ngram
code is run in C++. To compile the C++ code, run code is run in C++. To compile the C++ code, run
``` ```bash
pip install pybind11 pip install pybind11
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix) c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)
``` ```
MacOS users: If your compiler isn't linked to Python, you may need to add to the above `-undefined dynamic_lookup`. \ MacOS users: If your compiler isn't linked to Python, you may need to add to the above `-undefined dynamic_lookup`. \
Linux users: If your compiler isn't linked to Python, you may need to follow these steps: Linux users: If your compiler isn't linked to Python, you may need to follow these steps:
1. Rename the compiled code file to `janitor_util.so`. 1. Rename the compiled code file to `janitor_util.so`.
2. Before running `import Janitor` in your code, add `sys.path.append("your/relative/path/to/janitor_util.so")` so that Python knows the location of `janitor_util.so`. 2. Before running `import Janitor` in your code, add `sys.path.append("your/relative/path/to/janitor_util.so")` so that Python knows the location of `janitor_util.so`.
# Task-name # Task-name
### Paper ## Paper
Title: `paper titles goes here` Title: `paper titles goes here`
...@@ -10,10 +10,9 @@ Abstract: `link to paper PDF or arXiv abstract goes here` ...@@ -10,10 +10,9 @@ Abstract: `link to paper PDF or arXiv abstract goes here`
Homepage: `homepage to the benchmark's website goes here, if applicable` Homepage: `homepage to the benchmark's website goes here, if applicable`
### Citation ### Citation
``` ```text
BibTeX-formatted citation goes here BibTeX-formatted citation goes here
``` ```
...@@ -35,12 +34,13 @@ BibTeX-formatted citation goes here ...@@ -35,12 +34,13 @@ BibTeX-formatted citation goes here
### Checklist ### Checklist
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature? * [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task? * [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported: If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted? * [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment