Add Markdown linter (#2818)

* Add markdown linter to pre-commit hooks * Reformat existing markdown (excluding lm_eval/tasks/*.md)

Add Markdown linter (#2818)
* Add markdown linter to pre-commit hooks * Reformat existing markdown (excluding lm_eval/tasks/*.md)
7158f4f4 · Kiersten Stokes · GitHub · c73b43f4 · 7158f4f4 · 7158f4f4
Unverified Commit 7158f4f4 authored Mar 19, 2025 by Kiersten Stokes Committed by GitHub Mar 20, 2025
13 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -46,6 +46,12 @@ repos:
              .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
          )$
        args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
+  - repo: https://github.com/jackdewinter/pymarkdown
+    rev: v0.9.29
+    hooks:
+      - id: pymarkdown
+        exclude: ^lm_eval/tasks/
+        args: [fix, -r]
 #  - repo: https://github.com/pre-commit/mirrors-mypy
 #    rev: v1.5.1
 #    hooks:

--- a/README.md
+++ b/README.md
@@ -4,7 +4,8 @@
 ---
-*Latest News 📣*
+## Latest News 📣
 - [2025/03] Added support for steering HF models!
 - [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
 - [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
@@ -14,6 +15,7 @@
 ---
 ## Announcement
 **A new v0.4.0 release of lm-evaluation-harness is available** !
 New updates and features include:
@@ -39,6 +41,7 @@ Development will be continuing on the `main` branch, and we encourage you to giv
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
 **Features:**
 - Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
 - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
 - Support for fast and memory-efficient inference with [vLLM](https://github.com/vllm-project/vllm).
@@ -63,6 +66,7 @@ pip install -e .
 We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
 ## Basic Usage
 ### User Guide
 A user guide detailing the full list of supported arguments is provided [here](./docs/interface.md), and on the terminal by calling `lm_eval -h`. Alternatively, you can use `lm-eval` instead of `lm_eval`.
@@ -112,11 +116,12 @@ We support three main ways of using Hugging Face's [accelerate 🚀](https://git
 To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
-```
+```bash
 accelerate launch -m lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --batch_size 16
 ```
 (or via `accelerate launch --no-python lm_eval`).
 For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
@@ -127,7 +132,7 @@ The second way of using `accelerate` for multi-GPU evaluation is when your model
 In this setting, run the library *outside the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
-```
+```bash
 lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
    --model_args parallelize=True \
@@ -137,6 +142,7 @@ lm_eval --model hf \
 This means that your model's weights will be split across all available GPUs.
 For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
 - `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
 - `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
@@ -144,7 +150,7 @@ For more advanced users or even larger models, we allow for the following argume
 The third option is to use both at the same time. This will allow you to take advantage of both data parallelism and model sharding, and is especially useful for models that are too large to fit on a single GPU.
-```
+```bash
 accelerate launch --multi_gpu --num_processes {nb_of_copies_of_your_model} \
    -m lm_eval --model hf \
    --tasks lambada_openai,arc_easy \
@@ -194,6 +200,7 @@ pd.DataFrame({
 ```
 Run the evaluation harness with steering vectors applied:
 ```bash
 lm_eval --model steered \
    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \
@@ -211,6 +218,7 @@ To evaluate a `nemo` model, start by installing NeMo following [the documentatio
 NeMo models can be obtained through [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/models) or in [NVIDIA's Hugging Face page](https://huggingface.co/nvidia). In [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo/tree/main/scripts/nlp_language_modeling) there are conversion scripts to convert the `hf` checkpoints of popular models like llama, falcon, mixtral or mpt to `nemo`.
 Run a `nemo` model on one GPU:
 ```bash
 lm_eval --model nemo_lm \
    --model_args path=<path_to_nemo_model> \
@@ -220,7 +228,7 @@ lm_eval --model nemo_lm \
 It is recommended to unpack the `nemo` model to avoid the unpacking inside the docker container - it may overflow disk space. For that you can run:
-```
+```bash
 mkdir MY_MODEL
 tar -xvf MY_MODEL.nemo -c MY_MODEL
 ```
@@ -230,6 +238,7 @@ tar -xvf MY_MODEL.nemo -c MY_MODEL
 By default, only one GPU is used. But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node.
 1) To enable data replication, set the `model_args` of `devices` to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is:
 ```bash
 torchrun --nproc-per-node=8 --no-python lm_eval \
    --model nemo_lm \
@@ -238,7 +247,8 @@ torchrun --nproc-per-node=8 --no-python lm_eval \
    --batch_size 32
 ```
-2) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:
+1) To enable tensor and/or pipeline parallelism, set the `model_args` of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. In addition, you also have to set up `devices` to be equal to the product of `tensor_model_parallel_size` and/or `pipeline_model_parallel_size`. For example, the command to use one node of 4 GPUs with tensor parallelism of 2 and pipeline parallelism of 2 is:
 ```bash
 torchrun --nproc-per-node=4 --no-python lm_eval \
    --model nemo_lm \
@@ -246,6 +256,7 @@ torchrun --nproc-per-node=4 --no-python lm_eval \
    --tasks hellaswag \
    --batch_size 32
 ```
 Note that it is recommended to substitute the `python` command by `torchrun --nproc-per-node=<number of devices> --no-python` to facilitate loading the model into the GPUs. This is especially important for large checkpoints loaded into multiple GPUs.
 Not supported yet: multi-node evaluation and combinations of data replication with tensor or pipeline parallelism.
@@ -256,7 +267,7 @@ Pipeline parallelism during evaluation is supported with OpenVINO models
 To enable pipeline parallelism, set the `model_args` of `pipeline_parallel`. In addition, you also have to set up `device` to value `HETERO:<GPU index1>,<GPU index2>` for example `HETERO:GPU.1,GPU.0` For example, the command to use pipeline parallelism of 2 is:
-```
+```bash
 lm_eval --model openvino \
    --tasks wikitext \
    --model_args pretrained=<path_to_ov_model>,pipeline_parallel=True \
@@ -273,6 +284,7 @@ lm_eval --model vllm \
    --tasks lambada_openai \
    --batch_size auto
 ```
 To use vllm, do `pip install lm_eval[vllm]`. For a full list of supported vLLM configurations, please reference our [vLLM integration](https://github.com/EleutherAI/lm-evaluation-harness/blob/e74ec966556253fbe3d8ecba9de675c77c075bce/lm_eval/models/vllm_causallms.py) and the vLLM documentation.
 vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a [script](./scripts/model_comparator.py) for checking the validity of vllm results against HF.
@@ -293,14 +305,17 @@ To use SGLang as the evaluation backend, please **install it in advance** via SG
 > Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.
 SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here:
 ```bash
 lm_eval --model sglang \
    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \
    --tasks gsm8k_cot \
    --batch_size auto
 ```
 > [!Tip]
 > When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
+>
 > 1. Use a manual `batch_size`, rather than `auto`.
 > 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
 > 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
@@ -323,6 +338,7 @@ We also support using your own local inference server with servers that mirror t
 ```bash
 lm_eval --model local-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=16
 ```
 Note that for externally hosted models, configs such as `--device` which relate to where to place a local model should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.
 | API or Inference Server                                                                                                   | Implemented?                                                                                            | `--model <xxx>` name                                | Models supported:                                                                                                                                                                                                                                                                                                                                          | Request Types:                                                                 |
@@ -352,7 +368,6 @@ For more information on the different task `output_types` and model request type
 > [!Note]
 > For best performance with closed chat model APIs such as Anthropic Claude 3 and GPT-4, we recommend carefully looking at a few sample outputs using `--limit 10` first to confirm answer extraction and scoring on generative tasks is performing as expected. providing `system="<some system prompt here>"` within `--model_args` for anthropic-chat-completions, to instruct the model what format to respond in, may be useful.
 ### Other Frameworks
 A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
@@ -360,6 +375,7 @@ A number of other libraries contain scripts for calling the eval harness through
 To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
 ### Additional Features
 > [!Note]
 > For tasks unsuitable for direct evaluation — either due risks associated with executing untrusted code or complexities in the evaluation process — the `--predict_only` flag is available to obtain decoded generations for post-hoc evaluation.
@@ -367,6 +383,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
 > [!Note]
 > You can inspect what the LM inputs look like by running the following command:
+>
 > ```bash
 > python write_out.py \
 >     --tasks <task1,task2,...> \
@@ -374,6 +391,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
 >     --num_examples 10 \
 >     --output_base_path /path/to/output/folder
 > ```
+>
 > This will write out one text file for each task.
 To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
@@ -388,6 +406,7 @@ lm_eval --model openai \
 ## Advanced Usage Tips
 For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
 ```bash
 lm_eval --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6b,parallelize=True,load_in_4bit=True,peft=nomic-ai/gpt4all-j-lora \
@@ -396,6 +415,7 @@ lm_eval --model hf \
 ```
 Models provided as delta weights can be easily loaded using the Hugging Face transformers library. Within --model_args, set the delta argument to specify the delta weights, and use the pretrained argument to designate the relative base model to which they will be applied:
 ```bash
 lm_eval --model hf \
    --model_args pretrained=Ejafa/llama_7B,delta=lmsys/vicuna-7b-delta-v1.1 \
@@ -405,6 +425,7 @@ lm_eval --model hf \
 GPTQ quantized models can be loaded using [GPTQModel](https://github.com/ModelCloud/GPTQModel) (faster) or [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)
 GPTQModel: add `,gptqmodel=True` to `model_args`
 ```bash
 lm_eval --model hf \
    --model_args pretrained=model-name-or-path,gptqmodel=True \
@@ -412,6 +433,7 @@ lm_eval --model hf \
 ```
 AutoGPTQ: add `,autogptq=True` to `model_args`:
 ```bash
 lm_eval --model hf \
    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
@@ -439,6 +461,7 @@ lm_eval --model hf \
 ```
 This allows you to easily download the results and samples from the Hub, using:
 ```python
 from datasets import load_dataset
@@ -536,6 +559,7 @@ For more information on the library and how everything fits together, check out
 To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
 In general, we follow this priority list for addressing concerns about prompting and other eval details:
 1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
 2. If there is a clear and unambiguous official implementation, use that procedure.
 3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
@@ -550,6 +574,7 @@ We try to prioritize agreement with the procedures used by other groups to decre
 The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
 ## Optional Extras
 Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | Name                 | Use                                                   |
@@ -586,7 +611,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 ## Cite as
-```
+```text
 @misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},

--- a/docs/API_guide.md
+++ b/docs/API_guide.md
@@ -28,31 +28,29 @@ You may also need to override other methods or properties depending on your API'
 > [!NOTE]
 > Currently loglikelihood and MCQ based tasks (such as MMLU) are only supported for completion endpoints. Not for chat-completion — those that expect a list of dicts — endpoints! Completion APIs which support instruct tuned models can be evaluated with the `--apply_chat_template` option in order to simultaneously evaluate models using a chat template format while still being able to access the model logits needed for loglikelihood-based tasks.
-# TemplateAPI Usage Guide
 ## TemplateAPI Arguments
 When initializing a `TemplateAPI` instance or a subclass, you can provide several arguments to customize its behavior. Here's a detailed explanation of some important arguments:
 - `model` or `pretrained` (str):
-   - The name or identifier of the model to use.
+  - The name or identifier of the model to use.
-   - `model` takes precedence over `pretrained` when both are provided.
+  - `model` takes precedence over `pretrained` when both are provided.
 - `base_url` (str):
-   - The base URL for the API endpoint.
+  - The base URL for the API endpoint.
 - `tokenizer` (str, optional):
  - The name or path of the tokenizer to use.
  - If not provided, it defaults to using the same tokenizer name as the model.
 - `num_concurrent` (int):
-   - Number of concurrent requests to make to the API.
+  - Number of concurrent requests to make to the API.
-   - Useful for APIs that support parallel processing.
+  - Useful for APIs that support parallel processing.
-   - Default is 1 (sequential processing).
+  - Default is 1 (sequential processing).
 - `timeout` (int, optional):
-   - Timeout for API requests in seconds.
+  - Timeout for API requests in seconds.
-   - Default is 30.
+  - Default is 30.
 - `tokenized_requests` (bool):
  - Determines whether the input is pre-tokenized. Defaults to `True`.
@@ -71,8 +69,8 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa
  - Default is 2048.
 - `max_retries` (int, optional):
-   - Maximum number of retries for failed API requests.
+  - Maximum number of retries for failed API requests.
-   - Default is 3.
+  - Default is 3.
 - `max_gen_toks` (int, optional):
  - Maximum number of tokens to generate in completion tasks.
@@ -99,7 +97,6 @@ When initializing a `TemplateAPI` instance or a subclass, you can provide severa
  - Whether to validate the certificate of the API endpoint (if HTTPS).
  - Default is True.
 Example usage:
 ```python
@@ -202,5 +199,5 @@ To implement your own API model:
 ## Best Practices
 1. Use the `@register_model` decorator to register your model with the framework (and import it in `lm_eval/models/__init__.py`!).
-3. Use environment variables for sensitive information like API keys.
+2. Use environment variables for sensitive information like API keys.
-4. Properly handle batching and concurrent requests if supported by your API.
+3. Properly handle batching and concurrent requests if supported by your API.
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -29,7 +29,7 @@ in order to ensure linters and other checks will be run upon committing.
 We use [pytest](https://docs.pytest.org/en/latest/) for running unit tests. All library unit tests can be run via:
-```
+```bash
 python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
 ```
@@ -38,27 +38,30 @@ python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralma
 We ask that new contributors agree to a Contributor License Agreement affirming that EleutherAI has the rights to use your contribution to our library.
 First-time pull requests will have a reply added by @CLAassistant containing instructions for how to confirm this, and we require it before merging your PR.
 ## Contribution Best Practices
 We recommend a few best practices to make your contributions or reported errors easier to assist with.
 **For Pull Requests:**
 - PRs should be titled descriptively, and be opened with a brief description of the scope and intent of the new contribution.
 - New features should have appropriate documentation added alongside them.
 - Aim for code maintainability, and minimize code copying.
 - If opening a task, try to share test results on the task using a publicly-available model, and if any public results are available on the task, compare to them.
 **For Feature Requests:**
 - Provide a short paragraph's worth of description. What is the feature you are requesting? What is its motivation, and an example use case of it? How does this differ from what is currently supported?
 **For Bug Reports**:
 - Provide a short description of the bug.
 - Provide a *reproducible example*--what is the command you run with our library that results in this error? Have you tried any other steps to resolve it?
 - Provide a *full error traceback* of the error that occurs, if applicable. A one-line error message or small screenshot snippet is unhelpful without the surrounding context.
 - Note what version of the codebase you are using, and any specifics of your environment and setup that may be relevant.
 **For Requesting New Tasks**:
 - Provide a 1-2 sentence description of what the task is and what it evaluates.
 - Provide a link to the paper introducing the task.
 - Provide a link to where the dataset can be found.
@@ -70,6 +73,7 @@ We recommend a few best practices to make your contributions or reported errors
 To quickly get started, we maintain a list of good first issues, which can be found [on our project board](https://github.com/orgs/EleutherAI/projects/25/views/8) or by [filtering GH Issues](https://github.com/EleutherAI/lm-evaluation-harness/issues?q=is%3Aopen+label%3A%22good+first+issue%22+label%3A%22help+wanted%22). These are typically smaller code changes or self-contained features which can be added without extensive familiarity with library internals, and we recommend new contributors consider taking a stab at one of these first if they are feeling uncertain where to begin.
 There are a number of distinct ways to contribute to LM Evaluation Harness, and all are extremely helpful! A sampling of ways to contribute include:
 - **Implementing and verifying new evaluation tasks**: Is there a task you'd like to see LM Evaluation Harness support? Consider opening an issue requesting it, or helping add it! Verifying and cross-checking task implementations with their original versions is also a very valuable form of assistance in ensuring standardized evaluation.
 - **Improving documentation** - Improvements to the documentation, or noting pain points / gaps in documentation, are helpful in order for us to improve the user experience of the library and clarity + coverage of documentation.
 - **Testing and devops** - We are very grateful for any assistance in adding tests for the library that can be run for new PRs, and other devops workflows.

--- a/docs/chat-template-readme.md
+++ b/docs/chat-template-readme.md
 # Chat Template Delimiter Handling Update
 ## Overview
 This change modifies how delimiters are handled when applying chat templates in the request construction process for likelihood and multiple-choice based tasks. When `apply_chat_template` is set to `True`, the target delimiter is now set to an empty string instead of using the configured delimiter.
 ## Background
 By default, the system uses a target delimiter (typically a whitespace " ") between the context and target text when constructing prompts. The full string is constructed as:
-```
+```text
 doc_to_text(doc) + target_delimiter + doc_to_target(doc)
 ```
 While this worked well for base models where we wanted the model to predict a single whitespace followed by the answer, chat models have their own formatting conventions that handle spacing differently.
 ## The Change
 - When `apply_chat_template=True`, the target delimiter is now empty ("") instead of the default whitespace
 - This prevents interference between chat template formatting and the default delimiter system
 - Particularly important for multiple choice tasks where the template itself handles spacing
 ## Example
-```
+```text
 # Before (with default delimiter " ")
 <user>Question: What color is the sky?\nAnswer:<assistant> blue

--- a/docs/decontamination.md
+++ b/docs/decontamination.md
@@ -13,6 +13,7 @@ python -m lm_eval \
 ```
 ## Background
 Downstream evaluations test model generalization, and are less useful when test set data also exists in the training set, referred to as leakage or contamination.
 Filtering your training set against the test set is a good first step, however this isn't always possible, as in the case of a new benchmark or one that wasn't considered prior to model training. When training set filtering isn't possible, it is useful to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.
@@ -20,9 +21,11 @@ Filtering your training set against the test set is a good first step, however t
 The basis for our decontamination procedure can be found in Appendix C of "Language Models are Few-Shot Learners". OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. They used a range of N values between 8 and 13 depending on dataset, while we just used 13 for simplicity.
 ## Implementation
 Contamination detection can be found in `lm_eval/decontaminate.py` with supporting code in `lm_eval/decontamination/`.
 decontaminate.py does the following:
 1. Build dictionaries of all ngrams and their corresponding evaluation/document ids.
 2. Scan through sorted files containing training set n-grams.
 3. If a match is found, the corresponding evaluation/document combinations are marked as contaminated.
@@ -32,6 +35,7 @@ decontaminate.py does the following:
 This is disabled by default for new tasks, to support decontamination on a task override the "should_decontaminate" and "doc_to_decontamination_query" methods. For more details see the [task guide](task_guide.md).
 ## Pile Ngram Generation
 The relevant scripts can be found in `scripts/clean_training_data`, which also import from
 `lm_eval/decontamination/`
@@ -52,6 +56,7 @@ python -m scripts/clean_training_data/generate_13_grams \
 Took approximately 4 days for us. We had the time to wait, but this could be scaled out by doing partial pile scans on multiple instances of this script and merging the relevant buckets. We fixed PYTHONHASHSEED to ensure reproducibility of bucket hashing in case you need to stop and start.
 6. Sort the generated 13-grams.
 ```bash
 python -m scripts/clean_training_data/sort_13_gram_buckets \
       -dir path/to/working/directory/output

--- a/docs/interface.md
+++ b/docs/interface.md
@@ -47,8 +47,8 @@ This mode supports a number of command-line arguments, the details of which can
 - `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
 - `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
-	- `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
+  - `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
-	- `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
+  - `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
    For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
@@ -56,23 +56,23 @@ This mode supports a number of command-line arguments, the details of which can
 - `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
-* `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
+- `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
-* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
+- `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
-* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
+- `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
-    * `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
+  - `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
-    * `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
+  - `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
-    * `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
+  - `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
-    * `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
+  - `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
-    * `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
+  - `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
-    * `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
+  - `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
-    * `public_repo` - whether the repository is public, can be `True` or `False`,
+  - `public_repo` - whether the repository is public, can be `True` or `False`,
-    * `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
+  - `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
-    * `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
+  - `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
-    * `gated` - whether to gate the details dataset, can be `True` or `False`.
+  - `gated` - whether to gate the details dataset, can be `True` or `False`.
-* `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
+- `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
 ## External Library Usage

--- a/docs/model_guide.md
+++ b/docs/model_guide.md
@@ -45,6 +45,7 @@ class MyCustomLM(LM):
        #...
    #...
 ```
 Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
 We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
@@ -65,15 +66,13 @@ All three request types take as input `requests` of type `list[Instance]` that h
  - This is used to evaluate *perplexity* on a data distribution.
  - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
 To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!
 **Tip: be careful of indexing in loglikelihood!**
 LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`:
-```
+```text
 # how this all works (illustrated on a causal decoder-only setup):
 #          CTX      CONT
 # inp    0 1 2 3|4 5 6 7 8 9   <- last token is deleted by inp[:, :-1]
@@ -162,7 +161,8 @@ class MyCustomLM(LM):
 - `apply_chat_template`
  - This method performs the bulk of the work required for chat-formatting.
  - As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
-      ```
+  ```text
      [
        {"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
        {"user": <task example - a few-shot example 'input'>}
@@ -170,8 +170,9 @@ class MyCustomLM(LM):
        # ... more few-shot examples, potentially
        {"user": <test set query--response on which we will evaluate>},
      ]
-      ```
+  ```
-      which can then be converted into a string input.
+  which can then be converted into a string input.
  - The output is a string representing this conversation that can be fed into the model.
  - For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
 - `tokenizer_name`

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -27,10 +27,13 @@ To implement a new standard task, we'll need to write a YAML file which configur
 ```sh
 touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
 ```
 Or, copy the template subfolder we provide from `templates/new_yaml_task`:
 ```sh
 cp -r templates/new_yaml_task lm_eval/tasks/
 ```
 and rename the folders and YAML file(s) as desired.
 ### Selecting and configuring a dataset
@@ -54,13 +57,17 @@ training_split: <split name of training set, or `null`>
 validation_split: <split name of val. set, or `null`>
 test_split: <split name of test set, or `null`>
 ```
 Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`.
 We can also specify from which split the task should retrieve few-shot examples via:
 ```yaml
 fewshot_split: <split name to draw fewshot examples from, or `null`>
 ```
 or by hardcoding them, either using the following in the yaml file:
 ```yaml
 fewshot_config:
  sampler: first_n
@@ -69,24 +76,28 @@ fewshot_config:
    {<sample 2>},
  ]
 ```
 or by adding the function `list_fewshot_samples` in the associated utils.py file:
 ```python
 def list_fewshot_samples() -> list[dict]:
  return [{<sample 1>}, {<sample 2>}]
 ```
 See `lm_eval/tasks/minerva_math/minerva_math_algebra.yaml` for an example of the latter, and `lm_eval/tasks/gsm8k/gsm8k-cot.yaml` for an example of the former.
 In this case, each sample must contain the same fields as the samples in the above sets--for example, if `doc_to_text` expects an `input` field when rendering input prompts, these provided samples must include an `input` key.
 If neither above options are not set, we will default to train/validation/test sets, in that order.
 Finally, our dataset may not be already in the exact format we want. Maybe we have to strip whitespace and special characters via a regex from our dataset's "question" field! Or maybe we just want to rename its columns to match a convention we'll be using for our prompts.
 Let's create a python file in the directory where we're writing our YAML file:
 ```bash
 touch lm_eval/tasks/<dataset_name>/utils.py
 ```
 Now, in `utils.py` we'll write a function to process each split of our dataset (the following example is drawn from [the `hellaswag` task](../lm_eval/tasks/hellaswag/utils.py)):
 ```python
@@ -104,6 +115,7 @@ def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
 ```
 Now, in our YAML config file we'll use the `!function` constructor, and tell the config where our imported Python function will come from. At runtime, before doing anything else we will preprocess our dataset according to this function!
 ```yaml
 process_docs: !function utils.process_docs
 ```
@@ -112,15 +124,16 @@ process_docs: !function utils.process_docs
 To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
-```
+```yaml
 dataset_path: json
 dataset_name: null
 dataset_kwargs:
  data_files: /path/to/my/json
 ```
 Or with files already split into separate directories:
-```
+```yaml
 dataset_path: arrow
 dataset_kwargs:
  data_files:
@@ -130,7 +143,7 @@ dataset_kwargs:
 Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
-```
+```yaml
 dataset_path: hellaswag
 dataset_kwargs:
  data_dir: hellaswag_local/
@@ -149,52 +162,59 @@ To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_ch
 ### Basic prompts
 If a dataset is straightforward enough, users can enter the feature name directly. This assumes that no preprocessing is required. For example in [Swag](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/swag/swag.yaml#L10-L11), `doc_to_text` and `doc_to_target` given the name of one of the feature each.
 ```yaml
 doc_to_text: startphrase
 doc_to_target: label
 ```
 Hard-coding is also possible as is the case in [SciQ](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/sciq/sciq.yaml#L11).
 ```yaml
 doc_to_target: 3
 ```
 `doc_to_choice` can be directly given a list of text as option (See [Toxigen](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/toxigen/toxigen.yaml#L11))
 ```yaml
 doc_to_choice: ['No', 'Yes']
 ```
 if a dataset feature is already a list, you can set the name of the feature as `doc_to_choice` (See [Hellaswag](https://github.com/EleutherAI/lm-evaluation-harness/blob/e0eda4d3ffa10e5f65e0976161cd134bec61983a/lm_eval/tasks/hellaswag/hellaswag.yaml#L13))
-```
+```yaml
 doc_to_choice: choices
 ```
 ### Writing a prompt with Jinja 2
 We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
 Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a sample line `doc`, the model sees something in the format of:
-```
+```text
 doc["passage"]
 Question: doc["question"]?
 Answer:
 ```
 We do this by [writing](https://github.com/EleutherAI/lm-evaluation-harness/blob/1710b42d52d0f327cb0eb3cb1bfbbeca992836ca/lm_eval/tasks/super_glue/boolq/default.yaml#L9C1-L9C61)
 ```yaml
 doc_to_text: "{{passage}}\nQuestion: {{question}}?\nAnswer:"
 ```
 Such that `{{passage}}` will be replaced by `doc["passage"]` and `{{question}}` with `doc["question"]` when rendering the prompt template.
 Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
 ```yaml
 doc_to_target: "{{answer}}"
 ```
 > [!WARNING]
 > We add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_text(doc) + target_delimiter + doc_to_target(doc)`. `doc_to_text` and `doc_to_target` should not contain trailing right or left whitespace, respectively. For multiple choice the target will be each choice index concatenated with the delimiter.
 #### Multiple choice format
 For tasks which are multiple choice (a fixed, finite set of label words per each document) and evaluated via comparing loglikelihoods of all label words (the `multiple_choice` task output type) we enforce a particular convention on prompt format.
@@ -206,6 +226,7 @@ doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is t
 doc_to_target: 3 # this contains the index into the answer choice list of the correct answer.
 doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
 ```
 Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
 The label index can also be sourced from a feature directly. For example in `superglue/boolq`, the label index if defined in the feature `label`. We can set `doc_to_target` as simply `label`. The options or verbalizers can be written in the form of a list `["no", "yes"]` that will correspond to the label index.
@@ -221,7 +242,8 @@ doc_to_choice: ["no", "yes"]
 There may be cases where the prompt we want to implement is easier expressed in Python instead of Jinja 2. For this, we can use Python helper functions that are defined in the YAML config. It should be noted that the function script must be in the same directory as the yaml.
 A good example is WikiText that requires a lot of regex rules to clean the samples.
-```
+```python
 def wikitext_detokenizer(doc):
    string = doc["page"]
    # contractions
@@ -234,7 +256,8 @@ def wikitext_detokenizer(doc):
 ```
 We can load this function in `doc_to_target` by using a `!function` operator after `doc_to_target` and followed by `<file name>.<function name>`. In the file [wikitext.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wikitext/wikitext.yaml) we write:
-```
+```yaml
 doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
 ```
@@ -243,22 +266,24 @@ doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
 [Promptsource](https://github.com/bigscience-workshop/promptsource/tree/main/promptsource) is a great repository for crowdsourced prompts for many datasets. We can load these prompts easily by using the `use_prompt` argument and filling it with the format `"promptsource:<name of prompt template>"`. To use this, `doc_to_text` and `doc_to_target` should be left undefined. This will fetch the template of the dataset defined in the YAML file.
 For example, For Super Glue BoolQ, if we want to use the prompt template `GPT-3 Style` we can add this to the YAML file.
-```
+```yaml
 use_prompt: "promptsource:GPT-3 Style"
 ```
 If you would like to run evaluation on all prompt templates, you can simply call it this way.
-```
+```yaml
 use_prompt: "promptsource:*"
 ```
 ### Setting metrics
 You're almost done! Now we need to choose how to score our task.
 - *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
 - *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
 If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
 ```yaml
@@ -270,6 +295,7 @@ metric_list:
    aggregation: ...
    higher_is_better: ...
 ```
 `aggregation` and `higher_is_better` can optionally be left out to default to the manually-set defaults if using a natively supported metric, otherwise it must be defined explicitly (for example, when using a custom metric implemented as a function).
 For a full list of natively supported metrics and aggregation functions see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval` or `hf_evaluate` is set to `true`.
@@ -279,11 +305,12 @@ For a full list of natively supported metrics and aggregation functions see [`do
 Some tasks may require more advanced processing logic than is described in this guide.
 As a heuristic check:
-* Does your task require generating multiple free-form outputs per input document?
-* Does your task require complex, multi-step post-processing of generated model outputs?
+- Does your task require generating multiple free-form outputs per input document?
-* Does your task require subsetting documents on the fly based on their content?
+- Does your task require complex, multi-step post-processing of generated model outputs?
-* Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
+- Does your task require subsetting documents on the fly based on their content?
-* Does your task rely on metrics that need a custom implementation?
+- Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
+- Does your task rely on metrics that need a custom implementation?
 For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). If none of the above sounds like they apply to your task, it's time to continue onto checking your task performance!
@@ -296,6 +323,7 @@ If you're writing your YAML file inside the `lm_eval/tasks` folder, you just nee
 ```yaml
 task: <name of the task>
 ```
 Including a task name is mandatory.
 It is often also convenient to label your task with several `tag` values, though this field is optional:
@@ -305,8 +333,8 @@ tag:
  - tag1
  - tag2
 ```
-This will add your task to the `tag1` and `tag2` tags, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
+This will add your task to the `tag1` and `tag2` tags, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
 If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
@@ -318,7 +346,6 @@ task_manager = TaskManager(args.verbosity, include_path=args.include_path)
 Passing `--tasks /path/to/yaml/file` is also accepted.
 ### Advanced Group Configs
 While `tag` values are helpful when you want to be able to quickly and conveniently run a set of related tasks via `--tasks my_tag_name`, often, we wish to implement more complex logic. For example, the MMLU benchmark contains 57 *subtasks* that must all be *averaged* together in order to report a final 'MMLU score'.
@@ -341,7 +368,6 @@ metadata:
 This will behave almost identically to a `tag` that includes these 3 tasks, but with one key distinction: we'll print the `nli_tasks` group as a row (with no associated metrics) in our table of outputs, and visually show that these 3 tasks appear under its subheader.
 Now, let's assume we actually want to report an aggregate score for `nli_tasks`. We would instead use a YAML config like the following:
 ```yaml
@@ -420,7 +446,7 @@ In this example, `recipe` is the custom argument for the `Unitxt` class.
 To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `task_alias` to `abstract_algebra`. In group configs, a `group_alias` for a group can also be set.
-```
+```yaml
 "dataset_name": "abstract_algebra"
 "description": "The following are multiple choice questions (with answers) about abstract\
  \ algebra.\n\n"
@@ -451,7 +477,7 @@ One key feature in LM Evaluation Harness is the ability to version tasks and gro
 This version info can be provided by adding the following to your new task or group config file:
-```
+```yaml
 metadata:
  version: 0
 ```
@@ -462,7 +488,7 @@ If you are incrementing a task's version, please also consider adding a changelo
 for example,
-* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
+- \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
 ## Checking performance + equivalence
@@ -475,15 +501,16 @@ To enable this, we provide a checklist that should be completed when contributin
 The checklist is the following:
 For adding novel benchmarks/datasets to the library:
-* [ ] Is the task an existing benchmark in the literature?
-  * [ ] Have you referenced the original paper that introduced the task?
-  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+- [ ] Is the task an existing benchmark in the literature?
+  - [ ] Have you referenced the original paper that introduced the task?
+  - [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
 If other tasks on this dataset are already supported:
-* [ ] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+- [ ] Is the "Main" variant of this task clearly denoted?
-* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
 It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.

--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -15,11 +15,13 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
 ### Parameters
 Task naming + registration:
 - **task** (`str`, defaults to None) — name of the task.
 - **task_alias** (`str`, defaults to None) - Alias of the task name that will be printed in the final table results.
 - **tag** (`str`, *optional*) — name of the task tags(s) a task belongs to. Enables one to run all tasks with a specified tag name at once.
 Dataset configuration options:
 - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
 - **dataset_name**  (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
 - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
@@ -31,6 +33,7 @@ Dataset configuration options:
 - **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
 Prompting / in-context formatting options:
 - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
 - **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example.
 - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model.
@@ -41,10 +44,12 @@ Prompting / in-context formatting options:
 - **gen_prefix** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the gen_prefix could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.
 Runtime configuration options:
 - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
 - **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
 Scoring details:
 - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
 - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
@@ -54,6 +59,7 @@ Scoring details:
 - **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
 Other:
 - **metadata** (`dict`, *optional*) — An optional field where arbitrary metadata can be passed. Most tasks should include a `version` key in this field that is used to denote the version of the yaml config. Other special metadata keys are: `num_fewshot`, to override the printed `n-shot` table column for a task. Will also be passed to the `custom_dataset` function if defined.
 ## Filters
@@ -70,10 +76,8 @@ We do such post-processing by operating on *responses*, which are stored after r
 `resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.
 Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
 **End Aside**
 A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!
 ### Multiple Filter Pipelines
@@ -112,6 +116,7 @@ filter_list:
 We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence.
 Our first filter pipeline implements
 - applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
 - selecting only the first out of the 64 model answers
@@ -126,6 +131,7 @@ Then scoring this single answer.
 ```
 Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
 - applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
 - applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
 - taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.
@@ -140,8 +146,10 @@ Our second filter pipeline, "maj@64", does majority voting across all 64 answers
 ```
 Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
 - subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
 - performing the same sequence of filters on these new sets of 8 responses, for each document.
 ```yaml
 - name: "maj@8"
  filter:
@@ -155,7 +163,6 @@ Our final filter pipeline, "maj@8", does majority voting across the first 8 of t
 Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.
 ### Adding a custom filter
 Just like adding a custom model with `register_model` decorator one is able to do the same with filters, for example
@@ -169,11 +176,10 @@ class NewFilter(Filter)
    ...
 ```
 ## Embedded Python Code
 Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
 1. `doc_to_text`
 2. `doc_to_target`
 3. `doc_to_choice`
@@ -183,22 +189,22 @@ Use can use python functions for certain arguments by using the `!function` oper
 The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`.
 ## Including a Base YAML
 You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base template is in the same directory. Otherwise, You will need to define the full path.
-```
+```yaml
 include: <YAML filename or with full path>
 ...
 ```
-You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
+You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
 ## Passing Arguments to Metrics
 Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
-```
+```yaml
 metric_list:
  - metric: acc
  - metric: exact_match
@@ -216,25 +222,27 @@ metric_list:
 Here we list all metrics currently supported natively in `lm-eval`:
 Metrics:
-* `acc` (accuracy)
-* `acc_norm` (length-normalized accuracy)
+- `acc` (accuracy)
-* `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
+- `acc_norm` (length-normalized accuracy)
-* `perplexity`
+- `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
-* `word_perplexity` (perplexity per word)
+- `perplexity`
-* `byte_perplexity` (perplexity per byte)
+- `word_perplexity` (perplexity per word)
-* `bits_per_byte`
+- `byte_perplexity` (perplexity per byte)
-* `matthews_corrcoef` (Matthews correlation coefficient)
+- `bits_per_byte`
-* `f1` (F1 score)
+- `matthews_corrcoef` (Matthews correlation coefficient)
-* `bleu`
+- `f1` (F1 score)
-* `chrf`
+- `bleu`
-* `ter`
+- `chrf`
+- `ter`
 Aggregation functions:
-* `mean`
-* `median`
+- `mean`
-* `perplexity`
+- `median`
-* `weighted_perplexity`
+- `perplexity`
-* `bits_per_byte`
+- `weighted_perplexity`
+- `bits_per_byte`
 ### Adding a Multiple Choice Metric
@@ -246,37 +254,41 @@ Adding a multiple choice metric has a few steps. To get it working you need to:
 The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
+```python
-    @register_metric(
+@register_metric(
-        metric="mcc",
+    metric="mcc",
-        higher_is_better=True,
+    higher_is_better=True,
-        output_type="multiple_choice",
+    output_type="multiple_choice",
-        aggregation="matthews_corrcoef",
+    aggregation="matthews_corrcoef",
-    )
+)
-    def mcc_fn(items):  # This is a passthrough function
+def mcc_fn(items):  # This is a passthrough function
-        return items
+    return items
+```
 Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
 Aggregation functions are defined towards the top of the file, here's an example:
-    @register_aggregation("matthews_corrcoef")
+```python
-    def matthews_corrcoef(items):
+@register_aggregation("matthews_corrcoef")
-        unzipped_list = list(zip(*items))
+def matthews_corrcoef(items):
-        golds = unzipped_list[0]
+    unzipped_list = list(zip(*items))
-        preds = unzipped_list[1]
+    golds = unzipped_list[0]
-        return sklearn.metrics.matthews_corrcoef(golds, preds)
+    preds = unzipped_list[1]
+    return sklearn.metrics.matthews_corrcoef(golds, preds)
+```
 This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
+```python
-    result_dict = {
+result_dict = {
-        **({"acc": acc} if "acc" in use_metric else {}),
+    **({"acc": acc} if "acc" in use_metric else {}),
-        **({"f1": (gold, pred)} if "f1" in use_metric else {}),
+    **({"f1": (gold, pred)} if "f1" in use_metric else {}),
-        **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
+    **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
-        **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
+    **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
-        **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
+    **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
-    }
+}
+```
 The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
@@ -285,15 +297,19 @@ The value here determines the input to the aggregation function, though the name
 Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
 Multiple choice tasks:
 - SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
 Corpus perplexity evaluations:
 - Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
 Generative tasks:
 - GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
 Tasks using complex filtering:
 - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
 # Group Configuration

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -114,6 +114,14 @@ all = [
    "lm_eval[zeno]",
 ]
+[tool.pymarkdown]
+plugins.md013.enabled = false # line-length
+plugins.md024.allow_different_nesting = true # no-duplicate-headers
+plugins.md025.enabled = false # single-header
+plugins.md028.enabled = false # no-blanks-blockquote
+plugins.md029.allow_extended_start_values = true # ol-prefix
+plugins.md034.enabled = false # no-bare-urls
 [tool.ruff.lint]
 extend-select = ["I"]

--- a/scripts/clean_training_data/README.md
+++ b/scripts/clean_training_data/README.md
+# Clean Training Data
 janitor.py contains a script to remove benchmark data contamination from training data sets.
 It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.14165).
 ## Algorithm
 1) Collects all contamination text files that are to be removed from training data
 2) Filters training data by finding `N`gram matches between the training data
   and any contamination
@@ -13,7 +16,8 @@ It uses the approach described in the [GPT-3 paper](https://arxiv.org/abs/2005.1
    completely contaminated and removed
 OpenAI used:
-```
+```text
 ngram_n = 13
 window_to_remove = 200
 minimum_slice_length = 200
@@ -25,12 +29,13 @@ too_dirty_cutoff = 10
 Janitor can be used as a pure python program, but it is much faster if the ngram
 code is run in C++. To compile the C++ code, run
-```
+```bash
 pip install pybind11
 c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) janitor_util.cpp -o janitor_util$(python3-config --extension-suffix)
 ```
 MacOS users: If your compiler isn't linked to Python, you may need to add to the above `-undefined dynamic_lookup`. \
 Linux users: If your compiler isn't linked to Python, you may need to follow these steps:
 1. Rename the compiled code file to `janitor_util.so`.
 2. Before running `import Janitor` in your code, add `sys.path.append("your/relative/path/to/janitor_util.so")` so that Python knows the location of `janitor_util.so`.
--- a/templates/new_yaml_task/README.md
+++ b/templates/new_yaml_task/README.md
 # Task-name
-### Paper
+## Paper
 Title: `paper titles goes here`
@@ -10,10 +10,9 @@ Abstract: `link to paper PDF or arXiv abstract goes here`
 Homepage: `homepage to the benchmark's website goes here, if applicable`
 ### Citation
-```
+```text
 BibTeX-formatted citation goes here
 ```
@@ -35,12 +34,13 @@ BibTeX-formatted citation goes here
 ### Checklist
 For adding novel benchmarks/datasets to the library:
 * [ ] Is the task an existing benchmark in the literature?
  * [ ] Have you referenced the original paper that introduced the task?
  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
 If other tasks on this dataset are already supported:
 * [ ] Is the "Main" variant of this task clearly denoted?
 * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?