author={Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title={A framework for few-shot language model evaluation},
title={A framework for few-shot language model evaluation},
@@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
...
@@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
## Table of Contents
## Table of Contents
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/user_guide.md)
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
```
dataset_path: hellaswag
dataset_kwargs:
data_dir: hellaswag_local/
```
You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
## Writing a Prompt Template
## Writing a Prompt Template
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
Open the file specified at the `--output_base_path <path>` and ensure it passes
Open the file specified at the `--output_base_path <path>` and ensure it passes
a simple eye test.
a simple eye test.
## Versioning
One key feature in LM Evaluation Harness is the ability to version tasks--that is, mark them with a specific version number that can be bumped whenever a breaking change is made.
This version info can be provided by adding the following to your new task config file:
```
metadata:
version: 0
```
Now, whenever a change needs to be made to your task in the future, please increase the version number by 1 so that users can differentiate the different task iterations and versions.
If you are incrementing a task's version, please also consider adding a changelog to the task's README.md noting the date, PR number, what version you have updated to, and a one-liner describing the change.
for example,
* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
## Checking performance + equivalence
## Checking performance + equivalence
It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.
...
@@ -340,4 +379,4 @@ It is recommended to include a filled-out copy of this checklist in the README.m
...
@@ -340,4 +379,4 @@ It is recommended to include a filled-out copy of this checklist in the README.m
## Submitting your task
## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
You're all set! Now push your work and make a pull request to the `main` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
Adding a multiple choice metric has a few steps. To get it working you need to:
1. register a metric function
2. register an aggregation function
3. update the `Task` definition to make sure the correct arguments are passed
The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
@register_metric(
metric="mcc",
higher_is_better=True,
output_type="multiple_choice",
aggregation="matthews_corrcoef",
)
def mcc_fn(items): # This is a passthrough function
return items
Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
Aggregation functions are defined towards the top of the file, here's an example:
This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
result_dict = {
**({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}),
}
The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
## Good Reference Tasks
## Good Reference Tasks
...
@@ -258,6 +301,23 @@ task:
...
@@ -258,6 +301,23 @@ task:
-hendrycksTest*
-hendrycksTest*
```
```
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
```yaml
group:multimedqa
task:
-pubmedqa
-medmcqa
-medqa_4options
-task:mmlu_anatomy
task_alias:"anatomy(mmlu)"
group_alias:null
-task:mmlu_clinical_knowledge
task_alias:"clinical_knowledge(mmlu)"
group_alias:null
...
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
"Benchmarking your models is the first step towards making sure your model performs well.\n",
"However, looking at the data behind the benchmark, slicing the data into subsets, and comparing models on individual instances can help you even more in evaluating and quantifying the behavior of your AI system.\n",
"\n",
"All of this can be done in [Zeno](https://zenoml.com)!\n",
"Zeno is super easy to use with the eval harness, let's explore how you can easily upload and visualize your eval results.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install this project if you did not already do that. This is all that needs to be installed for you to be able to visualize your data in Zeno!\n",
"!pip install -e ..\n",
"!pip install -e ..[zeno]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"To visualize the results, run the eval harness with the `log_samples` and `output_path` flags. We expect `output_path` to contain multiple folders that represent individual model names. You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n"
"This is so you can be authenticated with Zeno.\n",
"If you don't already have a Zeno account, first create an account on [Zeno Hub](https://hub.zenoml.com).\n",
"After logging in to Zeno Hub, generate your API key by clicking on your profile at the bottom left to navigate to your account page.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%env ZENO_API_KEY=YOUR_API_KEY"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualize Eval Results\n",
"\n",
"You can now use the `zeno_visualize` script to upload the results to Zeno.\n",
"\n",
"This will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno. If you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n"
help="Acceptable values are 'auto', 'auto:N' or N, where N is an integer. Default 1.",
)
parser.add_argument(
parser.add_argument(
"--max_batch_size",
"--max_batch_size",
type=int,
type=int,
default=None,
default=None,
help="Maximal batch size to try with --batch_size auto",
metavar="N",
help="Maximal batch size to try with --batch_size auto.",
)
)
parser.add_argument(
parser.add_argument(
"--device",
"--device",
type=str,
type=str,
default=None,
default=None,
help="Device to use (e.g. cuda, cuda:0, cpu)",
help="Device to use (e.g. cuda, cuda:0, cpu).",
)
)
parser.add_argument(
parser.add_argument(
"--output_path",
"--output_path",
"-o",
default=None,
default=None,
type=str,
type=str,
metavar="= [dir/file.jsonl] [DIR]",
metavar="DIR|DIR/file.json",
help="The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.",
help="The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.",
)
)
parser.add_argument(
parser.add_argument(
"--limit",
"--limit",
"-L",
type=float,
type=float,
default=None,
default=None,
metavar="N|0<N<1",
help="Limit the number of examples per task. "
help="Limit the number of examples per task. "
"If <1, limit is a percentage of the total number of examples.",
"If <1, limit is a percentage of the total number of examples.",
)
)
parser.add_argument(
parser.add_argument(
"--use_cache",
"--use_cache",
"-c",
type=str,
type=str,
default=None,
default=None,
metavar="DIR",
help="A path to a sqlite db file for caching model responses. `None` if not caching.",
help="A path to a sqlite db file for caching model responses. `None` if not caching.",
)
)
parser.add_argument("--decontamination_ngrams_path",default=None)# TODO: not used
parser.add_argument("--decontamination_ngrams_path",default=None)# TODO: not used
parser.add_argument(
parser.add_argument(
"--check_integrity",
"--check_integrity",
action="store_true",
action="store_true",
help="Whether to run the relevant part of the test suite for the tasks",
help="Whether to run the relevant part of the test suite for the tasks.",
)
)
parser.add_argument(
parser.add_argument(
"--write_out",
"--write_out",
"-w",
action="store_true",
action="store_true",
default=False,
default=False,
help="Prints the prompt for the first few documents",
help="Prints the prompt for the first few documents.",
)
)
parser.add_argument(
parser.add_argument(
"--log_samples",
"--log_samples",
"-s",
action="store_true",
action="store_true",
default=False,
default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis",
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Use with --output_path.",
f"{utils.SPACING}Try `lm-eval --tasks list` for list of available tasks",
f"{utils.SPACING}Try `lm-eval --tasks list` for list of available tasks",
)
)
raiseValueError(
raiseValueError(
f"Tasks {missing} were not found. Try `lm-eval --tasks list` for list of available tasks."
f"Tasks not found: {missing}. Try `lm-eval --tasks list` for list of available tasks, or '--verbosity DEBUG' to troubleshoot task registration issues."
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'