For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
### Task name + groups (registering a task)
### Task name + tags (registering a task)
To test a task conveniently, it helps to *register* the task--that is, to give it a name and make the `lm-eval` library aware it exists!
...
...
@@ -296,14 +296,14 @@ task: <name of the task>
```
Including a task name is mandatory.
It is often also convenient to label your task with several `groups`, or tags, though this field is optional:
It is often also convenient to label your task with several `tag` though this field is optional:
```yaml
group:
-group1
-group2
tag:
-tag1
-tag2
```
This will add your task to the `group1` and `group2` groups, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
This will add your task to the `tag1` and `tag2` tags, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
...
...
@@ -321,7 +321,7 @@ Passing `--tasks /path/to/yaml/file` is also accepted.
You can make more complete group config while also tailoring parameters for individual tasks.
For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks.
For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks. By default, group configs do not aggregate the scores from all the tasks in it. To get a aggregate score, use the `aggregate_metric` and set it to `true`.
@@ -16,7 +16,8 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
Task naming + registration:
-**task** (`str`, defaults to None) — name of the task.
-**group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
-**task_alias** (`str`, defaults to None) - Alias of the task name that will be printed in the final table results.
-**tag** (`str`, *optional*) — name of the task tags(s) a task belongs to. Enables one to run all tasks with a specified tag name at once.
Dataset configuration options:
-**dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
...
...
@@ -295,12 +296,29 @@ Generative tasks:
Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
## Benchmarks
# Group Configuration
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
To solve this, we can create a group yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys `group` which denotes the name of the group and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
## Configurations
Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
### Parameters
-**group** (`str`, defaults to `None`) — name of the group.
-**group_alias** (`str`, defaults to `None`) - Alternative name for the group that will be printed in the table output.
-**task** (`Union[str, list]`, defaults to `None`) - List of tasks that constitute the group.
-**tag_to_task** (`str`, defaults to `False`) - Convert `tag` that are listed in task to be a considered as a group.
-**aggregate_metric** (`str`, defaults to `False`) - If `True` aggregate the scores from all metrics in the tasks.
-**aggregate_fn** (`str`, defaults to `"mean"`) - Type of aggregation done, default to average the scores per metric.
-**weight_by_size** (`str`, defaults to `False`) - Paired with `aggregate_metric`, aggregation for each task will be averaged by number of samples over all tasks instead of averaging across tasks.
-**metric_alias** - Still in Development
-**version** (`int`, defaults to `0`) - Version of group config.
The simplest usage of a group yaml is to just list all tasks we want in one group.