task_guide.md 20.2 KB
Newer Older
Rayyyyy's avatar
Rayyyyy committed
1
# Task Configuration
Rayyyyy's avatar
Rayyyyy committed
2

Rayyyyy's avatar
Rayyyyy committed
3
The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
Rayyyyy's avatar
Rayyyyy committed
4

Rayyyyy's avatar
Rayyyyy committed
5
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
Rayyyyy's avatar
Rayyyyy committed
6

Rayyyyy's avatar
Rayyyyy committed
7
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
Rayyyyy's avatar
Rayyyyy committed
8

Rayyyyy's avatar
Rayyyyy committed
9
If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
Rayyyyy's avatar
Rayyyyy committed
10

Rayyyyy's avatar
Rayyyyy committed
11
## Configurations
Rayyyyy's avatar
Rayyyyy committed
12

Rayyyyy's avatar
Rayyyyy committed
13
Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
Rayyyyy's avatar
Rayyyyy committed
14

Rayyyyy's avatar
Rayyyyy committed
15
### Parameters
Rayyyyy's avatar
Rayyyyy committed
16

Rayyyyy's avatar
Rayyyyy committed
17
18
19
Task naming + registration:
- **task** (`str`, defaults to None) — name of the task.
- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
Rayyyyy's avatar
Rayyyyy committed
20

Rayyyyy's avatar
Rayyyyy committed
21
22
23
24
25
26
27
28
29
Dataset configuration options:
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name**  (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0.
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Rayyyyy's avatar
Rayyyyy committed
30

Rayyyyy's avatar
Rayyyyy committed
31
32
33
34
35
36
37
38
Prompting / in-context formatting options:
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model.
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into the answer choice list of the correct answer.
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
Rayyyyy's avatar
Rayyyyy committed
39

Rayyyyy's avatar
Rayyyyy committed
40
41
42
Runtime configuration options:
- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
Rayyyyy's avatar
Rayyyyy committed
43

Rayyyyy's avatar
Rayyyyy committed
44
45
46
47
48
49
50
51
Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
- **should_decontaminate** (`bool`, *optional*, defaults to False) - Whether to decontaminate or not.
- **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
Rayyyyy's avatar
Rayyyyy committed
52

Rayyyyy's avatar
Rayyyyy committed
53
54
Other:
- **metadata** (`dict`, *optional*) — An optional field where arbitrary metadata can be passed. Most tasks should include a `version` key in this field that is used to denote the version of the yaml config. Other special metadata keys are: `num_fewshot`, to override the printed `n-shot` table column for a task.
Rayyyyy's avatar
Rayyyyy committed
55

Rayyyyy's avatar
Rayyyyy committed
56
## Filters
Rayyyyy's avatar
Rayyyyy committed
57

Rayyyyy's avatar
Rayyyyy committed
58
Explain: What are filters? What is their place in the pipeline?
Rayyyyy's avatar
Rayyyyy committed
59

Rayyyyy's avatar
Rayyyyy committed
60
A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).
Rayyyyy's avatar
Rayyyyy committed
61

Rayyyyy's avatar
Rayyyyy committed
62
After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
Rayyyyy's avatar
Rayyyyy committed
63

Rayyyyy's avatar
Rayyyyy committed
64
However, certain tasks may require more complex behavior than directly turning over model outputs to a metric function. For example, we may want to post-process our output text by truncating it or extracting a model's answer, we may want to ensemble over multiple "takes" on a different document, et cetera.
Rayyyyy's avatar
Rayyyyy committed
65

Rayyyyy's avatar
Rayyyyy committed
66
67
**Detailed Aside**:
We do such post-processing by operating on *responses*, which are stored after running an LM on an `Instance` from the task in `Instance.resps`.
Rayyyyy's avatar
Rayyyyy committed
68

Rayyyyy's avatar
Rayyyyy committed
69
`resps` is a `List[str]` for each instance, and we pass a `List[List[<expected return type from model>]]` to our filters that is a list of `[instance.resps for instance in instances]`.
Rayyyyy's avatar
Rayyyyy committed
70

Rayyyyy's avatar
Rayyyyy committed
71
Our filters, after completing a pipeline, must return a `List[<expected return type from model>]` which we then unpack and store each element of in `Instance.filtered_resps` for the corresponding instance. Thus, we take as input a list of returns from our model for each doc, and must return a return from our model *without it being wrapped in a list* for each doc.
Rayyyyy's avatar
Rayyyyy committed
72

Rayyyyy's avatar
Rayyyyy committed
73
**End Aside**
Rayyyyy's avatar
Rayyyyy committed
74
75


Rayyyyy's avatar
Rayyyyy committed
76
A full list of supported filter operations can be found in `lm_eval/filters/__init__.py`. Contributions of new filter types are welcome!
Rayyyyy's avatar
Rayyyyy committed
77

Rayyyyy's avatar
Rayyyyy committed
78
### Multiple Filter Pipelines
Rayyyyy's avatar
Rayyyyy committed
79

Rayyyyy's avatar
Rayyyyy committed
80
Tasks need not be limited to a single filter pipeline. We enable users to run multiple, distinct, filter pipelines on *the same model outputs* generated in one run on a task.
Rayyyyy's avatar
Rayyyyy committed
81

Rayyyyy's avatar
Rayyyyy committed
82
As a case study, let's look at an implementation of solving the Gsm8k math word problem benchmark in `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`. Here, we are emulating the setup used by [Self-Consistency Improves Chain of Thought Prompting](https://arxiv.org/abs/2203.11171), in which evaluation is performed by generating N chain-of-thought outputs from a model via temperature-based sampling, then selecting the answers output by the model at the end of the chains of thought, then majority voting across all those numeric answers.
Rayyyyy's avatar
Rayyyyy committed
83

Rayyyyy's avatar
Rayyyyy committed
84
Within our YAML file:
Rayyyyy's avatar
Rayyyyy committed
85

Rayyyyy's avatar
Rayyyyy committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
```yaml
...
repeats: 64
filter_list:
  - name: "score-first"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "take_first"
  - name: "maj@64"
    filter:
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "majority_vote"
      - function: "take_first"
  - name: "maj@8"
    filter:
      - function: "take_first_k"
        k: 8
      - function: "regex"
        regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
      - function: "majority_vote"
      - function: "take_first"
```
Rayyyyy's avatar
Rayyyyy committed
110

Rayyyyy's avatar
Rayyyyy committed
111
We are able to provide multiple different filter pipelines, each with their own name and list of filters to apply in sequence.
Rayyyyy's avatar
Rayyyyy committed
112

Rayyyyy's avatar
Rayyyyy committed
113
114
115
Our first filter pipeline implements
- applying a regex to the model generations (extracting the number within the phrase "The answer is (number)")
- selecting only the first out of the 64 model answers
Rayyyyy's avatar
Rayyyyy committed
116

Rayyyyy's avatar
Rayyyyy committed
117
Then scoring this single answer.
Rayyyyy's avatar
Rayyyyy committed
118

Rayyyyy's avatar
Rayyyyy committed
119
120
121
122
123
124
```yaml
- name: "score-first"
  filter:
    - function: "regex"
      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
    - function: "take_first"
Rayyyyy's avatar
Rayyyyy committed
125
126
```

Rayyyyy's avatar
Rayyyyy committed
127
128
129
130
131
132
133
134
135
136
137
138
139
Our second filter pipeline, "maj@64", does majority voting across all 64 answers via:
- applying the same regex to all responses, to get the numerical answer from the model for each of the 64 responses per problem
- applying majority voting to all responses, which then returns a length-1 `[<majority answer>]` list for each
- taking the first element of this length-1 list, to then score the sole response `<majority answer>` for each document.

```yaml
- name: "maj@64"
  filter:
    - function: "regex"
      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
    - function: "majority_vote"
    - function: "take_first"
```
Rayyyyy's avatar
Rayyyyy committed
140

Rayyyyy's avatar
Rayyyyy committed
141
142
143
144
145
146
147
148
149
150
151
152
153
Our final filter pipeline, "maj@8", does majority voting across the first 8 of the model's responses per document via:
- subsetting the len-64 list of responses `[answer1, answer2, ..., answer64]` to `[answer1, answer2, ..., answer8]` for each document
- performing the same sequence of filters on these new sets of 8 responses, for each document.
```yaml
- name: "maj@8"
  filter:
    - function: "take_first_k"
      k: 8
    - function: "regex"
      regex_pattern: "The answer is (\\-?[0-9\\.\\,]*[0-9]+)"
    - function: "majority_vote"
    - function: "take_first"
```
Rayyyyy's avatar
Rayyyyy committed
154

Rayyyyy's avatar
Rayyyyy committed
155
Thus, given the 64 responses from our LM on each document, we can report metrics on these responses in these 3 different ways, as defined by our filter pipelines.
Rayyyyy's avatar
Rayyyyy committed
156
157


Rayyyyy's avatar
Rayyyyy committed
158
### Adding a custom filter
Rayyyyy's avatar
Rayyyyy committed
159

Rayyyyy's avatar
Rayyyyy committed
160
Just like adding a custom model with `register_model` decorator one is able to do the same with filters, for example
Rayyyyy's avatar
Rayyyyy committed
161
162

```python
Rayyyyy's avatar
Rayyyyy committed
163
164
165
166
167
168
from lm_eval.api.filter import Filter
from lm_eval.api.registry import register_filter

@register_filter("new_filter")
class NewFilter(Filter)
    ...
Rayyyyy's avatar
Rayyyyy committed
169
170
171
172
```



Rayyyyy's avatar
Rayyyyy committed
173
## Embedded Python Code
Rayyyyy's avatar
Rayyyyy committed
174

Rayyyyy's avatar
Rayyyyy committed
175
176
177
178
179
Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
1. `doc_to_text`
2. `doc_to_target`
3. `doc_to_choice`
4. `aggregation` for a `metric` in `metric_list`
Rayyyyy's avatar
Rayyyyy committed
180

Rayyyyy's avatar
Rayyyyy committed
181
## (No Longer Recommended) Direct `Task` Subclassing
Rayyyyy's avatar
Rayyyyy committed
182

Rayyyyy's avatar
Rayyyyy committed
183
The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass the Task class and implement custom logic. For more information, see `docs/task_guide.md` in v0.3.0 of the `lm-evaluation-harness`.
Rayyyyy's avatar
Rayyyyy committed
184
185


Rayyyyy's avatar
Rayyyyy committed
186
## Including a Base YAML
Rayyyyy's avatar
Rayyyyy committed
187

Rayyyyy's avatar
Rayyyyy committed
188
189
190
191
192
193
You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
```
include: <YAML filename or with full path>
...
```
You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
Rayyyyy's avatar
Rayyyyy committed
194
195


Rayyyyy's avatar
Rayyyyy committed
196
## Passing Arguments to Metrics
Rayyyyy's avatar
Rayyyyy committed
197

Rayyyyy's avatar
Rayyyyy committed
198
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
Rayyyyy's avatar
Rayyyyy committed
199

Rayyyyy's avatar
Rayyyyy committed
200
201
202
203
204
205
206
207
208
209
210
211
```
metric_list:
  - metric: acc
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: false
    regexes_to_ignore:
      - ","
      - "\\$"
```
Rayyyyy's avatar
Rayyyyy committed
212

Rayyyyy's avatar
Rayyyyy committed
213
### Natively Supported Metrics
Rayyyyy's avatar
Rayyyyy committed
214

Rayyyyy's avatar
Rayyyyy committed
215
Here we list all metrics currently supported natively in `lm-eval`:
Rayyyyy's avatar
Rayyyyy committed
216

Rayyyyy's avatar
Rayyyyy committed
217
218
219
220
221
222
223
224
225
226
227
228
229
Metrics:
* `acc` (accuracy)
* `acc_norm` (length-normalized accuracy)
* `acc_mutual_info` (baseline loglikelihood - normalized accuracy)
* `perplexity`
* `word_perplexity` (perplexity per word)
* `byte_perplexity` (perplexity per byte)
* `bits_per_byte`
* `matthews_corrcoef` (Matthews correlation coefficient)
* `f1` (F1 score)
* `bleu`
* `chrf`
* `ter`
Rayyyyy's avatar
Rayyyyy committed
230

Rayyyyy's avatar
Rayyyyy committed
231
232
233
234
235
236
Aggregation functions:
* `mean`
* `median`
* `perplexity`
* `weighted_perplexity`
* `bits_per_byte`
Rayyyyy's avatar
Rayyyyy committed
237

Rayyyyy's avatar
Rayyyyy committed
238
### Adding a Multiple Choice Metric
Rayyyyy's avatar
Rayyyyy committed
239

Rayyyyy's avatar
Rayyyyy committed
240
Adding a multiple choice metric has a few steps. To get it working you need to:
Rayyyyy's avatar
Rayyyyy committed
241

Rayyyyy's avatar
Rayyyyy committed
242
243
244
1. register a metric function
2. register an aggregation function
3. update the `Task` definition to make sure the correct arguments are passed
Rayyyyy's avatar
Rayyyyy committed
245

Rayyyyy's avatar
Rayyyyy committed
246
The default metric and aggregation functions are in `lm_eval/api/metrics.py`, and you can add a function there if it's for general use. The metrics are towards the bottom of the file and look like this:
Rayyyyy's avatar
Rayyyyy committed
247
248


Rayyyyy's avatar
Rayyyyy committed
249
250
251
252
253
254
255
256
    @register_metric(
        metric="mcc",
        higher_is_better=True,
        output_type="multiple_choice",
        aggregation="matthews_corrcoef",
    )
    def mcc_fn(items):  # This is a passthrough function
        return items
Rayyyyy's avatar
Rayyyyy committed
257

Rayyyyy's avatar
Rayyyyy committed
258
Note that many of these are passthrough functions, and for multiple choice (at least) this function is never actually called.
Rayyyyy's avatar
Rayyyyy committed
259

Rayyyyy's avatar
Rayyyyy committed
260
Aggregation functions are defined towards the top of the file, here's an example:
Rayyyyy's avatar
Rayyyyy committed
261

Rayyyyy's avatar
Rayyyyy committed
262
263
264
265
266
267
    @register_aggregation("matthews_corrcoef")
    def matthews_corrcoef(items):
        unzipped_list = list(zip(*items))
        golds = unzipped_list[0]
        preds = unzipped_list[1]
        return sklearn.metrics.matthews_corrcoef(golds, preds)
Rayyyyy's avatar
Rayyyyy committed
268

Rayyyyy's avatar
Rayyyyy committed
269
This function returns a single numeric value. The input is defined in `Task.process_results` in `lm_eval/api/task.py`. There's a section that looks like this:
Rayyyyy's avatar
Rayyyyy committed
270
271


Rayyyyy's avatar
Rayyyyy committed
272
273
274
275
276
277
278
    result_dict = {
        **({"acc": acc} if "acc" in use_metric else {}),
        **({"f1": (gold, pred)} if "f1" in use_metric else {}),
        **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
        **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
        **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
    }
Rayyyyy's avatar
Rayyyyy committed
279

Rayyyyy's avatar
Rayyyyy committed
280
The value here determines the input to the aggregation function, though the name used matches the metric function. These metrics all have simple needs and just need the accuracy or gold and predicted values, but immediately below this there are examples of metrics with more complicated needs you can use as reference.
Rayyyyy's avatar
Rayyyyy committed
281

Rayyyyy's avatar
Rayyyyy committed
282
## Good Reference Tasks
Rayyyyy's avatar
Rayyyyy committed
283

Rayyyyy's avatar
Rayyyyy committed
284
Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
Rayyyyy's avatar
Rayyyyy committed
285

Rayyyyy's avatar
Rayyyyy committed
286
287
Multiple choice tasks:
- SciQ (`lm_eval/tasks/sciq/sciq.yaml`)
Rayyyyy's avatar
Rayyyyy committed
288

Rayyyyy's avatar
Rayyyyy committed
289
290
Corpus perplexity evaluations:
- Wikitext (`lm_eval/tasks/wikitext/wikitext.yaml`)
Rayyyyy's avatar
Rayyyyy committed
291

Rayyyyy's avatar
Rayyyyy committed
292
293
Generative tasks:
- GSM8k (`lm_eval/tasks/gsm8k/gsm8k.yaml`)
Rayyyyy's avatar
Rayyyyy committed
294

Rayyyyy's avatar
Rayyyyy committed
295
296
Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
Rayyyyy's avatar
Rayyyyy committed
297
298


Rayyyyy's avatar
Rayyyyy committed
299
## Benchmarks
Rayyyyy's avatar
Rayyyyy committed
300

Rayyyyy's avatar
Rayyyyy committed
301
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
Rayyyyy's avatar
Rayyyyy committed
302

Rayyyyy's avatar
Rayyyyy committed
303
To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
Rayyyyy's avatar
Rayyyyy committed
304

Rayyyyy's avatar
Rayyyyy committed
305
306
307
308
309
310
311
312
313
314
315
316
317
```yaml
group: pythia
task:
  - lambada_openai
  - wikitext
  - piqa
  - sciq
  - wsc
  - winogrande
  - arc
  - logiqa
  - blimp
  - hendrycksTest*
Rayyyyy's avatar
Rayyyyy committed
318
319
```

Rayyyyy's avatar
Rayyyyy committed
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.

```yaml
group: multimedqa
task:
  - pubmedqa
  - medmcqa
  - medqa_4options
  - task: mmlu_anatomy
    task_alias: "anatomy (mmlu)"
    group_alias: null
  - task: mmlu_clinical_knowledge
    task_alias: "clinical_knowledge (mmlu)"
    group_alias: null
  ...
```
Rayyyyy's avatar
Rayyyyy committed
336

Rayyyyy's avatar
Rayyyyy committed
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.

```yaml
group: t0_eval
task:
  # Coreference Resolution
  - dataset_path: super_glue
    dataset_name: wsc.fixed
    use_prompt: promptsource:*
    training_split: train
    validation_split: validation
    metric_list:
      - metric: exact_match
        aggregation: mean
        higher_is_better: true
        ignore_case: true
        ignore_punctuation: true
  # Coreference Resolution
  - dataset_path: winogrande
    dataset_name: winogrande_xl
    use_prompt: promptsource:*
    training_split: train
    validation_split: validation
    metric_list:
      - metric: exact_match
        aggregation: mean
        higher_is_better: true
        ignore_case: true
        ignore_punctuation: true
  ...
```
Rayyyyy's avatar
Rayyyyy committed
368

Rayyyyy's avatar
Rayyyyy committed
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.

```YAML
group: t0_eval
task:
  ...
  - task: anli_r1
    dataset_path: anli
    use_prompt: promptsource:*
    training_split: train_r1
    validation_split: dev_r1
    metric_list:
      - metric: exact_match
        aggregation: mean
        higher_is_better: true
        ignore_case: true
        ignore_punctuation: true
  - task: anli_r2
    dataset_path: anli
    use_prompt: promptsource:*
    training_split: train_r2
    validation_split: dev_r2
    metric_list:
      - metric: exact_match
        aggregation: mean
        higher_is_better: true
        ignore_case: true
        ignore_punctuation: true
Rayyyyy's avatar
Rayyyyy committed
397
398
```

Rayyyyy's avatar
Rayyyyy committed
399
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/tasks/benchmarks/`