@@ -19,7 +19,7 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
...
@@ -19,7 +19,7 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
-**reference** (`str`, *optional*) —
-**reference** (`str`, *optional*) —
-**dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
-**dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
-**dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
-**dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
-**dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
-**dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
-**training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
-**training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
-**validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
-**validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
-**test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
-**test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
...
@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc
...
@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc
## Passing Arguments to Metrics
## Passing Arguments to Metrics
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
```
```
metric_list:
metric_list:
...
@@ -225,4 +225,3 @@ Generative tasks:
...
@@ -225,4 +225,3 @@ Generative tasks:
Tasks using complex filtering:
Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
...
@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
## Creating a YAML file
## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files.
To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
Or, copy the template subfolder we provide from `templates/new_yaml_task`:
```sh
```sh
touch lm_eval/tasks/new_generative_task.yaml
cp-r templates/new_yaml_task lm_eval/tasks/
```
```
and rename the folders and YAML file(s) as desired.
### Selecting and configuring a dataset
### Selecting and configuring a dataset
...
@@ -241,16 +236,17 @@ The checklist is the following:
...
@@ -241,16 +236,17 @@ The checklist is the following:
For adding novel benchmarks/datasets to the library:
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Is the task an existing benchmark in the literature?
* [ ] Has the task been checked for equivalence with the original paper's methodology?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] Is the task in Eval-harness v0.3.0 or earlier?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
If other tasks on this dataset are already supported:
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
## Submitting your task
## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
- [ ] Glue
- [ ] Glue (WIP)
- [] SuperGlue
- [x] SuperGlue
- [ ] CoQA
- [ ] CoQA
- [ ] DROP
- [ ] DROP
- [x] ~~Lambada~~
- [x] ~~Lambada~~
...
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
...
@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress