some docs fixes

f48d87ec · haileyschoelkopf · 5b527a71 · f48d87ec
Commit f48d87ec authored Jun 21, 2024 by haileyschoelkopf
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 9 deletions

docs/task_guide.md docs/task_guide.md +8 -9

No files found.
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -56,8 +56,6 @@ Other:

 ## Filters

-Explain: What are filters? What is their place in the pipeline?
-
 A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).

 After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
@@ -300,7 +298,7 @@ Tasks using complex filtering:

 When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.

-To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
+To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example of a group yaml config can be found at [../lm_eval/tasks/mmlu/default/_mmlu.yaml].

 ## Configurations

@@ -315,10 +313,9 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
 - **aggregate_metric** (`str`, defaults to `False`) - If `True` aggregate the scores from all metrics in the tasks.
 - **aggregate_fn** (`str`, defaults to `"mean"`) - Type of aggregation done, default to average the scores per metric.
 - **weight_by_size** (`str`, defaults to `False`) - Paired with `aggregate_metric`, aggregation for each task will be averaged by number of samples over all tasks instead of averaging across tasks.
- **metric_alias** - Still in Development
- **version** (`int`, defaults to `0`) - Version of group config.
+- **metadata** (`dict`, *optional*) - As with TaskConfigs, a field where extra config metadata can be passed. set the `num_fewshot` key within this to override the printed n_shot value in a results table for your group, for example.

-The simplest usage of a group yaml is to just list all tasks we want in one group. This ends up with behavior analogous to a `tag`--it is predominantly a way to invoke a group of tasks one expects to frequently call in conjunction with one another.
+The simplest usage of a group yaml is to just list all tasks we want in one group. This ends up with behavior analogous to a `tag`--it is predominantly a convenient way on the CLI to invoke a group of tasks one expects to frequently call in conjunction with one another.

 ```yaml
 group: pythia
@@ -332,10 +329,12 @@ task:
  - arc
  - logiqa
  - blimp
-  - hendrycksTest*
+  - mmlu
 ```

-It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
+running with `--tasks pythia` will thus simply collect all listed tasks in this config, and run them, printing the same tables of results out as if we'd have simply called `--tasks lambada_openai,wikitext,piqa,sciq,wsc,winogrande,arc,logiqa,blimp,mmlu`.
+
+It is also possible to list an existing task in your benchmark configuration with some adjustments and overrides. For example, a few tasks from mmlu are included in `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.

 ```yaml
 group: multimedqa
@@ -412,4 +411,4 @@ task:
        ignore_punctuation: true
 ```

-Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/tasks/benchmarks/`
+Calling the benchmark is done the same way we would call any task with `--tasks`.