Commit 7d09b24c authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

fix alllll the merge conflicts

parents 96dfe976 6348b947
...@@ -296,7 +296,7 @@ task: <name of the task> ...@@ -296,7 +296,7 @@ task: <name of the task>
``` ```
Including a task name is mandatory. Including a task name is mandatory.
It is often also convenient to label your task with several `tag` though this field is optional: It is often also convenient to label your task with several `tag` values, though this field is optional:
```yaml ```yaml
tag: tag:
...@@ -319,9 +319,50 @@ Passing `--tasks /path/to/yaml/file` is also accepted. ...@@ -319,9 +319,50 @@ Passing `--tasks /path/to/yaml/file` is also accepted.
### Advanced Group Configs ### Advanced Group Configs
You can make more complete group config while also tailoring parameters for individual tasks. While `tag` values are helpful when you want to be able to quickly and conveniently run a set of related tasks via `--tasks my_tag_name`, often, we wish to implement more complex logic. For example, the MMLU benchmark contains 57 *subtasks* that must all be *averaged* together in order to report a final 'MMLU score'.
For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks. By default, group configs do not aggregate the scores from all the tasks in it. To get a aggregate score, use the `aggregate_metric` and set it to `true`. Groupings of tasks might also use particular variants of a task--for example, we might want to default to evaluating a task as 5-shot when called as part of a given grouping, but not have a preference for number of shots when evaluating it as a standalone.
We implement this via **groups**, which are distinct from tags. Groups can be implemented via *group config* YAML files, which are laid out similarly but slightly differently to tasks' YAML configs.
The most basic form of group can be defined via a YAML config similar to the following:
```yaml
group: nli_tasks
task:
- cb
- anli_r1
- rte
metadata:
version: 1.0
```
This will behave almost identically to a `tag` that includes these 3 tasks, but with one key distinction: we'll print the `nli_tasks` group as a row (with no associated metrics) in our table of outputs, and visually show that these 3 tasks appear under its subheader.
Now, let's assume we actually want to report an aggregate score for `nli_tasks`. We would instead use a YAML config like the following:
```yaml
group: nli_tasks
task:
- cb
- anli_r1
- rte
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true # defaults to `true`. Set this to `false` to do a "macro" average (taking each subtask's average accuracy, and summing those accuracies and dividing by 3)--by default we do a "micro" average (retain all subtasks' per-document accuracies, and take the mean over all documents' accuracies to get our aggregate mean).
metadata:
version: 1.0
```
Similar to our `metric_list` for listing out the metrics we want to calculate for a given task, we use an `aggregate_metric_list` field to specify which metric name to aggregate across subtasks, what aggregation function to use, and whether we should micro- or macro- average these metrics. See [./task_guide.md](./task_guide.md) for a full list of related sub-keys.
**[!Tip]: currently, we predominantly only support the aggregation of group metrics that use `mean` (either micro- or macro- averaged) over their subtasks. If you require even more complex aggregation rules, you may want to perform aggregation offline.**
Group configs can be fairly complex! We can do various operations, such as defining new subtask(s) inline in our group YAML, overriding an existing task's specific config value, or nesting existing groups within our
For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks.
```yaml ```yaml
group: nli_and_mmlu group: nli_and_mmlu
...@@ -331,34 +372,13 @@ task: ...@@ -331,34 +372,13 @@ task:
- cb - cb
- anli_r1 - anli_r1
- rte - rte
aggregate_metric: true aggregate_metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- task: mmlu - task: mmlu
num_fewshot: 2 num_fewshot: 2
``` ```
It's also important to note how you can basically insert a group config as a task. Here, to make a group of natural language inference tasks, you simply write like how you would normally write a group config but this time place that as part of a task list under the main group being built.
### Duplicate Tasks in Group Configs
There might be cases where you might want to evaluate prompts and how models perform over prompt variations. You can list an existing task (In the example below, `anli_r1`) which varying `doc_to_text` implementation. To differentiate from each variation, we can utilize `task_alias`. LM-Eval will recognize that there are multiple variations of the same tasks and differentiate them.
```yaml
group: flan_held_in
group_alias: Flan (Held-In)
task:
# ANLI R1
- group: anli_r1_flan
group_alias: ANLI R1
task:
- task: anli_r1
task_alias: prompt-0
include: _held_in_template_yaml
doc_to_text: "{{premise}}\n\nChoose your answer ..."
...
- task: anli_r1
task_alias: prompt-1
include: _held_in_template_yaml
doc_to_text: "{{premise}}\n\nBased on ..."
...
```
### Configuring python classes ### Configuring python classes
...@@ -385,19 +405,16 @@ task: ...@@ -385,19 +405,16 @@ task:
## Beautifying Table Display ## Beautifying Table Display
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`. To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `task_alias` to `abstract_algebra`. In group configs, a `group_alias` for a group can also be set.
``` ```
"dataset_name": "abstract_algebra" "dataset_name": "abstract_algebra"
"description": "The following are multiple choice questions (with answers) about abstract\ "description": "The following are multiple choice questions (with answers) about abstract\
\ algebra.\n\n" \ algebra.\n\n"
"group": "mmlu_stem"
"group_alias": "stem"
"include": "_default_template_yaml" "include": "_default_template_yaml"
"task": "mmlu_abstract_algebra" "task": "mmlu_abstract_algebra"
"task_alias": "abstract_algebra" "task_alias": "abstract_algebra"
``` ```
Note: Even though `group` can be a list, for now, `group_alias` can only be a single string.
## Checking validity ## Checking validity
...@@ -417,9 +434,9 @@ a simple eye test. ...@@ -417,9 +434,9 @@ a simple eye test.
## Versioning ## Versioning
One key feature in LM Evaluation Harness is the ability to version tasks--that is, mark them with a specific version number that can be bumped whenever a breaking change is made. One key feature in LM Evaluation Harness is the ability to version tasks and groups--that is, mark them with a specific version number that can be bumped whenever a breaking change is made.
This version info can be provided by adding the following to your new task config file: This version info can be provided by adding the following to your new task or group config file:
``` ```
metadata: metadata:
......
...@@ -56,8 +56,6 @@ Other: ...@@ -56,8 +56,6 @@ Other:
## Filters ## Filters
Explain: What are filters? What is their place in the pipeline?
A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring). A key component of the `lm-evaluation-harness` library is the `Filter` object. In a typical evaluation run of the harness, we take the formatted inputs and run them through our LM, with the appropriate output type (greedy or free-form generation, or loglikelihood-based comparative scoring).
After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user. After getting scores or output text from our LM on each `Instance` or document in the dataset, we then need to feed these responses into a metric or scoring function to return scores to a user.
...@@ -300,116 +298,20 @@ Tasks using complex filtering: ...@@ -300,116 +298,20 @@ Tasks using complex filtering:
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task. When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite. To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example of a group yaml config can be found at [../lm_eval/tasks/mmlu/default/_mmlu.yaml]. See also the [New Task Guide](./new_task_guide.md) for a more in-depth and tutorial-esque explanation of how to write complex GroupConfigs.
## Configurations ## Configurations
Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task. Groups are configured via the `GroupConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
### Parameters ### Parameters
- **group** (`str`, defaults to `None`) — name of the group. Used to invoke it from the command line. - **group** (`str`, defaults to `None`) — name of the group. Used to invoke it from the command line.
- **group_alias** (`str`, defaults to `None`) - Alternative name for the group that will be printed in the table output. - **group_alias** (`str`, defaults to `None`) - Alternative name for the group that will be printed in the table output.
- **task** (`Union[str, list]`, defaults to `None`) - List of tasks that constitute the group. - **task** (`Union[str, list]`, defaults to `None`) - List of tasks that constitute the group.
- **tag_to_task** (`str`, defaults to `False`) - Convert `tag` that are listed in task to be a considered as a group. - **aggregate_metric_list** (`list`, defaults to `None`) - similar to `metric_list` in TaskConfigs, provide a list of configurations for metrics that should be aggregated across subtasks. Leaving empty will result in no aggregation being performed for this group. Keys for each list entry are:
- **aggregate_metric** (`str`, defaults to `False`) - If `True` aggregate the scores from all metrics in the tasks. - `metric: str` - the name of the metric to aggregate over (all subtasks must report a metric holding this name.)
- **aggregate_fn** (`str`, defaults to `"mean"`) - Type of aggregation done, default to average the scores per metric. - `aggregation: str` - what aggregation function to apply to aggregate these per-subtask metrics. **currently, only `mean` is supported.**
- **weight_by_size** (`str`, defaults to `False`) - Paired with `aggregate_metric`, aggregation for each task will be averaged by number of samples over all tasks instead of averaging across tasks. - `weight_by_size: bool = True` whether to perform micro- averaging (`True`) or macro- (`False`) averaging of subtasks' accuracy scores when reporting the group's metric. MMLU, for example, averages over per-document accuracies (the *micro average*), resulting in the same accuracy as if one simply concatenated all 57 subjects into a single dataset and evaluated accuracy on that dataset.
- **metric_alias** - Still in Development - `filter_list: Union[str, List[str]] = "none"` - what filter keys one should match on to aggregate results. For example, if trying to aggregate over the `exact_match` metric using `strict-match` filter for `bbh_cot_zeroshot`, then set this to be `filter_list: "strict-match"`.
- **version** (`int`, defaults to `0`) - Version of group config. - **metadata** (`dict`, *optional*) - As with TaskConfigs, a field where extra config metadata can be passed. set the `num_fewshot` key within this to override the printed n_shot value in a results table for your group, for example.
The simplest usage of a group yaml is to just list all tasks we want in one group. This ends up with behavior analogous to a `tag`--it is predominantly a way to invoke a group of tasks one expects to frequently call in conjunction with one another.
```yaml
group: pythia
task:
- lambada_openai
- wikitext
- piqa
- sciq
- wsc
- winogrande
- arc
- logiqa
- blimp
- hendrycksTest*
```
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
```yaml
group: multimedqa
task:
- pubmedqa
- medmcqa
- medqa_4options
- task: mmlu_anatomy
task_alias: "anatomy (mmlu)"
- task: mmlu_clinical_knowledge
task_alias: "clinical_knowledge (mmlu)"
...
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
group: t0_eval
task:
# Coreference Resolution
- dataset_path: super_glue
dataset_name: wsc.fixed
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Coreference Resolution
- dataset_path: winogrande
dataset_name: winogrande_xl
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
...
```
If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
```YAML
group: t0_eval
task:
...
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/tasks/benchmarks/`
...@@ -377,7 +377,7 @@ ...@@ -377,7 +377,7 @@
"id": "LOUHK7PtQfq4" "id": "LOUHK7PtQfq4"
}, },
"source": [ "source": [
"Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the group `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n", "Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n", "\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n", "<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n", "\n",
...@@ -395,7 +395,7 @@ ...@@ -395,7 +395,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"YAML_cola_string = '''\n", "YAML_cola_string = '''\n",
"group: yes_or_no_tasks\n", "tag: yes_or_no_tasks\n",
"task: demo_cola\n", "task: demo_cola\n",
"dataset_path: glue\n", "dataset_path: glue\n",
"dataset_name: cola\n", "dataset_name: cola\n",
...@@ -494,7 +494,6 @@ ...@@ -494,7 +494,6 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"YAML_mmlu_geo_string = '''\n", "YAML_mmlu_geo_string = '''\n",
"group: mmlu\n",
"task: demo_mmlu_high_school_geography\n", "task: demo_mmlu_high_school_geography\n",
"dataset_path: cais/mmlu\n", "dataset_path: cais/mmlu\n",
"dataset_name: high_school_geography\n", "dataset_name: high_school_geography\n",
......
...@@ -22,7 +22,6 @@ from typing import ( ...@@ -22,7 +22,6 @@ from typing import (
import datasets import datasets
import numpy as np import numpy as np
import shortuuid
from tqdm import tqdm from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
...@@ -52,125 +51,6 @@ ALL_OUTPUT_TYPES = [ ...@@ -52,125 +51,6 @@ ALL_OUTPUT_TYPES = [
eval_logger = logging.getLogger("lm-eval") eval_logger = logging.getLogger("lm-eval")
@dataclass
class AggMetricConfig(dict):
metric: Optional[str] = "acc"
metric_alias: Optional[str] = None
aggregation: Optional[str] = "mean"
weight_by_size: Optional[str] = False
filter_list: Optional[Union[str, list]] = "none"
def __post_init__(self):
if isinstance(self.filter_list, str):
self.filter_list = [self.filter_list]
@dataclass
class GroupConfig(dict):
group: Optional[str] = None
group_alias: Optional[str] = None
task: Optional[Union[str, list]] = None
tag_to_task: Optional[str] = False
aggregate_metric: Optional[Union[List[AggMetricConfig], AggMetricConfig, dict]] = (
None
)
metadata: Optional[dict] = (
None # by default, not used in the code. allows for users to pass arbitrary info to tasks
)
def __getitem__(self, item):
return getattr(self, item)
def __setitem__(self, item, value):
return setattr(self, item, value)
def __post_init__(self):
if self.aggregate_metric is not None:
if isinstance(self.aggregate_metric, dict):
self.aggregate_metric = [self.aggregate_metric]
self.aggregate_metric = [
AggMetricConfig(**item) if isinstance(item, dict) else item
for item in self.aggregate_metric
]
def to_dict(self, keep_callable: bool = False) -> dict:
"""dumps the current config as a dictionary object, as a printable format.
null fields will not be printed.
Used for dumping results alongside full task configuration
:return: dict
A printable dictionary version of the TaskConfig object.
# TODO: should any default value in the TaskConfig not be printed?
"""
cfg_dict = asdict(self)
# remove values that are `None`
for k, v in list(cfg_dict.items()):
if callable(v):
cfg_dict[k] = self.serialize_function(v, keep_callable=keep_callable)
return cfg_dict
def serialize_function(
self, value: Union[Callable, str], keep_callable=False
) -> Union[Callable, str]:
"""Serializes a given function or string.
If 'keep_callable' is True, the original callable is returned.
Otherwise, attempts to return the source code of the callable using 'getsource'.
"""
if keep_callable:
return value
else:
try:
return getsource(value)
except (TypeError, OSError):
return str(value)
class ConfigurableGroup(abc.ABC):
def __init__(
self,
config: Optional[dict] = None,
) -> None:
# Create a unique identifier ID
self._config = GroupConfig(**config)
self._task_id = self._config.group
@property
def group(self):
return self._config.group
@property
def group_alias(self):
return self._config.group_alias
@property
def version(self):
return self._config.version
@property
def config(self):
return self._config.to_dict()
@property
def task_id(self) -> Any:
return self._task_id
@task_id.setter
def task_id(self, value):
self._task_id = value
@property
def group_name(self) -> Any:
return self._config.group
def __repr__(self):
return (
f"ConfigurableGroup(group={self.group}," f"group_alias={self.group_alias})"
)
@dataclass @dataclass
class TaskConfig(dict): class TaskConfig(dict):
# task naming/registry # task naming/registry
...@@ -217,6 +97,18 @@ class TaskConfig(dict): ...@@ -217,6 +97,18 @@ class TaskConfig(dict):
) )
def __post_init__(self) -> None: def __post_init__(self) -> None:
if self.group is not None:
eval_logger.warning(
"A task YAML file was found to contain a `group` key. Groups which provide aggregate scores over several subtasks now require a separate config file--if not aggregating, you may want to use the `tag` config option instead within your config. Setting `group` within a TaskConfig will be deprecated in v0.4.4. Please see https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md for more information."
)
if self.tag is None:
self.tag = self.group
else:
raise ValueError(
"Got both a `group` and `tag` entry within a TaskConfig. Please use one or the other--`group` values will be deprecated in v0.4.4."
)
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "generate_until": if self.output_type != "generate_until":
eval_logger.warning( eval_logger.warning(
...@@ -346,8 +238,6 @@ class Task(abc.ABC): ...@@ -346,8 +238,6 @@ class Task(abc.ABC):
self._fewshot_docs: Optional[list] = None self._fewshot_docs: Optional[list] = None
self._instances: Optional[List[Instance]] = None self._instances: Optional[List[Instance]] = None
# Create a unique identifier ID
self._task_id = shortuuid.uuid()[:8]
self._config: TaskConfig = TaskConfig({**config}) if config else TaskConfig() self._config: TaskConfig = TaskConfig({**config}) if config else TaskConfig()
self._filters = [build_filter_ensemble("none", [["take_first", None]])] self._filters = [build_filter_ensemble("none", [["take_first", None]])]
...@@ -805,14 +695,6 @@ class Task(abc.ABC): ...@@ -805,14 +695,6 @@ class Task(abc.ABC):
) )
return doc_iterator return doc_iterator
@property
def task_id(self) -> Any:
return self._task_id
@task_id.setter
def task_id(self, value):
self._task_id = value
class ConfigurableTask(Task): class ConfigurableTask(Task):
VERSION = "Yaml" VERSION = "Yaml"
...@@ -826,9 +708,6 @@ class ConfigurableTask(Task): ...@@ -826,9 +708,6 @@ class ConfigurableTask(Task):
download_mode=None, download_mode=None,
config: Optional[dict] = None, config: Optional[dict] = None,
) -> None: # TODO no super() call here ) -> None: # TODO no super() call here
# Create a unique identifier ID
self._task_id = shortuuid.uuid()[:8]
# Get pre-configured attributes # Get pre-configured attributes
self._config = self.CONFIG self._config = self.CONFIG
......
...@@ -15,6 +15,7 @@ import lm_eval.api.task ...@@ -15,6 +15,7 @@ import lm_eval.api.task
import lm_eval.models import lm_eval.models
from lm_eval.caching.cache import delete_cache from lm_eval.caching.cache import delete_cache
from lm_eval.evaluator_utils import ( from lm_eval.evaluator_utils import (
consolidate_group_results,
consolidate_results, consolidate_results,
get_sample_size, get_sample_size,
get_subtask_list, get_subtask_list,
...@@ -26,8 +27,6 @@ from lm_eval.evaluator_utils import ( ...@@ -26,8 +27,6 @@ from lm_eval.evaluator_utils import (
from lm_eval.loggers import EvaluationTracker from lm_eval.loggers import EvaluationTracker
from lm_eval.loggers.utils import add_env_info, add_tokenizer_info, get_git_commit_hash from lm_eval.loggers.utils import add_env_info, add_tokenizer_info, get_git_commit_hash
from lm_eval.tasks import ( from lm_eval.tasks import (
ConfigurableGroup,
ConfigurableTask,
TaskManager, TaskManager,
get_task_dict, get_task_dict,
) )
...@@ -42,7 +41,7 @@ from lm_eval.utils import ( ...@@ -42,7 +41,7 @@ from lm_eval.utils import (
if TYPE_CHECKING: if TYPE_CHECKING:
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.tasks import Task from lm_eval.api.task import Task
@positional_deprecated @positional_deprecated
...@@ -227,13 +226,15 @@ def simple_evaluate( ...@@ -227,13 +226,15 @@ def simple_evaluate(
task_dict = get_task_dict(tasks, task_manager) task_dict = get_task_dict(tasks, task_manager)
def _adjust_config(task_dict, predict_only): # helper function to recursively apply config overrides to leaf subtasks, skipping their constituent groups.
# (setting of num_fewshot ; bypassing metric calculation ; setting fewshot seed)
def _adjust_config(task_dict):
adjusted_task_dict = {} adjusted_task_dict = {}
for task_name, task_obj in task_dict.items(): for task_name, task_obj in task_dict.items():
if isinstance(task_obj, dict): if isinstance(task_obj, dict):
adjusted_task_dict = { adjusted_task_dict = {
**adjusted_task_dict, **adjusted_task_dict,
**{task_name: _adjust_config(task_obj, predict_only)}, **{task_name: _adjust_config(task_obj)},
} }
else: else:
...@@ -278,7 +279,7 @@ def simple_evaluate( ...@@ -278,7 +279,7 @@ def simple_evaluate(
return adjusted_task_dict return adjusted_task_dict
task_dict = _adjust_config(task_dict, predict_only) task_dict = _adjust_config(task_dict)
if check_integrity: if check_integrity:
run_task_tests(task_list=tasks) run_task_tests(task_list=tasks)
...@@ -575,138 +576,7 @@ def evaluate( ...@@ -575,138 +576,7 @@ def evaluate(
### Calculate group metrics ### ### Calculate group metrics ###
if bool(results): if bool(results):
results, versions, show_group_table, *_ = consolidate_group_results(
def process_group(
results,
versions,
task_dict,
task_root=None,
show_group_table=False,
task_aggregation_list=None,
):
if task_root is None:
task_root = {}
if task_aggregation_list is None:
task_aggregation_list = {}
for group_or_task, group_or_task_info in task_dict.items():
# Convert to string
if isinstance(group_or_task, ConfigurableGroup):
group_config = group_or_task.config
group_or_task = group_or_task.task_id
else:
group_config = None
if isinstance(group_or_task_info, ConfigurableTask):
if task_root:
task_aggregation_list.setdefault(task_root, []).append(
group_or_task_info.task_id
)
else:
(
results,
versions,
show_group_table,
_task_aggregation_list,
) = process_group(
results,
versions,
group_or_task_info,
group_or_task,
show_group_table,
task_aggregation_list,
)
if task_root:
task_aggregation_list.setdefault(task_root, []).extend(
task_aggregation_list.get(group_or_task, [])
)
if (group_config is None) or (
group_config["aggregate_metric"] is None
):
results[group_or_task][" "] = " "
continue
if "aggregate_metric" in group_config:
agg_metric_list = group_config["aggregate_metric"]
show_group_table = show_group_table | bool(
group_config["aggregate_metric"]
)
task_list = _task_aggregation_list[group_or_task]
metric_list = list(
{
key
for task in task_list
for key in results[task].keys()
if "_stderr" not in key
and key not in ["task", "alias", "samples"]
}
)
for metric in metric_list:
stderr = "_stderr,".join(metric.split(","))
# gather metrics, sizes, and stderrs from subtasks
metrics = [
results[task][metric]
for task in task_list
if metric in results[task]
] # TODO: copy?
stderrs = [
results[task][stderr]
for task in task_list
if stderr in results[task]
]
sizes = [
results[task]["samples"]
for task in task_list
if metric in results[task]
]
for metric_config in agg_metric_list:
for filter in metric_config["filter_list"]:
if metric != ",".join(
[metric_config["metric"], filter]
):
continue
# compute group's pooled metric and stderr
if metric_config["aggregation"] == "mean":
aggregate_fn = lm_eval.api.metrics.aggregate_subtask_metrics
else:
aggregate_fn = metric_config["aggregation"]
results[group_or_task][metric] = aggregate_fn(
metrics,
sizes,
metric_config["weight_by_size"],
)
# TODO: calculate grouped metric using aggregation fn
if "N/A" in stderrs:
results[group_or_task][stderr] = "N/A"
else:
results[group_or_task][stderr] = (
lm_eval.api.metrics.pooled_sample_stderr(
stderrs, sizes
)
)
# TODO: allow GroupConfigs to choose which variance formula is used, for back-compatibility
# To use the old (likely incorrect) variance formula, comment out the above and uncomment this line:
# results[group][stderr] = lm_eval.api.metrics.combined_sample_stderr(stderrs, sizes, metrics=metrics)
results[group_or_task]["samples"] = sum(sizes)
group_metadata = group_config.get("metadata", None)
if group_metadata is not None:
versions[group_or_task] = group_metadata.get(
"version", None
)
# print(results)
return results, versions, show_group_table, task_aggregation_list
results, versions, show_group_table, *_ = process_group(
results, versions, task_dict results, versions, task_dict
) )
......
...@@ -4,8 +4,13 @@ import pathlib ...@@ -4,8 +4,13 @@ import pathlib
import sys import sys
from typing import List, Optional, Tuple, Union from typing import List, Optional, Tuple, Union
from lm_eval.api import metrics from lm_eval.api.group import ConfigurableGroup
from lm_eval.tasks import ConfigurableGroup, ConfigurableTask from lm_eval.api.metrics import (
aggregate_subtask_metrics,
pooled_sample_stderr,
stderr_for_metric,
)
from lm_eval.api.task import Task
from lm_eval.utils import eval_logger, positional_deprecated from lm_eval.utils import eval_logger, positional_deprecated
...@@ -40,7 +45,6 @@ class TaskOutput: ...@@ -40,7 +45,6 @@ class TaskOutput:
self, self,
task=None, task=None,
task_name=None, task_name=None,
task_id=None,
task_config=None, task_config=None,
version=None, version=None,
group_name=None, group_name=None,
...@@ -52,7 +56,6 @@ class TaskOutput: ...@@ -52,7 +56,6 @@ class TaskOutput:
self.task = task self.task = task
self.task_config = task_config self.task_config = task_config
self.task_name = task_name self.task_name = task_name
self.task_id = task_id
self.group_name = group_name self.group_name = group_name
self.version = version self.version = version
self.n_shot = n_shot self.n_shot = n_shot
...@@ -78,7 +81,6 @@ class TaskOutput: ...@@ -78,7 +81,6 @@ class TaskOutput:
task=task, task_name=task_name, is_group=is_group, group_name=group_name task=task, task_name=task_name, is_group=is_group, group_name=group_name
) )
version = task.VERSION version = task.VERSION
task_id = task.task_id
task_config = dict(task.dump_config()) task_config = dict(task.dump_config())
if (n_shot := task_config.get("num_fewshot")) == 0: if (n_shot := task_config.get("num_fewshot")) == 0:
n_shot = task_config.get("metadata", {}).get("num_fewshot", 0) n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
...@@ -87,7 +89,6 @@ class TaskOutput: ...@@ -87,7 +89,6 @@ class TaskOutput:
return cls( return cls(
task=task, task=task,
task_name=task_name, task_name=task_name,
task_id=task_id,
task_config=task_config, task_config=task_config,
group_name=group_name, group_name=group_name,
version=version, version=version,
...@@ -103,7 +104,7 @@ class TaskOutput: ...@@ -103,7 +104,7 @@ class TaskOutput:
self.agg_metrics[metric_key] = agg_fn(items) self.agg_metrics[metric_key] = agg_fn(items)
self.sample_len = len(items) # TODO: same sample size for each metric? self.sample_len = len(items) # TODO: same sample size for each metric?
if isinstance(bootstrap_iters, int): if isinstance(bootstrap_iters, int):
stderr_fn = metrics.stderr_for_metric( stderr_fn = stderr_for_metric(
metric=agg_fn, metric=agg_fn,
bootstrap_iters=min(bootstrap_iters, 100) bootstrap_iters=min(bootstrap_iters, 100)
if metric in ["bleu", "chrf", "ter"] if metric in ["bleu", "chrf", "ter"]
...@@ -128,19 +129,13 @@ class TaskOutput: ...@@ -128,19 +129,13 @@ class TaskOutput:
) )
def get_task_list(task_dict: dict, task_root=None) -> List[TaskOutput]: def get_task_list(task_dict: dict) -> List[TaskOutput]:
outputs = [] outputs = []
for task_name, task_obj in task_dict.items(): for task_name, task_obj in task_dict.items():
if isinstance(task_name, str):
prefix_name = task_name
else:
prefix_name = task_name.task_id
if isinstance(task_obj, dict): if isinstance(task_obj, dict):
_outputs = get_task_list(task_obj, task_root=prefix_name) _outputs = get_task_list(task_obj)
outputs.extend(_outputs) outputs.extend(_outputs)
else: else:
task_obj.task_id = f"{task_root}:{task_obj.task_id}"
task_output = TaskOutput.from_taskdict(task_name, task_obj) task_output = TaskOutput.from_taskdict(task_name, task_obj)
outputs.append(task_output) outputs.append(task_output)
...@@ -152,7 +147,7 @@ def get_subtask_list(task_dict, task_root=None, depth=0): ...@@ -152,7 +147,7 @@ def get_subtask_list(task_dict, task_root=None, depth=0):
for group_obj, task_obj in task_dict.items(): for group_obj, task_obj in task_dict.items():
if isinstance(group_obj, ConfigurableGroup): if isinstance(group_obj, ConfigurableGroup):
# group_name = group_obj.group_name # group_name = group_obj.group_name
group_name = group_obj.task_id group_name = group_obj.group_name
else: else:
group_name = group_obj group_name = group_obj
if isinstance(task_obj, dict): if isinstance(task_obj, dict):
...@@ -172,10 +167,10 @@ def get_subtask_list(task_dict, task_root=None, depth=0): ...@@ -172,10 +167,10 @@ def get_subtask_list(task_dict, task_root=None, depth=0):
else: else:
if isinstance(task_obj, ConfigurableGroup): if isinstance(task_obj, ConfigurableGroup):
# group_or_task_name = task_obj.group_name # group_or_task_name = task_obj.group_name
group_or_task_name = task_obj.task_id group_or_task_name = task_obj.group_name
elif isinstance(task_obj, ConfigurableTask): elif isinstance(task_obj, Task):
# group_or_task_name = task_obj.task_name # group_or_task_name = task_obj.task_name
group_or_task_name = task_obj.task_id group_or_task_name = task_obj.task_name
if task_root is None: if task_root is None:
subtask_list.setdefault((group_or_task_name, depth), []) subtask_list.setdefault((group_or_task_name, depth), [])
...@@ -233,19 +228,37 @@ def prepare_print_tasks( ...@@ -233,19 +228,37 @@ def prepare_print_tasks(
Prepares the task hierarchy and aggregates the results for each task and group recursively for printing. Prepares the task hierarchy and aggregates the results for each task and group recursively for printing.
""" """
def _sort_task_dict(task_dict):
"""
Helper utility. Sorts the task dict at the current level of the hierarchy based on alphabetized task name.
Required so that we end up sorting within each sub-header correctly.
"""
return dict(
sorted(
task_dict.items(),
key=lambda item: item[0].group_name
if isinstance(item[0], ConfigurableGroup)
else item[0],
)
)
task_agg = collections.defaultdict(dict) task_agg = collections.defaultdict(dict)
group_agg = collections.defaultdict(dict) group_agg = collections.defaultdict(dict)
task_dict = _sort_task_dict(task_dict)
for task_or_group_name, task_or_group_obj in task_dict.items(): for task_or_group_name, task_or_group_obj in task_dict.items():
tab_string = " " * task_depth + "- " if task_depth > 0 else "" tab_string = " " * task_depth + "- " if task_depth > 0 else ""
if isinstance(task_or_group_name, ConfigurableGroup): if isinstance(task_or_group_name, ConfigurableGroup):
# string_name = task_or_group_name.group_name # string_name = task_or_group_name.group_name
name = task_or_group_name.task_id name = task_or_group_name.group_name
from_configurable_group = True from_configurable_group = True
task_or_group_obj = _sort_task_dict(task_or_group_obj)
elif isinstance(task_or_group_name, str): elif isinstance(task_or_group_name, str):
name = task_or_group_name name = task_or_group_name
if isinstance(task_or_group_obj, ConfigurableTask): if isinstance(task_or_group_obj, Task):
# string_name = task_or_group_obj.task_name # string_name = task_or_group_obj.task_name
name = task_or_group_obj.task_id name = task_or_group_obj.task_name
from_configurable_group = False from_configurable_group = False
task_agg[name] = results[name].copy() task_agg[name] = results[name].copy()
...@@ -325,31 +338,169 @@ def consolidate_results( ...@@ -325,31 +338,169 @@ def consolidate_results(
higher_is_better = collections.defaultdict(dict) higher_is_better = collections.defaultdict(dict)
for task_output in eval_tasks: for task_output in eval_tasks:
# results[task_output.task_id]["task"] = task_output.task_name
if "task_alias" in (task_config := task_output.task_config): if "task_alias" in (task_config := task_output.task_config):
results[task_output.task_id]["alias"] = task_config["task_alias"] results[task_output.task_name]["alias"] = task_config["task_alias"]
else: else:
results[task_output.task_id]["alias"] = task_output.task_name results[task_output.task_name]["alias"] = task_output.task_name
if group_alias := task_output.group_alias: if group_alias := task_output.group_alias:
if group_alias not in results and (group_name := task_output.group_name): if group_alias not in results and (group_name := task_output.group_name):
results[group_name]["alias"] = group_alias results[group_name]["alias"] = group_alias
num_fewshot[task_output.task_id] = task_output.n_shot num_fewshot[task_output.task_name] = task_output.n_shot
configs[task_output.task_id] = task_output.task_config configs[task_output.task_name] = task_output.task_config
versions[task_output.task_id] = task_output.version versions[task_output.task_name] = task_output.version
samples[task_output.task_id] = task_output.logged_samples samples[task_output.task_name] = task_output.logged_samples
higher_is_better[task_output.task_id] = task_output.task.higher_is_better() higher_is_better[task_output.task_name] = task_output.task.higher_is_better()
for (metric, filter_key), items in task_output.sample_metrics.items(): for (metric, filter_key), items in task_output.sample_metrics.items():
metric_key = f"{metric},{filter_key}" metric_key = f"{metric},{filter_key}"
results[task_output.task_id][metric_key] = task_output.agg_metrics[ results[task_output.task_name][metric_key] = task_output.agg_metrics[
metric_key metric_key
] ]
results[task_output.task_id]["samples"] = task_output.sample_len results[task_output.task_name]["samples"] = task_output.sample_len
results[task_output.task_id][f"{metric}_stderr,{filter_key}"] = ( results[task_output.task_name][f"{metric}_stderr,{filter_key}"] = (
task_output.agg_metrics[f"{metric}_stderr,{filter_key}"] task_output.agg_metrics[f"{metric}_stderr,{filter_key}"]
) )
return results, samples, configs, versions, num_fewshot, higher_is_better return results, samples, configs, versions, num_fewshot, higher_is_better
def consolidate_group_results(
results,
versions,
task_dict,
task_root=None,
show_group_table=False,
task_aggregation_list=None,
) -> Tuple[dict, dict, bool, Union[None,]]:
"""
(Recursively) calculates groups' aggregated metrics and updates the results and versions dictionaries with this info.
@return: a tuple [results, versions, show_group_table, task_aggregation_list] with formats described below:
- results: A defaultdict with task names (and, after this function is called, group names of
groups that perform aggregation) as keys, and dictionaries with "alias" and metric,filter_name pairs as keys.
- versions: A defaultdict with task names (and, after this function is called, group names of
groups that perform aggregation) as keys, and float values representing the task or group's version if a version is specified. (defaulting to None).
- show_group_table: a boolean which is true if there exists a group that requires printing of its aggregated scores in a group table.
- task_aggregation_list: a defaultdict listing the subtasks to average over to produce a given group's end metric.
The method then returns the updated results, versions, show_group_table, and task_aggregation_list as a tuple.
In the top-level invocation of this function, task_aggregation_list is ignored.
"""
if task_root is None:
task_root = {}
if task_aggregation_list is None:
task_aggregation_list = {}
for group_or_task, group_or_task_info in task_dict.items():
# Convert to string
if isinstance(group_or_task, ConfigurableGroup):
group_config = group_or_task.config
group_or_task = group_or_task.group_name
else:
group_config = None
if isinstance(group_or_task_info, Task):
if task_root:
task_aggregation_list.setdefault(task_root, []).append(
group_or_task_info.task_name
)
else:
(
results,
versions,
show_group_table,
_task_aggregation_list,
) = consolidate_group_results(
results,
versions,
group_or_task_info,
group_or_task,
show_group_table,
task_aggregation_list,
)
if task_root:
task_aggregation_list.setdefault(task_root, []).extend(
task_aggregation_list.get(group_or_task, [])
)
if (group_config is None) or (
group_config["aggregate_metric_list"] is None
):
results[group_or_task][" "] = " "
continue
if "aggregate_metric_list" in group_config:
agg_metric_list = group_config["aggregate_metric_list"]
show_group_table = show_group_table | bool(
group_config["aggregate_metric_list"]
)
task_list = _task_aggregation_list[group_or_task]
metric_list = list(
{
key
for task in task_list
for key in results[task].keys()
if "_stderr" not in key and key not in ["task", "alias", "samples"]
}
)
for metric in metric_list:
stderr = "_stderr,".join(metric.split(","))
# gather metrics, sizes, and stderrs from subtasks
metrics = [
results[task][metric]
for task in task_list
if metric in results[task]
] # TODO: copy?
stderrs = [
results[task][stderr]
for task in task_list
if stderr in results[task]
]
sizes = [
results[task]["samples"]
for task in task_list
if metric in results[task]
]
for metric_config in agg_metric_list:
for filter_name in metric_config["filter_list"]:
if metric != ",".join([metric_config["metric"], filter_name]):
continue
# compute group's pooled metric and stderr
if metric_config["aggregation"] == "mean":
aggregate_fn = aggregate_subtask_metrics
else:
raise ValueError(
f"Currently, only 'mean' is supported for automatically aggregating scores across groups' subtasks. Got '{metric_config['aggregation']}' for group '{group_or_task}'"
)
results[group_or_task][metric] = aggregate_fn(
metrics,
sizes,
metric_config["weight_by_size"],
)
# TODO: calculate groups' metrics using arbitrary agg fns
if "N/A" in stderrs:
results[group_or_task][stderr] = "N/A"
else:
# NOTE: this assumes we are using the mean to aggregate. There are warnings about this elsewhere
results[group_or_task][stderr] = pooled_sample_stderr(
stderrs, sizes
)
results[group_or_task]["samples"] = sum(sizes)
group_metadata = group_config.get("metadata", None)
if group_metadata is not None:
versions[group_or_task] = group_metadata.get("version", None)
# print(results)
return results, versions, show_group_table, task_aggregation_list
@positional_deprecated @positional_deprecated
def find_test_root(start_path: pathlib.Path) -> pathlib.Path: def find_test_root(start_path: pathlib.Path) -> pathlib.Path:
""" """
......
...@@ -5,7 +5,9 @@ from functools import partial ...@@ -5,7 +5,9 @@ from functools import partial
from typing import Dict, List, Mapping, Optional, Union from typing import Dict, List, Mapping, Optional, Union
from lm_eval import utils from lm_eval import utils
from lm_eval.api.task import ConfigurableGroup, ConfigurableTask, GroupConfig, Task from lm_eval.api.group import ConfigurableGroup, GroupConfig
from lm_eval.api.task import ConfigurableTask, Task
from lm_eval.evaluator_utils import get_subtask_list
GROUP_ONLY_KEYS = list(GroupConfig().to_dict().keys()) GROUP_ONLY_KEYS = list(GroupConfig().to_dict().keys())
...@@ -166,13 +168,16 @@ class TaskManager: ...@@ -166,13 +168,16 @@ class TaskManager:
**config, **config,
} }
if self._config_is_python_task(config): if self._config_is_python_task(config):
task_object = config["class"](config=config) task_object = (
config["class"](config=config)
if issubclass(config["class"], ConfigurableTask)
else config["class"]()
)
# very scuffed: set task name here. TODO: fixme?
task_object.config.task = config["task"]
else: else:
task_object = ConfigurableTask(config=config) task_object = ConfigurableTask(config=config)
if task != task_object.task_id:
task_object.task_id = task
return {task: task_object} return {task: task_object}
def _get_group_and_subtask_from_config(config): def _get_group_and_subtask_from_config(config):
...@@ -201,7 +206,9 @@ class TaskManager: ...@@ -201,7 +206,9 @@ class TaskManager:
if update_config is not None: if update_config is not None:
# Process name_or_config as a dict instead # Process name_or_config as a dict instead
name_or_config = {"task": name_or_config, **update_config} name_or_config = {"task": name_or_config, **update_config}
elif self._name_is_task(name_or_config): elif self._name_is_task(name_or_config) or self._name_is_python_task(
name_or_config
):
task_config = self._get_config(name_or_config) task_config = self._get_config(name_or_config)
return _load_task(task_config, task=name_or_config) return _load_task(task_config, task=name_or_config)
else: else:
...@@ -385,7 +392,7 @@ class TaskManager: ...@@ -385,7 +392,7 @@ class TaskManager:
if attr in config: if attr in config:
if attr == "group" and print_info: if attr == "group" and print_info:
self.logger.info( self.logger.info(
"`group` and `group_alias` will no longer be used in the next release of lm-eval. " "`group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. "
"`tag` will be used to allow to call a collection of tasks just like `group`. " "`tag` will be used to allow to call a collection of tasks just like `group`. "
"`group` will be removed in order to not cause confusion with the new ConfigurableGroup " "`group` will be removed in order to not cause confusion with the new ConfigurableGroup "
"which will be the offical way to create groups with addition of group-wide configuations." "which will be the offical way to create groups with addition of group-wide configuations."
...@@ -434,6 +441,33 @@ def get_task_name_from_object(task_object): ...@@ -434,6 +441,33 @@ def get_task_name_from_object(task_object):
) )
def _check_duplicates(task_dict: dict) -> List[str]:
"""helper function solely used in validating get_task_dict output.
Takes the output of lm_eval.evaluator_utils.get_subtask_list and
returns a list of all leaf subtasks contained within, and errors if any such leaf subtasks are
"oversubscribed" to several disjoint groups.
"""
subtask_names = []
for key, value in task_dict.items():
subtask_names.extend(value)
duplicate_tasks = {
task_name for task_name in subtask_names if subtask_names.count(task_name) > 1
}
# locate the potentially problematic groups that seem to 'compete' for constituent subtasks
competing_groups = [
group
for group in task_dict.keys()
if len(set(task_dict[group]).intersection(duplicate_tasks)) > 0
]
if len(duplicate_tasks) > 0:
raise ValueError(
f"Found 1 or more tasks while trying to call get_task_dict() that were members of more than 1 called group: {list(duplicate_tasks)}. Offending groups: {competing_groups}. Please call groups which overlap their constituent tasks in separate evaluation runs."
)
def get_task_dict( def get_task_dict(
task_name_list: Union[str, List[Union[str, Dict, Task]]], task_name_list: Union[str, List[Union[str, Dict, Task]]],
task_manager: Optional[TaskManager] = None, task_manager: Optional[TaskManager] = None,
...@@ -451,6 +485,7 @@ def get_task_dict( ...@@ -451,6 +485,7 @@ def get_task_dict(
:return :return
Dictionary of task objects Dictionary of task objects
""" """
task_name_from_string_dict = {} task_name_from_string_dict = {}
task_name_from_config_dict = {} task_name_from_config_dict = {}
task_name_from_object_dict = {} task_name_from_object_dict = {}
...@@ -497,8 +532,16 @@ def get_task_dict( ...@@ -497,8 +532,16 @@ def get_task_dict(
): ):
raise ValueError raise ValueError
return { final_task_dict = {
**task_name_from_string_dict, **task_name_from_string_dict,
**task_name_from_config_dict, **task_name_from_config_dict,
**task_name_from_object_dict, **task_name_from_object_dict,
} }
# behavior can get odd if one tries to invoke several groups that "compete" for the same task.
# (notably, because one could request several num_fewshot values at once in GroupConfig overrides for the subtask
# and we'd be unsure which to use and report.)
# we explicitly check and error in this case.
_check_duplicates(get_subtask_list(final_task_dict))
return final_task_dict
...@@ -26,7 +26,7 @@ Homepage: https://github.com/isen-zhang/ACLUE ...@@ -26,7 +26,7 @@ Homepage: https://github.com/isen-zhang/ACLUE
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
......
group: aclue
task:
- aclue_ancient_chinese_culture
- aclue_ancient_literature
- aclue_ancient_medical
- aclue_ancient_phonetics
- aclue_basic_ancient_chinese
- aclue_couplet_prediction
- aclue_homographic_character_resolution
- aclue_named_entity_recognition
- aclue_poetry_appreciate
- aclue_poetry_context_prediction
- aclue_poetry_quality_assessment
- aclue_poetry_sentiment_analysis
- aclue_polysemy_resolution
- aclue_reading_comprehension
- aclue_sentence_segmentation
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: aclue
dataset_path: tyouisen/aclue dataset_path: tyouisen/aclue
test_split: test test_split: test
fewshot_split: dev fewshot_split: dev
......
...@@ -24,11 +24,11 @@ Homepage for Arabic EXAMS: [EXAMS Arabic Homepage](https://github.com/FreedomInt ...@@ -24,11 +24,11 @@ Homepage for Arabic EXAMS: [EXAMS Arabic Homepage](https://github.com/FreedomInt
### Citation ### Citation
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
- `EXAMS Arabic`: include IslamicStudies, Biology, Science, Physics, Social. - `aexams`: Arabic EXAMS dataset, including IslamicStudies, Biology, Science, Physics, Social subjects.
#### Tasks #### Tasks
......
group: aexams
task:
- aexams_Biology
- aexams_IslamicStudies
- aexams_Physics
- aexams_Science
- aexams_Social
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: aexams
dataset_path: Hennara/aexams dataset_path: Hennara/aexams
test_split: test test_split: test
fewshot_split: dev fewshot_split: dev
......
...@@ -75,7 +75,7 @@ Please make sure to cite all the individual datasets in your paper when you use ...@@ -75,7 +75,7 @@ Please make sure to cite all the individual datasets in your paper when you use
} }
``` ```
### Groups and Tasks ### Groups, Tags, and Tasks
#### Groups #### Groups
...@@ -89,6 +89,10 @@ Please make sure to cite all the individual datasets in your paper when you use ...@@ -89,6 +89,10 @@ Please make sure to cite all the individual datasets in your paper when you use
- `agieval_nous`: Evaluates a specific subset of AGIEval tasks (multiple-choice and english-only), namely those in https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/Mistral-7B-Base.md - `agieval_nous`: Evaluates a specific subset of AGIEval tasks (multiple-choice and english-only), namely those in https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/Mistral-7B-Base.md
#### Tags
None.
#### Tasks #### Tasks
- `agieval_aqua_rat` - `agieval_aqua_rat`
......
group: agieval
task:
- agieval_gaokao_biology
- agieval_gaokao_chemistry
- agieval_gaokao_chinese
- agieval_gaokao_geography
- agieval_gaokao_history
- agieval_gaokao_mathcloze
- agieval_gaokao_mathqa
- agieval_gaokao_physics
- agieval_jec_qa_ca
- agieval_jec_qa_kd
- agieval_logiqa_zh
- agieval_aqua_rat
- agieval_gaokao_english
- agieval_logiqa_en
- agieval_lsat_ar
- agieval_lsat_lr
- agieval_lsat_rc
- agieval_math
- agieval_sat_en_without_passage
- agieval_sat_en
- agieval_sat_math
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: agieval_cn
task:
- agieval_gaokao_biology
- agieval_gaokao_chemistry
- agieval_gaokao_chinese
- agieval_gaokao_geography
- agieval_gaokao_history
- agieval_gaokao_mathcloze
- agieval_gaokao_mathqa
- agieval_gaokao_physics
- agieval_jec_qa_ca
- agieval_jec_qa_kd
- agieval_logiqa_zh
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: agieval_en
task:
- agieval_aqua_rat
- agieval_gaokao_english # categorizing as EN because the AGIEval codebase lists this as in `english_qa_tasks`
- agieval_logiqa_en
- agieval_lsat_ar
- agieval_lsat_lr
- agieval_lsat_rc
- agieval_math
- agieval_sat_en_without_passage
- agieval_sat_en
- agieval_sat_math
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group: agieval_nous
task:
- agieval_aqua_rat
- agieval_logiqa_en
- agieval_lsat_ar
- agieval_lsat_lr
- agieval_lsat_rc
- agieval_sat_en_without_passage
- agieval_sat_en
- agieval_sat_math
aggregate_metric_list:
- metric: acc_norm
aggregation: mean
weight_by_size: true
metadata:
version: 0.0
group:
- agieval
- agieval_en
- agieval_nous
task: agieval_aqua_rat task: agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat dataset_path: hails/agieval-aqua-rat
dataset_name: null dataset_name: null
......
include: aqua-rat.yaml include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_biology task: agieval_gaokao_biology
dataset_path: hails/agieval-gaokao-biology dataset_path: hails/agieval-gaokao-biology
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment