Unverified Commit 79b972d6 authored by Hailey Schoelkopf's avatar Hailey Schoelkopf Committed by GitHub
Browse files

[Refactor] [WIP] New YAML advanced docs (#567)



* add wip gsm8k yaml

* cleanup tasks dir

* push gsm8k yaml changes

* rename gpt2.py

* add updated gsm8k , triviaqa baseline

* add new cot yaml

* allow for multiple filter pipelines, new filter types

* updated gsm8k + sampling gen configs

* cleanup self-consistency yaml

* push outline for advanced docs

* push docs checklist

* switch to inheritance for many tasks

* acc_norm and acc_mutual_info fixed

* fix missing newline in error msg

* remove many .py tasks

* updated GSM8k

* added more doc

* Update advanced_task_guide.md

Added list of parameters

* Update advanced_task_guide.md

* Added details on listing metrics

* Update advanced_task_guide.md

* Added more explanation

* modify current default filter name

* add new tags to tasks

* remove a lingering print()

* add rest of param docs, cleanup deprecated fields

* push docs update

* move ALL_TASKS definition location

* confirm write_out.py works if no description dict passed

---------
Co-authored-by: default avatarlintangsutawika <lintang@sutawika.com>
parent 761f0087
Tracking progress on revamping documentation pages for the refactor of LM-Evaluation-Harness.
## Desired Pages
* [ ] YAML explainer
* [ ] Explainer on filters + advanced features
* [ ] Walkthrough start-to-finish of adding a new task to codebase
* [ ] Explaining registries + decorators
* [ ] model_guide.md for adding new model API
* [ ] guide to writing an adapter to new advanced codebase (e.g. NeoX)
* [ ] Parallelism guide (?)
\ No newline at end of file
# Advanced Task Configuration
The `lm-evaluation-harness` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
## Configurations
### Parameters
- **task** (`str`, defaults to None) — name of the task.
- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
- **reference** (`str`, *optional*) —
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **template_aliases** (`str`, *optional*) —
- **aliases**: (`Union[str, list]`, *optional*) —
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
- **should_decontaminate** (`bool`, *optional*, defaults to False) -
- **doc_to_decontamination_query** (`str`, *optional*) —
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.
## Filters
Explain: What are filters? What is their place in the pipeline?
Format of the `resps` object, and what needs to happen to yield proper scorable results
TODO: triviaqa is implementable if we don't use `take_first` and implement a multi-alias exact_match_any metric
TODO: Filters might warrant a separate doc.
### Multiple Filter Pipelines
On the same model outputs, we can perform multiple distinct filtering setups in parallel
Case study: gsm8k-CoT-self-consistency
### "Splitting" Pipelines
TODO: either allow for pipelines that "split" and report multiple keys, or something different. We in particular want to support not re-running reward /scoring models on every different filter pipeline if can be shared.
## Embedded Python Code
There could be cases where Jinja 2 or simple f-string format won't cut it. For tasks like these, we additionally support the importing of Python helper functions that can be injected directly to the yaml. It should be noted that the function script must be in the same directory as the yaml.
TODO: document the `!function filename.pythonfunctionname` syntax here.
TODO: add permannent link to wikitext.yaml and super_glue_cb.yml
```
wikitext.yaml and helper fn go here
```
## (No Longer Recommended) Direct `Task` Subclassing
The prior implementation method of new tasks was to subclass `Task`. While we intend to migrate all tasks to the new YAML implementation option going forward, it remains possible to subclass
{Insert a sample custom `Task` subclass code block here}
## Configuring Tasks with YAMLs
You can easily make a task evaluation using yamls, this is to allow faster and easier experience.
Doc to text
Jinja,
You can use Jinja or f-strings to make a prompt template.
To set a mapping of verbalizer to label, you can define that in the jinja string dorectly.
## Including a Base YAML
You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
```
include: <YAML file or with full path>
...
```
You can find an example of how to use this feature at [gsm8k-cot-self-consistency.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml) where it is based off [gsm8k-cot.yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/3c07cc04a92fc467d7c9a94894aeddd58c93a5da/lm_eval/tasks/gsm8k/gsm8k-cot.yaml)
## Listing Metrics
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting a `exact_match` (TODO: Add url to metric), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
```
metric_list:
- metric: acc
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
```
## Using Promptsource
- load prompt from promptsource
## Good Reference Tasks
- This section should list some "canonized" task examples for different use cases / subcategories, as suggestions from which to build new tasks off of.
\ No newline at end of file
# New Task Guide
`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
This documentation page provides a walkthrough to get started creating your own task.
## Setup
If you haven't already, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
```sh
# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout -b <task-name>
pip install -e ".[dev]"
```
As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (a *generative* task which requires sampling text from a model) and the `sciq` benchmark. (a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices).
## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files.
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
```sh
touch lm_eval/tasks/new_mcqa.yaml
```
or
```sh
touch lm_eval/tasks/new_generative_task.yaml
```
### Selecting and configuring a dataset
All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)
.
Once you have a HuggingFace dataset prepared for your task, we want to assign our new YAML to use this dataset:
```yaml
dataset_path: ... # the name of the dataset on the HF Hub.
dataset_name: ... # the dataset configuration to use. Leave `null` if your dataset does not require a config to be passed. See https://huggingface.co/docs/datasets/load_hub#configurations for more info.
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
```
Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
```yaml
training_split: <split name of training set, or `null`>
validation_split: <split name of val. set, or `null`>
test_split: <split name of test set, or `null`>
```
Tests will run on the `test_split` if it is available, and otherwise evaluate on the `validation_split`.
We can also specify from which split the task should retrieve few-shot examples via:
```yaml
fewshot_split: <split name to draw fewshot examples from, or `null`>
```
though if this is not set, we will default to train/validation/test sets, in that order.
### Writing a prompt
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
To write a prompt, users are required to write two YAML fields in Jinja as strings:
```yaml
doc_to_text:
doc_to_target:
```
Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
```
Question: {document[question]}
Answer:
```
We do this by writing
```yaml
doc_to_text: "Question: {{question}}\nAnswer:"
```
Such that {{question}} will be replaced by `doc["question"]` when rendering the prompt template.
Our intended output is for the model to predict a single whitespace, and then the answer to the question. We do this via:
```yaml
doc_to_target: "{{answer}}"
```
**Important**: We always add one whitespace between the input and output, such that the full input-output string is `doc_to_target(doc) + " " + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
TODO: mention promptsource here, or reserve it for advanced guide
#### Multiple choice format
- template_aliases
- expected mcqa setup
### Setting metrics
You're almost done! Now we need to choose how to score our task.
- *If this is a multiple choice task:* do you just want to check your model's accuracy in choosing the correct answer choice?
- *If this is a generation task:* do you just want to check how often your model outputs *exactly the ground-truth output string provided*?
If the answer to the above is no: you'll need to record what scoring metrics to use! Metrics can be listed in the following format:
```yaml
metric_list:
- metric: <name of the metric here>
aggregation: <name of the aggregation fn here>
higher_is_better: <true or false>
- metric: ...
aggregation: ...
higher_is_better: ...
```
For a full list of natively supported metrics and aggregation functions see `TODO: we should list out all supported metrics, aggregations, models, somewhere in the docs.` All metrics supported in [HuggingFace Evaluate](https://github.com/huggingface/evaluate/tree/main/metrics) can also be used, and will be loaded if a given metric name is not one natively supported in `lm-eval`.
### Optional, more advanced setup
Some tasks may require more advanced processing logic than is described in this guide.
As a heuristic check:
* Does your task require generating multiple free-form outputs per input document?
* Does your task require complex, multi-step post-processing of generated model outputs?
* Does your task require subsetting documents on the fly based on their content?
* Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
* Does your task rely on metrics that need a custom implementation?
For more detail on the task system and advanced features, see `docs/advanced_task_guide.md` . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
### Task name + groups (registering a task)
To test a task conveniently, it helps to *register* the task--that is, to give it a name and make the `lm-eval` library aware it exists!
If you're writing your YAML file inside the `lm_eval/tasks` folder, you just need to give your task a name! You can do this inside your YAML file:
```yaml
task: <name of the task>
```
Including a task name is mandatory.
It is often also convenient to label your task with several `groups`, or tags, though this field is optional:
```yaml
group:
- group1
- group2
```
This will add your task to the `group1` and `group2` groups, enabling people to know how to categorize your task, and if desired run all tasks in one of these groups at once, your task along with them.
If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
You can do this via adding the Python snippet
```python
from lm_eval.tasks import include_task_folder
include_task_folder("/path/to/yaml/parent/folder")
```
to the top of any Python file that is run or imported when performing evaluation, such as `main.py`.
Passing `--tasks /path/to/yaml/file` is also accepted.
## Checking validity
- write_out
## Checking performance ; implementation equivalence
## Task impl. checklist
- turn this into a GH PR template too
- README.md in task dir
## Submitting your task
You're all set! Now push your work and make a pull request! Thanks for the contribution 👍. If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
\ No newline at end of file
......@@ -45,6 +45,16 @@ def acc_fn(items): # This is a passthrough function
return items
@register_metric(
metric="acc_norm",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice"],
aggregation="mean",
)
def acc_norm_fn(items): # This is a passthrough function
return items
@register_metric(
metric="acc_mutual_info",
higher_is_better=True,
......
......@@ -31,6 +31,7 @@ def get_model(model_name):
TASK_REGISTRY = {}
GROUP_REGISTRY = {}
ALL_TASKS = []
func2task_index = {}
......@@ -49,10 +50,6 @@ def register_task(name):
def register_group(name):
def decorate(fn):
# assert (
# name not in GROUP_REGISTRY
# ), f"group named '{name}' conflicts with existing registered group!"
func_name = func2task_index[fn.__name__]
if name in GROUP_REGISTRY:
GROUP_REGISTRY[name].append(func_name)
......@@ -77,6 +74,7 @@ DEFAULT_METRIC_REGISTRY = {
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": [
"acc",
"acc_norm"
],
"greedy_until": ["exact_match"],
}
......
import abc
from dataclasses import dataclass, field
from dataclasses import dataclass, field, asdict
import re
import ast
......@@ -51,11 +51,9 @@ ALL_OUTPUT_TYPES = [
class TaskConfig(dict):
task: str = None
group: str = None
group: Union[str, list] = None
reference: str = None
task_name: str = (
None # TODO: deprecate this, it'll be set in __post_init__ to be names[0]
)
dataset_path: str = None
dataset_name: str = None
dataset_kwargs: dict = None
......@@ -68,6 +66,7 @@ class TaskConfig(dict):
aliases: Union[str, list] = None
doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None
use_prompt: str = None
num_fewshot: int = 0
batch_size: int = 1
......@@ -79,12 +78,8 @@ class TaskConfig(dict):
generation_kwargs: dict = None
delimiter: str = "\n\n"
filter_list: Union[str, list] = None
normalization: str = (
None # TODO: add length-normalization of various types, mutual info
)
should_decontaminate: bool = False
doc_to_decontamination_query: str = None
use_prompt: str = None
metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
......@@ -102,13 +97,17 @@ class TaskConfig(dict):
if type(self.gold_alias) == str:
self.gold_alias = self.template_aliases + self.doc_to_target
if not self.generation_kwargs:
if self.generation_kwargs or self.output_type == "greedy_until":
assert self.output_type == "greedy_until", "passed `generation_kwargs`, but not using a generation request type!"
# ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = {"do_sample": False, "temperature": 0.0}
def __getitem__(self, item):
return getattr(self, item)
def to_dict(self):
return asdict(self)
class Task(abc.ABC):
"""A task represents an entire benchmark including its dataset, problems,
......@@ -460,10 +459,20 @@ class Task(abc.ABC):
eval_logger.warning("No filter defined, passing through instances")
return self._instances
def dump_config(self):
"""Returns a dictionary representing the task's config.
:returns: str
The fewshot context.
"""
# TODO: this should only return the overrides applied to a non-YAML task's configuration.
# (batch size, num_fewshot)
return self._config.to_dict()
class ConfigurableTask(Task):
VERSION = "2.0"
VERSION = "Yaml"
OUTPUT_TYPE = None
CONFIG = None
......@@ -503,7 +512,7 @@ class ConfigurableTask(Task):
_metric_list = DEFAULT_METRIC_REGISTRY[self._config.output_type]
if self._config.metric_list is None:
# TODO: handle this in TaskConfig.__post_init__ ?
for metric_name in _metric_list:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
......@@ -521,9 +530,9 @@ class ConfigurableTask(Task):
for key in metric_config
if key not in ["metric", "aggregation", "higher_is_better"]
}
if metric_name in _metric_list:
try:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
else:
except:
eval_logger.warning(
f"Metric {metric_name} not found, "
"Searching from https://huggingface.co/evaluate-metric"
......@@ -540,7 +549,8 @@ class ConfigurableTask(Task):
)
if "aggregation" in metric_config:
self._aggregation_list[metric_name] = metric_config["aggregation"]
agg_name = metric_config["aggregation"]
self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[agg_name]
else:
eval_logger.warning(
f"metric {metric_name} is defined, but aggregation is not"
......@@ -579,12 +589,11 @@ class ConfigurableTask(Task):
key: function[key] for key in function if key != "function"
}
components.append([function["function"], kwargs])
filter_pipeline = build_filter_ensemble(filter_name, components)
self._filters.append(filter_pipeline)
else:
self._filters = [
build_filter_ensemble("take_first", [["take_first", None]])
build_filter_ensemble("none", [["take_first", None]])
]
if self._config.use_prompt is not None:
......@@ -598,7 +607,7 @@ class ConfigurableTask(Task):
if self.fewshot_docs() is not None:
self.sampler = samplers.Sampler(
list(self.fewshot_docs()), self, rnd=random.Random()
) # TODO: pass the correct docs in here
)
def download(self, dataset_kwargs=None):
......@@ -639,15 +648,15 @@ class ConfigurableTask(Task):
return self.dataset[self._config.test_split]
def fewshot_docs(self):
if (self._config.num_fewshot > 0) and (self._config.fewshot_split is None):
eval_logger.warning(
"num_fewshot > 0 but fewshot_split is None. "
"using preconfigured rule."
)
return super().fewshot_docs()
elif self._config.fewshot_split is not None:
if self._config.fewshot_split is not None:
return self.dataset[self._config.fewshot_split]
else:
if self._config.num_fewshot > 0:
eval_logger.warning(
"num_fewshot > 0 but fewshot_split is None. "
"using preconfigured rule."
)
return super().fewshot_docs()
def should_decontaminate(self):
return self._config.should_decontaminate
......@@ -818,7 +827,7 @@ class ConfigurableTask(Task):
)
if (
2 * len(choices) == len(lls)
and "acc_mutual_info" in self._metric_list.keys()
and "acc_mutual_info" in self._metric_fn_list.keys()
):
# then we are doing mutual info.
# this stores the "dryrun" / unconditional answer loglikelihoods
......
......@@ -141,15 +141,16 @@ def evaluate(
results = collections.defaultdict(dict)
versions = collections.defaultdict(dict)
configs = collections.defaultdict(dict)
requests = collections.defaultdict(list)
# requests_origin = collections.defaultdict(list)
# docs = {}
# get lists of each type of request
for task_name, task in task_dict.items():
versions[task_name] = task.VERSION
configs[task_name] = dict(task.dump_config()) # TODO: don't access a private attribute here ; for non-YAML tasks handle this case
# deterministically shuffle docs and chop off the first `limit` because sometimes docs are in some kind of order
# task_docs = list(task_doc_func())
......@@ -289,7 +290,7 @@ def evaluate(
if stderr is not None:
results[task_name][metric + "_stderr" + "," + key] = stderr(items)
return {"results": dict(results), "versions": dict(versions)}
return {"results": dict(results), "configs": dict(configs), "versions": dict(versions)}
else:
return None
......@@ -19,37 +19,42 @@ def get_task_name_from_config(task_config):
return "{dataset_path}_{dataset_name}".format(**task_config)
def include_task_folder(task_dir):
"""
Calling this function
"""
for root, subdirs, file_list in os.walk(task_dir):
if (subdirs == []) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
yaml_path = os.path.join(root, f)
try:
config = utils.load_yaml_config(yaml_path)
SubClass = type(
config["task"] + "ConfigurableTask",
(ConfigurableTask,),
{"CONFIG": TaskConfig(**config)},
)
if "task" in config:
task_name = "{}".format(config["task"])
register_task(task_name)(SubClass)
if "group" in config:
for group in config["group"]:
register_group(group)(SubClass)
except Exception as error:
eval_logger.warning(
"Failed to load config in\n"
f" {yaml_path}\n"
" Config will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
for root, subdirs, file_list in os.walk(task_dir):
if (subdirs == []) and (len(file_list) > 0):
for file in file_list:
if "yaml" in file:
yaml_path = os.path.join(root, file)
try:
config = utils.load_yaml_config(yaml_path)
SubClass = type(
config["task"] + "ConfigurableTask",
(ConfigurableTask,),
{"CONFIG": TaskConfig(**config)},
)
if "task" in config:
task_name = "{}".format(config["task"])
register_task(task_name)(SubClass)
if "group" in config:
for group in config["group"]:
register_group(group)(SubClass)
except Exception as error:
eval_logger.warning(
"Failed to load config in\n"
f" {yaml_path}\n"
" Config will not be added to registry"
f" Error: {error}"
)
ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
include_task_folder(task_dir)
def get_task(task_name, config):
......
"""
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
https://arxiv.org/pdf/1803.05457.pdf
The ARC dataset consists of 7,787 science exam questions drawn from a variety
of sources, including science questions provided under license by a research
partner affiliated with AI2. These are text-only, English language exam questions
that span several grade levels as indicated in the files. Each question has a
multiple choice structure (typically 4 answer options). The questions are sorted
into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.
Homepage: https://allenai.org/data/arc
"""
from lm_eval import utils
from lm_eval.prompts import get_prompt
from lm_eval.api.task import MultipleChoiceTask
from lm_eval.api.registry import register_task, register_group
_CITATION = """
@article{Clark2018ThinkYH,
title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
journal={ArXiv},
year={2018},
volume={abs/1803.05457}
}
"""
@register_group("arc")
@register_task("arc_easy")
class ARCEasy(MultipleChoiceTask):
VERSION = "2.0"
DATASET_PATH = "ai2_arc"
DATASET_NAME = "ARC-Easy"
OUTPUT_TYPE = "loglikelihood"
def has_training_docs(self):
return True
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def training_docs(self):
if self._training_docs is None:
self._training_docs = list(map(self._process_doc, self.dataset["train"]))
return self._training_docs
def validation_docs(self):
return map(self._process_doc, self.dataset["validation"])
def test_docs(self):
return map(self._process_doc, self.dataset["test"])
def _process_doc(self, doc):
# NOTE: Some `doc["answerKey"]`s are in numeric string format being one
# of {'1', '2', '3', '4', '5'}. We map them back to letters.
num_to_letter = {"1": "A", "2": "B", "3": "C", "4": "D", "5": "E"}
doc["answerKey"] = num_to_letter.get(doc["answerKey"], doc["answerKey"])
out_doc = {
"id": doc["id"],
"question": doc["question"],
"choices": doc["choices"]["text"],
"gold": ["A", "B", "C", "D", "E"].index(doc["answerKey"]),
}
return out_doc
def doc_to_text(self, doc):
doc_to_text = get_prompt("qa-basic:question-newline-answer")
return utils.apply_template(doc_to_text, doc)
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["query"]
@register_group("arc")
@register_task("arc_challenge")
class ARCChallenge(ARCEasy):
DATASET_PATH = "ai2_arc"
DATASET_NAME = "ARC-Challenge"
group:
- arc_yaml
task: arc_challenge_yaml
- ai2_arc
- multiple_choice
task: arc_challenge
dataset_path: ai2_arc
dataset_name: ARC-Challenge
output_type: multiple_choice
......
group:
- arc_yaml
task: arc_easy_yaml
- ai2_arc
- multiple_choice
task: arc_easy
dataset_path: ai2_arc
dataset_name: ARC-Easy
output_type: multiple_choice
......
......@@ -30,3 +30,17 @@ Homepage: https://github.com/openai/grade-school-math
primaryClass={cs.LG}
}
```
### Checklist
- [x] Is in Eval-harness v1.0 ?
- [ ] Has been checked for regression from v1.0?
- [ ] Has been checked for equivalence with original paper methodology?
- [ ] "Main" checked variant clearly denoted?
### Variant Wishlist
- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
- [ ] Using Verifiers
- [ ] Majority voting "without CoT"
\ No newline at end of file
group:
- greedy_until
- math_word_problems
task: gsm8k_yaml
dataset_path: gsm8k
dataset_name: main
......@@ -25,7 +28,7 @@ generation_kwargs:
- "Question:"
do_sample: false
temperature: 0.0
repeats: 2
repeats: 1
num_fewshot: 5
# filter_list:
# - name: "get-answer"
......
"""
The LAMBADA dataset: Word prediction requiring a broad discourse context∗
https://arxiv.org/pdf/1606.06031.pdf
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
passages sharing the characteristic that human subjects are able to guess their last
word if they are exposed to the whole passage, but not if they only see the last
sentence preceding the target word. To succeed on LAMBADA, computational models
cannot simply rely on local context, but must be able to keep track of information
in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
"""
from lm_eval.api.task import Task
from lm_eval.api.instance import Instance
from lm_eval.api.metrics import mean, perplexity
from lm_eval.api.registry import register_task, register_group
_CITATION = """
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
"""
class LambadaBase(Task):
VERSION = None
OUTPUT_TYPE = "loglikelihood"
def training_docs(self):
if self.has_training_docs():
return self.dataset["train"]
def validation_docs(self):
if self.has_validation_docs():
return self.dataset["validation"]
def test_docs(self):
if self.has_test_docs():
return self.dataset["test"]
def doc_to_text(self, doc):
return doc["text"].rsplit(" ", 1)[0]
def should_decontaminate(self):
return True
def doc_to_decontamination_query(self, doc):
return doc["text"]
def doc_to_target(self, doc):
return " " + doc["text"].rsplit(" ", 1)[1]
def construct_requests(self, doc, ctx, **kwargs):
return Instance(
request_type=self.OUTPUT_TYPE,
doc=doc,
arguments=(ctx, self.doc_to_target(doc)),
**kwargs
)
def process_results(self, doc, results):
# TODO: this ^ is a hack. filters should make it so that we only have one response per request that we score
results = results[
0
] # TODO: recheck this. currently a list of [(ll, is_greedy)] is passed in
ll, is_greedy = results
return {"ppl": ll, "acc": int(is_greedy)}
def aggregation(self):
return {"ppl": perplexity, "acc": mean}
def higher_is_better(self):
return {"ppl": False, "acc": True}
@register_task("lambada_standard")
class LambadaStandard(LambadaBase):
"""The LAMBADA task using the standard original LAMBADA dataset."""
VERSION = "2.0"
DATASET_PATH = "lambada"
def has_training_docs(self):
return False
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
@register_task("lambada_openai")
class LambadaOpenAI(LambadaBase):
"""The LAMBADA task using the LAMBADA OpenAI dataset, a modified version of the
original LAMBADA dataset created by OpenAI for evaluating their GPT-2 model.
Reference: https://github.com/openai/gpt-2/issues/131#issuecomment-497136199
"""
VERSION = "2.0"
DATASET_PATH = "EleutherAI/lambada_openai"
def has_training_docs(self):
return False
def has_validation_docs(self):
return False
def has_test_docs(self):
return True
group:
- lambada
task: lambada_openai_yaml
- loglikelihood
- perplexity
task: lambada_openai
dataset_path: EleutherAI/lambada_openai
dataset_name: default
output_type: loglikelihood
......
group:
- lambada
task: lambada_standard_yaml
- loglikelihood
- perplexity
task: lambada_standard
dataset_path: lambada
dataset_name: null
output_type: loglikelihood
......
group:
- lambada_cloze
- loglikelihood
task: lambada_openai_cloze_yaml
dataset_path: EleutherAI/lambada_openai
dataset_name: default
......
group:
- lambada_cloze
- loglikelihood
task: lambada_standard_cloze_yaml
dataset_path: lambada
dataset_name: null
......
"""
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
https://arxiv.org/pdf/2101.00027.pdf
The Pile is a 825 GiB diverse, open source language modelling data set that consists
of 22 smaller, high-quality datasets combined together. To score well on Pile
BPB (bits per byte), a model must be able to understand many disparate domains
including books, github repositories, webpages, chat logs, and medical, physics,
math, computer science, and philosophy papers.
Homepage: https://pile.eleuther.ai/
"""
from lm_eval.api.task import PerplexityTask
from lm_eval.api.registry import register_task, register_group
_CITATION = """
@article{pile,
title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
"""
class PilePerplexityTask(PerplexityTask):
VERSION = "2.0"
DATASET_PATH = "EleutherAI/the_pile"
DATASET_NAME = None
def has_training_docs(self):
return False
def test_docs(self):
for doc in self.dataset["train"].select(range(100)):
yield doc
def has_validation_docs(self):
return False
def has_test_docs(self):
return True
def doc_to_target(self, doc):
return doc["text"]
# def validation_docs(self):
# for doc in self.dataset["validation"]:
# yield doc["text"]
# def test_docs(self):
# for doc in self.dataset["test"]:
# yield doc["text"]
class PileArxiv(PilePerplexityTask):
DATASET_NAME = "pile_arxiv"
class PileBooks3(PilePerplexityTask):
DATASET_NAME = "pile_books3"
class PileBookCorpus2(PilePerplexityTask):
DATASET_NAME = "pile_bookcorpus2"
class PileDmMathematics(PilePerplexityTask):
DATASET_NAME = "pile_dm-mathematics"
@register_task("pile_enron")
class PileEnron(PilePerplexityTask):
DATASET_NAME = "enron_emails"
class PileEuroparl(PilePerplexityTask):
DATASET_NAME = "pile_europarl"
class PileFreeLaw(PilePerplexityTask):
DATASET_NAME = "pile_freelaw"
class PileGithub(PilePerplexityTask):
DATASET_NAME = "pile_github"
class PileGutenberg(PilePerplexityTask):
DATASET_NAME = "pile_gutenberg"
class PileHackernews(PilePerplexityTask):
DATASET_NAME = "pile_hackernews"
class PileNIHExporter(PilePerplexityTask):
DATASET_NAME = "pile_nih-exporter"
class PileOpenSubtitles(PilePerplexityTask):
DATASET_NAME = "pile_opensubtitles"
class PileOpenWebText2(PilePerplexityTask):
DATASET_NAME = "pile_openwebtext2"
class PilePhilPapers(PilePerplexityTask):
DATASET_NAME = "pile_philpapers"
class PilePileCc(PilePerplexityTask):
DATASET_NAME = "pile_pile-cc"
class PilePubmedAbstracts(PilePerplexityTask):
DATASET_NAME = "pile_pubmed-abstracts"
class PilePubmedCentral(PilePerplexityTask):
DATASET_NAME = "pile_pubmed-central"
class PileStackExchange(PilePerplexityTask):
DATASET_NAME = "pile_stackexchange"
class PileUspto(PilePerplexityTask):
DATASET_NAME = "pile_upsto"
class PileUbuntuIrc(PilePerplexityTask):
DATASET_NAME = "pile_ubuntu-irc"
class PileWikipedia(PilePerplexityTask):
DATASET_NAME = "pile_wikipedia"
class PileYoutubeSubtitles(PilePerplexityTask):
DATASET_NAME = "pile_youtubesubtitles"
group:
- pile
- perplexity
- loglikelihood_rolling
task: pile_arxiv
dataset_path: EleutherAI/the_pile
dataset_name: pile_arxiv
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment