Unverified Commit 02362e6a authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'big-refactor' into refactor-more-tasks

parents b1b5239d 0ba4ae15
......@@ -5,3 +5,4 @@ lm_cache
.idea
*.egg-info/
.vscode/
......@@ -14,32 +14,42 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
### Parameters
Task naming + registration:
- **task** (`str`, defaults to None) — name of the task.
- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
- **reference** (`str`, *optional*) —
Dataset configuration options:
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **template_aliases** (`str`, *optional*) —
- **aliases**: (`Union[str, list]`, *optional*) —
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
Prompting / in-context formatting options:
- **template_aliases** (`str`, *optional*) — A field for inputting additional Jinja2 content. Intended not to render as text after applying a Jinja template, but to instead define variables within Jinja that will be used within the written prompts. (for example, mapping the dataset column `label` to the new name `gold`).
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text and doc_to_target and make template_aliases unused.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
Runtime configuration options:
- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs for each sample. can be used for cases such as self-consistency.
Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
- **output_type** (`str`, *optional*, defaults to "greedy_until") — Selects the type of model output for the given task. Options are `greedy_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
- **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
- **should_decontaminate** (`bool`, *optional*, defaults to False) -
- **doc_to_decontamination_query** (`str`, *optional*) —
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use, if defined will overwrite doc_to_text and doc_to_target.
Other:
- **metadata** (`str`, *optional*) — An optional field where arbitrary metadata can be passed.
## Filters
......
# Description Guide
![fewshot-example](./img/fewshot_example_gpt3.png)
(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
```python
description_dict = {
"task_name_1": "description",
"task_name_2": "description",
...
}
```
Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
```python
"""
<description>
<examples>
<prompt>
"""
```
## Descriptions in File
One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
```json
{
"cycle_letters": "Please unscramble the letters into a word, and write that word:",
"copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
}
```
which can then be supplied to the CLI as:
```bash
python main.py \
--tasks cycle_letters,copa \
--description_dict_path /your/path/descriptions.json \
...
```
......@@ -157,3 +157,17 @@ def get_aggregation(name):
raise Warning(
"{} not a registered aggregation metric!".format(name),
)
def get_default_aggregation(metric_name):
try:
return DEFAULT_AGGREGATION_REGISTRY[metric_name]
except KeyError:
raise Warning(f"No default aggregation metric for metric '{metric_name}'!")
def is_higher_better(metric_name):
try:
return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError:
raise Warning(f"higher_is_better not specified for metric '{metric_name}'!")
......@@ -24,19 +24,18 @@ from lm_eval.logger import eval_logger
from lm_eval.prompts import get_prompt
from lm_eval.filters import build_filter_ensemble
from lm_eval.api.metrics import (
# get_metric,
# get_aggregation,
mean,
weighted_perplexity,
bits_per_byte,
)
from lm_eval.api.registry import (
METRIC_REGISTRY,
get_metric,
get_aggregation,
get_default_aggregation,
is_higher_better,
DEFAULT_METRIC_REGISTRY,
OUTPUT_TYPE_REGISTRY,
AGGREGATION_REGISTRY,
HIGHER_IS_BETTER_REGISTRY,
DEFAULT_AGGREGATION_REGISTRY,
)
ALL_OUTPUT_TYPES = [
......@@ -50,10 +49,12 @@ ALL_OUTPUT_TYPES = [
@dataclass
class TaskConfig(dict):
# task naming/registry
task: str = None
group: Union[str, list] = None
# HF dataset options.
# which dataset to use,
# and what splits for what purpose
dataset_path: str = None
dataset_name: str = None
dataset_kwargs: dict = None
......@@ -61,24 +62,25 @@ class TaskConfig(dict):
validation_split: str = None
test_split: str = None
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
# formatting / prompting options.
# see docs/advanced_task_guide.md for more info
template_aliases: str = None
doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None
gold_alias: Union[Callable, str] = None
use_prompt: str = None
description: str = ""
target_delimiter: str = " "
fewshot_delimiter: str = "\n\n"
# runtime configuration options
num_fewshot: int = 0
batch_size: int = 1
repeats: int = 1
# scoring options
metric_list: str = None
gold_alias: Union[Callable, str] = None
create_choices: Union[Callable, str] = None
output_type: str = "greedy_until"
generation_kwargs: dict = None
repeats: int = 1
filter_list: Union[str, list] = None
should_decontaminate: bool = False
doc_to_decontamination_query: str = None
......@@ -480,7 +482,7 @@ class Task(abc.ABC):
The fewshot context.
"""
# TODO: this should only return the overrides applied to a non-YAML task's configuration.
# (batch size, num_fewshot)
# (num_fewshot)
return self._config.to_dict()
......@@ -528,13 +530,11 @@ class ConfigurableTask(Task):
if self._config.metric_list is None:
# TODO: handle this in TaskConfig.__post_init__ ?
for metric_name in _metric_list:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[
metric_name
]
self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_default_aggregation(
metric_name
]
)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
else:
for metric_config in self._config.metric_list:
assert "metric" in metric_config
......@@ -544,30 +544,13 @@ class ConfigurableTask(Task):
for key in metric_config
if key not in ["metric", "aggregation", "higher_is_better"]
}
try:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
except Exception:
eval_logger.warning(
f"Metric {metric_name} not found, "
"Searching from https://huggingface.co/evaluate-metric"
)
try:
metric_object = evaluate.load(metric_name)
self._metric_fn_list[metric_name] = metric_object
self._metric_fn_kwargs[metric_name] = kwargs
except Exception:
raise Warning(
"{} not found in the evaluate library!".format(metric_name),
"Please check https://huggingface.co/evaluate-metric",
)
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._metric_fn_kwargs[metric_name] = kwargs
if "aggregation" in metric_config:
agg_name = metric_config["aggregation"]
if type(agg_name) == str:
self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[
agg_name
]
self._aggregation_list[metric_name] = get_aggregation(agg_name)
elif callable(agg_name):
self._aggregation_list[metric_name] = metric_config[
"aggregation"
......@@ -575,7 +558,7 @@ class ConfigurableTask(Task):
else:
INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
metric_agg = DEFAULT_AGGREGATION_REGISTRY[metric_name]
metric_agg = get_default_aggregation(metric_name)
eval_logger.warning(
f"metric {metric_name} is defined, but aggregation is not. "
f"using default "
......@@ -591,11 +574,9 @@ class ConfigurableTask(Task):
eval_logger.warning(
f"metric {metric_name} is defined, but higher_is_better is not. "
f"using default "
f"higher_is_better={HIGHER_IS_BETTER_REGISTRY[metric_name]}"
f"higher_is_better={is_higher_better(metric_name)}"
)
self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
metric_name
]
self._higher_is_better[metric_name] = is_higher_better(metric_name)
self.download(self._config.dataset_kwargs)
self._training_docs = None
......@@ -865,7 +846,6 @@ class ConfigurableTask(Task):
else:
gold = int(self.doc_to_target(doc))
pred = np.argmax(lls)
# retrieve choices in List[str] form, to compute choice lengths, etc.
choices = self.create_choices(doc)
if (
......@@ -879,6 +859,8 @@ class ConfigurableTask(Task):
# and this stores our "regular" conditional loglikelihoods
lls = lls[::2]
pred = np.argmax(lls)
acc = 1.0 if np.argmax(lls) == gold else 0.0
completion_len = np.array([float(len(i)) for i in choices])
acc_norm = 1.0 if np.argmax(lls / completion_len) == gold else 0.0
......@@ -890,7 +872,6 @@ class ConfigurableTask(Task):
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
}
# TODO: set which normalization metrics should be reported, and calculate them
if "exact_match" in self._metric_fn_list.keys():
# TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
is_greedy = is_greedy[gold] # take value for the gold answer
......@@ -926,7 +907,7 @@ class ConfigurableTask(Task):
gold = self.doc_to_target(doc)
for key, result in zip(self._metric_fn_list.keys(), results):
_dict = self._metric_fn_list[key].compute(
_dict = self._metric_fn_list[key](
references=[gold],
predictions=[result],
**self._metric_fn_kwargs[key],
......
......@@ -45,7 +45,6 @@ def simple_evaluate(
check_integrity=False,
decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
):
"""Instantiate and evaluate a model on a list of tasks.
......@@ -74,8 +73,6 @@ def simple_evaluate(
Whether to run the relevant part of the test suite for the tasks
:param write_out: bool
If True, write details about prompts and logits to json for all tasks
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir.
:return
Dictionary of results
"""
......@@ -121,7 +118,6 @@ def simple_evaluate(
bootstrap_iters=bootstrap_iters,
decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out,
output_base_path=output_base_path,
)
if lm.rank == 0:
......@@ -158,7 +154,6 @@ def evaluate(
bootstrap_iters=100000,
decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
):
"""Instantiate and evaluate a model on a list of tasks.
......@@ -174,8 +169,6 @@ def evaluate(
Number of iterations for bootstrap statistics
:param write_out: bool
If True, write all prompts, logits and metrics to json for offline analysis
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir
:return
Dictionary of results
"""
......@@ -188,8 +181,6 @@ def evaluate(
samples = collections.defaultdict(list)
requests = collections.defaultdict(list)
# docs = {}
# get lists of each type of request
for task_name, task in task_dict.items():
versions[task_name] = task.VERSION
......
......@@ -115,9 +115,10 @@ class HFLM(LM):
else torch.device("cpu")
)
else:
eval_logger.info(
f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
)
if device != "cuda":
eval_logger.info(
f"Using `accelerate launch` or `parallelize=True`, device '{device}' will be overridden when placing model."
)
# TODO: include in warning that `load_in_8bit` etc. affect this too
self._device = device
......@@ -204,7 +205,12 @@ class HFLM(LM):
self.model.tie_weights()
if gpus <= 1 and not parallelize:
# place model onto device, if not using HF Accelerate in any form
self.model.to(self.device)
try:
self.model.to(self.device)
except ValueError:
eval_logger.info(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
)
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer,
......@@ -246,7 +252,12 @@ class HFLM(LM):
if torch.cuda.is_available()
else torch.device("cpu")
)
self.model.to(self.device)
try:
self.model.to(self.device)
except ValueError:
eval_logger.info(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
)
else:
self._model = accelerator.prepare(self.model)
self._device = torch.device(f"cuda:{accelerator.local_process_index}")
......
......@@ -9,7 +9,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] DROP
- [x] ~~Lambada~~
- [x] Lambada (Cloze variants)
- [ ] Lambada (Multilingual)
- [x] ~~Lambada (Multilingual)~~
- [x] Wikitext
- [x] PiQA
- [ ] PROST (WIP)
......@@ -17,7 +17,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] Pubmed QA
- [x] SciQ
- [ ] QASPER
- [ ] QA4MRE
- [ ] QA4MRE (WIP)
- [ ] TriviaQA
- [x] AI2 ARC
- [ ] LogiQA
......@@ -31,7 +31,8 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] SWAG (WIP)
- [x] OpenBookQA
- [ ] SQuADv2 (WIP)
- [ ] HeadQA
- [ ] RACE (WIP)
- [ ] HeadQA (WIP)
- [ ] MathQA
- [ ] WebQs
- [ ] WSC273
......
......@@ -14,7 +14,10 @@ from lm_eval.api.registry import (
def get_task_name_from_config(task_config):
return "{dataset_path}_{dataset_name}".format(**task_config)
if "dataset_name" in task_config:
return "{dataset_path}_{dataset_name}".format(**task_config)
else:
return "{dataset_path}".format(**task_config)
def include_task_folder(task_dir):
......
......@@ -19,6 +19,6 @@ metric_list:
- metric: acc_norm
aggregation: mean
higher_is_better: true
- metric: acc_mutual_info
aggregation: mean
higher_is_better: true
# - metric: acc_mutual_info
# aggregation: mean
# higher_is_better: true
# LAMBADA
### Paper
The LAMBADA dataset: Word prediction requiring a broad discourse context
https://arxiv.org/pdf/1606.06031.pdf
LAMBADA is a dataset to evaluate the capabilities of computational models for text
understanding by means of a word prediction task. LAMBADA is a collection of narrative
passages sharing the characteristic that human subjects are able to guess their last
word if they are exposed to the whole passage, but not if they only see the last
sentence preceding the target word. To succeed on LAMBADA, computational models
cannot simply rely on local context, but must be able to keep track of information
in the broader discourse.
Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
### Citation
@misc{
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
year={2016},
month={Aug}
}
### Subtasks
* `lambada_mt_{en, fr, de, it, es}`: Machine-translated versions of OpenAI's Lambada variant.
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
(This task is novel to the Evaluation Harness, and has been checked against v0.3.0 of the harness.)
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_de
dataset_name: de
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_en
dataset_path: EleutherAI/lambada_openai
dataset_name: en
output_type: loglikelihood
test_split: test
template_aliases: ""
doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
doc_to_target: "{{' '+text.split(' ')[-1]}}"
should_decontaminate: true
doc_to_decontamination_query: "{{text}}"
metric_list:
- metric: perplexity
aggregation: perplexity
higher_is_better: false
- metric: acc
aggregation: mean
higher_is_better: true
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_es
dataset_name: es
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_fr
dataset_name: fr
include: lambada_mt_en.yaml
group:
- lambada_multilingual
- loglikelihood
- perplexity
task: lambada_openai_mt_it
dataset_name: it
group:
- multiple_choice
task: corypaik_prost
dataset_path: corypaik/prost
dataset_name: null
output_type: multiple_choice
test_split: test
template_aliases: "{% set answer_choices = [A, B, C, D] %}{% set gold = label %}" # set the list of possible answer choices, and set what this doc's gold answer is (set what ds column used, and what)
doc_to_text: "{{context}}\nQuestion: {{ex_question}}\nAnswer:"
doc_to_target: "{{answer_choices[gold]}}"
gold_alias: "{{gold}}" # this will be cast to an int.
should_decontaminate: true
doc_to_decontamination_query: "{{context}}\nQuestion: {{ex_question}}\nAnswer:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
......@@ -8,7 +8,13 @@ training_split: train
validation_split: validation
doc_to_text: "{{passage}}\nQuestion: {{question}}\nAnswer:"
doc_to_target: "{{answer_choices[label]}}"
gold_alias: "{{label}}" # this will be cast to an int.
gold_alias: " {{answer_choices[label]}}" # this will be cast to an int.
generation_kwargs:
until:
- "\n\n"
- "\n"
do_sample: false
temperature: 0.0
template_aliases: "{% set answer_choices = ['no', 'yes'] %}"
metric_list:
- metric: exact_match
......
group:
- super-glue-lm-eval-v1-
- super-glue-lm-eval-v1
task: "copa"
dataset_path: super_glue
dataset_name: copa
......
group:
- super-glue-promptsource
task: "GPT-3 style"
task: "rte"
dataset_path: super_glue
dataset_name: rte
training_split: train
validation_split: validation
use_prompt: "promptsource:GPT-3 style"
generation_kwargs:
until:
- "\n"
- "\n\n"
metric_list:
- metric: exact_match
aggregation: mean
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment