Unverified Commit 546fd5cd authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge pull request #686 from EleutherAI/cleanup

[Refactor] Cleanup for `big-refactor`
parents 465c695b 540d468e
...@@ -26,13 +26,13 @@ Dataset configuration options: ...@@ -26,13 +26,13 @@ Dataset configuration options:
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split. - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split. - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) - **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Prompting / in-context formatting options: Prompting / in-context formatting options:
- **template_aliases** (`str`, *optional*) — A field for inputting additional Jinja2 content. Intended not to render as text after applying a Jinja template, but to instead define variables within Jinja that will be used within the written prompts. (for example, mapping the dataset column `label` to the new name `gold`). - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text and doc_to_target and make template_aliases unused.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into possible choices for `multiple_choice` - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`. - **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples. - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested. - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
...@@ -160,7 +160,7 @@ Thus, given the 64 responses from our LM on each document, we can report metrics ...@@ -160,7 +160,7 @@ Thus, given the 64 responses from our LM on each document, we can report metrics
Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments: Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
1. `doc_to_text` 1. `doc_to_text`
2. `doc_to_target` 2. `doc_to_target`
3. `gold_alias` 3. `doc_to_choice`
4. `aggregation` for a `metric` in `metric_list` 4. `aggregation` for a `metric` in `metric_list`
## (No Longer Recommended) Direct `Task` Subclassing ## (No Longer Recommended) Direct `Task` Subclassing
......
...@@ -60,16 +60,44 @@ fewshot_split: <split name to draw fewshot examples from, or `null`> ...@@ -60,16 +60,44 @@ fewshot_split: <split name to draw fewshot examples from, or `null`>
``` ```
though if this is not set, we will default to train/validation/test sets, in that order. though if this is not set, we will default to train/validation/test sets, in that order.
Finally, our dataset may not be already in the exact format we want. Maybe we have to strip whitespace and special characters via a regex from our dataset's "question" field! Or maybe we just want to rename its columns to match a convention we'll be using for our prompts.
Let's create a python file in the directory where we're writing our YAML file:
```bash
touch lm_eval/tasks/<dataset_name>/utils.py
```
Now, in `utils.py` we'll write a function to process each split of our dataset:
```python
def process_docs(dataset: datasets.Dataset):
def _helper(doc):
# modifies the contents of a single
# document in our dataset.
doc["choices"] = [doc["choice1"], doc["choice2"], doc["wrong_answer"]]
doc["gold"] = doc["label"]
return doc
return dataset.map(_helper) # returns back a datasets.Dataset object
```
Now, in our YAML config file we'll use the `!function` constructor, and tell the config where our imported Python function will come from. At runtime, before doing anything else we will preprocess our dataset according to this function!
```yaml
process_docs: !function utils.process_docs
```
### Writing a prompt with Jinja 2 ### Writing a prompt with Jinja 2
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format. The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format. We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
To write a prompt, users are required to write two YAML fields in Jinja as strings: To write a prompt, users are required to write two or three YAML fields in Jinja as strings:
```yaml ```yaml
doc_to_text: doc_to_text:
doc_to_target: doc_to_target:
doc_to_choice:
``` ```
Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset: Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
``` ```
...@@ -101,10 +129,9 @@ For tasks which are multiple choice (a fixed, finite set of label words per each ...@@ -101,10 +129,9 @@ For tasks which are multiple choice (a fixed, finite set of label words per each
An annotated example in the case of SciQ is as follows: An annotated example in the case of SciQ is as follows:
```yaml ```yaml
template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices. doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
doc_to_target: "{{answer_choices[gold]}}" # this contains the gold-standard answer choice, selected via indexing to index `gold` in the answer choice list. doc_to_target: 3 # this contains the index into the answer choice list of the correct answer.
gold_alias: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label. doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
``` ```
Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use. Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
......
...@@ -80,7 +80,6 @@ DEFAULT_METRIC_REGISTRY = { ...@@ -80,7 +80,6 @@ DEFAULT_METRIC_REGISTRY = {
], ],
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"], "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": ["acc", "acc_norm"], "multiple_choice": ["acc", "acc_norm"],
"winograd_schema": ["acc"],
"greedy_until": ["exact_match"], "greedy_until": ["exact_match"],
} }
......
...@@ -65,7 +65,7 @@ class TaskConfig(dict): ...@@ -65,7 +65,7 @@ class TaskConfig(dict):
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
# formatting / prompting options. # formatting / prompting options.
# see docs/advanced_task_guide.md for more info # see docs/advanced_task_guide.md for more info
template_aliases: Union[str, list] = None process_docs: Callable = None
doc_to_text: Union[Callable, str] = None doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None doc_to_target: Union[Callable, str] = None
doc_to_choice: Union[Callable, str, dict, list] = None doc_to_choice: Union[Callable, str, dict, list] = None
...@@ -88,24 +88,13 @@ class TaskConfig(dict): ...@@ -88,24 +88,13 @@ class TaskConfig(dict):
metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
def __post_init__(self): def __post_init__(self):
# allow user-specified aliases so that users can
# force prompt-compatibility for some prompt regardless of
# field names in prompt
if self.template_aliases:
if type(self.doc_to_text) == str:
self.doc_to_text = self.template_aliases + self.doc_to_text
if type(self.doc_to_target) == str:
self.doc_to_target = self.template_aliases + self.doc_to_target
if type(self.gold_alias) == str:
self.gold_alias = self.template_aliases + self.gold_alias
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "greedy_until": if self.output_type != "greedy_until":
eval_logger.warning( eval_logger.warning(
"passed `generation_kwargs`, but not using a generation request type!" "passed `generation_kwargs`, but not using `output_type: greedy_until`!"
) )
assert self.output_type != "greedy_until"
if "temperature" in self.generation_kwargs: if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float( self.generation_kwargs["temperature"] = float(
...@@ -624,10 +613,6 @@ class ConfigurableTask(Task): ...@@ -624,10 +613,6 @@ class ConfigurableTask(Task):
list(self.fewshot_docs()), self, rnd=random.Random(1234) list(self.fewshot_docs()), self, rnd=random.Random(1234)
) )
if self._config.template_aliases is not None:
for key, alias in self._config.template_aliases:
self.dataset.rename_column(key, alias)
if self.has_test_docs(): if self.has_test_docs():
docs = self.test_docs() docs = self.test_docs()
elif self.has_validation_docs(): elif self.has_validation_docs():
...@@ -685,15 +670,25 @@ class ConfigurableTask(Task): ...@@ -685,15 +670,25 @@ class ConfigurableTask(Task):
return False return False
def training_docs(self): def training_docs(self):
if self._config.training_split is not None: if self.has_training_docs():
if self._config.process_docs is not None:
return self._config.process_docs(
self.dataset[self._config.training_split]
)
return self.dataset[self._config.training_split] return self.dataset[self._config.training_split]
def validation_docs(self): def validation_docs(self):
if self._config.validation_split is not None: if self.has_validation_docs():
if self._config.process_docs is not None:
return self._config.process_docs(
self.dataset[self._config.validation_split]
)
return self.dataset[self._config.validation_split] return self.dataset[self._config.validation_split]
def test_docs(self): def test_docs(self):
if self._config.test_split is not None: if self.has_test_docs():
if self._config.process_docs is not None:
return self._config.process_docs(self.dataset[self._config.test_split])
return self.dataset[self._config.test_split] return self.dataset[self._config.test_split]
def fewshot_docs(self): def fewshot_docs(self):
......
import torch import torch
import transformers import transformers
from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES from transformers.models.auto.modeling_auto import (
MODEL_FOR_CAUSAL_LM_MAPPING_NAMES,
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
)
from peft import __version__ as PEFT_VERSION, PeftModel from peft import __version__ as PEFT_VERSION, PeftModel
import copy import copy
...@@ -147,6 +150,18 @@ class HFLM(LM): ...@@ -147,6 +150,18 @@ class HFLM(LM):
if getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES: if getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
elif (
not getattr(self._config, "model_type")
in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
):
if not trust_remote_code:
eval_logger.warning(
"HF model type is neither marked as CausalLM or Seq2SeqLM. \
This is expected if your model requires `trust_remote_code=True` but may be an error otherwise."
)
# if model type is neither in HF transformers causal or seq2seq model registries
# then we default to AutoModelForCausalLM
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
else: else:
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
...@@ -634,8 +649,10 @@ class HFLM(LM): ...@@ -634,8 +649,10 @@ class HFLM(LM):
contlen = len(cont_toks) contlen = len(cont_toks)
# take only logits in the continuation # take only logits in the continuation
# (discard context toks if decoder-only ; discard right-padding) # (discard context toks if decoder-only ; discard right-padding)
# also discards + checks for "virtual tokens" in the causal LM's input window
# from prompt/prefix tuning tokens, if applicable
ctx_len = ( ctx_len = (
inplen inplen + (logits.shape[0] - padding_len_inp)
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
else None else None
) )
......
...@@ -6,7 +6,6 @@ dataset_name: arithmetic_1dc ...@@ -6,7 +6,6 @@ dataset_name: arithmetic_1dc
output_type: loglikelihood output_type: loglikelihood
validation_split: validation validation_split: validation
test_split: null test_split: null
template_aliases: ""
doc_to_text: "{{context}}" doc_to_text: "{{context}}"
doc_to_target: "{{completion}}" doc_to_target: "{{completion}}"
metric_list: metric_list:
......
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2da task: arithmetic_2da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2da dataset_name: arithmetic_2da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2dm task: arithmetic_2dm
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2dm dataset_name: arithmetic_2dm
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2ds task: arithmetic_2ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2ds dataset_name: arithmetic_2ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_3da task: arithmetic_3da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_3da dataset_name: arithmetic_3da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_3ds task: arithmetic_3ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_3ds dataset_name: arithmetic_3ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_4da task: arithmetic_4da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_4da dataset_name: arithmetic_4da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_4ds task: arithmetic_4ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_4ds dataset_name: arithmetic_4ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_5da task: arithmetic_5da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_5da dataset_name: arithmetic_5da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_5ds task: arithmetic_5ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_5ds dataset_name: arithmetic_5ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: group:
- hendrycks_ethics - hendrycks_ethics
task: ethics_cm task: ethics_cm
dataset_path: hails/hendrycks_ethics dataset_path: EleutherAI/hendrycks_ethics
dataset_name: commonsense dataset_name: commonsense
output_type: multiple_choice output_type: multiple_choice
training_split: train training_split: train
......
include: commonsense.yaml include: commonsense.yaml
task: ethics_deontology task: ethics_deontology
dataset_path: hails/hendrycks_ethics
dataset_name: deontology dataset_name: deontology
doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:" doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}} {{excuse.rstrip()}}\"\nAnswer:"
doc_to_target: label doc_to_target: label
......
...@@ -3,6 +3,5 @@ group: ...@@ -3,6 +3,5 @@ group:
- hendrycks_ethics - hendrycks_ethics
task: ethics_justice task: ethics_justice
dataset_name: justice dataset_name: justice
output_type: multiple_choice
doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:" doc_to_text: "Question: Would most people believe this reasonable or unreasonable to say? \"{{scenario}}\"\nAnswer:"
# TODO: impl. exact match for this and deontology # TODO: impl. exact match for this and deontology
...@@ -2,11 +2,7 @@ include: commonsense.yaml ...@@ -2,11 +2,7 @@ include: commonsense.yaml
group: group:
- hendrycks_ethics - hendrycks_ethics
task: ethics_utilitarianism task: ethics_utilitarianism
dataset_path: hails/hendrycks_ethics
dataset_name: utilitarianism dataset_name: utilitarianism
output_type: multiple_choice
training_split: train
test_split: test
doc_to_text: !function utils.doc_to_text doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target doc_to_target: !function utils.doc_to_target
doc_to_choice: ['no', 'yes'] doc_to_choice: ['no', 'yes']
......
...@@ -7,7 +7,6 @@ dataset_path: EleutherAI/lambada_openai ...@@ -7,7 +7,6 @@ dataset_path: EleutherAI/lambada_openai
dataset_name: default dataset_name: default
output_type: loglikelihood output_type: loglikelihood
test_split: test test_split: test
template_aliases: ""
doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}" doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
doc_to_target: "{{' '+text.split(' ')[-1]}}" doc_to_target: "{{' '+text.split(' ')[-1]}}"
should_decontaminate: true should_decontaminate: true
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment