Commit 16c4afc6 authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into toxicity

parents 7b376ae1 176d5a26
...@@ -55,7 +55,7 @@ jobs: ...@@ -55,7 +55,7 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
pip install -e '.[testing]' --extra-index-url https://download.pytorch.org/whl/cpu pip install -e '.[testing,anthropic,sentencepiece]' --extra-index-url https://download.pytorch.org/whl/cpu
# Install optional git dependencies # Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt # pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi # if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
......
env env
*.pyc *.pyc
output/
data/ data/
lm_cache lm_cache
.idea .idea
......
...@@ -142,6 +142,10 @@ python main.py \ ...@@ -142,6 +142,10 @@ python main.py \
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
### Additional Features
If you have a CUDA-compatible Mac GPU, you can run the eval harness using the MPS back-end by replaicng `--device cuda:0` with `--device mps:0`. PyTorch does not currently support automatic mixed precision (AMP) for MPS, so we forcibly cast all weights to fp32 regardless of how they're stored. This is slower and has a larger memory footprint than we can achieve on Linux systems, but as PyTorch continues to improve its MPS support we hope to continue to improve it.
💡 **Tip**: You can inspect what the LM inputs look like by running the following command: 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
```bash ```bash
......
...@@ -26,13 +26,13 @@ Dataset configuration options: ...@@ -26,13 +26,13 @@ Dataset configuration options:
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split. - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split. - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) - **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Prompting / in-context formatting options: Prompting / in-context formatting options:
- **template_aliases** (`str`, *optional*) — A field for inputting additional Jinja2 content. Intended not to render as text after applying a Jinja template, but to instead define variables within Jinja that will be used within the written prompts. (for example, mapping the dataset column `label` to the new name `gold`). - **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text and doc_to_target and make template_aliases unused.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model - **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. - **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into possible choices for `multiple_choice` - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2, f-string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `greedy_until` tasks.
- **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`. - **gold_alias** (`str`, *optional*, defaults to None) — if provided, used to generate the reference answer that is scored against. Used in cases where `doc_to_target` should be the "target string" format appended to each example's input for a fewshot exemplar, so doc_to_target is used for fewshot examples, but the input to the metric function as `gold` is from `gold_alias`.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples. - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested. - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
...@@ -160,7 +160,7 @@ Thus, given the 64 responses from our LM on each document, we can report metrics ...@@ -160,7 +160,7 @@ Thus, given the 64 responses from our LM on each document, we can report metrics
Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments: Use can use python functions for certain arguments by using the `!function` operator after the argument name followed by `<filename>.<pythonfunctionname>`. This feature can be used for the following arguments:
1. `doc_to_text` 1. `doc_to_text`
2. `doc_to_target` 2. `doc_to_target`
3. `gold_alias` 3. `doc_to_choice`
4. `aggregation` for a `metric` in `metric_list` 4. `aggregation` for a `metric` in `metric_list`
## (No Longer Recommended) Direct `Task` Subclassing ## (No Longer Recommended) Direct `Task` Subclassing
......
...@@ -60,16 +60,44 @@ fewshot_split: <split name to draw fewshot examples from, or `null`> ...@@ -60,16 +60,44 @@ fewshot_split: <split name to draw fewshot examples from, or `null`>
``` ```
though if this is not set, we will default to train/validation/test sets, in that order. though if this is not set, we will default to train/validation/test sets, in that order.
Finally, our dataset may not be already in the exact format we want. Maybe we have to strip whitespace and special characters via a regex from our dataset's "question" field! Or maybe we just want to rename its columns to match a convention we'll be using for our prompts.
Let's create a python file in the directory where we're writing our YAML file:
```bash
touch lm_eval/tasks/<dataset_name>/utils.py
```
Now, in `utils.py` we'll write a function to process each split of our dataset:
```python
def process_docs(dataset: datasets.Dataset):
def _helper(doc):
# modifies the contents of a single
# document in our dataset.
doc["choices"] = [doc["choice1"], doc["choice2"], doc["wrong_answer"]]
doc["gold"] = doc["label"]
return doc
return dataset.map(_helper) # returns back a datasets.Dataset object
```
Now, in our YAML config file we'll use the `!function` constructor, and tell the config where our imported Python function will come from. At runtime, before doing anything else we will preprocess our dataset according to this function!
```yaml
process_docs: !function utils.process_docs
```
### Writing a prompt with Jinja 2 ### Writing a prompt with Jinja 2
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format. The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format. We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.
To write a prompt, users are required to write two YAML fields in Jinja as strings: To write a prompt, users are required to write two or three YAML fields in Jinja as strings:
```yaml ```yaml
doc_to_text: doc_to_text:
doc_to_target: doc_to_target:
doc_to_choice:
``` ```
Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset: Suppose our dataset has a `"question"` field, and an `"answer"` field, which are both strings. We want the model to see, if given a `document` object that is a row of our dataset:
``` ```
...@@ -101,10 +129,9 @@ For tasks which are multiple choice (a fixed, finite set of label words per each ...@@ -101,10 +129,9 @@ For tasks which are multiple choice (a fixed, finite set of label words per each
An annotated example in the case of SciQ is as follows: An annotated example in the case of SciQ is as follows:
```yaml ```yaml
template_aliases: "{% set answer_choices = [distractor1, distractor2, distractor3, correct_answer] %}{% set gold = 3 %}" # `template_aliases` must set the list of possible answer choices to the jinja variable `answer_choices` (List[str]), and set what the index within `answer_choices` of this doc's gold label (correct answer choice).
doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices. doc_to_text: "{{support.lstrip()}}\nQuestion: {{question}}\nAnswer:" # This is the input portion of the prompt for this doc. It will have " {{choice}}" appended to it as target for each choice in answer_choices.
doc_to_target: "{{answer_choices[gold]}}" # this contains the gold-standard answer choice, selected via indexing to index `gold` in the answer choice list. doc_to_target: 3 # this contains the index into the answer choice list of the correct answer.
gold_alias: "{{gold}}" # this must be castable to an integer. It must output only the index within `answer_choices` that is the correct label. doc_to_choice: "{{[distractor1, distractor2, distractor3, correct_answer]}}"
``` ```
Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use. Task implementers are thus able to decide what the answer choices should be for a document, and what prompt format to use.
......
import os import os
import evaluate import evaluate
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.logger import eval_logger
MODEL_REGISTRY = {} MODEL_REGISTRY = {}
...@@ -80,7 +81,6 @@ DEFAULT_METRIC_REGISTRY = { ...@@ -80,7 +81,6 @@ DEFAULT_METRIC_REGISTRY = {
], ],
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"], "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": ["acc", "acc_norm"], "multiple_choice": ["acc", "acc_norm"],
"winograd_schema": ["acc"],
"greedy_until": ["exact_match"], "greedy_until": ["exact_match"],
} }
...@@ -131,7 +131,7 @@ searching in HF Evaluate library..." ...@@ -131,7 +131,7 @@ searching in HF Evaluate library..."
metric_object = evaluate.load(name) metric_object = evaluate.load(name)
return metric_object.compute return metric_object.compute
except Exception: except Exception:
raise Warning( eval_logger.error(
"{} not found in the evaluate library!".format(name), "{} not found in the evaluate library!".format(name),
"Please check https://huggingface.co/evaluate-metric", "Please check https://huggingface.co/evaluate-metric",
) )
...@@ -154,7 +154,7 @@ def get_aggregation(name): ...@@ -154,7 +154,7 @@ def get_aggregation(name):
try: try:
return AGGREGATION_REGISTRY[name] return AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
raise Warning( eval_logger.warning(
"{} not a registered aggregation metric!".format(name), "{} not a registered aggregation metric!".format(name),
) )
...@@ -163,7 +163,9 @@ def get_default_aggregation(metric_name): ...@@ -163,7 +163,9 @@ def get_default_aggregation(metric_name):
try: try:
return DEFAULT_AGGREGATION_REGISTRY[metric_name] return DEFAULT_AGGREGATION_REGISTRY[metric_name]
except KeyError: except KeyError:
raise Warning(f"No default aggregation metric for metric '{metric_name}'!") eval_logger.warning(
f"No default aggregation metric for metric '{metric_name}'!"
)
def is_higher_better(metric_name): def is_higher_better(metric_name):
...@@ -171,3 +173,6 @@ def is_higher_better(metric_name): ...@@ -171,3 +173,6 @@ def is_higher_better(metric_name):
return HIGHER_IS_BETTER_REGISTRY[metric_name] return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError: except KeyError:
raise Warning(f"higher_is_better not specified for metric '{metric_name}'!") raise Warning(f"higher_is_better not specified for metric '{metric_name}'!")
eval_logger.warning(
f"higher_is_better not specified for metric '{metric_name}'!"
)
...@@ -65,11 +65,12 @@ class TaskConfig(dict): ...@@ -65,11 +65,12 @@ class TaskConfig(dict):
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?) fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
# formatting / prompting options. # formatting / prompting options.
# see docs/advanced_task_guide.md for more info # see docs/advanced_task_guide.md for more info
template_aliases: Union[str, list] = None process_docs: Callable = None
doc_to_text: Union[Callable, str] = None doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None doc_to_target: Union[Callable, str] = None
doc_to_choice: Union[Callable, str, dict, list] = None doc_to_choice: Union[Callable, str, dict, list] = None
gold_alias: Union[Callable, str] = None gold_alias: Union[Callable, str] = None
process_results: Union[Callable, str] = None
use_prompt: str = None use_prompt: str = None
description: str = "" description: str = ""
target_delimiter: str = " " target_delimiter: str = " "
...@@ -88,24 +89,13 @@ class TaskConfig(dict): ...@@ -88,24 +89,13 @@ class TaskConfig(dict):
metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks metadata: str = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
def __post_init__(self): def __post_init__(self):
# allow user-specified aliases so that users can
# force prompt-compatibility for some prompt regardless of
# field names in prompt
if self.template_aliases:
if type(self.doc_to_text) == str:
self.doc_to_text = self.template_aliases + self.doc_to_text
if type(self.doc_to_target) == str:
self.doc_to_target = self.template_aliases + self.doc_to_target
if type(self.gold_alias) == str:
self.gold_alias = self.template_aliases + self.gold_alias
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "greedy_until": if self.output_type != "greedy_until":
eval_logger.warning( eval_logger.warning(
"passed `generation_kwargs`, but not using a generation request type!" "passed `generation_kwargs`, but not using `output_type: greedy_until`!"
) )
assert self.output_type != "greedy_until"
if "temperature" in self.generation_kwargs: if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float( self.generation_kwargs["temperature"] = float(
...@@ -556,8 +546,18 @@ class ConfigurableTask(Task): ...@@ -556,8 +546,18 @@ class ConfigurableTask(Task):
for key in metric_config for key in metric_config
if key not in ["metric", "aggregation", "higher_is_better"] if key not in ["metric", "aggregation", "higher_is_better"]
} }
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._metric_fn_kwargs[metric_name] = kwargs if self._config.process_results is not None:
self._metric_fn_list[metric_name] = None
self._metric_fn_kwargs[metric_name] = {}
elif callable(metric_name):
metric_fn = metric_name.__call__
metric_name = metric_name.__name__
self._metric_fn_list[metric_name] = metric_fn
self._metric_fn_kwargs[metric_name] = kwargs
else:
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._metric_fn_kwargs[metric_name] = kwargs
if "aggregation" in metric_config: if "aggregation" in metric_config:
agg_name = metric_config["aggregation"] agg_name = metric_config["aggregation"]
...@@ -624,10 +624,6 @@ class ConfigurableTask(Task): ...@@ -624,10 +624,6 @@ class ConfigurableTask(Task):
list(self.fewshot_docs()), self, rnd=random.Random(1234) list(self.fewshot_docs()), self, rnd=random.Random(1234)
) )
if self._config.template_aliases is not None:
for key, alias in self._config.template_aliases:
self.dataset.rename_column(key, alias)
if self.has_test_docs(): if self.has_test_docs():
docs = self.test_docs() docs = self.test_docs()
elif self.has_validation_docs(): elif self.has_validation_docs():
...@@ -685,15 +681,25 @@ class ConfigurableTask(Task): ...@@ -685,15 +681,25 @@ class ConfigurableTask(Task):
return False return False
def training_docs(self): def training_docs(self):
if self._config.training_split is not None: if self.has_training_docs():
if self._config.process_docs is not None:
return self._config.process_docs(
self.dataset[self._config.training_split]
)
return self.dataset[self._config.training_split] return self.dataset[self._config.training_split]
def validation_docs(self): def validation_docs(self):
if self._config.validation_split is not None: if self.has_validation_docs():
if self._config.process_docs is not None:
return self._config.process_docs(
self.dataset[self._config.validation_split]
)
return self.dataset[self._config.validation_split] return self.dataset[self._config.validation_split]
def test_docs(self): def test_docs(self):
if self._config.test_split is not None: if self.has_test_docs():
if self._config.process_docs is not None:
return self._config.process_docs(self.dataset[self._config.test_split])
return self.dataset[self._config.test_split] return self.dataset[self._config.test_split]
def fewshot_docs(self): def fewshot_docs(self):
...@@ -890,8 +896,8 @@ class ConfigurableTask(Task): ...@@ -890,8 +896,8 @@ class ConfigurableTask(Task):
def process_results(self, doc, results): def process_results(self, doc, results):
# if callable(self._config.process_results): if callable(self._config.process_results):
# return self._config.process_results(doc, results) return self._config.process_results(doc, results)
result_dict = {} result_dict = {}
use_metric = list(self._metric_fn_list.keys()) use_metric = list(self._metric_fn_list.keys())
...@@ -980,6 +986,9 @@ class ConfigurableTask(Task): ...@@ -980,6 +986,9 @@ class ConfigurableTask(Task):
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "greedy_until":
gold = self.doc_to_target(doc) gold = self.doc_to_target(doc)
if type(gold) == int:
choices = self.doc_to_choice(doc)
gold = choices[gold]
for key, result in zip(self._metric_fn_list.keys(), results): for key, result in zip(self._metric_fn_list.keys(), results):
if self.multiple_target: if self.multiple_target:
......
...@@ -192,14 +192,35 @@ def evaluate( ...@@ -192,14 +192,35 @@ def evaluate(
# decontaminate = decontamination_ngrams_path is not None # decontaminate = decontamination_ngrams_path is not None
# stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict) results = collections.defaultdict(dict)
# Tracks each task's version.
versions = collections.defaultdict(dict) versions = collections.defaultdict(dict)
# Tracks the YAML configs of all chosen tasks.
configs = collections.defaultdict(dict) configs = collections.defaultdict(dict)
# logs info about each document evaluated.
samples = collections.defaultdict(list) samples = collections.defaultdict(list)
# tracks all Instances/requests a model must generate output on.
requests = collections.defaultdict(list) requests = collections.defaultdict(list)
# Stores task scores based on task grouping.
aggregate = collections.defaultdict(dict)
# tracks if a task was chosen via user selecting a group containing it
task_groups = collections.defaultdict(dict)
# stores the amount to pad out reqs per req. type so that
# number of fwd passes per distributed rank is equal
padding_requests = collections.defaultdict(int)
# Stores group related keys and values for group-aggregation
aggregate = collections.defaultdict(dict)
task_groups = collections.defaultdict(dict)
# get lists of each type of request # get lists of each type of request
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple:
group, task = task
task_groups[task_name] = group
versions[task_name] = task.VERSION versions[task_name] = task.VERSION
configs[task_name] = dict(task.dump_config()) configs[task_name] = dict(task.dump_config())
...@@ -243,6 +264,7 @@ def evaluate( ...@@ -243,6 +264,7 @@ def evaluate(
# compute number of pseudobatches to pad with (FSDP/DDP require even batches among ranks) # compute number of pseudobatches to pad with (FSDP/DDP require even batches among ranks)
numpad = max(gathered_item) - gathered_item[lm.rank] numpad = max(gathered_item) - gathered_item[lm.rank]
padding_requests[task.OUTPUT_TYPE] += numpad
### Run LM on inputs, get all outputs ### ### Run LM on inputs, get all outputs ###
# execute each type of request # execute each type of request
...@@ -253,8 +275,8 @@ def evaluate( ...@@ -253,8 +275,8 @@ def evaluate(
for req in reqs: for req in reqs:
cloned_reqs.extend([req] * req.repeats) cloned_reqs.extend([req] * req.repeats)
if (lm.world_size > 1) and (numpad > 0): if (lm.world_size > 1) and (padding_requests[reqtype] > 0):
for _ in range(numpad): for _ in range(padding_requests[reqtype]):
cloned_reqs.extend([req] * req.repeats) cloned_reqs.extend([req] * req.repeats)
# run requests through model # run requests through model
...@@ -264,12 +286,14 @@ def evaluate( ...@@ -264,12 +286,14 @@ def evaluate(
for x, req in zip(resps, cloned_reqs): for x, req in zip(resps, cloned_reqs):
req.resps.append(x) req.resps.append(x)
if lm.world_size > 1: if lm.world_size > 1:
lm.accelerator.wait_for_everyone() lm.accelerator.wait_for_everyone()
### Postprocess outputs ### ### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately) # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple:
group, task = task
task.apply_filters() task.apply_filters()
### Collect values of metrics on all datapoints ### ### Collect values of metrics on all datapoints ###
...@@ -277,6 +301,8 @@ def evaluate( ...@@ -277,6 +301,8 @@ def evaluate(
# unpack results and sort back in order and return control to Task # unpack results and sort back in order and return control to Task
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
if type(task) == tuple:
group, task = task
# TODO: make it possible to use a different metric per filter # TODO: make it possible to use a different metric per filter
# iterate over different filters used # iterate over different filters used
for key in task.instances[0].filtered_resps.keys(): for key in task.instances[0].filtered_resps.keys():
...@@ -362,7 +388,23 @@ def evaluate( ...@@ -362,7 +388,23 @@ def evaluate(
# aggregate results ; run bootstrap CIs # aggregate results ; run bootstrap CIs
for (task_name, key, metric), items in vals.items(): for (task_name, key, metric), items in vals.items():
task = task_dict[task_name] task = task_dict[task_name]
results[task_name][metric + "," + key] = task.aggregation()[metric](items) if type(task) == tuple:
group, task = task
task_score = task.aggregation()[metric](items)
results[task_name][metric + "," + key] = task_score
# Need to put back in results
# pythia | acc
# | perplexity
# | word_perplexity
# | byte_perplexity
# | bits_per_byte
if bool(task_groups):
group_name = task_groups[task_name]
if metric not in aggregate[group_name]:
aggregate[group_name][metric] = [task_score]
else:
aggregate[group_name][metric].append(task_score)
# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this # so we run them less iterations. still looking for a cleaner way to do this
...@@ -377,10 +419,21 @@ def evaluate( ...@@ -377,10 +419,21 @@ def evaluate(
if stderr is not None: if stderr is not None:
results[task_name][metric + "_stderr" + "," + key] = stderr(items) results[task_name][metric + "_stderr" + "," + key] = stderr(items)
if bool(aggregate):
for group in aggregate.keys():
for metric in aggregate[group].keys():
aggregate[group][metric] = np.average(aggregate[group][metric])
versions[group] = "N/A"
results_dict = { results_dict = {
"results": dict(results), "results": dict(sorted(results.items())),
"configs": dict(configs), **(
"versions": dict(versions), {"aggregate": dict(sorted(aggregate.items()))}
if bool(aggregate)
else {}
),
"configs": dict(sorted(configs.items())),
"versions": dict(sorted(versions.items())),
} }
if log_samples: if log_samples:
results_dict["samples"] = dict(samples) results_dict["samples"] = dict(samples)
......
...@@ -3,21 +3,28 @@ from lm_eval.api.model import LM ...@@ -3,21 +3,28 @@ from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from tqdm import tqdm from tqdm import tqdm
import time import time
import anthropic
from lm_eval.logger import eval_logger
from typing import List, Literal, Any
def anthropic_completion( def anthropic_completion(
client, model, prompt, max_tokens_to_sample, temperature, stop client: anthropic.Anthropic,
model: str,
prompt: str,
max_tokens_to_sample: int,
temperature: float,
stop: List[str],
**kwargs: Any,
): ):
"""Query Anthropic API for completion. """Query Anthropic API for completion.
Retry with back-off until they respond Retry with back-off until they respond
""" """
import anthropic
backoff_time = 3 backoff_time = 3
while True: while True:
try: try:
response = client.completion( response = client.completions.create(
prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}", prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
model=model, model=model,
# NOTE: Claude really likes to do CoT, and overly aggressive stop sequences # NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
...@@ -25,36 +32,53 @@ def anthropic_completion( ...@@ -25,36 +32,53 @@ def anthropic_completion(
stop_sequences=[anthropic.HUMAN_PROMPT] + stop, stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
max_tokens_to_sample=max_tokens_to_sample, max_tokens_to_sample=max_tokens_to_sample,
temperature=temperature, temperature=temperature,
**kwargs,
)
return response.completion
except anthropic.RateLimitError as e:
eval_logger.warning(
f"RateLimitError occurred: {e.__cause__}\n Retrying in {backoff_time} seconds"
) )
return response["completion"]
except RuntimeError:
# TODO: I don't actually know what error Anthropic raises when it times out
# So err update this error when we find out.
import traceback
traceback.print_exc()
time.sleep(backoff_time) time.sleep(backoff_time)
backoff_time *= 1.5 backoff_time *= 1.5
@register_model("anthropic") @register_model("anthropic")
class AnthropicLM(LM): class AnthropicLM(LM):
REQ_CHUNK_SIZE = 20 REQ_CHUNK_SIZE = 20 # TODO: not used
def __init__(self, model): def __init__(
""" self,
batch_size: int = 1,
model: str = "claude-2.0",
max_tokens_to_sample: int = 256,
temperature: float = 0, # defaults to 1
**kwargs, # top_p, top_k, etc.
):
"""Anthropic API wrapper.
:param model: str :param model: str
Anthropic model e.g. claude-instant-v1 Anthropic model e.g. 'claude-instant-v1', 'claude-2'
:param max_tokens_to_sample: int
Maximum number of tokens to sample from the model
:param temperature: float
Sampling temperature
:param kwargs: Any
Additional model_args to pass to the API client
""" """
super().__init__() super().__init__()
import anthropic
self.model = model self.model = model
self.client = anthropic.Client(os.environ["ANTHROPIC_API_KEY"]) # defaults to os.environ.get("ANTHROPIC_API_KEY")
self.client = anthropic.Anthropic()
self.temperature = temperature
self.max_tokens_to_sample = max_tokens_to_sample
self.tokenizer = self.client.get_tokenizer()
self.kwargs = kwargs
@property @property
def eot_token_id(self): def eot_token_id(self):
# Not sure but anthropic.AI_PROMPT -> [203, 203, 50803, 30]
raise NotImplementedError("No idea about anthropic tokenization.") raise NotImplementedError("No idea about anthropic tokenization.")
@property @property
...@@ -63,23 +87,23 @@ class AnthropicLM(LM): ...@@ -63,23 +87,23 @@ class AnthropicLM(LM):
@property @property
def max_gen_toks(self): def max_gen_toks(self):
return 256 return self.max_tokens_to_sample
@property @property
def batch_size(self): def batch_size(self):
# Isn't used because we override _loglikelihood_tokens # Isn't used because we override _loglikelihood_tokens
raise NotImplementedError() raise NotImplementedError("No support for logits.")
@property @property
def device(self): def device(self):
# Isn't used because we override _loglikelihood_tokens # Isn't used because we override _loglikelihood_tokens
raise NotImplementedError() raise NotImplementedError("No support for logits.")
def tok_encode(self, string: str): def tok_encode(self, string: str) -> List[int]:
raise NotImplementedError("No idea about anthropic tokenization.") return self.tokenizer.encode(string).ids
def tok_decode(self, tokens): def tok_decode(self, tokens: List[int]) -> str:
raise NotImplementedError("No idea about anthropic tokenization.") return self.tokenizer.decode(tokens)
def _loglikelihood_tokens(self, requests, disable_tqdm=False): def _loglikelihood_tokens(self, requests, disable_tqdm=False):
raise NotImplementedError("No support for logits.") raise NotImplementedError("No support for logits.")
...@@ -92,20 +116,31 @@ class AnthropicLM(LM): ...@@ -92,20 +116,31 @@ class AnthropicLM(LM):
res = [] res = []
for request in tqdm(requests): for request in tqdm(requests):
inp = request[0] try:
request_args = request[1] inp = request[0]
until = request_args["until"] request_args = request[1]
response = anthropic_completion( # generation_kwargs
client=self.client, until = request_args.get("until")
model=self.model, max_gen_toks = request_args.get("max_gen_toks", self.max_length)
prompt=inp, temperature = request_args.get("temperature", self.temperature)
max_tokens_to_sample=self.max_gen_toks, response = anthropic_completion(
temperature=0.0, # TODO: implement non-greedy sampling for Anthropic client=self.client,
stop=until, model=self.model,
) prompt=inp,
res.append(response) max_tokens_to_sample=max_gen_toks,
temperature=temperature, # TODO: implement non-greedy sampling for Anthropic
self.cache_hook.add_partial("greedy_until", request, response) stop=until,
**self.kwargs,
)
res.append(response)
self.cache_hook.add_partial("greedy_until", request, response)
except anthropic.APIConnectionError as e:
eval_logger.critical(f"Server unreachable: {e.__cause__}")
break
except anthropic.APIStatusError as e:
eval_logger.critical(f"API error {e.status_code}: {e.message}")
break
return res return res
...@@ -116,3 +151,9 @@ class AnthropicLM(LM): ...@@ -116,3 +151,9 @@ class AnthropicLM(LM):
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until # Isn't used because we override greedy_until
raise NotImplementedError() raise NotImplementedError()
def loglikelihood(self, requests):
raise NotImplementedError("No support for logits.")
def loglikelihood_rolling(self, requests):
raise NotImplementedError("No support for logits.")
import torch import torch
import transformers import transformers
from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES from transformers.models.auto.modeling_auto import (
MODEL_FOR_CAUSAL_LM_MAPPING_NAMES,
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES,
)
from peft import __version__ as PEFT_VERSION, PeftModel from peft import __version__ as PEFT_VERSION, PeftModel
import copy import copy
...@@ -147,6 +150,18 @@ class HFLM(LM): ...@@ -147,6 +150,18 @@ class HFLM(LM):
if getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES: if getattr(self._config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES:
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
elif (
not getattr(self._config, "model_type")
in MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES
):
if not trust_remote_code:
eval_logger.warning(
"HF model type is neither marked as CausalLM or Seq2SeqLM. \
This is expected if your model requires `trust_remote_code=True` but may be an error otherwise."
)
# if model type is neither in HF transformers causal or seq2seq model registries
# then we default to AutoModelForCausalLM
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
else: else:
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
...@@ -634,8 +649,10 @@ class HFLM(LM): ...@@ -634,8 +649,10 @@ class HFLM(LM):
contlen = len(cont_toks) contlen = len(cont_toks)
# take only logits in the continuation # take only logits in the continuation
# (discard context toks if decoder-only ; discard right-padding) # (discard context toks if decoder-only ; discard right-padding)
# also discards + checks for "virtual tokens" in the causal LM's input window
# from prompt/prefix tuning tokens, if applicable
ctx_len = ( ctx_len = (
inplen inplen + (logits.shape[0] - padding_len_inp)
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
else None else None
) )
...@@ -740,11 +757,13 @@ class HFLM(LM): ...@@ -740,11 +757,13 @@ class HFLM(LM):
context_enc = context_enc.to(self.device) context_enc = context_enc.to(self.device)
attn_masks = attn_masks.to(self.device) attn_masks = attn_masks.to(self.device)
if "max_length" not in kwargs:
kwargs["max_length"] = (context_enc.shape[1] + max_gen_toks,)
# perform batched generation # perform batched generation
cont = self._model_generate( cont = self._model_generate(
context=context_enc, context=context_enc,
attention_mask=attn_masks, attention_mask=attn_masks,
max_length=context_enc.shape[1] + max_gen_toks,
stop=primary_until, stop=primary_until,
**kwargs, **kwargs,
) )
......
from lm_eval import utils
from lm_eval.logger import eval_logger from lm_eval.logger import eval_logger
# Prompt library. # Prompt library.
...@@ -51,3 +52,17 @@ def get_prompt(prompt_id: str, dataset_name=None, subset_name=None): ...@@ -51,3 +52,17 @@ def get_prompt(prompt_id: str, dataset_name=None, subset_name=None):
f"expected only a single `:` as separator between \ f"expected only a single `:` as separator between \
prompt category and name, but got `{prompt_id}` instead" prompt category and name, but got `{prompt_id}` instead"
) )
def load_prompt_list(use_prompt: str, dataset_name=None, subset_name=None, **kwargs):
from promptsource.templates import DatasetTemplates
if subset_name is None:
prompts = DatasetTemplates(dataset_name=dataset_name)
else:
prompts = DatasetTemplates(dataset_name=dataset_name, subset_name=subset_name)
category_name, prompt_name = use_prompt.split(":")
prompt_list = utils.pattern_match(prompt_name, prompts.all_template_names)
return [":".join([category_name, prompt]) for prompt in prompt_list]
...@@ -3,7 +3,7 @@ This list keeps track of which tasks' implementations have been ported to YAML / ...@@ -3,7 +3,7 @@ This list keeps track of which tasks' implementations have been ported to YAML /
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already. Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- [ ] Glue (WIP) - [ ] Glue (Lintang)
- [x] SuperGlue - [x] SuperGlue
- [ ] CoQA - [ ] CoQA
- [ ] DROP - [ ] DROP
...@@ -20,14 +20,14 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -20,14 +20,14 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] QA4MRE - [x] QA4MRE
- [ ] TriviaQA - [ ] TriviaQA
- [x] AI2 ARC - [x] AI2 ARC
- [ ] LogiQA (WIP) - [ ] LogiQA [(WIP)](https://github.com/EleutherAI/lm-evaluation-harness/pull/711)
- [x] HellaSwag - [x] HellaSwag
- [x] SWAG - [x] SWAG
- [x] OpenBookQA - [x] OpenBookQA
- [ ] SQuADv2 (WIP) - [ ] SQuADv2
- [x] RACE - [x] RACE
- [x] HeadQA - [x] HeadQA
- [ ] MathQA (WIP) - [x] MathQA
- [ ] WebQs - [ ] WebQs
- [ ] WSC273 - [ ] WSC273
- [x] Winogrande - [x] Winogrande
...@@ -37,28 +37,27 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -37,28 +37,27 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] TruthfulQA (mc2) - [ ] TruthfulQA (mc2)
- [ ] TruthfulQA (gen) - [ ] TruthfulQA (gen)
- [ ] MuTual - [ ] MuTual
- [ ] Hendrycks Math (WIP) - [ ] Hendrycks Math
- [ ] Asdiv (WIP) - [ ] Asdiv
- [ ] GSM8k - [ ] GSM8k
- [x] Arithmetic - [x] Arithmetic
- [ ] MMMLU - [ ] MMMLU (Hailey)
- [ ] Translation (WMT) suite - [ ] Translation (WMT) suite (Hailey)
- [x] Unscramble - [x] Unscramble
- [x] ~~Pile (perplexity)~~ - [x] ~~Pile (perplexity)~~
- [ ] BLiMP - [ ] BLiMP (Lintang)
- [x] ToxiGen - [x] ToxiGen
- [ ] StoryCloze - [ ] StoryCloze
- [ ] NaturalQs (WIP) - [ ] NaturalQs
- [ ] CrowS-Pairs - [ ] CrowS-Pairs
- [ ] XCopa - [ ] XCopa
- [ ] BIG-Bench - [ ] BIG-Bench
- [ ] XStoryCloze - [ ] XStoryCloze
- [ ] XWinograd - [x] XWinograd
- [ ] PAWS-X - [ ] PAWS-X
- [ ] XNLI - [ ] XNLI
- [ ] MGSM - [ ] MGSM
- [ ] SCROLLS - [ ] SCROLLS
- [ ] JSON Task (reference: https://github.com/EleutherAI/lm-evaluation-harness/pull/481)
- [ ] Babi - [ ] Babi
# Novel Tasks # Novel Tasks
......
import os import os
import yaml
from typing import List, Union from typing import List, Union
from lm_eval import utils from lm_eval import utils
from lm_eval import prompts
from lm_eval.logger import eval_logger from lm_eval.logger import eval_logger
from lm_eval.api.task import TaskConfig, Task, ConfigurableTask from lm_eval.api.task import TaskConfig, Task, ConfigurableTask
from lm_eval.api.registry import ( from lm_eval.api.registry import (
...@@ -13,6 +15,58 @@ from lm_eval.api.registry import ( ...@@ -13,6 +15,58 @@ from lm_eval.api.registry import (
) )
def register_configurable_task(config):
SubClass = type(
config["task"] + "ConfigurableTask",
(ConfigurableTask,),
{"CONFIG": TaskConfig(**config)},
)
if "task" in config:
task_name = "{}".format(config["task"])
register_task(task_name)(SubClass)
if "group" in config:
if type(config["group"]) == str:
group_name = [config["group"]]
else:
group_name = config["group"]
for group in group_name:
register_group(group)(SubClass)
return 0
def check_prompt_config(config):
all_configs = []
if "use_prompt" in config:
prompt_list = prompts.load_prompt_list(
use_prompt=config["use_prompt"],
dataset_name=config["dataset_path"],
subset_name=config["dataset_name"],
)
for idx, prompt_variation in enumerate(prompt_list):
all_configs.append(
{
**config,
**{"use_prompt": prompt_variation},
**{
"task": "_".join(
[
get_task_name_from_config(config),
prompt_variation,
]
)
},
**{"output_type": "greedy_until"},
}
)
else:
all_configs.append(config)
return all_configs
def get_task_name_from_config(task_config): def get_task_name_from_config(task_config):
if "dataset_name" in task_config: if "dataset_name" in task_config:
return "{dataset_path}_{dataset_name}".format(**task_config) return "{dataset_path}_{dataset_name}".format(**task_config)
...@@ -31,23 +85,10 @@ def include_task_folder(task_dir): ...@@ -31,23 +85,10 @@ def include_task_folder(task_dir):
yaml_path = os.path.join(root, f) yaml_path = os.path.join(root, f)
try: try:
config = utils.load_yaml_config(yaml_path) config = utils.load_yaml_config(yaml_path)
all_configs = check_prompt_config(config)
for config in all_configs:
register_configurable_task(config)
SubClass = type(
config["task"] + "ConfigurableTask",
(ConfigurableTask,),
{"CONFIG": TaskConfig(**config)},
)
if "task" in config:
# task_name = "{}:{}".format(
# get_task_name_from_config(config), config["task"]
# )
task_name = "{}".format(config["task"])
register_task(task_name)(SubClass)
if "group" in config:
for group in config["group"]:
register_group(group)(SubClass)
except Exception as error: except Exception as error:
eval_logger.warning( eval_logger.warning(
"Failed to load config in\n" "Failed to load config in\n"
...@@ -57,8 +98,58 @@ def include_task_folder(task_dir): ...@@ -57,8 +98,58 @@ def include_task_folder(task_dir):
) )
def include_benchmarks(task_dir, benchmark_dir="benchmarks"):
for root, subdirs, file_list in os.walk(os.path.join(task_dir, benchmark_dir)):
if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/" task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_task_folder(task_dir) include_task_folder(task_dir)
include_benchmarks(task_dir)
def get_task(task_name, config): def get_task(task_name, config):
...@@ -97,11 +188,15 @@ def get_task_dict(task_name_list: List[Union[str, dict, Task]], **kwargs): ...@@ -97,11 +188,15 @@ def get_task_dict(task_name_list: List[Union[str, dict, Task]], **kwargs):
if isinstance(task_element, str): if isinstance(task_element, str):
if task_element in GROUP_REGISTRY: if task_element in GROUP_REGISTRY:
group_name = task_element
for task_name in GROUP_REGISTRY[task_element]: for task_name in GROUP_REGISTRY[task_element]:
if task_name not in task_name_from_registry_dict: if task_name not in task_name_from_registry_dict:
task_name_from_registry_dict = { task_name_from_registry_dict = {
**task_name_from_registry_dict, **task_name_from_registry_dict,
task_name: get_task(task_name=task_name, config=config), task_name: (
group_name,
get_task(task_name=task_name, config=config),
),
} }
else: else:
task_name = task_element task_name = task_element
......
...@@ -6,7 +6,6 @@ dataset_name: arithmetic_1dc ...@@ -6,7 +6,6 @@ dataset_name: arithmetic_1dc
output_type: loglikelihood output_type: loglikelihood
validation_split: validation validation_split: validation
test_split: null test_split: null
template_aliases: ""
doc_to_text: "{{context}}" doc_to_text: "{{context}}"
doc_to_target: "{{completion}}" doc_to_target: "{{completion}}"
metric_list: metric_list:
......
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2da task: arithmetic_2da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2da dataset_name: arithmetic_2da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2dm task: arithmetic_2dm
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2dm dataset_name: arithmetic_2dm
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_2ds task: arithmetic_2ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_2ds dataset_name: arithmetic_2ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_3da task: arithmetic_3da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_3da dataset_name: arithmetic_3da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_3ds task: arithmetic_3ds
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_3ds dataset_name: arithmetic_3ds
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: arithmetic_1dc.yaml
- arithmetic
task: arithmetic_4da task: arithmetic_4da
dataset_path: EleutherAI/arithmetic
dataset_name: arithmetic_4da dataset_name: arithmetic_4da
output_type: loglikelihood
validation_split: validation
test_split: null
template_aliases: ""
doc_to_text: "{{context}}"
doc_to_target: "{{completion}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment