Unverified Commit 170ae096 authored by Leo Gao's avatar Leo Gao Committed by GitHub
Browse files

Merge pull request #226 from jon-tow/evaluator-description-option

Replace `fewshot_description` API with a `description_dict` based interface
parents 8728710c 02a4def2
...@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv ...@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv
## Implementing new tasks ## Implementing new tasks
To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md). To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
## Cite as ## Cite as
...@@ -364,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command: ...@@ -364,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command:
```bash ```bash
python write_out.py \ python write_out.py \
--tasks all_tasks \ --tasks all_tasks \
--provide_description \
--num_fewshot 5 \ --num_fewshot 5 \
--num_examples 10 \ --num_examples 10 \
--output_base_path /path/to/output/folder --output_base_path /path/to/output/folder
......
# Description Guide
![fewshot-example](./img/fewshot_example_gpt3.png)
(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
```python
description_dict = {
"task_name_1": "description",
"task_name_2": "description",
...
}
```
Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
```python
"""
<description>
<examples>
<prompt>
"""
```
## Descriptions in File
One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
```json
{
"cycle_letters": "Please unscramble the letters into a word, and write that word:",
"copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
}
```
which can then be supplied to the CLI as:
```bash
python main.py \
--tasks cycle_letters,copa \
--description_dict_path /your/path/descriptions.json \
...
```
...@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data: ...@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
``` ```
These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set. These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
```python ```python
def training_docs(self): def training_docs(self):
return #... return #...
...@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task ...@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task
<br> <br>
In the case your task is _not_ multiple-choice, override the following methods for your task class:
In the case your task is not multiple-choice, override the following methods for your task class: Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
```python
def fewshot_description(self):
return ""
```
Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
```python ```python
def doc_to_text(self, doc): def doc_to_text(self, doc):
...@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri ...@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri
```bash ```bash
python -m scripts.write_out \ python -m scripts.write_out \
--task <your-task> \
--output_base_path <path> \ --output_base_path <path> \
--tasks <your-task> \
--sets <train | val | test> \ --sets <train | val | test> \
--num_fewshot K \ --num_fewshot K \
--num_examples N --num_examples N \
--description_dict_path <path>
``` ```
Open the file specified at the `--output_base_path <path>` and ensure it passes Open the file specified at the `--output_base_path <path>` and ensure it passes
......
import abc import abc
from typing import Iterable from typing import Iterable
import numpy as np import numpy as np
import random
import re import re
import os import os
import json import json
...@@ -450,11 +451,43 @@ class Task(abc.ABC): ...@@ -450,11 +451,43 @@ class Task(abc.ABC):
pass pass
def fewshot_description(self): def fewshot_description(self):
import warnings
warnings.warn(
"`fewshot_description` will be removed in futures versions. Pass "
"any custom descriptions to the `evaluate` function instead.",
DeprecationWarning)
return "" return ""
def fewshot_context(self, doc, num_fewshot, provide_description, rnd): @utils.positional_deprecated
raw_description = self.fewshot_description() def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
description = (raw_description + "\n===\n\n") if provide_description and raw_description else "" """ Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc: str
The document as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str
The fewshot context.
"""
assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
description = description + "\n\n" if description else ""
if num_fewshot == 0: if num_fewshot == 0:
labeled_examples = "" labeled_examples = ""
...@@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC): ...@@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC):
def has_training_docs(self): def has_training_docs(self):
return False return False
def fewshot_description(self):
return ""
def fewshot_examples(self, k, rnd): def fewshot_examples(self, k, rnd):
assert k == 0 assert k == 0
return [] return []
def fewshot_context(self, doc, num_fewshot, provide_description, rnd): def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
assert num_fewshot == 0 assert num_fewshot == 0
assert not provide_description assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
return "" return ""
def higher_is_better(self): def higher_is_better(self):
......
...@@ -6,19 +6,23 @@ import lm_eval.models ...@@ -6,19 +6,23 @@ import lm_eval.models
import lm_eval.tasks import lm_eval.tasks
import lm_eval.base import lm_eval.base
import numpy as np import numpy as np
from lm_eval.utils import positional_deprecated
def simple_evaluate(model, model_args, task_names, @positional_deprecated
def simple_evaluate(model, model_args=None, tasks=[],
num_fewshot=0, batch_size=None, device=None, num_fewshot=0, batch_size=None, device=None,
no_cache=False, limit=None, bootstrap_iters=100000): no_cache=False, limit=None, bootstrap_iters=100000,
description_dict=None):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param model: str :param model: Union[str, LM]
Name of model, see lm_eval.models.get_model Name of model or LM object, see lm_eval.models.get_model
:param model_args: str :param model_args: Optional[str]
String arguments for each model class, see LM.create_from_arg_string String arguments for each model class, see LM.create_from_arg_string.
:param task_names: list[str] Ignored if `model` argument is a LM object.
List of task names :param tasks: list[Union[str, Task]]
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int :param num_fewshot: int
Number of examples in few-shot context Number of examples in few-shot context
:param batch_size: int, optional :param batch_size: int, optional
...@@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names, ...@@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names,
Limit the number of examples per task (only use this for testing) Limit the number of examples per task (only use this for testing)
:param bootstrap_iters: :param bootstrap_iters:
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description`
:return :return
Dictionary of results Dictionary of results
""" """
random.seed(1234) random.seed(1234)
np.random.seed(1234) np.random.seed(1234)
lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, { assert tasks != [], "No tasks specified"
'batch_size': batch_size, 'device': device
}) if isinstance(model, str):
if model_args is None: model_args = ""
lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
'batch_size': batch_size, 'device': device
})
else:
assert isinstance(model, lm_eval.base.LM)
lm = model
if not no_cache: if not no_cache:
lm = lm_eval.base.CachingLM( lm = lm_eval.base.CachingLM(
lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db' lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db'
) )
task_dict = lm_eval.tasks.get_task_dict(task_names) task_dict = lm_eval.tasks.get_task_dict(tasks)
results = evaluate(lm, task_dict, False, num_fewshot, limit)
results = evaluate(
lm=lm,
task_dict=task_dict,
num_fewshot=num_fewshot,
limit=limit,
description_dict=description_dict
)
# add info about the model and few shot config # add info about the model and few shot config
results["config"] = { results["config"] = {
...@@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names, ...@@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names,
"device": device, "device": device,
"no_cache": no_cache, "no_cache": no_cache,
"limit": limit, "limit": limit,
"bootstrap_iters": bootstrap_iters "bootstrap_iters": bootstrap_iters,
"description_dict": description_dict
} }
return results return results
def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000): @positional_deprecated
def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param lm: obj :param lm: obj
Language Model Language Model
:param task_dict: dict[str, Task] :param task_dict: dict[str, Task]
Dictionary of tasks Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param provide_description: bool :param provide_description: bool
Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
:param num_fewshot: int :param num_fewshot: int
...@@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i ...@@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
Limit the number of examples per task (only use this for testing) Limit the number of examples per task (only use this for testing)
:param bootstrap_iters: :param bootstrap_iters:
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description`
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i ...@@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
# TODO: todo: implement proper description-providing system # TODO: todo: implement proper description-providing system
assert not provide_description # not implemented. assert not provide_description # not implemented.
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
task_dict_items = [ task_dict_items = [
(name, task) (name, task)
...@@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i ...@@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
rnd.seed(42) rnd.seed(42)
rnd.shuffle(task_docs) rnd.shuffle(task_docs)
description = description_dict[task_name] if description_dict and task_name in description_dict else ""
for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)): for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
docs[(task_name, doc_id)] = doc docs[(task_name, doc_id)] = doc
ctx = task.fewshot_context( ctx = task.fewshot_context(
doc=doc, doc=doc,
provide_description=provide_description,
num_fewshot=num_fewshot, num_fewshot=num_fewshot,
rnd=rnd rnd=rnd,
description=description
) )
reqs = task.construct_requests(doc, ctx) reqs = task.construct_requests(doc, ctx)
if not isinstance(reqs, (list, tuple)): if not isinstance(reqs, (list, tuple)):
reqs = [reqs] reqs = [reqs]
......
from pprint import pprint from pprint import pprint
from typing import List, Union
import sacrebleu import sacrebleu
import lm_eval.base
from . import superglue from . import superglue
from . import glue from . import glue
...@@ -305,8 +307,23 @@ def get_task(task_name): ...@@ -305,8 +307,23 @@ def get_task(task_name):
raise KeyError(f"Missing task {task_name}") raise KeyError(f"Missing task {task_name}")
def get_task_dict(task_name_list): def get_task_name_from_object(task_object):
return { for name, class_ in TASK_REGISTRY.items():
if class_ is task_object:
return name
# this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__
def get_task_dict(task_name_list: List[Union[str, lm_eval.base.Task]]):
task_name_dict = {
task_name: get_task(task_name)() task_name: get_task(task_name)()
for task_name in task_name_list for task_name in task_name_list if isinstance(task_name, str)
}
task_name_from_object_dict = {
get_task_name_from_object(task_object): task_object
for task_object in task_name_list if not isinstance(task_object, str)
} }
assert set(task_name_dict.keys()).isdisjoint(set(task_name_from_object_dict.keys()))
return {**task_name_dict, **task_name_from_object_dict}
...@@ -33,10 +33,6 @@ class ANLIBase(HFTask): ...@@ -33,10 +33,6 @@ class ANLIBase(HFTask):
if self.has_test_docs(): if self.has_test_docs():
return self.data["test_r" + str(self.SPLIT)] return self.data["test_r" + str(self.SPLIT)]
def fewshot_description(self):
# TODO: figure out description
return ""
def doc_to_text(self, doc): def doc_to_text(self, doc):
# OA does this a bit weirdly: they prepend "anli 1: anli 1: " to the beginning # OA does this a bit weirdly: they prepend "anli 1: anli 1: " to the beginning
# of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly # of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly
......
...@@ -29,10 +29,6 @@ class ARCEasy(HFTask, MultipleChoiceTask): ...@@ -29,10 +29,6 @@ class ARCEasy(HFTask, MultipleChoiceTask):
} }
return out_doc return out_doc
def fewshot_description(self):
# TODO: figure out description
return ""
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
......
...@@ -29,9 +29,18 @@ class BlimpTask(HFTask): ...@@ -29,9 +29,18 @@ class BlimpTask(HFTask):
self.data["validation"] = self.data["train"] self.data["validation"] = self.data["train"]
del self.data["train"] del self.data["train"]
def fewshot_context(self, doc, num_fewshot, provide_description, rnd): def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
assert num_fewshot == 0 assert num_fewshot == 0
assert not provide_description assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
assert not provide_description, (
"The `provide_description` arg will be removed in future versions. To prepend "
"a custom description to the context, supply the corresponding string via the "
"`description` arg."
)
if provide_description is not None:
# nudge people to not specify it at all
print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
return "" return ""
def doc_to_text(self, doc): def doc_to_text(self, doc):
......
...@@ -17,10 +17,6 @@ class CBTBase(HFTask): ...@@ -17,10 +17,6 @@ class CBTBase(HFTask):
VERSION = 0 VERSION = 0
def fewshot_description(self):
# TODO: Figure out description.
return ""
def detokenize(self, text): def detokenize(self, text):
text = text.replace(" '", "'") text = text.replace(" '", "'")
text = text.replace(" \n", "\n") text = text.replace(" \n", "\n")
......
...@@ -36,10 +36,7 @@ class CoQA(Task): ...@@ -36,10 +36,7 @@ class CoQA(Task):
def test_docs(self): def test_docs(self):
pass pass
def fewshot_description(self):
return "Given a passage and a conversation so far, answer the next question in the conversation."
def doc_to_text(self, doc): def doc_to_text(self, doc):
# Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1} # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1}
# and a question qi, the task is to predict the answer ai # and a question qi, the task is to predict the answer ai
......
...@@ -40,10 +40,6 @@ class DROP(Task): ...@@ -40,10 +40,6 @@ class DROP(Task):
def has_test_docs(self): def has_test_docs(self):
return False return False
def fewshot_description(self):
# TODO: figure out description
return ""
def _load_docs(self, docs): def _load_docs(self, docs):
for doc in docs: for doc in docs:
for qa in doc["qa_pairs"]: for qa in doc["qa_pairs"]:
......
...@@ -21,10 +21,6 @@ class CoLA(HFTask): ...@@ -21,10 +21,6 @@ class CoLA(HFTask):
def has_test_docs(self): def has_test_docs(self):
return False return False
def fewshot_description(self):
# TODO
return ""
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"]) return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])
...@@ -69,9 +65,6 @@ class SST(HFTask): ...@@ -69,9 +65,6 @@ class SST(HFTask):
def has_test_docs(self): def has_test_docs(self):
return False return False
def fewshot_description(self):
return "Indicate if the sentiment of each sentence is positive or negative."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format( return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format(
general_detokenize(doc["sentence"]), general_detokenize(doc["sentence"]),
...@@ -341,9 +334,6 @@ class MRPC(HFTask): ...@@ -341,9 +334,6 @@ class MRPC(HFTask):
def has_test_docs(self): def has_test_docs(self):
return False return False
def fewshot_description(self):
return "Indicate if both sentences mean the same thing."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format( return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format(
general_detokenize(doc["sentence1"]), general_detokenize(doc["sentence1"]),
...@@ -394,9 +384,6 @@ class QQP(HFTask): ...@@ -394,9 +384,6 @@ class QQP(HFTask):
def has_test_docs(self): def has_test_docs(self):
return False return False
def fewshot_description(self):
return "Indicate if both questions ask the same thing."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format( return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format(
doc["question1"], doc["question1"],
...@@ -447,10 +434,6 @@ class STSB(HFTask): ...@@ -447,10 +434,6 @@ class STSB(HFTask):
def has_test_docs(self): def has_test_docs(self):
return True return True
def fewshot_description(self):
return "Indicate if both sentences mean the same thing from a scale of 0-5, " \
"where 5 means identical and 0 means unrelated."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "sentence 1: {}\nsentence 2: {}\nAnswer:".format( return "sentence 1: {}\nsentence 2: {}\nAnswer:".format(
doc["sentence1"], doc["sentence1"],
......
...@@ -24,10 +24,6 @@ class HeadQABase(HFTask, MultipleChoiceTask): ...@@ -24,10 +24,6 @@ class HeadQABase(HFTask, MultipleChoiceTask):
} }
return out_doc return out_doc
def fewshot_description(self):
# TODO: figure out description
return ""
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
......
...@@ -35,10 +35,5 @@ class HellaSwag(HFTask, MultipleChoiceTask): ...@@ -35,10 +35,5 @@ class HellaSwag(HFTask, MultipleChoiceTask):
} }
return out_doc return out_doc
def fewshot_description(self):
return "Label for the relevant action: Sentences describing the " \
"context, with an incomplete sentence trailing\nanswer that " \
"plausibly completes the situation."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
...@@ -237,9 +237,6 @@ class EthicsUtilitarianismOriginal(Ethics): ...@@ -237,9 +237,6 @@ class EthicsUtilitarianismOriginal(Ethics):
for doc in docs: for doc in docs:
yield {"activity": doc[0], "baseline": doc[1], "rating": ""} yield {"activity": doc[0], "baseline": doc[1], "rating": ""}
def fewshot_description(self):
return "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n"
def fewshot_examples(self, k, rnd): def fewshot_examples(self, k, rnd):
# Overwriting fewshot examples as k can be max 5 # Overwriting fewshot examples as k can be max 5
assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more." assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more."
...@@ -350,9 +347,6 @@ class EthicsVirtue(Ethics): ...@@ -350,9 +347,6 @@ class EthicsVirtue(Ethics):
def get_prefix(self): def get_prefix(self):
return "virtue/virtue" return "virtue/virtue"
def fewshot_description(self):
return "The following is a list of sentences and traits, along with whether the trait is exhibited in that sentence.\n\n"
def process_doc(self, doc): def process_doc(self, doc):
# Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers # Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers
return [x + [i] for i, x in enumerate(doc[1:])] return [x + [i] for i, x in enumerate(doc[1:])]
......
...@@ -55,9 +55,6 @@ class Math(Task): ...@@ -55,9 +55,6 @@ class Math(Task):
def test_docs(self): def test_docs(self):
return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info()) return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info())
def fewshot_description(self):
return "Given a mathematics problem, determine the answer. Simplify your answer as much as possible."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return "Problem: " + doc["problem"] + "\nAnswer:" return "Problem: " + doc["problem"] + "\nAnswer:"
......
...@@ -114,9 +114,5 @@ class GeneralHendrycksTest(MultipleChoiceTask): ...@@ -114,9 +114,5 @@ class GeneralHendrycksTest(MultipleChoiceTask):
return rnd.sample(list(self._fewshot_docs), k) return rnd.sample(list(self._fewshot_docs), k)
def fewshot_description(self):
subject = self.subject.replace("_", " ")
return f"The following are multiple choice questions (with answers) about {subject}."
def doc_to_text(self, doc): def doc_to_text(self, doc):
return doc["query"] return doc["query"]
...@@ -47,10 +47,6 @@ class LAMBADA(Task): ...@@ -47,10 +47,6 @@ class LAMBADA(Task):
def doc_to_target(self, doc): def doc_to_target(self, doc):
return " " + doc['text'].rsplit(' ', 1)[1] return " " + doc['text'].rsplit(' ', 1)[1]
def fewshot_description(self):
# TODO: figure out description
return ""
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc)) ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment