Merge pull request #965 from EleutherAI/remove-goldalias

[Refactor] Remove deprecated `gold_alias` task YAML option

Merge pull request #965 from EleutherAI/remove-goldalias
[Refactor] Remove deprecated `gold_alias` task YAML option
01227a7e · Lintang Sutawika · GitHub · 815f59e6 · 93d088c8 · 01227a7e
Unverified Commit 01227a7e authored Nov 06, 2023 by Lintang Sutawika Committed by GitHub Nov 06, 2023
6 changed files
--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -20,12 +20,12 @@ Task naming + registration:
 Dataset configuration options:
 - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name**  (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
+- **dataset_name**  (`str`, *optional*, defaults to None) — The name of what HF calls a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
 - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
 - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
 - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
 - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
+- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0.
 - **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
 Prompting / in-context formatting options:

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -71,7 +71,6 @@ class TaskConfig(dict):
    doc_to_text: Union[Callable, str] = None
    doc_to_target: Union[Callable, str] = None
    doc_to_choice: Union[Callable, str, dict, list] = None
-    gold_alias: Union[Callable, str] = None
    process_results: Union[Callable, str] = None
    use_prompt: str = None
    description: str = ""
@@ -895,26 +894,6 @@ class ConfigurableTask(Task):
        else:
            raise TypeError
-    def gold_alias(self, doc):
-        # returns a version of the gold target answer to a document,
-        # which should be passed into metric for scoring as the ground truth.
-        # in multiple_choice tasks, this should be castable to an int corresponding to the index
-        # within the answer choices, while doc_to_target is the string version of {{answer_choices[gold]}}.
-        if self.config.gold_alias is not None:
-            doc_to_target = self.config.gold_alias
-        else:
-            return self.doc_to_target(doc)
-        if type(doc_to_target) == str:
-            return utils.apply_template(doc_to_target, doc)
-        elif callable(doc_to_target):
-            return doc_to_target(doc)
-        elif hasattr(doc_to_target, "apply"):
-            return doc_to_target.apply(doc)[1]
-        else:
-            raise TypeError
    def construct_requests(
        self, doc: dict, ctx: str, **kwargs
    ) -> Union[List[Instance], Instance]:

--- a/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k-cot.yaml
@@ -14,8 +14,7 @@ Q: There were nine computers in the server room. Five more computers were instal
 Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\n\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\n\
 Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\n\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\n\
 Q: {{question}}\n\nA:"
-doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
+doc_to_target: " {{answer.split('### ')[-1].rstrip()}}"
-gold_alias: "{{answer.split('### ')[-1].rstrip()}}" # this post-processes the reference that we'll score against
 metric_list:
  - metric: exact_match
    aggregation: mean
@@ -25,6 +24,8 @@ metric_list:
    regexes_to_ignore:
      - ","
      - "\\$"
+      - "(?s).*#### "
+      - "\n\n"
 generation_kwargs:
  until:
    - "Q:"
@@ -37,5 +38,5 @@ filter_list:
  - name: "get-answer"
    filter:
      - function: "regex"
-        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)."
      - function: "take_first"
--- a/lm_eval/tasks/gsm8k/gsm8k.yaml
+++ b/lm_eval/tasks/gsm8k/gsm8k.yaml
 group:
  - math_word_problems
-task: gsm8k_yaml
+task: gsm8k
 dataset_path: gsm8k
 dataset_name: main
 output_type: generate_until
@@ -9,7 +9,6 @@ fewshot_split: train
 test_split: test
 doc_to_text: "Question: {{question}}\nAnswer:"
 doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
-gold_alias: "{{answer.split('### ')[-1].rstrip()}}" # this post-processes the reference that we'll score against
 metric_list:
  - metric: exact_match
    aggregation: mean
@@ -19,7 +18,7 @@ metric_list:
    regexes_to_ignore:
      - ","
      - "\\$"
-      - ".*### "
+      - "(?s).*#### "
 generation_kwargs:
  until:
    - "\n\n"
@@ -28,9 +27,9 @@ generation_kwargs:
  temperature: 0.0
 repeats: 1
 num_fewshot: 5
-# filter_list:
+filter_list:
-#   - name: "get-answer"
+  - name: "get-answer"
-#     filter:
+    filter:
-#       - function: "regex"
+      - function: "regex"
-#         regex_pattern: "### (\\-?[0-9\\.\\,]+)"
+        regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
-#       - function: "take_first"
+      - function: "take_first"
--- a/lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml
+++ b/lm_eval/tasks/hendrycks_ethics/utilitarianism_original_yaml
@@ -9,7 +9,6 @@
 # template_aliases:  #"{% set answer_choices = range(1, 11)|list %}"
 # doc_to_text: 'Activity: "{{activity}}"\nRating:'
 # doc_to_target: "{{answer_choices[label]}}"
-# gold_alias: "{{label}}" # this will be cast to an int.
 # metric_list:
 #   - metric: acc
 # TODO: we want this to be implemented as a winograd_schema task type, actually
--- a/lm_eval/tasks/pubmedqa/preprocess_pubmedqa.py
+++ b/lm_eval/tasks/pubmedqa/preprocess_pubmedqa.py
@@ -3,12 +3,3 @@ def doc_to_text(doc) -> str:
    return "Abstract: {}\nQuestion: {}\nAnswer:".format(
        ctxs, doc["QUESTION"], doc["final_decision"]
    )
-def doc_to_target(doc) -> str:
-    return " {}".format(doc["final_decision"])
-def gold_alias(doc):
-    dict_to_label = {"yes": 0, "no": 1, "maybe": 2}
-    return dict_to_label[doc["final_decision"]]