[Refactor] Migrate webqs task to yaml

d2332149 · Eddy Yeo · 21882abe · d2332149 · d2332149 · d2332149
Commit d2332149 authored Jul 18, 2023 by Eddy Yeo
4 changed files
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -949,22 +949,21 @@ class ConfigurableTask(Task):
            if self.multiple_target:
                acc = 1.0 if pred in gold else 0.0
                acc_norm = 1.0 if pred_norm in gold else 0.0
+                exact_match = int(any([is_greedy[i] for i in gold]))
            else:
                acc = 1.0 if pred == gold else 0.0
                acc_norm = 1.0 if pred_norm == gold else 0.0
+                # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
+                exact_match = int(is_greedy[gold])

            result_dict = {
                **({"acc": acc} if "acc" in use_metric else {}),
                **({"f1": (gold, pred)} if "f1" in use_metric else {}),
                **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
                **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
+                **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
            }

-            if "exact_match" in self._metric_fn_list.keys():
-                # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
-                is_greedy = is_greedy[gold]  # take value for the gold answer
-                result_dict["exact_match"] = int(is_greedy)
-
            if "acc_mutual_info" in use_metric:
                lls_mutual_info = [
                    ll_c - ll_u for ll_c, ll_u in zip(lls, lls_unconditional)

--- a/lm_eval/tasks/webqs/README.md
+++ b/lm_eval/tasks/webqs/README.md
+# Task-name
+
+### Paper
+
+Title: `Semantic Parsing on Freebase from Question-Answer Pairs`
+
+Abstract: `https://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf`
+
+WebQuestions is a benchmark for question answering. The dataset consists of 6,642
+question/answer pairs. The questions are supposed to be answerable by Freebase, a
+large knowledge graph. The questions are mostly centered around a single named entity.
+The questions are popular ones asked on the web (at least in 2013).
+
+Homepage: `https://worksheets.codalab.org/worksheets/0xba659fe363cb46e7a505c5b6a774dc8a`
+
+
+### Citation
+
+```
+@inproceedings{berant-etal-2013-semantic,
+    title = "Semantic Parsing on {F}reebase from Question-Answer Pairs",
+    author = "Berant, Jonathan  and
+      Chou, Andrew  and
+      Frostig, Roy  and
+      Liang, Percy",
+    booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
+    month = oct,
+    year = "2013",
+    address = "Seattle, Washington, USA",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/D13-1160",
+    pages = "1533--1544",
+}
+```
+
+### Subtasks
+
+List or describe tasks defined in this folder, and their names here:
+* `webqs`: `Questions with multiple accepted answers.`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+  * [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/webqs/utils.py
+++ b/lm_eval/tasks/webqs/utils.py
+from typing import Dict, List
+
+
+def doc_to_choice(doc: Dict) -> List[str]:
+    """Return all of the accepted answers as choices."""
+    return _remove_prefixes(doc["answers"])
+
+
+def doc_to_target(doc: Dict) -> List[int]:
+    """Return list of indices of accepted answers (all of them)."""
+    remaining = _remove_prefixes(doc["answers"])
+    return list(range(len(remaining)))
+
+
+def _remove_prefixes(aliases):
+    """
+    Remove any alias that has a strict prefix elsewhere in the list.
+
+    This is an optimization. We can do this because if the prefix is acceptable by isgreedy,
+    we can stop looking.
+    """
+    aliases.sort()
+    ret = [aliases[0]]
+    for alias in aliases[1:]:
+        if not alias.startswith(ret[-1]):
+            ret.append(alias)
+    return ret
--- a/lm_eval/tasks/webqs/webqs.yaml
+++ b/lm_eval/tasks/webqs/webqs.yaml
+group:
+  - freebase
+  - question_answer
+task: webqs
+dataset_path: web_questions
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: null
+test_split: test
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: !function utils.doc_to_target
+doc_to_choice: !function utils.doc_to_choice
+should_decontaminate: true
+doc_to_decontamination_query: question
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true