add drop

f9558ce5 · lintangsutawika · 21e1ed17 · f9558ce5 · f9558ce5 · f9558ce5
Commit f9558ce5 authored Aug 30, 2023 by lintangsutawika
Showing with 110 additions and 0 deletions

lm_eval/tasks/drop/README.md lm_eval/tasks/drop/README.md +47 -0

lm_eval/tasks/drop/default.yaml lm_eval/tasks/drop/default.yaml +9 -0

lm_eval/tasks/drop/utils.py lm_eval/tasks/drop/utils.py +54 -0

No files found.
--- a/lm_eval/tasks/drop/README.md
+++ b/lm_eval/tasks/drop/README.md
+# DROP
+### Paper
+Title: `DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs`
+Abstract: https://aclanthology.org/attachments/N19-1246.Supplementary.pdf
+DROP is a QA dataset which tests comprehensive understanding of paragraphs. In
+this crowdsourced, adversarially-created, 96k question-answering benchmark, a
+system must resolve multiple references in a question, map them onto a paragraph,
+and perform discrete operations over them (such as addition, counting, or sorting).
+Homepage: https://allenai.org/data/drop
+Acknowledgement: This implementation is based on the official evaluation for `DROP`:
+https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py
+### Citation
+```
+BibTeX-formatted citation goes here
+```
+### Groups and Tasks
+#### Groups
+* `group_name`: `Short description`
+#### Tasks
+* `task_name`: `1-sentence description of what this particular task does`
+* `task_name2`: ...
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/drop/default.yaml
+++ b/lm_eval/tasks/drop/default.yaml
+task: drop
+dataset_path: EleutherAI/drop
+output_type: greedy_until
+training_split: train
+validation_split: test
+doc_to_text: "Passage: {{passage}}\nQuestion: {{question}}\nAnswer:"
+doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{passage}} {{question}}"
--- a/lm_eval/tasks/drop/utils.py
+++ b/lm_eval/tasks/drop/utils.py
+def process_doc(dataset):
+    def _process(doc):
+        return {
+            "id": doc["query_id"],
+            "passage": doc["passage"],
+            "question": doc["question"],
+            "answers": get_answers(doc),
+        }
+    return dataset.map(_process)
+def get_answers(doc):
+    def _flatten_validated_answers(validated_answers):
+        """Flattens a dict of lists of validated answers.
+        {"number": ['1', '8'], ...}
+        -> [{"number": ['1'], ...}, {"number": ['8'], ...}]
+        """
+        valid_answers = []
+        for i in range(len(validated_answers["number"])):
+            valid_answers.append(
+                {
+                    "number": validated_answers["number"][i],
+                    "date": validated_answers["date"][i],
+                    "spans": validated_answers["spans"][i],
+                }
+            )
+        return valid_answers
+    answers = []
+    answers_set = set()
+    candidates = [doc["answer"]] + _flatten_validated_answers(
+        doc["validated_answers"]
+    )
+    for candidate in candidates:
+        answer = parse_answer(candidate)
+        if answer in answers_set:
+            continue
+        answers_set.add(answer)
+        answers.append(answer)
+    return answers
+def parse_answer(answer):
+    # NOTE: Everything is returned as a tuple for uniformity and hashability.
+    if answer["number"] != "":
+        return (str(answer["number"]),)
+    if answer["spans"] != []:
+        return tuple(answer["spans"])
+    return (
+        " ".join(
+            [answer["date"]["day"], answer["date"]["month"], answer["date"]["year"]]
+        ).strip(),
+    )
\ No newline at end of file