added triviaqa

c86fd1a7 · lintangsutawika · d1a44c85 · c86fd1a7 · c86fd1a7
Commit c86fd1a7 authored Aug 09, 2023 by lintangsutawika
Hide whitespace changes
Inline Side-by-side

Showing with 70 additions and 0 deletions

lm_eval/tasks/triviaqa/README.md lm_eval/tasks/triviaqa/README.md +46 -0

lm_eval/tasks/triviaqa/default.yaml lm_eval/tasks/triviaqa/default.yaml +24 -0

No files found.
--- a/lm_eval/tasks/triviaqa/README.md
+++ b/lm_eval/tasks/triviaqa/README.md
+# Trivia QA
+
+### Paper
+
+Title: `TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension`
+Abstract: https://arxiv.org/abs/1705.03551
+
+TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence
+triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts
+and independently gathered evidence documents, six per question on average, that provide
+high quality distant supervision for answering the questions.
+
+Homepage: https://nlp.cs.washington.edu/triviaqa/
+
+
+### Citation
+
+```
+@InProceedings{JoshiTriviaQA2017,
+    author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},
+    title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},
+    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
+    month = {July},
+    year = {2017},
+    address = {Vancouver, Canada},
+    publisher = {Association for Computational Linguistics},
+}
+```
+
+### Subtasks
+
+List or describe tasks defined in this folder, and their names here:
+* `triviaqa`: `Generate and answer based on the question.`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/triviaqa/default.yaml
+++ b/lm_eval/tasks/triviaqa/default.yaml
+task: triviaqa
+dataset_path: trivia_qa
+dataset_name: rc.nocontext
+output_type: greedy_until
+training_split: train
+validation_split: validation
+doc_to_text: "Question: {{question}}?\nAnswer:"
+doc_to_target: "{{answer.aliases}}"
+should_decontaminate: true
+doc_to_decontamination_query: question
+target_delimiter: " "
+generation_kwargs:
+  until:
+    - "\n"
+    - "." 
+    - ","
+  do_sample: false
+  temperature: 0.0
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true