Update README.md

dbf03c05 · Lintang Sutawika · GitHub · 375346e6 · dbf03c05
Unverified Commit dbf03c05 authored Aug 09, 2023 by Lintang Sutawika Committed by GitHub Aug 09, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 28 additions and 5 deletions

lm_eval/tasks/arc/README.md lm_eval/tasks/arc/README.md +28 -5

No files found.
--- a/lm_eval/tasks/arc/README.md
+++ b/lm_eval/tasks/arc/README.md
-# [Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457)
+# ARC
-## Abstract
+### Paper
+Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
+Abstract: https://arxiv.org/abs/1803.05457
 The ARC dataset consists of 7,787 science exam questions drawn from a variety
 of sources, including science questions provided under license by a research
@@ -10,7 +14,11 @@ multiple choice structure (typically 4 answer options). The questions are sorted
 into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
 a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.
-## Citation
+Homepage: https://allenai.org/data/arc
+### Citation
 ```
 @article{Clark2018ThinkYH,
  title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
@@ -20,6 +28,21 @@ a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questi
  volume={abs/1803.05457}
 }
 ```
-## Links
-Homepage: https://allenai.org/data/arc
+### Subtasks
+* `arc_easy`: `1-sentence description of what this particular task does`
+* `arc_challange`: .....
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?