add ruler

6042f622 · Baber · 75a38212 · 6042f622
Commit 6042f622 authored Feb 20, 2025 by Baber
Show whitespace changes
Inline Side-by-side

Showing with 133 additions and 132 deletions

lm_eval/tasks/README.md lm_eval/tasks/README.md +133 -132

No files found.
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -6,7 +6,7 @@
 For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.
 | Task Family                                                              | Description                                                                                                                                                                                                                                                                                                                            | Language(s) |
-|-------------|-------------|-------------|
+|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
 | [aclue](aclue/README.md)                                                 | Tasks focusing on ancient Chinese language understanding and cultural aspects.                                                                                                                                                                                                                                                         | Ancient Chinese |
 | [aexams](aexams/README.md)                                               | Tasks in Arabic related to various academic exams covering a range of subjects.                                                                                                                                                                                                                                                        | Arabic |
 | [agieval](agieval/README.md)                                             | Tasks involving historical data or questions related to history and historical texts.                                                                                                                                                                                                                                                  | English, Chinese |
@@ -105,6 +105,7 @@
 | [qasper](qasper/README.md)                                               | Question Answering dataset based on academic papers, testing in-depth scientific knowledge.                                                                                                                                                                                                                                            | English |
 | [race](race/README.md)                                                   | Reading comprehension assessment tasks based on English exams in China.                                                                                                                                                                                                                                                                | English |
 | realtoxicityprompts                                                      | Tasks to evaluate language models for generating text with potential toxicity.                                                                                                                                                                                                                                                         | |
+| [ruler](ruler/README.md)                                                 | RULER is a benchmark for testing how well language models handle long pieces of text. Requires custom arg (see readme)                                                                                                                                                                                                                 | English |
 | [sciq](sciq/README.md)                                                   | Science Question Answering tasks to assess understanding of scientific concepts.                                                                                                                                                                                                                                                       | English |
 | [score](score/README.md)                                                 | Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH)                                                                                                                                                                                                                                   | English |
 | [scrolls](scrolls/README.md)                                             | Tasks that involve long-form reading comprehension across various domains.                                                                                                                                                                                                                                                             | English |