Merge branch 'big-refactor' of github.com:EleutherAI/lm-evaluation-harness into dataset-metric-log

6df13d93 · lintangsutawika · 6d7a24b2 · d1caf4a4 · 6df13d93 · 6df13d93
Commit 6df13d93 authored Jun 05, 2023 by lintangsutawika
20 changed files
--- a/docs/model_guide.md
+++ b/docs/model_guide.md
+This is a placeholder.
--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -155,7 +155,7 @@ class Task(abc.ABC):
        if not hasattr(self, "_filters"):
            self._filters = []
            for name, components in self._config.get(
-                "filters", [["none", ["take_first"]]]
+                "filters", [["none", [["take_first", None]]]]
            ):
                filter_pipeline = build_filter_ensemble(name, components)
                self._filters.append(filter_pipeline)

--- a/lm_eval/tasks/CHECKLIST.md
+++ b/lm_eval/tasks/CHECKLIST.md
+# v1.0 Tasks
+This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
+
+Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+
+- [ ] Glue
+- [ ] SuperGlue
+- [ ] CoQA
+- [ ] DROP
+- [x] ~~Lambada~~
+- [x] Lambada (Cloze variants)
+- [ ] Lambada (Multilingual)
+- [x] Wikitext
+- [x] PiQA
+- [ ] PROST
+- [ ] MCTACO
+- [ ] Pubmed QA
+- [x] SciQ
+- [ ] QASPER
+- [ ] QA4MRE
+- [ ] TriviaQA
+- [x] AI2 ARC
+- [ ] LogiQA
+- [ ] HellaSwag
+- [ ] SWAG
+- [ ] OpenBookQA
+- [ ] SQuADv2
+- [ ] RACE
+- [ ] HeadQA
+- [ ] MathQA
+- [ ] WebQs
+- [ ] WSC273
+- [ ] Winogrande
+- [ ] ANLI
+- [ ] Hendrycks Ethics
+- [ ] TruthfulQA
+- [ ] MuTual
+- [ ] Hendrycks Math
+- [ ] Asdiv
+- [ ] GSM8k
+- [ ] Arithmetic
+- [ ] MMMLU
+- [ ] Translation (WMT) suite
+- [ ] Unscramble
+- [x] ~~Pile (perplexity)~~
+- [ ] BLiMP
+- [ ] ToxiGen
+- [ ] CrowS-Pairs
+- [ ] XCopa
+- [ ] BIG-Bench
+- [ ] XStoryCloze
+- [ ] XWinograd
+- [ ] PAWS-X
+- [ ] XNLI
+- [ ] MGSM
+
+# Novel Tasks
+Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
+
+# Task Wishlist
+
+- [ ] TheoremQA
+- [ ] Theorem Proving evaluations
+- [ ] Chain of Thought
+- [ ] Self-consistency ; Least-to-Most prompting, etc.
+- [ ] Summarization Tasks
+- [ ] Anthropic Model-Written Evals 
\ No newline at end of file
--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
+# v1.0 Tasks
+This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
+
+Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
+
+- [ ] Glue
+- [ ] SuperGlue
+- [ ] CoQA
+- [ ] DROP
+- [x] ~~Lambada~~
+- [x] Lambada (Cloze variants)
+- [ ] Lambada (Multilingual)
+- [x] Wikitext
+- [x] PiQA
+- [ ] PROST
+- [ ] MCTACO
+- [ ] Pubmed QA
+- [x] SciQ
+- [ ] QASPER
+- [ ] QA4MRE
+- [ ] TriviaQA
+- [x] AI2 ARC
+- [ ] LogiQA
+- [ ] HellaSwag
+- [ ] SWAG
+- [ ] OpenBookQA
+- [ ] SQuADv2
+- [ ] RACE
+- [ ] HeadQA
+- [ ] MathQA
+- [ ] WebQs
+- [ ] WSC273
+- [ ] Winogrande
+- [ ] ANLI
+- [ ] Hendrycks Ethics
+- [ ] TruthfulQA
+- [ ] MuTual
+- [ ] Hendrycks Math
+- [ ] Asdiv
+- [ ] GSM8k
+- [ ] Arithmetic
+- [ ] MMMLU
+- [ ] Translation (WMT) suite
+- [ ] Unscramble
+- [x] ~~Pile (perplexity)~~
+- [ ] BLiMP
+- [ ] ToxiGen
+- [ ] CrowS-Pairs
+- [ ] XCopa
+- [ ] BIG-Bench
+- [ ] XStoryCloze
+- [ ] XWinograd
+- [ ] PAWS-X
+- [ ] XNLI
+- [ ] MGSM
+
+# Novel Tasks
+Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
+
+# Task Wishlist
+
+- [ ] TheoremQA
+- [ ] Theorem Proving evaluations
+- [ ] Chain of Thought
+- [ ] Self-consistency ; Least-to-Most prompting, etc.
+- [ ] Summarization Tasks
+- [ ] Anthropic Model-Written Evals 
\ No newline at end of file
--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
 import os
 from typing import List, Union

-from .arc import *

 from lm_eval import utils
 from lm_eval.logger import eval_logger

--- a/lm_eval/tasks/lambada/README.md
+++ b/lm_eval/tasks/lambada/README.md
+# LAMBADA
+
+### Paper
+The LAMBADA dataset: Word prediction requiring a broad discourse context 
+https://arxiv.org/pdf/1606.06031.pdf
+
+LAMBADA is a dataset to evaluate the capabilities of computational models for text
+understanding by means of a word prediction task. LAMBADA is a collection of narrative
+passages sharing the characteristic that human subjects are able to guess their last
+word if they are exposed to the whole passage, but not if they only see the last
+sentence preceding the target word. To succeed on LAMBADA, computational models
+cannot simply rely on local context, but must be able to keep track of information
+in the broader discourse.
+
+Homepage: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
+
+### Citation
+
+@misc{
+    author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
+    title={The LAMBADA dataset},
+    DOI={10.5281/zenodo.2630551},
+    publisher={Zenodo},
+    year={2016},
+    month={Aug}
+}
\ No newline at end of file
--- a/lm_eval/tasks/lambada/base.yaml
+++ b/lm_eval/tasks/lambada/base.yaml
+group:
+  - lambada
 task: lambada_openai_yaml
 dataset_path: EleutherAI/lambada_openai
 dataset_name: default
@@ -11,7 +13,7 @@ doc_to_decontamination_query: "{{text}}"
 metric_list:
  - metric: perplexity
    aggregation: perplexity
-    higher_is_better: true
+    higher_is_better: false
  - metric: accuracy
    aggregation: mean
    higher_is_better: true
--- a/lm_eval/tasks/lambada/lambada_standard.yaml
+++ b/lm_eval/tasks/lambada/lambada_standard.yaml
+group:
+  - lambada
+task: lambada_standard_yaml
+dataset_path: lambada
+dataset_name: null
+output_type: loglikelihood
+validation_split: validation
+test_split: test
+template_aliases: ""
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: accuracy
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_openai_cloze.yaml
+group:
+  - lambada_cloze
+task: lambada_openai_cloze_yaml
+dataset_path: EleutherAI/lambada_openai
+dataset_name: default
+output_type: loglikelihood
+test_split: test
+template_aliases: ""
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}} ____. ->"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: accuracy
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+++ b/lm_eval/tasks/lambada_cloze/lambada_standard_cloze.yaml
+group:
+  - lambada_cloze
+task: lambada_standard_cloze_yaml
+dataset_path: lambada
+dataset_name: null
+output_type: loglikelihood
+validation_split: validation
+test_split: test
+template_aliases: ""
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}} ____. ->"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: accuracy
+    aggregation: mean
+    higher_is_better: true
--- a/lm_eval/tasks/pile/README.md
+++ b/lm_eval/tasks/pile/README.md
+# The Pile
+
+### Paper
+The Pile: An 800GB Dataset of Diverse Text for Language Modeling
+https://arxiv.org/pdf/2101.00027.pdf
+
+The Pile is a 825 GiB diverse, open source language modelling data set that consists
+of 22 smaller, high-quality datasets combined together. To score well on Pile
+BPB (bits per byte), a model must be able to understand many disparate domains
+including books, github repositories, webpages, chat logs, and medical, physics,
+math, computer science, and philosophy papers.
+
+Homepage: https://pile.eleuther.ai/
+
+### Citation
+```
+@article{pile,
+  title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
+  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
+  journal={arXiv preprint arXiv:2101.00027},
+  year={2020}
+}
+```
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_arxiv.yaml
+++ b/lm_eval/tasks/pile/pile_arxiv.yaml
+group:
+  - pile
+task: pile_arxiv
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_arxiv
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_bookcorpus2.yaml
+++ b/lm_eval/tasks/pile/pile_bookcorpus2.yaml
+group:
+  - pile
+task: pile_bookcorpus2
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_bookcorpus2
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_books3.yaml
+++ b/lm_eval/tasks/pile/pile_books3.yaml
+group:
+  - pile
+task: pile_books3
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_books3
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_dm-mathematics.yaml
+++ b/lm_eval/tasks/pile/pile_dm-mathematics.yaml
+group:
+  - pile
+task: pile_dm-mathematics
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_dm-mathematics
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_enron.yaml
+++ b/lm_eval/tasks/pile/pile_enron.yaml
-# The Pile: An 800GB Dataset of Diverse Text for Language Modeling
-# https://arxiv.org/pdf/2101.00027.pdf
-
-# The Pile is a 825 GiB diverse, open source language modelling data set that consists
-# of 22 smaller, high-quality datasets combined together. To score well on Pile
-# BPB (bits per byte), a model must be able to understand many disparate domains
-# including books, github repositories, webpages, chat logs, and medical, physics,
-# math, computer science, and philosophy papers.
-# Homepage: https://pile.eleuther.ai/
-
-# _CITATION = """
-# @article{pile,
-#   title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
-#   author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
-#   journal={arXiv preprint arXiv:2101.00027},
-#   year={2020}
-# }
-# """
-
-names:
-  - pile_enron_yaml
+group:
+  - pile
+task: pile_enron
 dataset_path: EleutherAI/the_pile
 dataset_name: enron_emails
 output_type: loglikelihood_rolling

--- a/lm_eval/tasks/pile/pile_europarl.yaml
+++ b/lm_eval/tasks/pile/pile_europarl.yaml
+group:
+  - pile
+task: pile_europarl
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_europarl
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_freelaw.yaml
+++ b/lm_eval/tasks/pile/pile_freelaw.yaml
+group:
+  - pile
+task: pile_freelaw
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_freelaw
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_github.yaml
+++ b/lm_eval/tasks/pile/pile_github.yaml
+group:
+  - pile
+task: pile_github
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_github
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file
--- a/lm_eval/tasks/pile/pile_gutenberg.yaml
+++ b/lm_eval/tasks/pile/pile_gutenberg.yaml
+group:
+  - pile
+task: pile_gutenberg
+dataset_path: EleutherAI/the_pile
+dataset_name: pile_gutenberg
+output_type: loglikelihood_rolling
+test_split: train
+template_aliases: ""
+doc_to_text: ""
+doc_to_target: "{{text}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: word_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: byte_perplexity
+    aggregation: weighted_perplexity
+    higher_is_better: false
+  - metric: bits_per_byte
+    aggregation: bits_per_byte
+    higher_is_better: false
\ No newline at end of file