This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
Features:
**Features:**
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
**Evaluation Overview**
`Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.
```mermaid
graph LR;
classDef empty width:0px,height:0px;
T[Task]
I[Input]
F[Filter]
M[Model]
O[Ouput]:::empty
P[Prompt]
Me[Metric]
R[Result]
T --- I:::empty
P --- I
I --> M
M --> O
O --> F
Me --> R:::empty
F --> R
```
## Install
To install `lm-eval` from the github repository main branch, run:
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
...
...
@@ -12,3 +14,35 @@ class TakeFirstFilter:
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
returnmap(lambdar:r[0],resps)
classTakeKFilter(Filter):
def__init__(self,*args,**kwargs):
self.k=kwargs.pop("k")
super().__init__(*args,**kwargs)
defapply(self,resps):
# check we have at least k responses per doc, else we can't take the first k
assertlen(resps[0])>=self.k,f"Need at least {self.k} responses per doc to take first {self.k}, but got {len(resps[0])} only! Please increase TaskConfig.repeats ."
returnmap(lambdar:r[:self.k],resps)
classMajorityVoteFilter(Filter):
def__init__(self):
"""
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""
defapply(self,resps):
"""
Each entry of `resps` is a list of model responses.
We select the response that occurs most frequently in each entry of `resps`.
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
- [ ] Glue
- [ ] SuperGlue
- [ ] CoQA
- [ ] DROP
- [x] ~~Lambada~~
- [x] Lambada (Cloze variants)
- [ ] Lambada (Multilingual)
- [x] Wikitext
- [x] PiQA
- [ ] PROST
- [ ] MCTACO
- [ ] Pubmed QA
- [x] SciQ
- [ ] QASPER
- [ ] QA4MRE
- [ ] TriviaQA
- [x] AI2 ARC
- [ ] LogiQA
- [ ] HellaSwag
- [ ] SWAG
- [ ] OpenBookQA
- [ ] SQuADv2
- [ ] RACE
- [ ] HeadQA
- [ ] MathQA
- [ ] WebQs
- [ ] WSC273
- [ ] Winogrande
- [ ] ANLI
- [ ] Hendrycks Ethics
- [ ] TruthfulQA
- [ ] MuTual
- [ ] Hendrycks Math
- [ ] Asdiv
- [ ] GSM8k
- [ ] Arithmetic
- [ ] MMMLU
- [ ] Translation (WMT) suite
- [ ] Unscramble
- [x] ~~Pile (perplexity)~~
- [ ] BLiMP
- [ ] ToxiGen
- [ ] CrowS-Pairs
- [ ] XCopa
- [ ] BIG-Bench
- [ ] XStoryCloze
- [ ] XWinograd
- [ ] PAWS-X
- [ ] XNLI
- [ ] MGSM
# Novel Tasks
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
# Task Wishlist
- [ ] TheoremQA
- [ ] Theorem Proving evaluations
- [ ] Chain of Thought
- [ ] Self-consistency ; Least-to-Most prompting, etc.
# TODO: we should see how shuffling answer choices affects perf.
template_aliases:"{%setanswer_choices=[distractor1,distractor2,distractor3,correct_answer]%}{%setgold=3%}"# set the list of possible answer choices, and set what this doc's gold label idx is