Unverified Commit 35a24652 authored by Aflah's avatar Aflah Committed by GitHub
Browse files

Merge pull request #1 from EleutherAI/toxicity-test

Toxicity test
parents 52213e29 0021de21
...@@ -14,14 +14,13 @@ If you choose to port a task not yet completed according to [our checklist](http ...@@ -14,14 +14,13 @@ If you choose to port a task not yet completed according to [our checklist](http
Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information. Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
## Overview ## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
Features: Features:
- Many tasks implemented, 200+ tasks implemented in the old framework which require porting to the new setup as described in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md. - Many tasks implemented, 200+ tasks [implemented in the old framework](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md) which require porting to the new setup as described in [the new task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md).
- Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/) (including quantization via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/). - Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
...@@ -34,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run: ...@@ -34,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run:
```bash ```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness cd lm-evaluation-harness
git checkout big-refactor
pip install -e . pip install -e .
``` ```
...@@ -50,6 +48,17 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex ...@@ -50,6 +48,17 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
pip install -e ".[gptq]" pip install -e ".[gptq]"
``` ```
To install the package with all extras, run
```bash
pip install -e ".[all]"
```
## Support
The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
## Basic Usage ## Basic Usage
### Hugging Face `transformers` ### Hugging Face `transformers`
...@@ -79,6 +88,19 @@ python main.py \ ...@@ -79,6 +88,19 @@ python main.py \
Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via Support for this model type is currently pending. Models that are loaded via either `transformers.AutoModelForCausalLM` (autoregressive, decoder-only GPT style models) or `transformers.AutoModelForSeq2SeqLM` (such as encoder-decoder models like T5) in Huggingface are supported via Support for this model type is currently pending.
Batch size selection can be automated by setting the ```--batch_size``` flag to ```auto```. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append ```:N``` to above flag to automatically recompute the largest batch size ```N``` times. For example, to recompute the batch size 4 times, the command would be:
```bash
python main.py \
--model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \
--device cuda:0 \
--batch_size auto:4
```
Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere.
### Multi-GPU Evaluation with Hugging Face `accelerate` ### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation. To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
...@@ -114,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi ...@@ -114,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi
**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.** **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**
To use `accelerate` with the `lm-eval` command, use
```
accelerate launch --no_python lm-eval --model ...
```
### Commercial APIs ### Commercial APIs
Our library also supports language models served via the OpenAI API: Our library also supports the evaluation of models served via several commercial APIs, and hope to implement support for common performant local/self-hosted inference servers.
A full accounting of the supported and planned libraries + APIs can be seen below:
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs help! | N/A | (link here?) | `greedy_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `greedy_until` (no logprobs) |
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | Needs testing | `textsynth` | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `greedy_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
Our library supports language models served via the OpenAI Completions API as follows:
```bash ```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \ python main.py \
--model openai \ --model openai-completions \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada_openai,hellaswag --tasks lambada_openai,hellaswag
``` ```
While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`. While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_args engine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
### Other Frameworks ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
...@@ -158,6 +193,16 @@ python write_out.py \ ...@@ -158,6 +193,16 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_args engine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
## Advanced Usage ## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument: For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
...@@ -187,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu ...@@ -187,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants. As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Cite as ## Cite as
``` ```
......
...@@ -236,3 +236,89 @@ Generative tasks: ...@@ -236,3 +236,89 @@ Generative tasks:
Tasks using complex filtering: Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`) - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
## Benchmarks
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
```yaml
group: pythia
task:
- lambada_openai
- wikitext
- piqa
- sciq
- wsc
- winogrande
- arc
- logiqa
- blimp
- hendrycksTest*
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
group: t0_eval
task:
# Coreference Resolution
- dataset_path: super_glue
dataset_name: wsc.fixed
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Coreference Resolution
- dataset_path: winogrande
dataset_name: winogrande_xl
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
...
```
If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
```YAML
group: t0_eval
task:
...
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
...@@ -36,15 +36,19 @@ The LM class enforces a common interface via which we can extract responses from ...@@ -36,15 +36,19 @@ The LM class enforces a common interface via which we can extract responses from
```python ```python
class MyCustomLM(LM): class MyCustomLM(LM):
#... #...
def loglikelihood(self, requests): def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
#...
def loglikelihood_rolling(self, requests): def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]:
#...
def greedy_until(self, requests): def greedy_until(self, requests: list[Instance]) -> list[str]:
#...
#... #...
``` ```
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` which returns a tuple of (context, continuation).
We support We support
......
import abc import abc
import os import os
from typing import Union from typing import Union, List, Tuple
from sqlitedict import SqliteDict from sqlitedict import SqliteDict
import json import json
import hashlib import hashlib
...@@ -25,31 +25,32 @@ class LM(abc.ABC): ...@@ -25,31 +25,32 @@ class LM(abc.ABC):
self.cache_hook = CacheHook(None) self.cache_hook = CacheHook(None)
@abc.abstractmethod @abc.abstractmethod
def loglikelihood(self, requests): def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
"""Compute log-likelihood of generating a continuation from a context. """Compute log-likelihood of generating a continuation from a context.
Downstream tasks should attempt to use loglikelihood instead of other Downstream tasks should attempt to use loglikelihood instead of other
LM calls whenever possible. LM calls whenever possible.
:param requests: list :param requests: list[Instance]
A list of pairs (context, continuation) A list of Instance objects, with property `args` which returns a tuple (context, continuation).
context: str `context: str`
Context string. Implementations of LM must be able to handle an Context string. Implementations of LM must be able to handle an
empty context string. empty context string.
continuation: str `continuation: str`
The continuation over which log likelihood will be calculated. If The continuation over which log likelihood will be calculated. If
there is a word boundary, the space should be in the continuation. there is a word boundary, the space should be in the continuation.
For example, context="hello" continuation=" world" is correct. For example, context="hello" continuation=" world" is correct.
:return: list
:return: list[tuple[float, bool]]
A list of pairs (logprob, isgreedy) A list of pairs (logprob, isgreedy)
logprob: float `logprob: float`
The log probability of `continuation` The log probability of `continuation`.
isgreedy: `isgreedy`:
Whether `continuation` would be generated by greedy sampling from `context` Whether `continuation` would be generated by greedy sampling from `context`.
""" """
pass pass
@abc.abstractmethod @abc.abstractmethod
def loglikelihood_rolling(self, requests): def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
"""Compute full log-likelihood of a string, with no truncation, for perplexity computation """Compute full log-likelihood of a string, with no truncation, for perplexity computation
- We will use the full max context length of the model. - We will use the full max context length of the model.
- For inputs that exceed the max context length, we divide the tokenized string into chunks of up to - For inputs that exceed the max context length, we divide the tokenized string into chunks of up to
...@@ -77,11 +78,11 @@ class LM(abc.ABC): ...@@ -77,11 +78,11 @@ class LM(abc.ABC):
1. Each token is predicted exactly once 1. Each token is predicted exactly once
2. For the last pair, we provide the full context, but only score the last two tokens 2. For the last pair, we provide the full context, but only score the last two tokens
:param requests: list :param requests: list[Instance]
A list of strings A list of Instance objects with property `args` which returns a tuple (context, continuation).
string: str string: str
String for which we are computing per-token loglikelihood String for which we are computing per-token loglikelihood
:return: list :return: list[tuple[float, bool]]
A list of pairs (logprob, isgreedy) A list of pairs (logprob, isgreedy)
logprob: float logprob: float
The log probability of `continuation` The log probability of `continuation`
...@@ -92,17 +93,17 @@ class LM(abc.ABC): ...@@ -92,17 +93,17 @@ class LM(abc.ABC):
# TODO: Add an optional max length # TODO: Add an optional max length
@abc.abstractmethod @abc.abstractmethod
def greedy_until(self, requests): def greedy_until(self, requests) -> List[str]:
"""Generate greedily until a stopping sequence """Generate greedily until a stopping sequence
:param requests: list :param requests: list[Instance]
A list of pairs (context, until) A list of Instance objects with property `args` which returns a tuple (context, until).
context: str context: str
Context string Context string
until: [str] until: [str]
The string sequences to generate until. These string sequences The string sequences to generate until. These string sequences
may each span across multiple tokens, or may be part of one token. may each span across multiple tokens, or may be part of one token.
:return: list :return: list[str]
A list of strings continuation A list of strings continuation
continuation: str continuation: str
The generated continuation. The generated continuation.
......
...@@ -13,7 +13,7 @@ from tqdm import tqdm ...@@ -13,7 +13,7 @@ from tqdm import tqdm
import datasets import datasets
import numpy as np import numpy as np
from typing import Union from typing import Union, List, Any, Tuple, Literal
from collections.abc import Callable from collections.abc import Callable
from lm_eval import utils from lm_eval import utils
...@@ -477,7 +477,7 @@ class Task(abc.ABC): ...@@ -477,7 +477,7 @@ class Task(abc.ABC):
eval_logger.warning("No filter defined, passing through instances") eval_logger.warning("No filter defined, passing through instances")
return self._instances return self._instances
def dump_config(self): def dump_config(self) -> dict:
"""Returns a dictionary representing the task's config. """Returns a dictionary representing the task's config.
:returns: str :returns: str
...@@ -489,14 +489,13 @@ class Task(abc.ABC): ...@@ -489,14 +489,13 @@ class Task(abc.ABC):
class ConfigurableTask(Task): class ConfigurableTask(Task):
VERSION = "Yaml" VERSION = "Yaml"
OUTPUT_TYPE = None OUTPUT_TYPE = None
CONFIG = None CONFIG = None
def __init__( def __init__(
self, data_dir=None, cache_dir=None, download_mode=None, config: dict = None self, data_dir=None, cache_dir=None, download_mode=None, config: dict = None
): ): # TODO no super() call here
# Get pre-configured attributes # Get pre-configured attributes
self._config = self.CONFIG self._config = self.CONFIG
...@@ -662,25 +661,25 @@ class ConfigurableTask(Task): ...@@ -662,25 +661,25 @@ class ConfigurableTask(Task):
**dataset_kwargs if dataset_kwargs is not None else {}, **dataset_kwargs if dataset_kwargs is not None else {},
) )
def has_training_docs(self): def has_training_docs(self) -> bool:
if self._config.training_split is not None: if self._config.training_split is not None:
return True return True
else: else:
return False return False
def has_validation_docs(self): def has_validation_docs(self) -> bool:
if self._config.validation_split is not None: if self._config.validation_split is not None:
return True return True
else: else:
return False return False
def has_test_docs(self): def has_test_docs(self) -> bool:
if self._config.test_split is not None: if self._config.test_split is not None:
return True return True
else: else:
return False return False
def training_docs(self): def training_docs(self) -> datasets.Dataset:
if self.has_training_docs(): if self.has_training_docs():
if self._config.process_docs is not None: if self._config.process_docs is not None:
return self._config.process_docs( return self._config.process_docs(
...@@ -688,7 +687,7 @@ class ConfigurableTask(Task): ...@@ -688,7 +687,7 @@ class ConfigurableTask(Task):
) )
return self.dataset[self._config.training_split] return self.dataset[self._config.training_split]
def validation_docs(self): def validation_docs(self) -> datasets.Dataset:
if self.has_validation_docs(): if self.has_validation_docs():
if self._config.process_docs is not None: if self._config.process_docs is not None:
return self._config.process_docs( return self._config.process_docs(
...@@ -696,7 +695,7 @@ class ConfigurableTask(Task): ...@@ -696,7 +695,7 @@ class ConfigurableTask(Task):
) )
return self.dataset[self._config.validation_split] return self.dataset[self._config.validation_split]
def test_docs(self): def test_docs(self) -> datasets.Dataset:
if self.has_test_docs(): if self.has_test_docs():
if self._config.process_docs is not None: if self._config.process_docs is not None:
return self._config.process_docs(self.dataset[self._config.test_split]) return self._config.process_docs(self.dataset[self._config.test_split])
...@@ -762,12 +761,17 @@ class ConfigurableTask(Task): ...@@ -762,12 +761,17 @@ class ConfigurableTask(Task):
return doc_to_text(doc) return doc_to_text(doc)
# Used when applying a Promptsource template # Used when applying a Promptsource template
elif hasattr(doc_to_text, "apply"): elif hasattr(doc_to_text, "apply"):
return doc_to_text.apply(doc)[0] applied_prompt = doc_to_text.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[0]
else:
eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter
else: else:
print(type(doc_to_text)) print(type(doc_to_text))
raise TypeError raise TypeError
def doc_to_target(self, doc): def doc_to_target(self, doc: dict) -> Union[int, str]:
if self.prompt is not None: if self.prompt is not None:
doc_to_target = self.prompt doc_to_target = self.prompt
...@@ -792,11 +796,16 @@ class ConfigurableTask(Task): ...@@ -792,11 +796,16 @@ class ConfigurableTask(Task):
return doc_to_target(doc) return doc_to_target(doc)
# Used when applying a Promptsource template # Used when applying a Promptsource template
elif hasattr(doc_to_target, "apply"): elif hasattr(doc_to_target, "apply"):
return doc_to_target.apply(doc)[1] applied_prompt = doc_to_target.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[1]
else:
eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter
else: else:
raise TypeError raise TypeError
def doc_to_choice(self, doc): def doc_to_choice(self, doc: Any) -> List[str]:
if self.prompt is not None: if self.prompt is not None:
doc_to_choice = self.prompt doc_to_choice = self.prompt
...@@ -838,7 +847,9 @@ class ConfigurableTask(Task): ...@@ -838,7 +847,9 @@ class ConfigurableTask(Task):
else: else:
raise TypeError raise TypeError
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(
self, doc: dict, ctx: str, **kwargs
) -> Union[List[Instance], Instance]:
if self.OUTPUT_TYPE == "loglikelihood": if self.OUTPUT_TYPE == "loglikelihood":
arguments = (ctx, self.doc_to_target(doc)) arguments = (ctx, self.doc_to_target(doc))
...@@ -847,13 +858,14 @@ class ConfigurableTask(Task): ...@@ -847,13 +858,14 @@ class ConfigurableTask(Task):
elif self.OUTPUT_TYPE == "multiple_choice": elif self.OUTPUT_TYPE == "multiple_choice":
choices = self.doc_to_choice(doc) choices = self.doc_to_choice(doc)
target_delimiter = self._config.target_delimiter
if self.multiple_input: if self.multiple_input:
# If there are multiple inputs, choices are placed in the ctx # If there are multiple inputs, choices are placed in the ctx
cont = self.doc_to_target(doc) cont = self.doc_to_target(doc)
arguments = [(ctx, " {}".format(cont)) for ctx in choices] arguments = [(ctx, f"{target_delimiter}{cont}") for ctx in choices]
else: else:
# Otherwise they are placed in the continuation # Otherwise they are placed in the continuation
arguments = [(ctx, " {}".format(cont)) for cont in choices] arguments = [(ctx, f"{target_delimiter}{cont}") for cont in choices]
request_list = [ request_list = [
Instance( Instance(
...@@ -1037,13 +1049,12 @@ class ConfigurableTask(Task): ...@@ -1037,13 +1049,12 @@ class ConfigurableTask(Task):
class MultipleChoiceTask(Task): class MultipleChoiceTask(Task):
OUTPUT_TYPE: str = "loglikelihood" OUTPUT_TYPE: str = "loglikelihood"
def doc_to_target(self, doc): def doc_to_target(self, doc: dict) -> str:
return " " + doc["choices"][doc["gold"]] return " " + doc["choices"][doc["gold"]]
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(self, doc: dict, ctx: str, **kwargs) -> List[Instance]:
# TODO: add mutual info here? # TODO: add mutual info here?
return [ return [
Instance( Instance(
...@@ -1056,7 +1067,7 @@ class MultipleChoiceTask(Task): ...@@ -1056,7 +1067,7 @@ class MultipleChoiceTask(Task):
for i, choice in enumerate(doc["choices"]) for i, choice in enumerate(doc["choices"])
] ]
def process_results(self, doc, results): def process_results(self, doc: dict, results: List[Tuple[float, bool]]) -> dict:
results = [ results = [
res[0] for res in results res[0] for res in results
] # only retain loglikelihoods, discard is_greedy TODO: do we need is_greedy anywhere? ] # only retain loglikelihoods, discard is_greedy TODO: do we need is_greedy anywhere?
...@@ -1071,13 +1082,13 @@ class MultipleChoiceTask(Task): ...@@ -1071,13 +1082,13 @@ class MultipleChoiceTask(Task):
"acc_norm": acc_norm, "acc_norm": acc_norm,
} }
def higher_is_better(self): def higher_is_better(self) -> dict:
return { return {
"acc": True, "acc": True,
"acc_norm": True, "acc_norm": True,
} }
def aggregation(self): def aggregation(self) -> dict:
return { return {
"acc": mean, "acc": mean,
"acc_norm": mean, "acc_norm": mean,
...@@ -1085,24 +1096,23 @@ class MultipleChoiceTask(Task): ...@@ -1085,24 +1096,23 @@ class MultipleChoiceTask(Task):
class PerplexityTask(Task): class PerplexityTask(Task):
OUTPUT_TYPE = "loglikelihood_rolling" OUTPUT_TYPE = "loglikelihood_rolling"
def has_training_docs(self): def has_training_docs(self) -> bool:
return False return False
def fewshot_examples(self, k, rnd): def fewshot_examples(self, k: int, rnd) -> List:
assert k == 0 assert k == 0
return [] return []
def fewshot_context(self, doc, num_fewshot): def fewshot_context(self, doc: dict, num_fewshot: int) -> Literal[""]:
assert ( assert (
num_fewshot == 0 num_fewshot == 0
), "The number of fewshot examples must be 0 for perplexity tasks." ), "The number of fewshot examples must be 0 for perplexity tasks."
return "" return ""
def higher_is_better(self): def higher_is_better(self) -> dict:
return { return {
"word_perplexity": False, "word_perplexity": False,
"byte_perplexity": False, "byte_perplexity": False,
...@@ -1118,7 +1128,7 @@ class PerplexityTask(Task): ...@@ -1118,7 +1128,7 @@ class PerplexityTask(Task):
def doc_to_target(self, doc): def doc_to_target(self, doc):
return doc return doc
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(self, doc: dict, ctx: Union[str, None], **kwargs):
assert not ctx assert not ctx
return Instance( return Instance(
...@@ -1129,7 +1139,7 @@ class PerplexityTask(Task): ...@@ -1129,7 +1139,7 @@ class PerplexityTask(Task):
**kwargs, **kwargs,
) )
def process_results(self, doc, results): def process_results(self, doc: dict, results: float) -> dict:
(loglikelihood,) = results (loglikelihood,) = results
words = self.count_words(self.doc_to_target(doc)) words = self.count_words(self.doc_to_target(doc))
bytes_ = self.count_bytes(self.doc_to_target(doc)) bytes_ = self.count_bytes(self.doc_to_target(doc))
...@@ -1139,7 +1149,7 @@ class PerplexityTask(Task): ...@@ -1139,7 +1149,7 @@ class PerplexityTask(Task):
"bits_per_byte": (loglikelihood, bytes_), "bits_per_byte": (loglikelihood, bytes_),
} }
def aggregation(self): def aggregation(self) -> dict:
return { return {
"word_perplexity": weighted_perplexity, "word_perplexity": weighted_perplexity,
"byte_perplexity": weighted_perplexity, "byte_perplexity": weighted_perplexity,
...@@ -1147,10 +1157,10 @@ class PerplexityTask(Task): ...@@ -1147,10 +1157,10 @@ class PerplexityTask(Task):
} }
@classmethod @classmethod
def count_bytes(cls, doc): def count_bytes(cls, doc) -> int:
return len(doc.encode("utf-8")) return len(doc.encode("utf-8"))
@classmethod @classmethod
def count_words(cls, doc): def count_words(cls, doc) -> int:
"""Downstream tasks with custom word boundaries should override this!""" """Downstream tasks with custom word boundaries should override this!"""
return len(re.split(r"\s+", doc)) return len(re.split(r"\s+", doc))
import os
import yaml
from lm_eval import utils
from lm_eval.tasks import register_configurable_task, check_prompt_config
from lm_eval.logger import eval_logger
from lm_eval.api.registry import (
TASK_REGISTRY,
GROUP_REGISTRY,
ALL_TASKS,
)
def include_benchmarks(task_dir):
for root, subdirs, file_list in os.walk(task_dir):
if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_benchmarks(task_dir)
...@@ -6,7 +6,7 @@ task: ...@@ -6,7 +6,7 @@ task:
- sciq - sciq
- wsc - wsc
- winogrande - winogrande
- arc_* - arc
# - logiqa - logiqa
# - blimp_* - blimp
# - hendrycksTest* - hendrycksTest*
group: t0_eval
task:
# Coreference Resolution
- dataset_path: super_glue
dataset_name: wsc.fixed
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Coreference Resolution
- dataset_path: winogrande
dataset_name: winogrande_xl
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Natural Language Inference
- dataset_path: super_glue
dataset_name: cb
use_prompt: promptsource:*
training_split: train
validation_split: validation
output_type: greedy_until
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- dataset_path: super_glue
dataset_name: rte
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r3
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r3
validation_split: dev_r3
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Sentence Completion
- dataset_path: super_glue
dataset_name: copa
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Natural Language Inference
- dataset_path: hellaswag
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Word Sense Disambiguation
- dataset_path: super_glue
dataset_name: wic
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
...@@ -11,6 +11,7 @@ import numpy as np ...@@ -11,6 +11,7 @@ import numpy as np
import lm_eval.api import lm_eval.api
import lm_eval.tasks import lm_eval.tasks
import lm_eval.benchmarks
import lm_eval.models import lm_eval.models
import lm_eval.api.metrics import lm_eval.api.metrics
import lm_eval.api.registry import lm_eval.api.registry
...@@ -85,7 +86,9 @@ def simple_evaluate( ...@@ -85,7 +86,9 @@ def simple_evaluate(
1234 1234
) # TODO: this may affect training runs that are run with evaluation mid-run. ) # TODO: this may affect training runs that are run with evaluation mid-run.
assert tasks != [], "No tasks specified" assert (
tasks != []
), "No tasks specified, or no tasks found. Please verify the task names."
if isinstance(model, str): if isinstance(model, str):
if model_args is None: if model_args is None:
...@@ -114,7 +117,12 @@ def simple_evaluate( ...@@ -114,7 +117,12 @@ def simple_evaluate(
task_dict = lm_eval.tasks.get_task_dict(tasks) task_dict = lm_eval.tasks.get_task_dict(tasks)
for task_name in task_dict.keys(): for task_name in task_dict.keys():
config = task_dict[task_name]._config
task_obj = task_dict[task_name]
if type(task_obj) == tuple:
group, task_obj = task_obj
config = task_obj._config
if num_fewshot is not None: if num_fewshot is not None:
if config["num_fewshot"] > 0: if config["num_fewshot"] > 0:
default_num_fewshot = config["num_fewshot"] default_num_fewshot = config["num_fewshot"]
...@@ -122,7 +130,7 @@ def simple_evaluate( ...@@ -122,7 +130,7 @@ def simple_evaluate(
f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}" f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}"
) )
task_dict[task_name]._config["num_fewshot"] = num_fewshot task_obj._config["num_fewshot"] = num_fewshot
if check_integrity: if check_integrity:
run_task_tests(task_list=tasks) run_task_tests(task_list=tasks)
...@@ -246,7 +254,7 @@ def evaluate( ...@@ -246,7 +254,7 @@ def evaluate(
eval_logger.info( eval_logger.info(
f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\n{inst.args[0]}\n(end of prompt on previous line)" f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\n{inst.args[0]}\n(end of prompt on previous line)"
) )
eval_logger.info("Request:", inst) eval_logger.info(f"Request: {str(inst)}")
# aggregate Instances by LM method requested to get output. # aggregate Instances by LM method requested to get output.
reqtype = ( reqtype = (
......
from . import huggingface from . import huggingface
from . import openai_completions from . import openai_completions
from . import anthropic_llms
from . import textsynth from . import textsynth
from . import dummy from . import dummy
from . import anthropic_llms
# TODO: implement __all__ # TODO: implement __all__
import os
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from tqdm import tqdm from tqdm import tqdm
import time import time
import anthropic
from lm_eval.logger import eval_logger from lm_eval.logger import eval_logger
from typing import List, Literal, Any from typing import List, Any, Tuple
def anthropic_completion( def anthropic_completion(
client: anthropic.Anthropic, client, #: anthropic.Anthropic,
model: str, model: str,
prompt: str, prompt: str,
max_tokens_to_sample: int, max_tokens_to_sample: int,
temperature: float, temperature: float,
stop: List[str], stop: List[str],
**kwargs: Any, **kwargs: Any,
): ) -> str:
"""Query Anthropic API for completion. """Wrapper function around the Anthropic completion API client with exponential back-off
in case of RateLimitError.
Retry with back-off until they respond
params:
client: anthropic.Anthropic
Anthropic API client
model: str
Anthropic model e.g. 'claude-instant-v1', 'claude-2'
prompt: str
Prompt to feed to the model
max_tokens_to_sample: int
Maximum number of tokens to sample from the model
temperature: float
Sampling temperature
stop: List[str]
List of stop sequences
kwargs: Any
Additional model_args to pass to the API client
""" """
backoff_time = 3
try:
import anthropic
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
)
backoff_time: float = 3
while True: while True:
try: try:
response = client.completions.create( response = client.completions.create(
...@@ -68,6 +90,14 @@ class AnthropicLM(LM): ...@@ -68,6 +90,14 @@ class AnthropicLM(LM):
""" """
super().__init__() super().__init__()
try:
import anthropic
except ModuleNotFoundError:
raise Exception(
"attempted to use 'anthropic' LM type, but package `anthropic` is not installed. \
please install anthropic via `pip install lm-eval[anthropic]` or `pip install -e .[anthropic]`",
)
self.model = model self.model = model
# defaults to os.environ.get("ANTHROPIC_API_KEY") # defaults to os.environ.get("ANTHROPIC_API_KEY")
self.client = anthropic.Anthropic() self.client = anthropic.Anthropic()
...@@ -78,15 +108,15 @@ class AnthropicLM(LM): ...@@ -78,15 +108,15 @@ class AnthropicLM(LM):
@property @property
def eot_token_id(self): def eot_token_id(self):
# Not sure but anthropic.AI_PROMPT -> [203, 203, 50803, 30] # Not sure but anthropic.HUMAN_PROMPT ?
raise NotImplementedError("No idea about anthropic tokenization.") raise NotImplementedError("No idea about anthropic tokenization.")
@property @property
def max_length(self): def max_length(self) -> int:
return 2048 return 2048
@property @property
def max_gen_toks(self): def max_gen_toks(self) -> int:
return self.max_tokens_to_sample return self.max_tokens_to_sample
@property @property
...@@ -108,14 +138,15 @@ class AnthropicLM(LM): ...@@ -108,14 +138,15 @@ class AnthropicLM(LM):
def _loglikelihood_tokens(self, requests, disable_tqdm=False): def _loglikelihood_tokens(self, requests, disable_tqdm=False):
raise NotImplementedError("No support for logits.") raise NotImplementedError("No support for logits.")
def greedy_until(self, requests): def greedy_until(self, requests) -> List[str]:
if not requests: if not requests:
return [] return []
requests = [req.args for req in requests] _requests: List[Tuple[str, dict]] = [req.args for req in requests]
res = [] res = []
for request in tqdm(requests): for request in tqdm(_requests):
try: try:
inp = request[0] inp = request[0]
request_args = request[1] request_args = request[1]
...@@ -129,16 +160,16 @@ class AnthropicLM(LM): ...@@ -129,16 +160,16 @@ class AnthropicLM(LM):
prompt=inp, prompt=inp,
max_tokens_to_sample=max_gen_toks, max_tokens_to_sample=max_gen_toks,
temperature=temperature, # TODO: implement non-greedy sampling for Anthropic temperature=temperature, # TODO: implement non-greedy sampling for Anthropic
stop=until, stop=until, # type: ignore
**self.kwargs, **self.kwargs,
) )
res.append(response) res.append(response)
self.cache_hook.add_partial("greedy_until", request, response) self.cache_hook.add_partial("greedy_until", request, response)
except anthropic.APIConnectionError as e: except anthropic.APIConnectionError as e: # type: ignore # noqa: F821
eval_logger.critical(f"Server unreachable: {e.__cause__}") eval_logger.critical(f"Server unreachable: {e.__cause__}")
break break
except anthropic.APIStatusError as e: except anthropic.APIStatusError as e: # type: ignore # noqa: F821
eval_logger.critical(f"API error {e.status_code}: {e.message}") eval_logger.critical(f"API error {e.status_code}: {e.message}")
break break
......
...@@ -20,7 +20,7 @@ from lm_eval.api.registry import register_model ...@@ -20,7 +20,7 @@ from lm_eval.api.registry import register_model
from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria
from accelerate import Accelerator from accelerate import Accelerator, find_executable_batch_size
from typing import List, Optional, Union from typing import List, Optional, Union
...@@ -70,7 +70,8 @@ class HFLM(LM): ...@@ -70,7 +70,8 @@ class HFLM(LM):
max_length: Optional[int] = None, max_length: Optional[int] = None,
device: Optional[str] = "cuda", device: Optional[str] = "cuda",
dtype: Optional[Union[str, torch.dtype]] = "auto", dtype: Optional[Union[str, torch.dtype]] = "auto",
batch_size: Optional[int] = 1, batch_size: Optional[Union[int, str]] = 1,
max_batch_size: Optional[int] = 64,
low_cpu_mem_usage: Optional[bool] = True, low_cpu_mem_usage: Optional[bool] = True,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
use_fast_tokenizer: Optional[bool] = True, use_fast_tokenizer: Optional[bool] = True,
...@@ -94,7 +95,7 @@ class HFLM(LM): ...@@ -94,7 +95,7 @@ class HFLM(LM):
assert isinstance(device, str) assert isinstance(device, str)
assert isinstance(pretrained, str) assert isinstance(pretrained, str)
assert isinstance(batch_size, int) assert isinstance(batch_size, (int, str))
gpus = torch.cuda.device_count() gpus = torch.cuda.device_count()
accelerator = Accelerator() accelerator = Accelerator()
...@@ -244,8 +245,16 @@ class HFLM(LM): ...@@ -244,8 +245,16 @@ class HFLM(LM):
self._max_length = max_length self._max_length = max_length
# multithreading and batching self.batch_schedule = 1
self.batch_size_per_gpu = batch_size self.batch_sizes = {}
self.max_batch_size = max_batch_size
if str(batch_size).startswith("auto"):
batch_size = batch_size.split(":")
self.batch_size_per_gpu = batch_size[0]
self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
else:
self.batch_size_per_gpu = int(batch_size)
# multigpu data-parallel support when launched with accelerate # multigpu data-parallel support when launched with accelerate
if gpus > 1: if gpus > 1:
...@@ -280,7 +289,9 @@ class HFLM(LM): ...@@ -280,7 +289,9 @@ class HFLM(LM):
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore." "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
) )
else: else:
self._model = accelerator.prepare(self.model) self._model = accelerator.prepare_model(
self.model, evaluation_mode=True
)
self._device = torch.device(f"cuda:{accelerator.local_process_index}") self._device = torch.device(f"cuda:{accelerator.local_process_index}")
self.accelerator = accelerator self.accelerator = accelerator
...@@ -342,6 +353,56 @@ class HFLM(LM): ...@@ -342,6 +353,56 @@ class HFLM(LM):
def world_size(self): def world_size(self):
return self._world_size return self._world_size
def _detect_batch_size(self, requests=None, pos=0):
if requests:
_, context_enc, continuation_enc = requests[pos]
max_length = len(
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]
)
max_context_enc = len(context_enc[-(self.max_length + 1) :])
max_cont_enc = len(continuation_enc[-(self.max_length + 1) :])
else:
max_length = self.max_length
# if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size):
if self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
length = max(max_context_enc, max_cont_enc)
batched_conts = torch.ones(
(batch_size, length), device=self.device
).long()
test_batch = torch.ones((batch_size, length), device=self.device).long()
call_kwargs = {
"attn_mask": test_batch,
"labels": batched_conts,
}
else:
call_kwargs = {}
test_batch = torch.ones(
(batch_size, max_length), device=self.device
).long()
for _ in range(5):
out = F.log_softmax(self._model_call(test_batch, **call_kwargs), dim=-1)
out = out # Identity process so that it passes pre-commit
return batch_size
batch_size = forward_batch()
if self.world_size > 1:
# if multi-GPU, always take minimum over all selected batch sizes
max_rnk_bs = torch.tensor([batch_size], device=self.device)
gathered = (
self.accelerator.gather(max_rnk_bs).cpu().detach().numpy().tolist()
)
batch_size = min(gathered)
utils.clear_torch_cache()
return batch_size
utils.clear_torch_cache()
return batch_size
def tok_encode(self, string: str, left_truncate_len=None): def tok_encode(self, string: str, left_truncate_len=None):
""" """ """ """
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
...@@ -480,6 +541,15 @@ class HFLM(LM): ...@@ -480,6 +541,15 @@ class HFLM(LM):
def loglikelihood_rolling(self, requests): def loglikelihood_rolling(self, requests):
loglikelihoods = [] loglikelihoods = []
adaptive_batch_size = None
if self.batch_size == "auto":
# using rolling window with maximum context
print("Passed argument batch_size = auto. Detecting largest batch size")
batch_size = self._detect_batch_size()
print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size
for (string,) in tqdm([req.args for req in requests], disable=(self.rank != 0)): for (string,) in tqdm([req.args for req in requests], disable=(self.rank != 0)):
rolling_token_windows = list( rolling_token_windows = list(
map( map(
...@@ -509,7 +579,9 @@ class HFLM(LM): ...@@ -509,7 +579,9 @@ class HFLM(LM):
rolling_token_windows += pad_amnt * [rolling_token_windows[0]] rolling_token_windows += pad_amnt * [rolling_token_windows[0]]
string_nll = self._loglikelihood_tokens( string_nll = self._loglikelihood_tokens(
rolling_token_windows, disable_tqdm=True rolling_token_windows,
disable_tqdm=True,
override_bs=adaptive_batch_size,
) )
if (self.world_size > 1) and (pad_amnt > 0): if (self.world_size > 1) and (pad_amnt > 0):
...@@ -523,7 +595,7 @@ class HFLM(LM): ...@@ -523,7 +595,7 @@ class HFLM(LM):
return loglikelihoods return loglikelihoods
def _loglikelihood_tokens(self, requests, disable_tqdm=False): def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = [] res = []
...@@ -538,11 +610,43 @@ class HFLM(LM): ...@@ -538,11 +610,43 @@ class HFLM(LM):
toks = x[1] + x[2] toks = x[1] + x[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
# TODO: automatic (variable) batch size detection for vectorization
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
n_reordered_requests = len(re_ord.get_reordered())
# automatic (variable) batch size detection for vectorization
# pull longest context sample from request
def _batch_scheduler(pos):
sched = pos // int(n_reordered_requests / self.batch_schedule)
if sched in self.batch_sizes:
return self.batch_sizes[sched]
if (len(self.batch_sizes) > 1) and (
self.batch_sizes[sched - 1] == self.max_batch_size
):
# if previous batch size is already maximal, skip recomputation
self.batch_sizes[sched] = self.max_batch_size
return self.batch_sizes[sched]
print(
f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size"
)
self.batch_sizes[sched] = self._detect_batch_size(
re_ord.get_reordered(), pos
)
print(f"Determined largest batch size: {self.batch_sizes[sched]}")
return self.batch_sizes[sched]
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))), tqdm(re_ord.get_reordered(), disable=(disable_tqdm or (self.rank != 0))),
self.batch_size, n=self.batch_size
if self.batch_size != "auto"
else override_bs
if override_bs is not None
else 0,
fn=_batch_scheduler
if self.batch_size == "auto"
and n_reordered_requests > 0
and not override_bs
else None,
): ):
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
......
import os import os
import time import time
import transformers from typing import List, Tuple
import numpy as np
from tqdm import tqdm from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
def get_result(response, ctxlen): def get_result(response: dict, ctxlen: int) -> Tuple[float, bool]:
"""Process results from OpenAI API response. """Process results from OpenAI API response.
:param response: dict :param response: dict
...@@ -43,7 +40,13 @@ def oa_completion(**kwargs): ...@@ -43,7 +40,13 @@ def oa_completion(**kwargs):
Retry with back-off until they respond Retry with back-off until they respond
""" """
import openai try:
import openai, tiktoken # noqa: E401
except ModuleNotFoundError:
raise Exception(
"attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
)
backoff_time = 3 backoff_time = 3
while True: while True:
...@@ -61,7 +64,12 @@ def oa_completion(**kwargs): ...@@ -61,7 +64,12 @@ def oa_completion(**kwargs):
class OpenaiCompletionsLM(LM): class OpenaiCompletionsLM(LM):
REQ_CHUNK_SIZE = 20 REQ_CHUNK_SIZE = 20
def __init__(self, engine, truncate=False): def __init__(
self,
engine: str = "text-davinci-003",
truncate: bool = False,
batch_size: int = 1,
):
""" """
:param engine: str :param engine: str
...@@ -70,28 +78,25 @@ class OpenaiCompletionsLM(LM): ...@@ -70,28 +78,25 @@ class OpenaiCompletionsLM(LM):
Truncate input if too long (if False and input is too long, throw error) Truncate input if too long (if False and input is too long, throw error)
""" """
super().__init__() super().__init__()
try:
import openai import openai, tiktoken # noqa: E401
except ModuleNotFoundError:
raise Exception(
"attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. \
please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
)
self.engine = engine self.engine = engine
self.tokenizer = transformers.GPT2TokenizerFast.from_pretrained("gpt2") self.tokenizer = tiktoken.encoding_for_model(self.engine)
self.vocab_size = self.tokenizer.n_vocab
self.vocab_size = self.tokenizer.vocab_size
# to make the annoying "Using pad_token, but it is not set yet." error go away
self.tokenizer.pad_token = "<|endoftext|>"
assert self.tokenizer.encode("hello\n\nhello") == [31373, 198, 198, 31373]
self.truncate = truncate self.truncate = truncate
self.end_of_text_token_id = self.tokenizer.convert_tokens_to_ids( self.end_of_text_token_id = self.tokenizer.eot_token
["<|endoftext|>"]
)[0]
# Read from environment variable OPENAI_API_SECRET_KEY # Read from environment variable OPENAI_API_SECRET_KEY
openai.api_key = os.environ["OPENAI_API_SECRET_KEY"] openai.api_key = os.environ["OPENAI_API_SECRET_KEY"]
@property @property
def eot_token_id(self): def eot_token_id(self):
return self.tokenizer.eos_token_id return self.end_of_text_token_id
@property @property
def max_length(self): def max_length(self):
...@@ -112,19 +117,49 @@ class OpenaiCompletionsLM(LM): ...@@ -112,19 +117,49 @@ class OpenaiCompletionsLM(LM):
# Isn't used because we override _loglikelihood_tokens # Isn't used because we override _loglikelihood_tokens
raise NotImplementedError() raise NotImplementedError()
def tok_encode(self, string: str): def tok_encode(self, string: str) -> List[int]:
return self.tokenizer.encode(string, add_special_tokens=False) return self.tokenizer.encode(string)
def tok_decode(self, tokens): def tok_decode(self, tokens: List[int]) -> str:
return self.tokenizer.decode(tokens) return self.tokenizer.decode(tokens)
def _loglikelihood_tokens(self, requests, disable_tqdm=False): def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(
continuation
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def _loglikelihood_tokens(
self, requests, disable_tqdm=False
) -> List[Tuple[float, bool]]:
res = [] res = []
def _collate(x): def _collate(x):
# this doesn't efficiently handle last-token differences yet, but those are kinda annoying because # this doesn't efficiently handle last-token differences yet, but those are kinda annoying because
# it's not guaranteed that the 100 or so logprobs we get to see actually contain all the continuations # it's not guaranteed that the 100 or so logprobs we get to see actually contain all the continuations
# we care about and so we need some kind of backup for when it isn't # we care about, and so we need some kind of backup for when it isn't
toks = x[1] + x[2] toks = x[1] + x[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
...@@ -166,13 +201,13 @@ class OpenaiCompletionsLM(LM): ...@@ -166,13 +201,13 @@ class OpenaiCompletionsLM(LM):
# partial caching # partial caching
if cache_key is not None: if cache_key is not None:
self.cache_hook.add_partial("loglikelihood", cache_key, answer) self.cache_hook.add_partial("loglikelihood", cache_key, answer)
return re_ord.get_original(res) return re_ord.get_original(res)
def greedy_until(self, requests): def greedy_until(self, requests) -> List[str]:
if not requests: if not requests:
return [] return []
res = [] res = []
requests = [req.args for req in requests]
def _collate(x): def _collate(x):
toks = self.tok_encode(x[0]) toks = self.tok_encode(x[0])
...@@ -203,12 +238,7 @@ class OpenaiCompletionsLM(LM): ...@@ -203,12 +238,7 @@ class OpenaiCompletionsLM(LM):
inp = context_enc[-(self.max_length - self.max_gen_toks) :] inp = context_enc[-(self.max_length - self.max_gen_toks) :]
inps.append(inp) inps.append(inp)
try: until = request_args.get("until", ["<|endoftext|>"])
until = request_args["until"][
0
] # TODO: does this handle a list of stop seqs correctly?
except KeyError:
until = "<|endoftext|>"
response = oa_completion( response = oa_completion(
engine=self.engine, engine=self.engine,
...@@ -222,7 +252,7 @@ class OpenaiCompletionsLM(LM): ...@@ -222,7 +252,7 @@ class OpenaiCompletionsLM(LM):
for resp, (context, args_) in zip(response.choices, chunk): for resp, (context, args_) in zip(response.choices, chunk):
s = resp["text"] s = resp["text"]
until_ = args_.get(["until"], []) until_ = args_.get("until", ["<|endoftext|>"])
for term in until_: for term in until_:
if len(term) > 0: if len(term) > 0:
...@@ -234,7 +264,6 @@ class OpenaiCompletionsLM(LM): ...@@ -234,7 +264,6 @@ class OpenaiCompletionsLM(LM):
) )
res.append(s) res.append(s)
return re_ord.get_original(res) return re_ord.get_original(res)
def _model_call(self, inps): def _model_call(self, inps):
...@@ -244,3 +273,34 @@ class OpenaiCompletionsLM(LM): ...@@ -244,3 +273,34 @@ class OpenaiCompletionsLM(LM):
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
# Isn't used because we override greedy_until # Isn't used because we override greedy_until
raise NotImplementedError() raise NotImplementedError()
def loglikelihood_rolling(self, requests) -> List[float]:
loglikelihoods = []
for (string,) in tqdm([req.args for req in requests]):
rolling_token_windows = list(
map(
utils.make_disjoint_window,
utils.get_rolling_token_windows(
token_list=self.tok_encode(string),
prefix_token=self.eot_token_id,
max_seq_len=self.max_length,
context_len=1,
),
)
)
# TODO: Right now, we pass single EOT token to the Encoder and the full context to the decoder, in seq2seq case
rolling_token_windows = [(None,) + x for x in rolling_token_windows]
string_nll = self._loglikelihood_tokens(
rolling_token_windows,
disable_tqdm=True,
)
# discard is_greedy
string_nll = [x[0] for x in string_nll]
string_nll = sum(string_nll)
loglikelihoods.append(string_nll)
return loglikelihoods
...@@ -5,39 +5,39 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -5,39 +5,39 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] Glue (Lintang) - [ ] Glue (Lintang)
- [x] SuperGlue - [x] SuperGlue
- [ ] CoQA - [ ] CoQA (Lintang)
- [ ] DROP - [ ] DROP (Lintang)
- [x] ~~Lambada~~ - [x] ~~Lambada~~
- [x] Lambada (Cloze variants) - [x] Lambada (Cloze variants)
- [x] ~~Lambada (Multilingual)~~ - [x] ~~Lambada (Multilingual)~~
- [x] Wikitext - [x] Wikitext
- [x] PiQA - [x] PiQA
- [x] PROST - [x] PROST
- [ ] MCTACO - [ ] MCTACO (Lintang)
- [x] Pubmed QA - [x] Pubmed QA
- [x] SciQ - [x] SciQ
- [ ] QASPER - [ ] QASPER
- [x] QA4MRE - [x] QA4MRE
- [ ] TriviaQA - [ ] TriviaQA (Lintang)
- [x] AI2 ARC - [x] AI2 ARC
- [ ] LogiQA [(WIP)](https://github.com/EleutherAI/lm-evaluation-harness/pull/711) - [x] LogiQA
- [x] HellaSwag - [x] HellaSwag
- [x] SWAG - [x] SWAG
- [x] OpenBookQA - [x] OpenBookQA
- [ ] SQuADv2 - [ ] SQuADv2 (Lintang)
- [x] RACE - [x] RACE
- [x] HeadQA - [x] HeadQA
- [x] MathQA - [x] MathQA
- [ ] WebQs - [x] WebQs
- [ ] WSC273 - [ ] WSC273 (Lintang)
- [x] Winogrande - [x] Winogrande
- [x] ANLI - [x] ANLI
- [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info) - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
- [x] TruthfulQA (mc1) - [x] TruthfulQA (mc1) (Lintang)
- [ ] TruthfulQA (mc2) - [ ] TruthfulQA (mc2) (Lintang)
- [ ] TruthfulQA (gen) - [ ] TruthfulQA (gen) (Lintang)
- [ ] MuTual - [ ] MuTual
- [ ] Hendrycks Math - [ ] Hendrycks Math (Hailey)
- [ ] Asdiv - [ ] Asdiv
- [ ] GSM8k - [ ] GSM8k
- [x] Arithmetic - [x] Arithmetic
...@@ -47,18 +47,18 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -47,18 +47,18 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] ~~Pile (perplexity)~~ - [x] ~~Pile (perplexity)~~
- [ ] BLiMP (Lintang) - [ ] BLiMP (Lintang)
- [x] ToxiGen - [x] ToxiGen
- [ ] StoryCloze - [ ] StoryCloze (Lintang)
- [ ] NaturalQs - [ ] NaturalQs (Hailey)
- [ ] CrowS-Pairs - [x] CrowS-Pairs
- [ ] XCopa - [x] XCopa
- [ ] BIG-Bench - [ ] BIG-Bench (Hailey)
- [ ] XStoryCloze - [ ] XStoryCloze (Lintang)
- [x] XWinograd - [x] XWinograd
- [ ] PAWS-X - [ ] PAWS-X (Lintang)
- [ ] XNLI - [x] XNLI
- [ ] MGSM - [ ] MGSM (Lintang)
- [ ] SCROLLS - [ ] SCROLLS
- [ ] Babi - [x] Babi
# Novel Tasks # Novel Tasks
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*. Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.
......
...@@ -44,7 +44,7 @@ def check_prompt_config(config): ...@@ -44,7 +44,7 @@ def check_prompt_config(config):
prompt_list = prompts.load_prompt_list( prompt_list = prompts.load_prompt_list(
use_prompt=config["use_prompt"], use_prompt=config["use_prompt"],
dataset_name=config["dataset_path"], dataset_name=config["dataset_path"],
subset_name=config["dataset_name"], subset_name=config["dataset_name"] if "dataset_name" in config else None,
) )
for idx, prompt_variation in enumerate(prompt_list): for idx, prompt_variation in enumerate(prompt_list):
all_configs.append( all_configs.append(
...@@ -54,7 +54,9 @@ def check_prompt_config(config): ...@@ -54,7 +54,9 @@ def check_prompt_config(config):
**{ **{
"task": "_".join( "task": "_".join(
[ [
get_task_name_from_config(config), config["task"]
if "task" in config
else get_task_name_from_config(config),
prompt_variation, prompt_variation,
] ]
) )
...@@ -98,58 +100,8 @@ def include_task_folder(task_dir): ...@@ -98,58 +100,8 @@ def include_task_folder(task_dir):
) )
def include_benchmarks(task_dir, benchmark_dir="benchmarks"):
for root, subdirs, file_list in os.walk(os.path.join(task_dir, benchmark_dir)):
if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/" task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_task_folder(task_dir) include_task_folder(task_dir)
include_benchmarks(task_dir)
def get_task(task_name, config): def get_task(task_name, config):
......
group:
- greedy_until
task: babi
dataset_path: Muennighoff/babi
dataset_name: null
output_type: greedy_until
training_split: train
validation_split: valid
test_split: test
doc_to_text: "Passage: {{passage}}Question: {{question}}\nAnswer:"
doc_to_target: " {{answer}}"
target_delimiter: ""
generation_kwargs:
until:
- "\n"
- "Passage:"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
group: t0_eval
task:
# # Coreference Resolution
# - dataset_path: super_glue
# dataset_name: wsc.fixed
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# # Coreference Resolution
# - dataset_path: winogrande
# dataset_name: winogrande_xl
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# Natural Language Inference
- dataset_path: super_glue
dataset_name: cb
use_prompt: promptsource:*
training_split: train
validation_split: validation
output_type: greedy_until
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Natural Language Inference
# - dataset_path: super_glue
# dataset_name: rte
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# # Natural Language Inference
# # - dataset_path: anli
# # use_prompt: promptsource:*
# # training_split: train_r1
# # validation_split: dev_r1
# # Sentence Completion
# - dataset_path: super_glue
# dataset_name: copa
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# # Natural Language Inference
# - dataset_path: hellaswag
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# # Word Sense Disambiguation
# - dataset_path: super_glue
# dataset_name: wic
# use_prompt: promptsource:*
# training_split: train
# validation_split: validation
# metric_list:
# - metric: exact_match
# aggregation: mean
# higher_is_better: true
# ignore_case: true
# ignore_punctuation: true
# Task-name
### Paper
Title: `BLiMP: A Benchmark of Linguistic Minimal Pairs for English`
Abstract: `https://arxiv.org/abs/1912.00582`
BLiMP is a challenge set for evaluating what language models (LMs) know about
major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
containing 1000 minimal pairs isolating specific contrasts in syntax, morphology,
or semantics. The data is automatically generated according to expert-crafted
grammars.
Homepage: https://github.com/alexwarstadt/blimp
### Citation
```
@article{warstadt2019blimp,
author = {Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R.},
title = {BLiMP: The Benchmark of Linguistic Minimal Pairs for English},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {377-392},
year = {2020},
doi = {10.1162/tacl\_a\_00321},
URL = {https://doi.org/10.1162/tacl_a_00321},
eprint = {https://doi.org/10.1162/tacl_a_00321},
abstract = { We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4\%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands. }
}
```
### Subtasks
List or describe tasks defined in this folder, and their names here:
* `task_name`: `1-sentence description of what this particular task does`
* `task_name2`: .....
### Checklist
For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
# Generated by utils.py
dataset_name: adjunct_island
include: template_yaml
task: blimp_adjunct_island
# Generated by utils.py
dataset_name: anaphor_gender_agreement
include: template_yaml
task: blimp_anaphor_gender_agreement
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment