Commit f71d56eb authored by lintangsutawika's avatar lintangsutawika
Browse files

Merge branch 'big-refactor' of https://github.com/EleutherAI/lm-evaluation-harness into superglue

parents 33f2f9bf 2f870265
...@@ -50,6 +50,7 @@ jobs: ...@@ -50,6 +50,7 @@ jobs:
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: 3.9 python-version: 3.9
cache: 'pip'
- name: Install dependencies - name: Install dependencies
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true' if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
run: | run: |
......
name: Pull Request
on: [pull_request]
jobs:
pre-commit:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.9
- uses: pre-commit/action@v2.0.3
...@@ -6,10 +6,10 @@ name: Unit Tests ...@@ -6,10 +6,10 @@ name: Unit Tests
on: on:
push: push:
branches: branches:
- big-refactor - 'big-refactor*'
pull_request: pull_request:
branches: branches:
- big-refactor - 'big-refactor*'
workflow_dispatch: workflow_dispatch:
# Jobs run concurrently and steps run sequentially within a job. # Jobs run concurrently and steps run sequentially within a job.
# jobs: linter and cpu_tests. Add more jobs/steps as required. # jobs: linter and cpu_tests. Add more jobs/steps as required.
...@@ -26,8 +26,11 @@ jobs: ...@@ -26,8 +26,11 @@ jobs:
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: 3.9 python-version: 3.9
cache: 'pip'
- name: Install dependencies - name: Install dependencies
run: pip install -e '.[linting,testing]' --extra-index-url https://download.pytorch.org/whl/cpu run: pip install -e '.[linting,testing]' --extra-index-url https://download.pytorch.org/whl/cpu
- name: Pre-Commit
uses: pre-commit/action@v3.0.0
- name: Lint with pylint - name: Lint with pylint
run: python -m pylint --disable=all -e W0311 --jobs=0 --indent-string=' ' **/*.py run: python -m pylint --disable=all -e W0311 --jobs=0 --indent-string=' ' **/*.py
- name: Lint with flake8 - name: Lint with flake8
...@@ -52,6 +55,7 @@ jobs: ...@@ -52,6 +55,7 @@ jobs:
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: 3.9 python-version: 3.9
cache: 'pip'
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip
...@@ -60,4 +64,4 @@ jobs: ...@@ -60,4 +64,4 @@ jobs:
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt # pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi # if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with pytest - name: Test with pytest
run: python -m pytest -s -v -n=auto --ignore=tests/tests_master --ignore=tests/extra run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/tests_master --ignore=tests/extra
...@@ -33,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run: ...@@ -33,7 +33,6 @@ To install the `lm-eval` refactor branch from the github repository, run:
```bash ```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness cd lm-evaluation-harness
git checkout big-refactor
pip install -e . pip install -e .
``` ```
...@@ -49,6 +48,13 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex ...@@ -49,6 +48,13 @@ To support loading GPTQ quantized models, install the package with the `gptq` ex
pip install -e ".[gptq]" pip install -e ".[gptq]"
``` ```
To install the package with all extras, run
```bash
pip install -e ".[all]"
```
## Support ## Support
The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. The best way to get support is to open an issue on this repo or join the EleutherAI discord server](discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
...@@ -93,6 +99,8 @@ python main.py \ ...@@ -93,6 +99,8 @@ python main.py \
--batch_size auto:4 --batch_size auto:4
``` ```
Alternatively, you can use `lm-eval` instead of `python main.py` to call lm eval from anywhere.
### Multi-GPU Evaluation with Hugging Face `accelerate` ### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation. To parallelize evaluation of HuggingFace models across multiple GPUs, we allow for two different types of multi-GPU evaluation.
...@@ -128,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi ...@@ -128,30 +136,43 @@ Using this setting helps for massive models like BLOOM which require, or to avoi
**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.** **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**
To use `accelerate` with the `lm-eval` command, use
```
accelerate launch --no_python lm-eval --model ...
```
### Commercial APIs ### Commercial APIs
Our library also supports language models served via the OpenAI API: Our library also supports the evaluation of models served via several commercial APIs, and hope to implement support for common performant local/self-hosted inference servers.
A full accounting of the supported and planned libraries + APIs can be seen below:
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|-----------------------------|---------------------------------|----------------------------------------------------------------------------------|--------------------------------------|----------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai`, `openai-completions`, `gooseai` | up to `code-davinci-002` | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :x: Not yet - needs help! | N/A | (link here?) | `greedy_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `greedy_until` (no logprobs) |
| GooseAI | :heavy_check_mark: (not separately maintained) | `openai`, `openai-completions`, `gooseai` (same interface as OpenAI Completions) | | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| Textsynth | Needs testing | `textsynth` | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| Cohere | :hourglass: - blocked on Cohere API bug | N/A | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models) | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| GGML | :hourglass: [PR](https://github.com/EleutherAI/lm-evaluation-harness/pull/617) | N/A | ??? | `greedy_until`, `loglikelihood`, `loglikelihood_rolling` |
| vLLM | :x: Not yet - needs help! | N/A | All HF models | `greedy_until` (no logprobs) |
| Your inference server here! | ... | ... | ... | ... | | ... |
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
Our library supports language models served via the OpenAI Completions API as follows:
```bash ```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
python main.py \ python main.py \
--model openai \ --model openai-completions \
--model_args engine=davinci \ --model_args engine=davinci \
--tasks lambada_openai,hellaswag --tasks lambada_openai,hellaswag
``` ```
While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`. While this functionality is only officially maintained for the official OpenAI API, it tends to also work for other hosting services that use the same API such as [goose.ai](goose.ai) with minor modification. We also have an implementation for the [TextSynth](https://textsynth.com/index.html) API, using `--model textsynth`.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_args engine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
### Other Frameworks ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
...@@ -172,6 +193,16 @@ python write_out.py \ ...@@ -172,6 +193,16 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model openai \
--model_args engine=davinci \
--tasks lambada_openai,hellaswag \
--check_integrity
```
## Advanced Usage ## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument: For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
...@@ -201,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu ...@@ -201,6 +232,14 @@ To implement a new task in the eval harness, see [this guide](./docs/new_task_gu
As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants. As a start, we currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, we support prompts authored in the [Promptsource Library](https://github.com/bigscience-workshop/promptsource/tree/main) as described further in https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/new_task_guide.md and https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/docs/advanced_task_guide.md and welcome contributions of novel task templates and task variants.
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Cite as ## Cite as
``` ```
......
...@@ -236,3 +236,89 @@ Generative tasks: ...@@ -236,3 +236,89 @@ Generative tasks:
Tasks using complex filtering: Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`) - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
## Benchmarks
When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
To solve this, we can create a benchmark yaml config. This is a config that contains the names of the tasks that should be included in a particular benchmark. The config consists of two main keys `group` which denotes the name of the benchmark and `task` which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example would be the list of tasks used to evaluate the Pythia Suite.
```yaml
group: pythia
task:
- lambada_openai
- wikitext
- piqa
- sciq
- wsc
- winogrande
- arc
- logiqa
- blimp
- hendrycksTest*
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
group: t0_eval
task:
# Coreference Resolution
- dataset_path: super_glue
dataset_name: wsc.fixed
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Coreference Resolution
- dataset_path: winogrande
dataset_name: winogrande_xl
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
...
```
If the benchmark contains the same dataset but with different configurations, use `task` to differentiate between them. For example, T0-Eval evaluates on 3 versions of ANLI but the huggingface dataset collects them in one dataset.
```YAML
group: t0_eval
task:
...
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
...@@ -48,7 +48,9 @@ class Sampler: ...@@ -48,7 +48,9 @@ class Sampler:
) )
+ self.target_delimiter + self.target_delimiter
+ ( + (
self.doc_to_target(doc) self.doc_to_target(doc)[0]
if type(self.doc_to_target(doc)) is list
else self.doc_to_target(doc)
if ( if (
self.config.doc_to_choice is None self.config.doc_to_choice is None
or type(self.doc_to_target(doc)) is str or type(self.doc_to_target(doc)) is str
......
...@@ -465,8 +465,11 @@ class Task(abc.ABC): ...@@ -465,8 +465,11 @@ class Task(abc.ABC):
elif type(example) == list: elif type(example) == list:
return [labeled_examples + ex for ex in example] return [labeled_examples + ex for ex in example]
elif type(example) == int: elif type(example) == int:
choices = self.doc_to_choice(doc) if self._config.doc_to_choice is not None:
return labeled_examples + choices[example] choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
def apply_filters(self): def apply_filters(self):
...@@ -649,9 +652,36 @@ class ConfigurableTask(Task): ...@@ -649,9 +652,36 @@ class ConfigurableTask(Task):
if type(test_text) is int: if type(test_text) is int:
self.multiple_input = num_choice self.multiple_input = num_choice
else:
test_choice = None
if type(test_target) is list: if type(test_target) is list:
self.multiple_target = len(test_target) self.multiple_target = len(test_target)
else:
if (type(test_target) is int) and (test_choice is not None):
test_target = [self.doc_to_choice(test_target)[test_target]]
else:
test_target = [test_target]
if test_choice is not None:
check_choices = test_choice
else:
check_choices = test_target
for choice in check_choices:
choice_has_whitespace = True if " " in choice else False
delimiter_has_whitespace = (
True if " " in self._config.target_delimiter else False
)
if delimiter_has_whitespace and choice_has_whitespace:
eval_logger.warning(
f'Both target_delimiter and target choice: "{choice}" have whitespace'
)
elif (not delimiter_has_whitespace) and (not choice_has_whitespace):
eval_logger.warning(
f'Both target_delimiter and target choice: "{choice}" does not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
)
def download(self, dataset_kwargs=None): def download(self, dataset_kwargs=None):
...@@ -761,12 +791,17 @@ class ConfigurableTask(Task): ...@@ -761,12 +791,17 @@ class ConfigurableTask(Task):
return doc_to_text(doc) return doc_to_text(doc)
# Used when applying a Promptsource template # Used when applying a Promptsource template
elif hasattr(doc_to_text, "apply"): elif hasattr(doc_to_text, "apply"):
return doc_to_text.apply(doc)[0] applied_prompt = doc_to_text.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[0]
else:
eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter
else: else:
print(type(doc_to_text)) print(type(doc_to_text))
raise TypeError raise TypeError
def doc_to_target(self, doc: dict) -> Union[int, str]: def doc_to_target(self, doc: dict) -> Union[int, str, list]:
if self.prompt is not None: if self.prompt is not None:
doc_to_target = self.prompt doc_to_target = self.prompt
...@@ -785,13 +820,26 @@ class ConfigurableTask(Task): ...@@ -785,13 +820,26 @@ class ConfigurableTask(Task):
target_string = utils.apply_template(doc_to_target, doc) target_string = utils.apply_template(doc_to_target, doc)
if target_string.isdigit(): if target_string.isdigit():
return ast.literal_eval(target_string) return ast.literal_eval(target_string)
elif (
len(target_string) >= 2
and (target_string[0] == "[")
and (target_string[-1] == "]")
):
return ast.literal_eval(target_string)
else: else:
return target_string return target_string
elif type(doc_to_target) == list:
return doc_to_target
elif callable(doc_to_target): elif callable(doc_to_target):
return doc_to_target(doc) return doc_to_target(doc)
# Used when applying a Promptsource template # Used when applying a Promptsource template
elif hasattr(doc_to_target, "apply"): elif hasattr(doc_to_target, "apply"):
return doc_to_target.apply(doc)[1] applied_prompt = doc_to_target.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[1]
else:
eval_logger.warning("Applied prompt returns empty string")
return self._config.fewshot_delimiter
else: else:
raise TypeError raise TypeError
...@@ -988,9 +1036,13 @@ class ConfigurableTask(Task): ...@@ -988,9 +1036,13 @@ class ConfigurableTask(Task):
elif self.OUTPUT_TYPE == "greedy_until": elif self.OUTPUT_TYPE == "greedy_until":
gold = self.doc_to_target(doc) gold = self.doc_to_target(doc)
if type(gold) == int: if self._config.doc_to_choice is not None:
# If you set doc_to_choice,
# it assumes that doc_to_target returns a number.
choices = self.doc_to_choice(doc) choices = self.doc_to_choice(doc)
gold = choices[gold] gold = choices[gold]
else:
gold = str(gold)
for key, result in zip(self._metric_fn_list.keys(), results): for key, result in zip(self._metric_fn_list.keys(), results):
if self.multiple_target: if self.multiple_target:
...@@ -1009,20 +1061,20 @@ class ConfigurableTask(Task): ...@@ -1009,20 +1061,20 @@ class ConfigurableTask(Task):
res = res[key] res = res[key]
scores.append(res) scores.append(res)
if any(scores): if any(scores):
result = 1.0 result_score = 1.0
else: else:
result = 0.0 result_score = 0.0
else: else:
result = self._metric_fn_list[key]( result_score = self._metric_fn_list[key](
references=[gold], references=[gold],
predictions=[result], predictions=[result],
**self._metric_fn_kwargs[key], **self._metric_fn_kwargs[key],
) )
if isinstance(result, dict): if isinstance(result_score, dict):
result_dict.update(result) result_dict.update(result_score)
else: else:
result_dict[key] = result result_dict[key] = result_score
else: else:
raise ValueError( raise ValueError(
f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ", f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
......
import os
import yaml
from lm_eval import utils
from lm_eval.tasks import register_configurable_task, check_prompt_config
from lm_eval.logger import eval_logger
from lm_eval.api.registry import (
TASK_REGISTRY,
GROUP_REGISTRY,
ALL_TASKS,
)
def include_benchmarks(task_dir):
for root, subdirs, file_list in os.walk(task_dir):
if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_benchmarks(task_dir)
...@@ -6,7 +6,7 @@ task: ...@@ -6,7 +6,7 @@ task:
- sciq - sciq
- wsc - wsc
- winogrande - winogrande
- arc_* - arc
# - logiqa - logiqa
# - blimp_* - blimp
# - hendrycksTest* - hendrycksTest*
group: t0_eval
task:
# Coreference Resolution
- dataset_path: super_glue
dataset_name: wsc.fixed
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Coreference Resolution
- dataset_path: winogrande
dataset_name: winogrande_xl
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Natural Language Inference
- dataset_path: super_glue
dataset_name: cb
use_prompt: promptsource:*
training_split: train
validation_split: validation
output_type: greedy_until
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- dataset_path: super_glue
dataset_name: rte
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r1
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r1
validation_split: dev_r1
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r2
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r2
validation_split: dev_r2
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
- task: anli_r3
dataset_path: anli
use_prompt: promptsource:*
training_split: train_r3
validation_split: dev_r3
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Sentence Completion
- dataset_path: super_glue
dataset_name: copa
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Natural Language Inference
- dataset_path: hellaswag
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
# Word Sense Disambiguation
- dataset_path: super_glue
dataset_name: wic
use_prompt: promptsource:*
training_split: train
validation_split: validation
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
...@@ -11,6 +11,7 @@ import numpy as np ...@@ -11,6 +11,7 @@ import numpy as np
import lm_eval.api import lm_eval.api
import lm_eval.tasks import lm_eval.tasks
import lm_eval.benchmarks
import lm_eval.models import lm_eval.models
import lm_eval.api.metrics import lm_eval.api.metrics
import lm_eval.api.registry import lm_eval.api.registry
......
...@@ -8,6 +8,7 @@ FILTER_REGISTRY = { ...@@ -8,6 +8,7 @@ FILTER_REGISTRY = {
"regex": extraction.RegexFilter, "regex": extraction.RegexFilter,
"majority_vote": selection.MajorityVoteFilter, "majority_vote": selection.MajorityVoteFilter,
"take_first_k": selection.TakeKFilter, "take_first_k": selection.TakeKFilter,
"remove_whitespace": extraction.WhitespaceFilter,
# TODO: implement this filter. either it should take in an arbitrary "scoring"/reward function # TODO: implement this filter. either it should take in an arbitrary "scoring"/reward function
# that takes an input and returns a scalar and then should select the max reward, # that takes an input and returns a scalar and then should select the max reward,
# or should implement different filters for different ways of handling a reward model's inference. # or should implement different filters for different ways of handling a reward model's inference.
......
...@@ -36,3 +36,26 @@ class RegexFilter(Filter): ...@@ -36,3 +36,26 @@ class RegexFilter(Filter):
# print(filtered_resps) # print(filtered_resps)
return filtered_resps return filtered_resps
class WhitespaceFilter(Filter):
""" """
def __init__(self):
pass
def apply(self, resps):
def filter_set(inst):
filtered_resp = []
for resp in inst:
if resp.startswith(" "):
resp = resp[1:]
filtered_resp.append(resp)
return filtered_resp
filtered_resps = [filter_set(resp) for resp in resps]
return filtered_resps
...@@ -292,7 +292,9 @@ class HFLM(LM): ...@@ -292,7 +292,9 @@ class HFLM(LM):
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore." "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
) )
else: else:
self._model = accelerator.prepare(self.model) self._model = accelerator.prepare_model(
self.model, evaluation_mode=True
)
self._device = torch.device(f"cuda:{accelerator.local_process_index}") self._device = torch.device(f"cuda:{accelerator.local_process_index}")
self.accelerator = accelerator self.accelerator = accelerator
......
...@@ -3,7 +3,7 @@ This list keeps track of which tasks' implementations have been ported to YAML / ...@@ -3,7 +3,7 @@ This list keeps track of which tasks' implementations have been ported to YAML /
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already. Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- [ ] Glue (Lintang) - [x] Glue
- [x] SuperGlue - [x] SuperGlue
- [ ] CoQA (Lintang) - [ ] CoQA (Lintang)
- [ ] DROP (Lintang) - [ ] DROP (Lintang)
...@@ -13,12 +13,12 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -13,12 +13,12 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] Wikitext - [x] Wikitext
- [x] PiQA - [x] PiQA
- [x] PROST - [x] PROST
- [ ] MCTACO (Lintang) - [x] MCTACO
- [x] Pubmed QA - [x] Pubmed QA
- [x] SciQ - [x] SciQ
- [ ] QASPER - [ ] QASPER
- [x] QA4MRE - [x] QA4MRE
- [ ] TriviaQA (Lintang) - [x] TriviaQA
- [x] AI2 ARC - [x] AI2 ARC
- [x] LogiQA - [x] LogiQA
- [x] HellaSwag - [x] HellaSwag
...@@ -33,9 +33,9 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -33,9 +33,9 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] Winogrande - [x] Winogrande
- [x] ANLI - [x] ANLI
- [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info) - [x] Hendrycks Ethics (missing some tasks/metrics, see PR 660: <https://github.com/EleutherAI/lm-evaluation-harness/pull/660> for more info)
- [x] TruthfulQA (mc1) (Lintang) - [x] TruthfulQA (mc1)
- [ ] TruthfulQA (mc2) (Lintang) - [x] TruthfulQA (mc2)
- [ ] TruthfulQA (gen) (Lintang) - [x] TruthfulQA (gen)
- [ ] MuTual - [ ] MuTual
- [ ] Hendrycks Math (Hailey) - [ ] Hendrycks Math (Hailey)
- [ ] Asdiv - [ ] Asdiv
...@@ -45,17 +45,17 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for ...@@ -45,17 +45,17 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] Translation (WMT) suite (Hailey) - [ ] Translation (WMT) suite (Hailey)
- [x] Unscramble - [x] Unscramble
- [x] ~~Pile (perplexity)~~ - [x] ~~Pile (perplexity)~~
- [ ] BLiMP (Lintang) - [x] BLiMP
- [x] ToxiGen - [x] ToxiGen
- [ ] StoryCloze (Lintang) - [x] StoryCloze
- [ ] NaturalQs (Hailey) - [ ] NaturalQs (Hailey)
- [x] CrowS-Pairs - [x] CrowS-Pairs
- [x] XCopa - [x] XCopa
- [ ] BIG-Bench (Hailey) - [ ] BIG-Bench (Hailey)
- [ ] XStoryCloze (Lintang) - [x] XStoryCloze
- [x] XWinograd - [x] XWinograd
- [ ] PAWS-X (Lintang) - [x] PAWS-X
- [ ] XNLI (Lintang) - [x] XNLI
- [ ] MGSM (Lintang) - [ ] MGSM (Lintang)
- [ ] SCROLLS - [ ] SCROLLS
- [x] Babi - [x] Babi
......
...@@ -44,7 +44,7 @@ def check_prompt_config(config): ...@@ -44,7 +44,7 @@ def check_prompt_config(config):
prompt_list = prompts.load_prompt_list( prompt_list = prompts.load_prompt_list(
use_prompt=config["use_prompt"], use_prompt=config["use_prompt"],
dataset_name=config["dataset_path"], dataset_name=config["dataset_path"],
subset_name=config["dataset_name"], subset_name=config["dataset_name"] if "dataset_name" in config else None,
) )
for idx, prompt_variation in enumerate(prompt_list): for idx, prompt_variation in enumerate(prompt_list):
all_configs.append( all_configs.append(
...@@ -54,7 +54,9 @@ def check_prompt_config(config): ...@@ -54,7 +54,9 @@ def check_prompt_config(config):
**{ **{
"task": "_".join( "task": "_".join(
[ [
get_task_name_from_config(config), config["task"]
if "task" in config
else get_task_name_from_config(config),
prompt_variation, prompt_variation,
] ]
) )
...@@ -98,58 +100,8 @@ def include_task_folder(task_dir): ...@@ -98,58 +100,8 @@ def include_task_folder(task_dir):
) )
def include_benchmarks(task_dir, benchmark_dir="benchmarks"):
for root, subdirs, file_list in os.walk(os.path.join(task_dir, benchmark_dir)):
if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
if f.endswith(".yaml"):
try:
benchmark_path = os.path.join(root, f)
with open(benchmark_path, "rb") as file:
yaml_config = yaml.full_load(file)
assert "group" in yaml_config
group = yaml_config["group"]
all_task_list = yaml_config["task"]
config_list = [
task for task in all_task_list if type(task) != str
]
task_list = [
task for task in all_task_list if type(task) == str
]
for task_config in config_list:
var_configs = check_prompt_config(
{
**task_config,
**{"group": group},
}
)
for config in var_configs:
register_configurable_task(config)
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if task in TASK_REGISTRY:
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
except Exception as error:
eval_logger.warning(
"Failed to load benchmark in\n"
f" {benchmark_path}\n"
" Benchmark will not be added to registry\n"
f" Error: {error}"
)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/" task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_task_folder(task_dir) include_task_folder(task_dir)
include_benchmarks(task_dir)
def get_task(task_name, config): def get_task(task_name, config):
......
# Task-name # ANLI
### Paper ### Paper
Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding` Title: `Adversarial NLI: A New Benchmark for Natural Language Understanding`
Abstract: `https://arxiv.org/pdf/1910.14599.pdf` Paper Link: https://arxiv.org/abs/1910.14599
Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial Adversarial NLI (ANLI) is a dataset collected via an iterative, adversarial
human-and-model-in-the-loop procedure. It consists of three rounds that progressively human-and-model-in-the-loop procedure. It consists of three rounds that progressively
increase in difficulty and complexity, and each question-answer includes annotator- increase in difficulty and complexity, and each question-answer includes annotator-
provided explanations. provided explanations.
Homepage: `https://github.com/facebookresearch/anli` Homepage: https://github.com/facebookresearch/anli
### Citation ### Citation
...@@ -31,13 +30,18 @@ Homepage: `https://github.com/facebookresearch/anli` ...@@ -31,13 +30,18 @@ Homepage: `https://github.com/facebookresearch/anli`
} }
``` ```
### Subtasks ### Groups and Tasks
#### Groups
List or describe tasks defined in this folder, and their names here: * `anli`: Evaluates `anli_r1`, `anli_r2`, and `anli_r3`
#### Tasks
* `anli_r1`: The data collected adversarially in the first round. * `anli_r1`: The data collected adversarially in the first round.
* `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data. * `anli_r2`: The data collected adversarially in the second round, after training on the previous round's data.
* `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data. * `anli_r3`: The data collected adversarially in the third round, after training on the previous multiple rounds of data.
### Checklist ### Checklist
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
......
group: group:
- multiple_choice - anli
- natural_language_inference
- nli
- adverserial
task: anli_r1 task: anli_r1
dataset_path: anli dataset_path: anli
dataset_name: null dataset_name: null
......
group: include: anli_r1.yaml
- multiple_choice
- natural_language_inference
- nli
- adverserial
task: anli_r2 task: anli_r2
dataset_path: anli
dataset_name: null
output_type: multiple_choice
training_split: train_r2 training_split: train_r2
validation_split: dev_r2 validation_split: dev_r2
test_split: test_r2 test_split: test_r2
doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
# True = entailment
# False = contradiction
# Neither = neutral
doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
doc_to_choice:
- "True"
- "Neither"
- "False"
should_decontaminate: true
doc_to_decontamination_query: premise
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
group: include: anli_r1.yaml
- multiple_choice
- natural_language_inference
- nli
- adverserial
task: anli_r3 task: anli_r3
dataset_path: anli
dataset_name: null
output_type: multiple_choice
training_split: train_r3 training_split: train_r3
validation_split: dev_r3 validation_split: dev_r3
test_split: test_r3 test_split: test_r3
doc_to_text: "{{premise}}\nQuestion: {{hypothesis}} True, False, or Neither?\nAnswer:"
# True = entailment
# False = contradiction
# Neither = neutral
doc_to_target: "{{['True', 'Neither', 'False'][label]}}"
doc_to_choice:
- "True"
- "Neither"
- "False"
should_decontaminate: true
doc_to_decontamination_query: premise
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment