Commit 7604b873 authored by cardy20's avatar cardy20
Browse files

conflict changed

parents 17b04444 e8f38aee
...@@ -3,3 +3,7 @@ env ...@@ -3,3 +3,7 @@ env
data/ data/
lm_cache lm_cache
.idea .idea
build/
logs/
output/
lm_eval.egg-info/
\ No newline at end of file
...@@ -32,7 +32,7 @@ repos: ...@@ -32,7 +32,7 @@ repos:
rev: 22.3.0 rev: 22.3.0
hooks: hooks:
- id: black - id: black
language_version: python3.8 language_version: python3.9
- repo: https://github.com/codespell-project/codespell - repo: https://github.com/codespell-project/codespell
rev: v2.1.0 rev: v2.1.0
hooks: hooks:
......
# Language Model Evaluation Harness # Language Model Evaluation Harness
![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)
[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
## Overview ## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
...@@ -10,9 +7,11 @@ This project provides a unified framework to test generative language models on ...@@ -10,9 +7,11 @@ This project provides a unified framework to test generative language models on
Features: Features:
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list. - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility. - Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
- Task versioning to ensure reproducibility when tasks are updated.
## Install ## Install
...@@ -34,14 +33,16 @@ pip install -e ".[multilingual]" ...@@ -34,14 +33,16 @@ pip install -e ".[multilingual]"
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info. > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on tasks with names matching the pattern `lambada_*` and `hellaswag` you can use the following command: ### Hugging Face `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command:
```bash ```bash
python main.py \ python main.py \
--model hf-causal \ --model hf-causal \
--model_args pretrained=EleutherAI/gpt-j-6B \ --model_args pretrained=EleutherAI/gpt-j-6B \
--tasks lambada_*,hellaswag \ --tasks hellaswag \
--device cuda:0 --device cuda:0
``` ```
...@@ -59,16 +60,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste ...@@ -59,16 +60,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring. > **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below: ### Commercial APIs
```bash
python main.py \
--model hf-causal-experimental \
--model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
--tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
--device cuda:0
```
Our library also supports the OpenAI API: Our library also supports language models served via the OpenAI API:
```bash ```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
...@@ -90,7 +84,9 @@ python main.py \ ...@@ -90,7 +84,9 @@ python main.py \
--check_integrity --check_integrity
``` ```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command: 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
...@@ -104,6 +100,22 @@ python write_out.py \ ...@@ -104,6 +100,22 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
```bash
python main.py \
--model hf-causal-experimental \
--model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
--tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
--device cuda:0
```
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
## Implementing new tasks ## Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
......
...@@ -271,6 +271,19 @@ python main.py \ ...@@ -271,6 +271,19 @@ python main.py \
--num_fewshot K --num_fewshot K
``` ```
### Checking the Model Outputs
The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.
```sh
python main.py \
--model gpt2 \
--model_args device=<device-name> \
--tasks <task-name> \
--num_fewshot K \
--write_out \
--output_base_path <path>
```
### Running Unit Tests ### Running Unit Tests
To run the entire test suite, use: To run the entire test suite, use:
......
This diff is collapsed.
ROUGE ROUGE
rouge rouge
nin nin
maka
mor
te
...@@ -187,20 +187,25 @@ class BaseLM(LM): ...@@ -187,20 +187,25 @@ class BaseLM(LM):
def loglikelihood_rolling(self, requests): def loglikelihood_rolling(self, requests):
# TODO: Implement caching once we've confirmed the perplexity implementation # TODO: Implement caching once we've confirmed the perplexity implementation
# automatic batch size detection for vectorization # automatic batch size detection for vectorization
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == 'auto': if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print('Passed argument batch_size = auto. Detecting largest batch size') print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size): def forward_batch(batch_size):
test_batch = torch.ones((batch_size, self.max_length), device=self.device).long() test_batch = torch.ones(
for _ in range(5): (batch_size, self.max_length), device=self.device
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu() ).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size return batch_size
batch_size = forward_batch() batch_size = forward_batch()
print(f"Determined Largest batch size: {batch_size}") print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size adaptive_batch_size = batch_size
...@@ -223,7 +228,9 @@ class BaseLM(LM): ...@@ -223,7 +228,9 @@ class BaseLM(LM):
# TODO: extract out this call so it only gets called once and also somehow figure out partial caching for # TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
# that # that
string_nll = self._loglikelihood_tokens( string_nll = self._loglikelihood_tokens(
rolling_token_windows, disable_tqdm=True, override_bs = adaptive_batch_size rolling_token_windows,
disable_tqdm=True,
override_bs=adaptive_batch_size,
) )
# discard is_greedy # discard is_greedy
...@@ -234,7 +241,7 @@ class BaseLM(LM): ...@@ -234,7 +241,7 @@ class BaseLM(LM):
return loglikelihoods return loglikelihoods
def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs = None): def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = [] res = []
...@@ -249,33 +256,36 @@ class BaseLM(LM): ...@@ -249,33 +256,36 @@ class BaseLM(LM):
toks = x[1] + x[2] toks = x[1] + x[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
# automatic (variable) batch size detection for vectorization # automatic (variable) batch size detection for vectorization
# pull longest context sample from request # pull longest context sample from request
_, context_enc, continuation_enc = re_ord.get_reordered()[0] if len(re_ord.get_reordered()) > 0:
max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]) _, context_enc, continuation_enc = re_ord.get_reordered()[0]
if (self.batch_size == 'auto'): max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
if (self.batch_size == 'auto'):
if override_bs is None:
print('Passed argument batch_size = auto. Detecting largest batch size')
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_context), device=self.device).long()
for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined largest batch size: {batch_size}")
adaptive_batch_size = batch_size
else: if override_bs is None:
adaptive_batch_size = override_bs print('Passed argument batch_size = auto. Detecting largest batch size')
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size):
test_batch = torch.ones((batch_size, max_context), device=self.device).long()
for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
return batch_size
batch_size = forward_batch()
print(f"Determined largest batch size: {batch_size}")
adaptive_batch_size = batch_size
else:
adaptive_batch_size = override_bs
else:
adaptive_batch_size = 0 if override_bs is None else override_bs
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(re_ord.get_reordered(), disable=disable_tqdm), self.batch_size if self.batch_size != "auto" else adaptive_batch_size tqdm(re_ord.get_reordered(), disable=disable_tqdm),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
): ):
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
...@@ -382,7 +392,7 @@ class BaseLM(LM): ...@@ -382,7 +392,7 @@ class BaseLM(LM):
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
for context, request_args in tqdm(re_ord.get_reordered()): for context, request_args in tqdm(re_ord.get_reordered()):
until = request_args['until'] until = request_args["until"]
if isinstance(until, str): if isinstance(until, str):
until = [until] until = [until]
...@@ -396,7 +406,7 @@ class BaseLM(LM): ...@@ -396,7 +406,7 @@ class BaseLM(LM):
).to(self.device) ).to(self.device)
max_gen_tokens = min( max_gen_tokens = min(
self.max_gen_toks, request_args.get('max_length', self.max_gen_toks) self.max_gen_toks, request_args.get("max_length", self.max_gen_toks)
) )
cont = self._model_generate( cont = self._model_generate(
context_enc, context_enc.shape[1] + max_gen_tokens, primary_until context_enc, context_enc.shape[1] + max_gen_tokens, primary_until
......
...@@ -125,7 +125,7 @@ ...@@ -125,7 +125,7 @@
{ {
"input": "The deadline is Jun 1, 2021, which is 2 days away from now. What is the date tomorrow in MM/DD/YYYY?", "input": "The deadline is Jun 1, 2021, which is 2 days away from now. What is the date tomorrow in MM/DD/YYYY?",
"target_scores": { "target_scores": {
"05/30/2021": 0, "05/30/2021": 0,
"05/31/2021": 1, "05/31/2021": 1,
"07/07/2021": 0, "07/07/2021": 0,
"06/01/2021": 0, "06/01/2021": 0,
...@@ -3406,7 +3406,7 @@ ...@@ -3406,7 +3406,7 @@
"target_scores": { "target_scores": {
"10/22/1924": 1, "10/22/1924": 1,
"10/21/1924": 0, "10/21/1924": 0,
"10/07/1924": 0, "10/07/1924": 0,
"11/21/1924": 0, "11/21/1924": 0,
"08/22/1925": 0, "08/22/1925": 0,
"10/22/1944": 0 "10/22/1944": 0
...@@ -3495,7 +3495,7 @@ ...@@ -3495,7 +3495,7 @@
"01/24/1947": 0, "01/24/1947": 0,
"08/24/2011": 0, "08/24/2011": 0,
"01/24/1923": 0, "01/24/1923": 0,
"02/06/2012": 0, "02/06/2012": 0,
"01/22/2012": 0 "01/22/2012": 0
} }
}, },
...@@ -3504,7 +3504,7 @@ ...@@ -3504,7 +3504,7 @@
"target_scores": { "target_scores": {
"02/07/2012": 0, "02/07/2012": 0,
"02/06/2012": 1, "02/06/2012": 1,
"01/31/2012": 0, "01/31/2012": 0,
"02/07/2014": 0, "02/07/2014": 0,
"03/16/2012": 0, "03/16/2012": 0,
"02/01/2012": 0 "02/01/2012": 0
......
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Korean Offensive Language Dataset"""
import json
import datasets
_CITATION = """\
@InProceedings{jeong-etal-2022-kold,
title = "{KOLD}: {K}orean Offensive Language Dataset",
author = "Jeong, Younghoon and
Oh, Juhyun and
Lee, Jongwon and
Ahn, Jaimeen and
Moon, Jihyung and
Park, Sungjoon and
Oh, Alice",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.744",
pages = "10818--10833",
abstract = "Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.",
}
"""
_DESCRIPTION = """\
They present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans.
They collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process.
"""
_HOMEPAGE = "https://github.com/boychaboy/KOLD"
_LICENSE = "CC0 1.0 Universal (CC0 1.0)"
_URLs = "https://raw.githubusercontent.com/Gun1Yun/KOLD/main/data/kold_v1.json"
# TODO: Name of the dataset usually match the script name with CamelCase instead of snake_case
class KOLD(datasets.GeneratorBasedBuilder):
"""Korean Offensive Language Dataset."""
VERSION = datasets.Version("1.1.0")
def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"id": datasets.Value("string"),
"title": datasets.Value("string"),
"comment": datasets.Value("string"),
"off": datasets.ClassLabel(names=["False", "True"]),
"tgt": datasets.ClassLabel(names=["None", 'group', 'individual', 'other', 'untargeted'])
# "GRP": datasets.ClassLabel(names=["None", "ohters"]),
}
),
supervised_keys=None,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager):
downloaded_files = dl_manager.download_and_extract(_URLs)
return [
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={
"filepath": downloaded_files,
"split": "test",
},
),
]
def _generate_examples(self, filepath, split):
with open(filepath, "r") as f:
data = json.loads(f.read())
for id_, row in enumerate(data):
yield id_, {
"id": row["guid"],
"title": row["title"],
"comment": row["comment"],
"off": int(row["OFF"]),
"tgt": row["TGT"],
# "grp": row["GRP"]
}
\ No newline at end of file
...@@ -23,8 +23,9 @@ def simple_evaluate( ...@@ -23,8 +23,9 @@ def simple_evaluate(
description_dict=None, description_dict=None,
check_integrity=False, check_integrity=False,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM] :param model: Union[str, LM]
...@@ -42,14 +43,18 @@ def simple_evaluate( ...@@ -42,14 +43,18 @@ def simple_evaluate(
PyTorch device (e.g. "cpu" or "cuda:0") for running models PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param no_cache: bool :param no_cache: bool
Whether or not to cache Whether or not to cache
:param limit: int, optional :param limit: int or float, optional
Limit the number of examples per task (only use this for testing) Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters: :param bootstrap_iters:
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str] :param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description` Dictionary of custom task descriptions of the form: `task_name: description`
:param check_integrity: bool :param check_integrity: bool
Whether to run the relevant part of the test suite for the tasks Whether to run the relevant part of the test suite for the tasks
:param write_out: bool
If True, write details about prompts and logits to json for all tasks
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir.
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -91,6 +96,8 @@ def simple_evaluate( ...@@ -91,6 +96,8 @@ def simple_evaluate(
bootstrap_iters=bootstrap_iters, bootstrap_iters=bootstrap_iters,
description_dict=description_dict, description_dict=description_dict,
decontamination_ngrams_path=decontamination_ngrams_path, decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out,
output_base_path=output_base_path,
) )
# add info about the model and few shot config # add info about the model and few shot config
...@@ -122,6 +129,8 @@ def evaluate( ...@@ -122,6 +129,8 @@ def evaluate(
bootstrap_iters=100000, bootstrap_iters=100000,
description_dict=None, description_dict=None,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out=False,
output_base_path=None,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
...@@ -139,6 +148,10 @@ def evaluate( ...@@ -139,6 +148,10 @@ def evaluate(
Number of iterations for bootstrap statistics Number of iterations for bootstrap statistics
:param description_dict: dict[str, str] :param description_dict: dict[str, str]
Dictionary of custom task descriptions of the form: `task_name: description` Dictionary of custom task descriptions of the form: `task_name: description`
:param write_out: bool
If True, write all prompts, logits and metrics to json for offline analysis
:param output_base_path: str, optional
Directory to which detailed eval info will be written. Defaults to present working dir
:return :return
Dictionary of results Dictionary of results
""" """
...@@ -175,6 +188,7 @@ def evaluate( ...@@ -175,6 +188,7 @@ def evaluate(
# TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable # TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable
docs = {} docs = {}
write_out_info = {}
docs_for_decontamination = collections.defaultdict(list) docs_for_decontamination = collections.defaultdict(list)
...@@ -197,15 +211,20 @@ def evaluate( ...@@ -197,15 +211,20 @@ def evaluate(
rnd = random.Random() rnd = random.Random()
rnd.seed(42) rnd.seed(42)
rnd.shuffle(task_docs) rnd.shuffle(task_docs)
print(f"Task: {task_name}; number of docs: {len(task_docs)}")
if write_out:
prompt_details = []
description = ( description = (
description_dict[task_name] description_dict[task_name]
if description_dict and task_name in description_dict if description_dict and task_name in description_dict
else "" else ""
) )
if limit is not None:
limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)): for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
if decontaminate and task.should_decontaminate(): if decontaminate and task.should_decontaminate():
docs_for_decontamination[(task_name, task_set)].append( docs_for_decontamination[(task_name, task_set)].append(
task.doc_to_decontamination_query(doc) task.doc_to_decontamination_query(doc)
...@@ -216,6 +235,17 @@ def evaluate( ...@@ -216,6 +235,17 @@ def evaluate(
doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description
) )
reqs = task.construct_requests(doc, ctx) reqs = task.construct_requests(doc, ctx)
if write_out:
prompt_details.append({"doc_id": doc_id})
# print the prompt for the first few documents
if doc_id < 1:
print(
f"Task: {task_name}; document {doc_id}; context prompt (starting on next line):\n{ctx}\n(end of prompt on previous line)"
)
print("Requests:", reqs)
if not isinstance(reqs, (list, tuple)): if not isinstance(reqs, (list, tuple)):
reqs = [reqs] reqs = [reqs]
for i, req in enumerate(reqs): for i, req in enumerate(reqs):
...@@ -224,6 +254,14 @@ def evaluate( ...@@ -224,6 +254,14 @@ def evaluate(
# doc_id: unique id that we can get back to a doc using `docs` # doc_id: unique id that we can get back to a doc using `docs`
requests_origin[req.request_type].append((i, task_name, doc, doc_id)) requests_origin[req.request_type].append((i, task_name, doc, doc_id))
if write_out:
prompt_details[-1][f"prompt_{i}"] = "".join(
(map(lambda x: "".join(x), req.args))
)
if write_out:
write_out_info[task_name] = prompt_details
# Compare all tasks/sets at once to ensure a single training set scan # Compare all tasks/sets at once to ensure a single training set scan
if decontaminate: if decontaminate:
from lm_eval.decontamination.decontaminate import get_train_overlap from lm_eval.decontamination.decontaminate import get_train_overlap
...@@ -252,6 +290,18 @@ def evaluate( ...@@ -252,6 +290,18 @@ def evaluate(
for resp, (i, task_name, doc, doc_id) in zip(resps, requests_origin[reqtype]): for resp, (i, task_name, doc, doc_id) in zip(resps, requests_origin[reqtype]):
process_res_queue[(task_name, doc_id)].append((i, resp)) process_res_queue[(task_name, doc_id)].append((i, resp))
if write_out:
write_out_info[task_name][doc_id][f"logit_{i}"] = resp
task = task_dict[task_name]
if isinstance(task, lm_eval.base.MultipleChoiceTask):
write_out_info[task_name][doc_id]["truth"] = doc["gold"]
elif isinstance(task, lm_eval.tasks.winogrande.Winogrande):
write_out_info[task_name][doc_id]["truth"] = task.answer_to_num[
doc["answer"]
]
else:
write_out_info[task_name][doc_id]["truth"] = task.doc_to_target(doc)
vals = collections.defaultdict(list) vals = collections.defaultdict(list)
# unpack results and sort back in order and return control to Task # unpack results and sort back in order and return control to Task
...@@ -266,6 +316,9 @@ def evaluate( ...@@ -266,6 +316,9 @@ def evaluate(
for metric, value in metrics.items(): for metric, value in metrics.items():
vals[(task_name, metric)].append(value) vals[(task_name, metric)].append(value)
if write_out:
write_out_info[task_name][doc_id][metric] = str(value)
# Re-use the evaluation for the decontaminated set by just ignoring the overlaps # Re-use the evaluation for the decontaminated set by just ignoring the overlaps
if decontaminate and task_name in overlaps: if decontaminate and task_name in overlaps:
if doc_id not in overlaps[task_name]: if doc_id not in overlaps[task_name]:
...@@ -294,6 +347,28 @@ def evaluate( ...@@ -294,6 +347,28 @@ def evaluate(
if stderr is not None: if stderr is not None:
results[task_name][metric + "_stderr"] = stderr(items) results[task_name][metric + "_stderr"] = stderr(items)
if write_out:
import json
import pathlib
output_base_path = (
pathlib.Path(output_base_path)
if output_base_path is not None
else pathlib.Path(".")
)
try:
output_base_path.mkdir(parents=True, exist_ok=False)
except FileExistsError:
pass
for task_name, _ in task_dict_items:
with open(
output_base_path.joinpath(f"{task_name}_write_out_info.json"),
"w",
encoding="utf8",
) as fp:
json.dump(write_out_info[task_name], fp, indent=4, ensure_ascii=False)
return {"results": dict(results), "versions": dict(versions)} return {"results": dict(results), "versions": dict(versions)}
......
...@@ -3,6 +3,7 @@ import transformers ...@@ -3,6 +3,7 @@ import transformers
from typing import Optional from typing import Optional
from lm_eval.base import BaseLM from lm_eval.base import BaseLM
class HFLM(BaseLM): class HFLM(BaseLM):
def __init__( def __init__(
self, self,
...@@ -20,9 +21,11 @@ class HFLM(BaseLM): ...@@ -20,9 +21,11 @@ class HFLM(BaseLM):
assert isinstance(device, str) assert isinstance(device, str)
assert isinstance(pretrained, str) assert isinstance(pretrained, str)
assert isinstance(batch_size, (int,str)) assert isinstance(batch_size, (int, str))
device_list = set(["cuda", "cpu"] + [f'cuda:{i}' for i in range(torch.cuda.device_count())]) device_list = set(
["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
)
if device and device in device_list: if device and device in device_list:
self._device = torch.device(device) self._device = torch.device(device)
print(f"Using device '{device}'") print(f"Using device '{device}'")
...@@ -49,6 +52,7 @@ class HFLM(BaseLM): ...@@ -49,6 +52,7 @@ class HFLM(BaseLM):
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer, pretrained if tokenizer is None else tokenizer,
<<<<<<< HEAD
<<<<<<< HEAD <<<<<<< HEAD
revision=revision + ("/" + subfolder if subfolder is not None else "")) revision=revision + ("/" + subfolder if subfolder is not None else ""))
...@@ -71,6 +75,8 @@ class HFLM(BaseLM): ...@@ -71,6 +75,8 @@ class HFLM(BaseLM):
# if gpus > 1: # if gpus > 1:
# self.gpt2 = nn.DataParallel(self.gpt2) # self.gpt2 = nn.DataParallel(self.gpt2)
======= =======
=======
>>>>>>> e8f38aee79569d51bd6c84f23f4227771291a816
revision=revision, revision=revision,
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
) )
...@@ -88,11 +94,15 @@ class HFLM(BaseLM): ...@@ -88,11 +94,15 @@ class HFLM(BaseLM):
], self.tokenizer.encode("hello\n\nhello") ], self.tokenizer.encode("hello\n\nhello")
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == 'auto': if batch_size == "auto":
self.batch_size_per_gpu = batch_size self.batch_size_per_gpu = batch_size
else: else:
<<<<<<< HEAD
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
>>>>>>> 0542d35d5e56768dd9041ef9b88b90256970d843 >>>>>>> 0542d35d5e56768dd9041ef9b88b90256970d843
=======
self.batch_size_per_gpu = int(batch_size)
>>>>>>> e8f38aee79569d51bd6c84f23f4227771291a816
@property @property
def eot_token_id(self): def eot_token_id(self):
...@@ -139,9 +149,10 @@ class HFLM(BaseLM): ...@@ -139,9 +149,10 @@ class HFLM(BaseLM):
return self.gpt2(inps)[0] return self.gpt2(inps)[0]
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
generation_kwargs = {'do_sample': False, 'max_length': max_length} generation_kwargs = {"do_sample": False, "max_length": max_length}
if eos_token_id is not None: if eos_token_id is not None:
generation_kwargs['eos_token_id'] = eos_token_id generation_kwargs['eos_token_id'] = eos_token_id
generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
return self.gpt2.generate(context, **generation_kwargs) return self.gpt2.generate(context, **generation_kwargs)
......
...@@ -72,7 +72,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -72,7 +72,7 @@ class HuggingFaceAutoLM(BaseLM):
tokenizer: Optional[str] = None, tokenizer: Optional[str] = None,
subfolder: Optional[str] = None, subfolder: Optional[str] = None,
revision: Optional[str] = "main", revision: Optional[str] = "main",
batch_size: Optional[Union[int,str]] = 1, batch_size: Optional[Union[int, str]] = 1,
max_gen_toks: Optional[int] = 256, max_gen_toks: Optional[int] = 256,
max_length: Optional[int] = None, max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None, add_special_tokens: Optional[bool] = None,
...@@ -159,10 +159,10 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -159,10 +159,10 @@ class HuggingFaceAutoLM(BaseLM):
), "Evaluating causal models with `add_special_tokens=True` is currently not supported." ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == 'auto': if batch_size == "auto":
self._batch_size = batch_size self._batch_size = batch_size
else: else:
self._batch_size = int(batch_size) self._batch_size = int(batch_size)
self._max_gen_toks = max_gen_toks self._max_gen_toks = max_gen_toks
self._max_length = max_length self._max_length = max_length
...@@ -369,35 +369,43 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -369,35 +369,43 @@ class HuggingFaceAutoLM(BaseLM):
def tok_decode(self, tokens: torch.LongTensor) -> List[str]: def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
return self.tokenizer.batch_decode(tokens, skip_special_tokens=True) return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)
def greedy_until(self, requests: List[Tuple[str, Union[List[str], str]]]) -> List[str]: def greedy_until(
self, requests: List[Tuple[str, Union[List[str], str]]]
) -> List[str]:
def _collate(x): def _collate(x):
tokens = self.tok_encode(x[0]) tokens = self.tok_encode(x[0])
return len(tokens), x[0] return len(tokens), x[0]
results = [] results = []
reorder = utils.Reorderer(requests, _collate) reorder = utils.Reorderer(requests, _collate)
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == 'auto': if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print('Passed argument batch_size = auto. Detecting largest batch size') print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size): def forward_batch(batch_size):
test_batch = torch.ones((batch_size, self.max_length), device=self.device).long() test_batch = torch.ones(
for _ in range(5): (batch_size, self.max_length), device=self.device
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu() ).long()
for _ in range(5):
_ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size return batch_size
batch_size = forward_batch() batch_size = forward_batch()
print(f"Determined Largest batch size: {batch_size}") print(f"Determined Largest batch size: {batch_size}")
adaptive_batch_size = batch_size adaptive_batch_size = batch_size
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(reorder.get_reordered(), disable=False), self.batch_size if self.batch_size != "auto" else adaptive_batch_size tqdm(reorder.get_reordered(), disable=False),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
): ):
context = [c[0] for c in chunk] context = [c[0] for c in chunk]
request_args = chunk[0][1] request_args = chunk[0][1]
stop = request_args.get('until', None) stop = request_args.get("until", None)
stop_sequences = stop if isinstance(stop, list) else [stop] stop_sequences = stop if isinstance(stop, list) else [stop]
max_generation_length = request_args.get("max_length", None) max_generation_length = request_args.get("max_length", None)
...@@ -405,7 +413,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -405,7 +413,7 @@ class HuggingFaceAutoLM(BaseLM):
isinstance(max_generation_length, int) or max_generation_length is None isinstance(max_generation_length, int) or max_generation_length is None
) )
assert isinstance(stop_sequences, list) or stop_sequences is None assert isinstance(stop_sequences, list) or stop_sequences is None
# TODO: Find a better way to handle stop sequences for 0-shot. # TODO: Find a better way to handle stop sequences for 0-shot.
if stop_sequences is None: if stop_sequences is None:
until = [self.eot_token] until = [self.eot_token]
......
...@@ -124,7 +124,7 @@ class TextSynthLM(BaseLM): ...@@ -124,7 +124,7 @@ class TextSynthLM(BaseLM):
for request in tqdm(requests): for request in tqdm(requests):
inp = request[0] inp = request[0]
request_args = request[1] request_args = request[1]
until = request_args['until'] until = request_args["until"]
response = textsynth_completion( response = textsynth_completion(
url=self.api_url + "/v1/engines/" + self.engine + "/completions", url=self.api_url + "/v1/engines/" + self.engine + "/completions",
headers={"Authorization": "Bearer " + self.api_key}, headers={"Authorization": "Bearer " + self.api_key},
......
...@@ -16,7 +16,7 @@ from lm_eval.metrics import mean ...@@ -16,7 +16,7 @@ from lm_eval.metrics import mean
_CITATION = """ _CITATION = """
@misc{srivastava2022imitation, @misc{srivastava2022imitation,
title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models}, title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models},
author={Aarohi Srivastava and Abhinav Rastogi and Abhishek Rao and Abu Awal Md Shoeb and Abubakar Abid and Adam Fisch and Adam R. Brown and Adam Santoro and Aditya Gupta and Adrià Garriga-Alonso and Agnieszka Kluska and Aitor Lewkowycz and Akshat Agarwal and Alethea Power and Alex Ray and Alex Warstadt and Alexander W. Kocurek and Ali Safaya and Ali Tazarv and Alice Xiang and Alicia Parrish and Allen Nie and Aman Hussain and Amanda Askell and Amanda Dsouza and Ambrose Slone and Ameet Rahane and Anantharaman S. Iyer and Anders Andreassen and Andrea Madotto and Andrea Santilli and Andreas Stuhlmüller and Andrew Dai and Andrew La and Andrew Lampinen and Andy Zou and Angela Jiang and Angelica Chen and Anh Vuong and Animesh Gupta and Anna Gottardi and Antonio Norelli and Anu Venkatesh and Arash Gholamidavoodi and Arfa Tabassum and Arul Menezes and Arun Kirubarajan and Asher Mullokandov and Ashish Sabharwal and Austin Herrick and Avia Efrat and Aykut Erdem and Ayla Karakaş and B. Ryan Roberts and Bao Sheng Loe and Barret Zoph and Bartłomiej Bojanowski and Batuhan Özyurt and Behnam Hedayatnia and Behnam Neyshabur and Benjamin Inden and Benno Stein and Berk Ekmekci and Bill Yuchen Lin and Blake Howald and Cameron Diao and Cameron Dour and Catherine Stinson and Cedrick Argueta and César Ferri Ramírez and Chandan Singh and Charles Rathkopf and Chenlin Meng and Chitta Baral and Chiyu Wu and Chris Callison-Burch and Chris Waites and Christian Voigt and Christopher D. Manning and Christopher Potts and Cindy Ramirez and Clara E. Rivera and Clemencia Siro and Colin Raffel and Courtney Ashcraft and Cristina Garbacea and Damien Sileo and Dan Garrette and Dan Hendrycks and Dan Kilman and Dan Roth and Daniel Freeman and Daniel Khashabi and Daniel Levy and Daniel Moseguí González and Danielle Perszyk and Danny Hernandez and Danqi Chen and Daphne Ippolito and Dar Gilboa and David Dohan and David Drakard and David Jurgens and Debajyoti Datta and Deep Ganguli and Denis Emelin and Denis Kleyko and Deniz Yuret and Derek Chen and Derek Tam and Dieuwke Hupkes and Diganta Misra and Dilyar Buzan and Dimitri Coelho Mollo and Diyi Yang and Dong-Ho Lee and Ekaterina Shutova and Ekin Dogus Cubuk and Elad Segal and Eleanor Hagerman and Elizabeth Barnes and Elizabeth Donoway and Ellie Pavlick and Emanuele Rodola and Emma Lam and Eric Chu and Eric Tang and Erkut Erdem and Ernie Chang and Ethan A. Chi and Ethan Dyer and Ethan Jerzak and Ethan Kim and Eunice Engefu Manyasi and Evgenii Zheltonozhskii and Fanyue Xia and Fatemeh Siar and Fernando Martínez-Plumed and Francesca Happé and Francois Chollet and Frieda Rong and Gaurav Mishra and Genta Indra Winata and Gerard de Melo and Germán Kruszewski and Giambattista Parascandolo and Giorgio Mariani and Gloria Wang and Gonzalo Jaimovitch-López and Gregor Betz and Guy Gur-Ari and Hana Galijasevic and Hannah Kim and Hannah Rashkin and Hannaneh Hajishirzi and Harsh Mehta and Hayden Bogar and Henry Shevlin and Hinrich Schütze and Hiromu Yakura and Hongming Zhang and Hugh Mee Wong and Ian Ng and Isaac Noble and Jaap Jumelet and Jack Geissinger and Jackson Kernion and Jacob Hilton and Jaehoon Lee and Jaime Fernández Fisac and James B. Simon and James Koppel and James Zheng and James Zou and Jan Kocoń and Jana Thompson and Jared Kaplan and Jarema Radom and Jascha Sohl-Dickstein and Jason Phang and Jason Wei and Jason Yosinski and Jekaterina Novikova and Jelle Bosscher and Jennifer Marsh and Jeremy Kim and Jeroen Taal and Jesse Engel and Jesujoba Alabi and Jiacheng Xu and Jiaming Song and Jillian Tang and Joan Waweru and John Burden and John Miller and John U. Balis and Jonathan Berant and Jörg Frohberg and Jos Rozen and Jose Hernandez-Orallo and Joseph Boudeman and Joseph Jones and Joshua B. Tenenbaum and Joshua S. Rule and Joyce Chua and Kamil Kanclerz and Karen Livescu and Karl Krauth and Karthik Gopalakrishnan and Katerina Ignatyeva and Katja Markert and Kaustubh D. Dhole and Kevin Gimpel and Kevin Omondi and Kory Mathewson and Kristen Chiafullo and Ksenia Shkaruta and Kumar Shridhar and Kyle McDonell and Kyle Richardson and Laria Reynolds and Leo Gao and Li Zhang and Liam Dugan and Lianhui Qin and Lidia Contreras-Ochando and Louis-Philippe Morency and Luca Moschella and Lucas Lam and Lucy Noble and Ludwig Schmidt and Luheng He and Luis Oliveros Colón and Luke Metz and Lütfi Kerem Şenel and Maarten Bosma and Maarten Sap and Maartje ter Hoeve and Maheen Farooqi and Manaal Faruqui and Mantas Mazeika and Marco Baturan and Marco Marelli and Marco Maru and Maria Jose Ramírez Quintana and Marie Tolkiehn and Mario Giulianelli and Martha Lewis and Martin Potthast and Matthew L. Leavitt and Matthias Hagen and Mátyás Schubert and Medina Orduna Baitemirova and Melody Arnaud and Melvin McElrath and Michael A. Yee and Michael Cohen and Michael Gu and Michael Ivanitskiy and Michael Starritt and Michael Strube and Michał Swędrowski and Michele Bevilacqua and Michihiro Yasunaga and Mihir Kale and Mike Cain and Mimee Xu and Mirac Suzgun and Mo Tiwari and Mohit Bansal and Moin Aminnaseri and Mor Geva and Mozhdeh Gheini and Mukund Varma T and Nanyun Peng and Nathan Chi and Nayeon Lee and Neta Gur-Ari Krakover and Nicholas Cameron and Nicholas Roberts and Nick Doiron and Nikita Nangia and Niklas Deckers and Niklas Muennighoff and Nitish Shirish Keskar and Niveditha S. Iyer and Noah Constant and Noah Fiedel and Nuan Wen and Oliver Zhang and Omar Agha and Omar Elbaghdadi and Omer Levy and Owain Evans and Pablo Antonio Moreno Casares and Parth Doshi and Pascale Fung and Paul Pu Liang and Paul Vicol and Pegah Alipoormolabashi and Peiyuan Liao and Percy Liang and Peter Chang and Peter Eckersley and Phu Mon Htut and Pinyu Hwang and Piotr Miłkowski and Piyush Patil and Pouya Pezeshkpour and Priti Oli and Qiaozhu Mei and Qing Lyu and Qinlang Chen and Rabin Banjade and Rachel Etta Rudolph and Raefer Gabriel and Rahel Habacker and Ramón Risco Delgado and Raphaël Millière and Rhythm Garg and Richard Barnes and Rif A. Saurous and Riku Arakawa and Robbe Raymaekers and Robert Frank and Rohan Sikand and Roman Novak and Roman Sitelew and Ronan LeBras and Rosanne Liu and Rowan Jacobs and Rui Zhang and Ruslan Salakhutdinov and Ryan Chi and Ryan Lee and Ryan Stovall and Ryan Teehan and Rylan Yang and Sahib Singh and Saif M. Mohammad and Sajant Anand and Sam Dillavou and Sam Shleifer and Sam Wiseman and Samuel Gruetter and Samuel R. Bowman and Samuel S. Schoenholz and Sanghyun Han and Sanjeev Kwatra and Sarah A. Rous and Sarik Ghazarian and Sayan Ghosh and Sean Casey and Sebastian Bischoff and Sebastian Gehrmann and Sebastian Schuster and Sepideh Sadeghi and Shadi Hamdan and Sharon Zhou and Shashank Srivastava and Sherry Shi and Shikhar Singh and Shima Asaadi and Shixiang Shane Gu and Shubh Pachchigar and Shubham Toshniwal and Shyam Upadhyay and Shyamolima and Debnath and Siamak Shakeri and Simon Thormeyer and Simone Melzi and Siva Reddy and Sneha Priscilla Makini and Soo-Hwan Lee and Spencer Torene and Sriharsha Hatwar and Stanislas Dehaene and Stefan Divic and Stefano Ermon and Stella Biderman and Stephanie Lin and Stephen Prasad and Steven T. Piantadosi and Stuart M. Shieber and Summer Misherghi and Svetlana Kiritchenko and Swaroop Mishra and Tal Linzen and Tal Schuster and Tao Li and Tao Yu and Tariq Ali and Tatsu Hashimoto and Te-Lin Wu and Théo Desbordes and Theodore Rothschild and Thomas Phan and Tianle Wang and Tiberius Nkinyili and Timo Schick and Timofei Kornev and Timothy Telleen-Lawton and Titus Tunduny and Tobias Gerstenberg and Trenton Chang and Trishala Neeraj and Tushar Khot and Tyler Shultz and Uri Shaham and Vedant Misra and Vera Demberg and Victoria Nyamai and Vikas Raunak and Vinay Ramasesh and Vinay Uday Prabhu and Vishakh Padmakumar and Vivek Srikumar and William Fedus and William Saunders and William Zhang and Wout Vossen and Xiang Ren and Xiaoyu Tong and Xinran Zhao and Xinyi Wu and Xudong Shen and Yadollah Yaghoobzadeh and Yair Lakretz and Yangqiu Song and Yasaman Bahri and Yejin Choi and Yichi Yang and Yiding Hao and Yifu Chen and Yonatan Belinkov and Yu Hou and Yufang Hou and Yuntao Bai and Zachary Seid and Zhuoye Zhao and Zijian Wang and Zijie J. Wang and Zirui Wang and Ziyi Wu}, author={Aarohi Srivastava and Abhinav Rastogi and Abhishek Rao and Abu Awal Md Shoeb and Abubakar Abid and Adam Fisch and Adam R. Brown and Adam Santoro and Aditya Gupta and Adrià Garriga-Alonso and Agnieszka Kluska and Aitor Lewkowycz and Akshat Agarwal and Alethea Power and Alex Ray and Alex Warstadt and Alexander W. Kocurek and Ali Safaya and Ali Tazarv and Alice Xiang and Alicia Parrish and Allen Nie and Aman Hussain and Amanda Askell and Amanda Dsouza and Ambrose Slone and Ameet Rahane and Anantharaman S. Iyer and Anders Andreassen and Andrea Madotto and Andrea Santilli and Andreas Stuhlmüller and Andrew Dai and Andrew La and Andrew Lampinen and Andy Zou and Angela Jiang and Angelica Chen and Anh Vuong and Animesh Gupta and Anna Gottardi and Antonio Norelli and Anu Venkatesh and Arash Gholamidavoodi and Arfa Tabassum and Arul Menezes and Arun Kirubarajan and Asher Mullokandov and Ashish Sabharwal and Austin Herrick and Avia Efrat and Aykut Erdem and Ayla Karakaş and B. Ryan Roberts and Bao Sheng Loe and Barret Zoph and Bartłomiej Bojanowski and Batuhan Özyurt and Behnam Hedayatnia and Behnam Neyshabur and Benjamin Inden and Benno Stein and Berk Ekmekci and Bill Yuchen Lin and Blake Howald and Cameron Diao and Cameron Dour and Catherine Stinson and Cedrick Argueta and César Ferri Ramírez and Chandan Singh and Charles Rathkopf and Chenlin Meng and Chitta Baral and Chiyu Wu and Chris Callison-Burch and Chris Waites and Christian Voigt and Christopher D. Manning and Christopher Potts and Cindy Ramirez and Clara E. Rivera and Clemencia Siro and Colin Raffel and Courtney Ashcraft and Cristina Garbacea and Damien Sileo and Dan Garrette and Dan Hendrycks and Dan Kilman and Dan Roth and Daniel Freeman and Daniel Khashabi and Daniel Levy and Daniel Moseguí González and Danielle Perszyk and Danny Hernandez and Danqi Chen and Daphne Ippolito and Dar Gilboa and David Dohan and David Drakard and David Jurgens and Debajyoti Datta and Deep Ganguli and Denis Emelin and Denis Kleyko and Deniz Yuret and Derek Chen and Derek Tam and Dieuwke Hupkes and Diganta Misra and Dilyar Buzan and Dimitri Coelho Mollo and Diyi Yang and Dong-Ho Lee and Ekaterina Shutova and Ekin Dogus Cubuk and Elad Segal and Eleanor Hagerman and Elizabeth Barnes and Elizabeth Donoway and Ellie Pavlick and Emanuele Rodola and Emma Lam and Eric Chu and Eric Tang and Erkut Erdem and Ernie Chang and Ethan A. Chi and Ethan Dyer and Ethan Jerzak and Ethan Kim and Eunice Engefu Manyasi and Evgenii Zheltonozhskii and Fanyue Xia and Fatemeh Siar and Fernando Martínez-Plumed and Francesca Happé and Francois Chollet and Frieda Rong and Gaurav Mishra and Genta Indra Winata and Gerard de Melo and Germán Kruszewski and Giambattista Parascandolo and Giorgio Mariani and Gloria Wang and Gonzalo Jaimovitch-López and Gregor Betz and Guy Gur-Ari and Hana Galijasevic and Hannah Kim and Hannah Rashkin and Hannaneh Hajishirzi and Harsh Mehta and Hayden Bogar and Henry Shevlin and Hinrich Schütze and Hiromu Yakura and Hongming Zhang and Hugh Mee Wong and Ian Ng and Isaac Noble and Jaap Jumelet and Jack Geissinger and Jackson Kernion and Jacob Hilton and Jaehoon Lee and Jaime Fernández Fisac and James B. Simon and James Koppel and James Zheng and James Zou and Jan Kocoń and Jana Thompson and Jared Kaplan and Jarema Radom and Jascha Sohl-Dickstein and Jason Phang and Jason Wei and Jason Yosinski and Jekaterina Novikova and Jelle Bosscher and Jennifer Marsh and Jeremy Kim and Jeroen Taal and Jesse Engel and Jesujoba Alabi and Jiacheng Xu and Jiaming Song and Jillian Tang and Joan Waweru and John Burden and John Miller and John U. Balis and Jonathan Berant and Jörg Frohberg and Jos Rozen and Jose Hernandez-Orallo and Joseph Boudeman and Joseph Jones and Joshua B. Tenenbaum and Joshua S. Rule and Joyce Chua and Kamil Kanclerz and Karen Livescu and Karl Krauth and Karthik Gopalakrishnan and Katerina Ignatyeva and Katja Markert and Kaustubh D. Dhole and Kevin Gimpel and Kevin Omondi and Kory Mathewson and Kristen Chiafullo and Ksenia Shkaruta and Kumar Shridhar and Kyle McDonell and Kyle Richardson and Laria Reynolds and Leo Gao and Li Zhang and Liam Dugan and Lianhui Qin and Lidia Contreras-Ochando and Louis-Philippe Morency and Luca Moschella and Lucas Lam and Lucy Noble and Ludwig Schmidt and Luheng He and Luis Oliveros Colón and Luke Metz and Lütfi Kerem Şenel and Maarten Bosma and Maarten Sap and Maartje ter Hoeve and Maheen Farooqi and Manaal Faruqui and Mantas Mazeika and Marco Baturan and Marco Marelli and Marco Maru and Maria Jose Ramírez Quintana and Marie Tolkiehn and Mario Giulianelli and Martha Lewis and Martin Potthast and Matthew L. Leavitt and Matthias Hagen and Mátyás Schubert and Medina Orduna Baitemirova and Melody Arnaud and Melvin McElrath and Michael A. Yee and Michael Cohen and Michael Gu and Michael Ivanitskiy and Michael Starritt and Michael Strube and Michał Swędrowski and Michele Bevilacqua and Michihiro Yasunaga and Mihir Kale and Mike Cain and Mimee Xu and Mirac Suzgun and Mo Tiwari and Mohit Bansal and Moin Aminnaseri and Mor Geva and Mozhdeh Gheini and Mukund Varma T and Nanyun Peng and Nathan Chi and Nayeon Lee and Neta Gur-Ari Krakover and Nicholas Cameron and Nicholas Roberts and Nick Doiron and Nikita Nangia and Niklas Deckers and Niklas Muennighoff and Nitish Shirish Keskar and Niveditha S. Iyer and Noah Constant and Noah Fiedel and Nuan Wen and Oliver Zhang and Omar Agha and Omar Elbaghdadi and Omer Levy and Owain Evans and Pablo Antonio Moreno Casares and Parth Doshi and Pascale Fung and Paul Pu Liang and Paul Vicol and Pegah Alipoormolabashi and Peiyuan Liao and Percy Liang and Peter Chang and Peter Eckersley and Phu Mon Htut and Pinyu Hwang and Piotr Miłkowski and Piyush Patil and Pouya Pezeshkpour and Priti Oli and Qiaozhu Mei and Qing Lyu and Qinlang Chen and Rabin Banjade and Rachel Etta Rudolph and Raefer Gabriel and Rahel Habacker and Ramón Risco Delgado and Raphaël Millière and Rhythm Garg and Richard Barnes and Rif A. Saurous and Riku Arakawa and Robbe Raymaekers and Robert Frank and Rohan Sikand and Roman Novak and Roman Sitelew and Ronan LeBras and Rosanne Liu and Rowan Jacobs and Rui Zhang and Ruslan Salakhutdinov and Ryan Chi and Ryan Lee and Ryan Stovall and Ryan Teehan and Rylan Yang and Sahib Singh and Saif M. Mohammad and Sajant Anand and Sam Dillavou and Sam Shleifer and Sam Wiseman and Samuel Gruetter and Samuel R. Bowman and Samuel S. Schoenholz and Sanghyun Han and Sanjeev Kwatra and Sarah A. Rous and Sarik Ghazarian and Sayan Ghosh and Sean Casey and Sebastian Bischoff and Sebastian Gehrmann and Sebastian Schuster and Sepideh Sadeghi and Shadi Hamdan and Sharon Zhou and Shashank Srivastava and Sherry Shi and Shikhar Singh and Shima Asaadi and Shixiang Shane Gu and Shubh Pachchigar and Shubham Toshniwal and Shyam Upadhyay and Shyamolima and Debnath and Siamak Shakeri and Simon Thormeyer and Simone Melzi and Siva Reddy and Sneha Priscilla Makini and Soo-Hwan Lee and Spencer Torene and Sriharsha Hatwar and Stanislas Dehaene and Stefan Divic and Stefano Ermon and Stella Biderman and Stephanie Lin and Stephen Prasad and Steven T. Piantadosi and Stuart M. Shieber and Summer Misherghi and Svetlana Kiritchenko and Swaroop Mishra and Tal Linzen and Tal Schuster and Tao Li and Tao Yu and Tariq Ali and Tatsu Hashimoto and Te-Lin Wu and Théo Desbordes and Theodore Rothschild and Thomas Phan and Tianle Wang and Tiberius Nkinyili and Timo Schick and Timofei Kornev and Timothy Telleen-Lawton and Titus Tunduny and Tobias Gerstenberg and Trenton Chang and Trishala Neeraj and Tushar Khot and Tyler Shultz and Uri Shaham and Vedant Misra and Vera Demberg and Victoria Nyamai and Vikas Raunak and Vinay Ramasesh and Vinay Uday Prabhu and Vishakh Padmakumar and Vivek Srikumar and William Fedus and William Saunders and William Zhang and Wout Vossen and Xiang Ren and Xiaoyu Tong and Xinran Zhao and Xinyi Wu and Xudong Shen and Yadollah Yaghoobzadeh and Yair Lakretz and Yangqiu Song and Yasaman Bahri and Yejin Choi and Yichi Yang and Yiding Hao and Yifu Chen and Yonatan Belinkov and Yu Hou and Yufang Hou and Yuntao Bai and Zachary Seid and Zhuoye Zhao and Zijian Wang and Zijie J. Wang and Zirui Wang and Ziyi Wu},
year={2022}, year={2022},
eprint={2206.04615}, eprint={2206.04615},
......
...@@ -141,7 +141,7 @@ class CoQA(Task): ...@@ -141,7 +141,7 @@ class CoQA(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
cont_request = rf.greedy_until(ctx, {'until': ["\nQ:"]}) cont_request = rf.greedy_until(ctx, {"until": ["\nQ:"]})
return cont_request return cont_request
def process_results(self, doc, results): def process_results(self, doc, results):
......
...@@ -134,7 +134,7 @@ class DROP(Task): ...@@ -134,7 +134,7 @@ class DROP(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
conts = [rf.greedy_until(ctx, {'until': ["."]})] conts = [rf.greedy_until(ctx, {"until": ["."]})]
return conts return conts
def process_results(self, doc, results): def process_results(self, doc, results):
......
...@@ -79,7 +79,7 @@ class GradeSchoolMath8K(Task): ...@@ -79,7 +79,7 @@ class GradeSchoolMath8K(Task):
""" """
# NOTE: The paper implements "verifiers" that assign a score to multiple # NOTE: The paper implements "verifiers" that assign a score to multiple
# solutions and output the highest ranked solution. # solutions and output the highest ranked solution.
completion = rf.greedy_until(ctx, {'until': ["\n"]}) completion = rf.greedy_until(ctx, {"until": [":", "Question:", "Question"]})
return completion return completion
def _extract_answer(self, completion): def _extract_answer(self, completion):
......
...@@ -63,7 +63,7 @@ class Math(Task): ...@@ -63,7 +63,7 @@ class Math(Task):
return " " + doc["solution"] return " " + doc["solution"]
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
return rf.greedy_until(ctx, {'until': ["\n"]}) return rf.greedy_until(ctx, {"until": ["\n"]})
def process_results(self, doc, results): def process_results(self, doc, results):
retval = 0 retval = 0
......
import datasets
from lm_eval.base import PerplexityTask
from lm_eval.utils import escaped_split
class JsonPerplexity(PerplexityTask):
VERSION = 0
DATASET_NAME = "json"
def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
"""
:param data_dir: str
Use this to specify the path to manually downloaded JSON test data.
This also needs to include the split key and text key for the data
in the following format:
```
split:text:/absolute/path/to/data.json
```
If you do not have splits inside the JSON file, it should be "train".
Colons in the split or text key can be escaped by backslashes.
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self._split, self._key, data_file = escaped_split(data_dir, ":", 2)
self.load(data_file)
self._training_docs = None
self._fewshot_docs = None
def download(self, data_dir=None, cache_dir=None, download_mode=None):
raise TypeError("cannot download an arbitrary JSON dataset")
def load(self, data_file):
self.dataset = datasets.load_dataset("json", data_files=data_file)
def has_validation_docs(self):
return False
def has_test_docs(self):
return True
def test_docs(self):
return map(self._process_doc, self.dataset[self._split])
def _process_doc(self, doc):
return doc[self._key]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment