Commit 92a50856 authored by Julen Etxaniz's avatar Julen Etxaniz
Browse files

Merge remote-tracking branch 'upstream/master' into results

parents afbf8e66 d1451679
...@@ -10,9 +10,11 @@ This project provides a unified framework to test generative language models on ...@@ -10,9 +10,11 @@ This project provides a unified framework to test generative language models on
Features: Features:
- 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list. - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface. - Support for models loaded via [transformers](https://github.com/huggingface/transformers/), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft). - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility. - Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
- Task versioning to ensure reproducibility when tasks are updated.
## Install ## Install
...@@ -34,14 +36,16 @@ pip install -e ".[multilingual]" ...@@ -34,14 +36,16 @@ pip install -e ".[multilingual]"
> **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info. > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on tasks with names matching the pattern `lambada_*` and `hellaswag` you can use the following command: ### Hugging Face `transformers`
To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command:
```bash ```bash
python main.py \ python main.py \
--model hf-causal \ --model hf-causal \
--model_args pretrained=EleutherAI/gpt-j-6B \ --model_args pretrained=EleutherAI/gpt-j-6B \
--tasks lambada_*,hellaswag \ --tasks hellaswag \
--device cuda:0 --device cuda:0
``` ```
...@@ -59,16 +63,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste ...@@ -59,16 +63,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste
> **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring. > **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below: ### Commercial APIs
```bash
python main.py \
--model hf-causal-experimental \
--model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
--tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
--device cuda:0
```
Our library also supports the OpenAI API: Our library also supports language models served via the OpenAI API:
```bash ```bash
export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
...@@ -90,7 +87,9 @@ python main.py \ ...@@ -90,7 +87,9 @@ python main.py \
--check_integrity --check_integrity
``` ```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py). ### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
💡 **Tip**: You can inspect what the LM inputs look like by running the following command: 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
...@@ -104,6 +103,22 @@ python write_out.py \ ...@@ -104,6 +103,22 @@ python write_out.py \
This will write out one text file for each task. This will write out one text file for each task.
## Advanced Usage
For models loaded with the HuggingFace `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
```bash
python main.py \
--model hf-causal-experimental \
--model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
--tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
--device cuda:0
```
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
## Implementing new tasks ## Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/task_guide.md). To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
......
This diff is collapsed.
ROUGE ROUGE
rouge rouge
nin nin
maka
mor
te
...@@ -190,14 +190,19 @@ class BaseLM(LM): ...@@ -190,14 +190,19 @@ class BaseLM(LM):
# automatic batch size detection for vectorization # automatic batch size detection for vectorization
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == 'auto': if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print('Passed argument batch_size = auto. Detecting largest batch size') print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size): def forward_batch(batch_size):
test_batch = torch.ones((batch_size, self.max_length), device=self.device).long() test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5): for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu() _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size return batch_size
batch_size = forward_batch() batch_size = forward_batch()
...@@ -223,7 +228,9 @@ class BaseLM(LM): ...@@ -223,7 +228,9 @@ class BaseLM(LM):
# TODO: extract out this call so it only gets called once and also somehow figure out partial caching for # TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
# that # that
string_nll = self._loglikelihood_tokens( string_nll = self._loglikelihood_tokens(
rolling_token_windows, disable_tqdm=True, override_bs = adaptive_batch_size rolling_token_windows,
disable_tqdm=True,
override_bs=adaptive_batch_size,
) )
# discard is_greedy # discard is_greedy
...@@ -234,7 +241,7 @@ class BaseLM(LM): ...@@ -234,7 +241,7 @@ class BaseLM(LM):
return loglikelihoods return loglikelihoods
def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs = None): def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = [] res = []
...@@ -249,11 +256,11 @@ class BaseLM(LM): ...@@ -249,11 +256,11 @@ class BaseLM(LM):
toks = x[1] + x[2] toks = x[1] + x[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
# automatic (variable) batch size detection for vectorization # automatic (variable) batch size detection for vectorization
# pull longest context sample from request # pull longest context sample from request
if len(re_ord.get_reordered()) > 0:
_, context_enc, continuation_enc = re_ord.get_reordered()[0] _, context_enc, continuation_enc = re_ord.get_reordered()[0]
max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1]) max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
if (self.batch_size == 'auto'): if (self.batch_size == 'auto'):
...@@ -273,9 +280,12 @@ class BaseLM(LM): ...@@ -273,9 +280,12 @@ class BaseLM(LM):
else: else:
adaptive_batch_size = override_bs adaptive_batch_size = override_bs
else:
adaptive_batch_size = 0 if override_bs is None else override_bs
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(re_ord.get_reordered(), disable=disable_tqdm), self.batch_size if self.batch_size != "auto" else adaptive_batch_size tqdm(re_ord.get_reordered(), disable=disable_tqdm),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
): ):
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
...@@ -382,7 +392,7 @@ class BaseLM(LM): ...@@ -382,7 +392,7 @@ class BaseLM(LM):
re_ord = utils.Reorderer(requests, _collate) re_ord = utils.Reorderer(requests, _collate)
for context, request_args in tqdm(re_ord.get_reordered()): for context, request_args in tqdm(re_ord.get_reordered()):
until = request_args['until'] until = request_args["until"]
if isinstance(until, str): if isinstance(until, str):
until = [until] until = [until]
...@@ -396,7 +406,7 @@ class BaseLM(LM): ...@@ -396,7 +406,7 @@ class BaseLM(LM):
).to(self.device) ).to(self.device)
max_gen_tokens = min( max_gen_tokens = min(
self.max_gen_toks, request_args.get('max_length', self.max_gen_toks) self.max_gen_toks, request_args.get("max_length", self.max_gen_toks)
) )
cont = self._model_generate( cont = self._model_generate(
context_enc, context_enc.shape[1] + max_gen_tokens, primary_until context_enc, context_enc.shape[1] + max_gen_tokens, primary_until
......
...@@ -3,6 +3,7 @@ import transformers ...@@ -3,6 +3,7 @@ import transformers
from typing import Optional from typing import Optional
from lm_eval.base import BaseLM from lm_eval.base import BaseLM
class HFLM(BaseLM): class HFLM(BaseLM):
def __init__( def __init__(
self, self,
...@@ -20,9 +21,11 @@ class HFLM(BaseLM): ...@@ -20,9 +21,11 @@ class HFLM(BaseLM):
assert isinstance(device, str) assert isinstance(device, str)
assert isinstance(pretrained, str) assert isinstance(pretrained, str)
assert isinstance(batch_size, (int,str)) assert isinstance(batch_size, (int, str))
device_list = set(["cuda", "cpu"] + [f'cuda:{i}' for i in range(torch.cuda.device_count())]) device_list = set(
["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
)
if device and device in device_list: if device and device in device_list:
self._device = torch.device(device) self._device = torch.device(device)
print(f"Using device '{device}'") print(f"Using device '{device}'")
...@@ -66,7 +69,7 @@ class HFLM(BaseLM): ...@@ -66,7 +69,7 @@ class HFLM(BaseLM):
], self.tokenizer.encode("hello\n\nhello") ], self.tokenizer.encode("hello\n\nhello")
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == 'auto': if batch_size == "auto":
self.batch_size_per_gpu = batch_size self.batch_size_per_gpu = batch_size
else: else:
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
...@@ -116,9 +119,10 @@ class HFLM(BaseLM): ...@@ -116,9 +119,10 @@ class HFLM(BaseLM):
return self.gpt2(inps)[0] return self.gpt2(inps)[0]
def _model_generate(self, context, max_length, eos_token_id): def _model_generate(self, context, max_length, eos_token_id):
generation_kwargs = {'do_sample': False, 'max_length': max_length} generation_kwargs = {"do_sample": False, "max_length": max_length}
if eos_token_id is not None: if eos_token_id is not None:
generation_kwargs['eos_token_id'] = eos_token_id generation_kwargs['eos_token_id'] = eos_token_id
generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
return self.gpt2.generate(context, **generation_kwargs) return self.gpt2.generate(context, **generation_kwargs)
......
...@@ -72,7 +72,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -72,7 +72,7 @@ class HuggingFaceAutoLM(BaseLM):
tokenizer: Optional[str] = None, tokenizer: Optional[str] = None,
subfolder: Optional[str] = None, subfolder: Optional[str] = None,
revision: Optional[str] = "main", revision: Optional[str] = "main",
batch_size: Optional[Union[int,str]] = 1, batch_size: Optional[Union[int, str]] = 1,
max_gen_toks: Optional[int] = 256, max_gen_toks: Optional[int] = 256,
max_length: Optional[int] = None, max_length: Optional[int] = None,
add_special_tokens: Optional[bool] = None, add_special_tokens: Optional[bool] = None,
...@@ -159,7 +159,7 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -159,7 +159,7 @@ class HuggingFaceAutoLM(BaseLM):
), "Evaluating causal models with `add_special_tokens=True` is currently not supported." ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
# setup for automatic batch size detection # setup for automatic batch size detection
if batch_size == 'auto': if batch_size == "auto":
self._batch_size = batch_size self._batch_size = batch_size
else: else:
self._batch_size = int(batch_size) self._batch_size = int(batch_size)
...@@ -369,7 +369,9 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -369,7 +369,9 @@ class HuggingFaceAutoLM(BaseLM):
def tok_decode(self, tokens: torch.LongTensor) -> List[str]: def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
return self.tokenizer.batch_decode(tokens, skip_special_tokens=True) return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)
def greedy_until(self, requests: List[Tuple[str, Union[List[str], str]]]) -> List[str]: def greedy_until(
self, requests: List[Tuple[str, Union[List[str], str]]]
) -> List[str]:
def _collate(x): def _collate(x):
tokens = self.tok_encode(x[0]) tokens = self.tok_encode(x[0])
return len(tokens), x[0] return len(tokens), x[0]
...@@ -378,14 +380,19 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -378,14 +380,19 @@ class HuggingFaceAutoLM(BaseLM):
reorder = utils.Reorderer(requests, _collate) reorder = utils.Reorderer(requests, _collate)
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == 'auto': if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
print('Passed argument batch_size = auto. Detecting largest batch size') print("Passed argument batch_size = auto. Detecting largest batch size")
@find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
@find_executable_batch_size(
starting_batch_size=512
) # if OOM, then halves batch_size and tries again
def forward_batch(batch_size): def forward_batch(batch_size):
test_batch = torch.ones((batch_size, self.max_length), device=self.device).long() test_batch = torch.ones(
(batch_size, self.max_length), device=self.device
).long()
for _ in range(5): for _ in range(5):
out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu() _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
return batch_size return batch_size
batch_size = forward_batch() batch_size = forward_batch()
...@@ -393,11 +400,12 @@ class HuggingFaceAutoLM(BaseLM): ...@@ -393,11 +400,12 @@ class HuggingFaceAutoLM(BaseLM):
adaptive_batch_size = batch_size adaptive_batch_size = batch_size
for chunk in utils.chunks( for chunk in utils.chunks(
tqdm(reorder.get_reordered(), disable=False), self.batch_size if self.batch_size != "auto" else adaptive_batch_size tqdm(reorder.get_reordered(), disable=False),
self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
): ):
context = [c[0] for c in chunk] context = [c[0] for c in chunk]
request_args = chunk[0][1] request_args = chunk[0][1]
stop = request_args.get('until', None) stop = request_args.get("until", None)
stop_sequences = stop if isinstance(stop, list) else [stop] stop_sequences = stop if isinstance(stop, list) else [stop]
max_generation_length = request_args.get("max_length", None) max_generation_length = request_args.get("max_length", None)
......
...@@ -124,7 +124,7 @@ class TextSynthLM(BaseLM): ...@@ -124,7 +124,7 @@ class TextSynthLM(BaseLM):
for request in tqdm(requests): for request in tqdm(requests):
inp = request[0] inp = request[0]
request_args = request[1] request_args = request[1]
until = request_args['until'] until = request_args["until"]
response = textsynth_completion( response = textsynth_completion(
url=self.api_url + "/v1/engines/" + self.engine + "/completions", url=self.api_url + "/v1/engines/" + self.engine + "/completions",
headers={"Authorization": "Bearer " + self.api_key}, headers={"Authorization": "Bearer " + self.api_key},
......
...@@ -52,7 +52,13 @@ from . import gsm8k ...@@ -52,7 +52,13 @@ from . import gsm8k
from . import storycloze from . import storycloze
from . import toxigen from . import toxigen
from . import crowspairs from . import crowspairs
from . import xcopa
from . import bigbench from . import bigbench
from . import xstorycloze
from . import xwinograd
from . import pawsx
from . import xnli
from . import mgsm
######################################## ########################################
# Translation tasks # Translation tasks
...@@ -311,7 +317,13 @@ TASK_REGISTRY = { ...@@ -311,7 +317,13 @@ TASK_REGISTRY = {
# "storycloze_2016": storycloze.StoryCloze2016, # "storycloze_2016": storycloze.StoryCloze2016,
# "storycloze_2018": storycloze.StoryCloze2018, # "storycloze_2018": storycloze.StoryCloze2018,
# "sat": sat.SATAnalogies, # "sat": sat.SATAnalogies,
**xcopa.construct_tasks(),
**bigbench.create_all_tasks(), **bigbench.create_all_tasks(),
**xstorycloze.create_all_tasks(),
**xwinograd.create_all_tasks(),
**pawsx.construct_tasks(),
**xnli.construct_tasks(),
**mgsm.construct_tasks(),
} }
......
...@@ -141,7 +141,7 @@ class CoQA(Task): ...@@ -141,7 +141,7 @@ class CoQA(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
cont_request = rf.greedy_until(ctx, {'until': ["\nQ:"]}) cont_request = rf.greedy_until(ctx, {"until": ["\nQ:"]})
return cont_request return cont_request
def process_results(self, doc, results): def process_results(self, doc, results):
......
...@@ -134,7 +134,7 @@ class DROP(Task): ...@@ -134,7 +134,7 @@ class DROP(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
conts = [rf.greedy_until(ctx, {'until': ["."]})] conts = [rf.greedy_until(ctx, {"until": ["."]})]
return conts return conts
def process_results(self, doc, results): def process_results(self, doc, results):
......
...@@ -79,7 +79,7 @@ class GradeSchoolMath8K(Task): ...@@ -79,7 +79,7 @@ class GradeSchoolMath8K(Task):
""" """
# NOTE: The paper implements "verifiers" that assign a score to multiple # NOTE: The paper implements "verifiers" that assign a score to multiple
# solutions and output the highest ranked solution. # solutions and output the highest ranked solution.
completion = rf.greedy_until(ctx, {'until': ["\n"]}) completion = rf.greedy_until(ctx, {"until": [":", "Question:", "Question"]})
return completion return completion
def _extract_answer(self, completion): def _extract_answer(self, completion):
......
...@@ -63,7 +63,7 @@ class Math(Task): ...@@ -63,7 +63,7 @@ class Math(Task):
return " " + doc["solution"] return " " + doc["solution"]
def construct_requests(self, doc, ctx): def construct_requests(self, doc, ctx):
return rf.greedy_until(ctx, {'until': ["\n"]}) return rf.greedy_until(ctx, {"until": ["\n"]})
def process_results(self, doc, results): def process_results(self, doc, results):
retval = 0 retval = 0
......
"""
Language Models are Multilingual Chain-of-Thought Reasoners
https://arxiv.org/abs/2210.03057
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
Homepage: https://github.com/google-research/url-nlp/tree/main/mgsm
"""
import re
from lm_eval.base import Task, rf
from lm_eval.metrics import mean
_CITATION = """
@misc{cobbe2021training,
title={Training Verifiers to Solve Math Word Problems},
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
year={2021},
eprint={2110.14168},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
"""
ANS_RE = re.compile(r"(\-?\d+)")
INVALID_ANS = "[invalid]"
class MGSM(Task):
VERSION = 0
DATASET_PATH = "juletxara/mgsm"
DATASET_NAME = None
QUESTION = "Question:"
ANSWER = "Step-by-Step Answer:"
def has_training_docs(self):
return True
def has_validation_docs(self):
return False
def has_test_docs(self):
return True
def training_docs(self):
return self.dataset["train"]
def validation_docs(self):
raise NotImplementedError
def test_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
if doc["answer"] is not None:
return doc["question"] + "\n" + self.ANSWER
else:
return self.QUESTION + " " + doc["question"] + "\n" + self.ANSWER
def doc_to_target(self, doc):
if doc["answer"] is not None:
return " " + doc["answer"][len(self.ANSWER) + 1 :]
else:
return " " + str(doc["answer_number"])
def construct_requests(self, doc, ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
part of the document for `doc`.
"""
completion = rf.greedy_until(ctx, {"until": ["\n", ":", self.QUESTION]})
return completion
def _extract_answer(self, completion):
match = re.findall(ANS_RE, completion)
if match:
return int(match[-1])
else:
return INVALID_ANS
def _is_correct(self, completion, answer):
gold = answer
assert gold != INVALID_ANS, "No ground truth answer found in the document."
return self._extract_answer(completion) == gold
def process_results(self, doc, results):
"""Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
the metric for that one document
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param results:
The results of the requests created in construct_requests.
"""
completion = results[0]
answer = doc["answer_number"]
return {"acc": self._is_correct(completion, answer)}
def aggregation(self):
"""
:returns: {str: [float] -> float}
A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metrics
"""
return {"acc": mean}
def higher_is_better(self):
"""
:returns: {str: bool}
A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better
"""
return {"acc": True}
class MGSM_English(MGSM):
DATASET_NAME = "en"
QUESTION = "Question:"
ANSWER = "Step-by-Step Answer:"
class MGSM_Spanish(MGSM):
DATASET_NAME = "es"
QUESTION = "Pregunta:"
ANSWER = "Respuesta paso a paso:"
class MGSM_French(MGSM):
DATASET_NAME = "fr"
QUESTION = "Question :"
ANSWER = "R\u00e9ponse \u00e9tape par \u00e9tape :"
class MGSM_German(MGSM):
DATASET_NAME = "de"
QUESTION = "Frage:"
ANSWER = "Schritt-f\u00fcr-Schritt-Antwort:"
class MGSM_Russian(MGSM):
DATASET_NAME = "ru"
QUESTION = "\u0417\u0430\u0434\u0430\u0447\u0430:"
ANSWER = "\u041f\u043e\u0448\u0430\u0433\u043e\u0432\u043e\u0435\u0440\u0435\u0448\u0435\u043d\u0438\u0435:"
class MGSM_Chinese(MGSM):
DATASET_NAME = "zh"
QUESTION = "\u95ee\u9898:"
ANSWER = "\u9010\u6b65\u89e3\u7b54:"
class MGSM_Japanese(MGSM):
DATASET_NAME = "ja"
QUESTION = "\u554f\u984c:"
ANSWER = "\u30b9\u30c6\u30c3\u30d7\u3054\u3068\u306e\u7b54\u3048:"
class MGSM_Thai(MGSM):
DATASET_NAME = "th"
QUESTION = "\u0e42\u0e08\u0e17\u0e22\u0e4c:"
ANSWER = "\u0e04\u0e33\u0e15\u0e2d\u0e1a\u0e17\u0e35\u0e25\u0e30\u0e02\u0e31\u0e49\u0e19\u0e15\u0e2d\u0e19:"
class MGSM_Swahili(MGSM):
DATASET_NAME = "sw"
QUESTION = "Swali:"
ANSWER = "Jibu la Hatua kwa Hatua:"
class MGSM_Bengali(MGSM):
DATASET_NAME = "bn"
QUESTION = "\u09aa\u09cd\u09b0\u09b6\u09cd\u09a8:"
ANSWER = "\u09a7\u09be\u09aa\u09c7 \u09a7\u09be\u09aa\u09c7 \u0989\u09a4\u09cd\u09a4\u09b0:"
class MGSM_Telugu(MGSM):
DATASET_NAME = "te"
QUESTION = "\u0c2a\u0c4d\u0c30\u0c36\u0c4d\u0c28:"
ANSWER = "\u0c26\u0c36\u0c32\u0c35\u0c3e\u0c30\u0c40\u0c17\u0c3e \u0c38\u0c2e\u0c3e\u0c27\u0c3e\u0c28\u0c02:"
LANGS = ["en", "es", "fr", "de", "ru", "zh", "ja", "th", "sw", "bn", "te"]
LANG_CLASSES = [
MGSM_English,
MGSM_Spanish,
MGSM_French,
MGSM_German,
MGSM_Russian,
MGSM_Chinese,
MGSM_Japanese,
MGSM_Thai,
MGSM_Swahili,
MGSM_Bengali,
MGSM_Telugu,
]
def construct_tasks():
tasks = {}
for lang, lang_class in zip(LANGS, LANG_CLASSES):
tasks[f"mgsm_{lang}"] = lang_class
return tasks
"""
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
https://arxiv.org/abs/1908.11828
The dataset consists of 23,659 human translated PAWS evaluation pairs and
296,406 machine translated training pairs in 6 typologically distinct languages.
Examples are adapted from PAWS-Wiki
Prompt format (same as in mGPT):
"<s>" + sentence1 + ", right? " + mask + ", " + sentence2 + "</s>",
where mask is the string that matches the label:
Yes, No.
Example:
<s> The Tabaci River is a tributary of the River Leurda in Romania, right? No, The Leurda River is a tributary of the River Tabaci in Romania.</s>
Language specific prompts are translated word-by-word with Google Translate
and may differ from the ones used by mGPT and XGLM (they do not provide their prompts).
Homepage: https://github.com/google-research-datasets/paws/tree/master/pawsx
"""
from lm_eval.base import Task, rf
from lm_eval.metrics import mean
_CITATION = """
@inproceedings{yang-etal-2019-paws,
title = "{PAWS}-{X}: A Cross-lingual Adversarial Dataset for Paraphrase Identification",
author = "Yang, Yinfei and
Zhang, Yuan and
Tar, Chris and
Baldridge, Jason",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D19-1382",
doi = "10.18653/v1/D19-1382",
pages = "3687--3692",
}"""
class PAWSXBase(Task):
VERSION = 0
DATASET_PATH = "paws-x"
DATASET_NAME = None # 'en'
YES = None # 'Yes'
NO = None # 'No'
QUESTION_WORD = None # 'right'
def has_training_docs(self):
return True
def has_validation_docs(self):
return True
def has_test_docs(self):
return True
def training_docs(self):
return self.dataset["train"]
def validation_docs(self):
return self.dataset["validation"]
def test_docs(self):
return self.dataset["test"]
def doc_to_text(self, doc):
# same as in mGPT paper
return (
doc["sentence1"]
+ ", "
+ self.QUESTION_WORD
+ "? [MASK], "
+ doc["sentence2"]
)
def doc_to_target(self, doc):
return " " + [self.YES, self.NO][doc["label"]]
def construct_requests(self, doc, ctx):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
:param doc:
The document as returned from training_docs, validation_docs, or
test_docs.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
part of the document for `doc`.
"""
ll_yes = rf.loglikelihood_rolling(ctx.replace("[MASK]", self.YES))
ll_no = rf.loglikelihood_rolling(ctx.replace("[MASK]", self.NO))
return ll_yes, ll_no
def process_results(self, doc, results):
"""Take a single document and the LM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
the metric for that one document
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param results:
The results of the requests created in construct_requests.
"""
ll_yes, ll_no = results
pred = ll_yes > ll_no
true_label = doc["label"]
return {
"acc": pred == true_label,
}
def aggregation(self):
"""
:returns: {str: [metric_score] -> float}
A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metric scores
"""
return {
"acc": mean,
}
def higher_is_better(self):
return {"acc": True}
class PAWSX_en(PAWSXBase):
DATASET_NAME = "en"
YES = "Yes"
NO = "No"
QUESTION_WORD = "right"
class PAWSX_de(PAWSXBase):
DATASET_NAME = "de"
YES = "Ja"
NO = "Nein"
QUESTION_WORD = "richtig"
class PAWSX_fr(PAWSXBase):
DATASET_NAME = "fr"
YES = "Oui"
NO = "No"
QUESTION_WORD = "right"
class PAWSX_es(PAWSXBase):
DATASET_NAME = "es"
YES = "Sí"
NO = "No"
QUESTION_WORD = "verdad"
class PAWSX_ja(PAWSXBase):
DATASET_NAME = "ja"
YES = "はい"
NO = "いいえ"
QUESTION_WORD = "ですね"
class PAWSX_ko(PAWSXBase):
DATASET_NAME = "ko"
YES = "예"
NO = "아니요"
QUESTION_WORD = "맞죠"
class PAWSX_zh(PAWSXBase):
DATASET_NAME = "zh"
YES = "是"
NO = "不是"
QUESTION_WORD = "对吧"
LANGS = [
"en",
"de",
"es",
"fr",
"ja",
"ko",
"zh",
]
LANG_CLASSES = [
PAWSX_en,
PAWSX_de,
PAWSX_es,
PAWSX_fr,
PAWSX_ja,
PAWSX_ko,
PAWSX_zh,
]
def construct_tasks():
tasks = {}
for lang, lang_class in zip(LANGS, LANG_CLASSES):
tasks[f"pawsx_{lang}"] = lang_class
return tasks
...@@ -214,7 +214,7 @@ class QASPER(Task): ...@@ -214,7 +214,7 @@ class QASPER(Task):
""" """
# unanswerable = rf.loglikelihood(ctx, " " + "unanswerable") # unanswerable = rf.loglikelihood(ctx, " " + "unanswerable")
if doc["answer_type"] in ("free form answer"): if doc["answer_type"] in ("free form answer"):
return [rf.greedy_until(ctx, {'until': ["\n"]})] return [rf.greedy_until(ctx, {"until": ["\n"]})]
elif doc["answer_type"] in ("bool"): elif doc["answer_type"] in ("bool"):
ll_yes, _ = rf.loglikelihood(ctx, " yes") ll_yes, _ = rf.loglikelihood(ctx, " yes")
ll_no, _ = rf.loglikelihood(ctx, " no") ll_no, _ = rf.loglikelihood(ctx, " no")
......
...@@ -107,7 +107,7 @@ class SQuAD2(Task): ...@@ -107,7 +107,7 @@ class SQuAD2(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
continuation = rf.greedy_until(ctx, {'until': ["\n"]}) continuation = rf.greedy_until(ctx, {"until": ["\n"]})
is_unanswerable = rf.loglikelihood(ctx, " " + "unanswerable") is_unanswerable = rf.loglikelihood(ctx, " " + "unanswerable")
return continuation, is_unanswerable return continuation, is_unanswerable
......
...@@ -184,7 +184,7 @@ class GeneralTranslationTask(Task): ...@@ -184,7 +184,7 @@ class GeneralTranslationTask(Task):
language description, as well as the few shot examples, and the question language description, as well as the few shot examples, and the question
part of the document for `doc`. part of the document for `doc`.
""" """
return rf.greedy_until(ctx, {'until': ["\n"]}) return rf.greedy_until(ctx, {"until": ["\n"]})
def process_results(self, doc, results): def process_results(self, doc, results):
# Add spaces between words for BLEU score calculation of target languages like Chinese # Add spaces between words for BLEU score calculation of target languages like Chinese
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment