conflict changed

7604b873 · cardy20 · 17b04444 · e8f38aee · 7604b873 · 7604b873
Commit 7604b873 authored Jun 03, 2023 by cardy20
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,7 @@ env
 data/
 lm_cache
 .idea
+build/
+logs/
+output/
+lm_eval.egg-info/
\ No newline at end of file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -32,7 +32,7 @@ repos:
    rev: 22.3.0
    hooks:
      - id: black
-        language_version: python3.8
+        language_version: python3.9
  - repo: https://github.com/codespell-project/codespell
    rev: v2.1.0
    hooks:

--- a/README.md
+++ b/README.md
 # Language Model Evaluation Harness
-![](https://github.com/EleutherAI/lm-evaluation-harness/workflows/Build/badge.svg)
-[![codecov](https://codecov.io/gh/EleutherAI/lm-evaluation-harness/branch/master/graph/badge.svg?token=JSG3O2427J)](https://codecov.io/gh/EleutherAI/lm-evaluation-harness)
 ## Overview
 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
@@ -10,9 +7,11 @@ This project provides a unified framework to test generative language models on
 Features:
 - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
+- Support for models loaded via [transformers](https://github.com/huggingface/transformers/), [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/), with a flexible tokenization-agnostic interface.
+- Support for commercial APIs including [OpenAI](https://openai.com), [goose.ai](https://goose.ai), and [TextSynth](https://textsynth.com/).
 - Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
+- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
+- Task versioning to ensure reproducibility when tasks are updated.
 ## Install
@@ -34,14 +33,16 @@ pip install -e ".[multilingual]"
 > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.
-To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on tasks with names matching the pattern `lambada_*` and `hellaswag` you can use the following command:
+### Hugging Face `transformers`
+To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `hellaswag` you can use the following command:
 ```bash
 python main.py \
    --model hf-causal \
    --model_args pretrained=EleutherAI/gpt-j-6B \
-    --tasks lambada_*,hellaswag \
+    --tasks hellaswag \
    --device cuda:0
 ```
@@ -59,16 +60,9 @@ To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you inste
 > **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.
-To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
+### Commercial APIs
-```bash
-python main.py \
-    --model hf-causal-experimental \
-    --model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
-    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
-    --device cuda:0
-```
-Our library also supports the OpenAI API:
+Our library also supports language models served via the OpenAI API:
 ```bash
 export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
@@ -90,7 +84,9 @@ python main.py \
    --check_integrity
 ```
-To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
+### Other Frameworks
+A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:
@@ -104,6 +100,22 @@ python write_out.py \
 This will write out one text file for each task.
+## Advanced Usage
+For models loaded with the HuggingFace  `transformers` library, any arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library. For example, you can pass a local path via `pretrained=` or use models finetuned with [PEFT](https://github.com/huggingface/peft) by taking the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument:
+```bash
+python main.py \
+    --model hf-causal-experimental \
+    --model_args pretrained=EleutherAI/gpt-j-6b,peft=nomic-ai/gpt4all-j-lora \
+    --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \
+    --device cuda:0
+```
+We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
+We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
 ## Implementing new tasks
 To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).

--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -271,6 +271,19 @@ python main.py \
 	--num_fewshot K
 ```
+### Checking the Model Outputs
+The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.
+```sh
+python main.py \
+	--model gpt2 \
+	--model_args device=<device-name> \
+	--tasks <task-name> \
+	--num_fewshot K \
+    --write_out \
+    --output_base_path <path>
+```
 ### Running Unit Tests
 To run the entire test suite, use:

--- a/docs/task_table.md
+++ b/docs/task_table.md
--- a/ignore.txt
+++ b/ignore.txt
 ROUGE
 rouge
 nin
+maka
+mor
+te
--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -190,14 +190,19 @@ class BaseLM(LM):
        # automatic batch size detection for vectorization
        adaptive_batch_size = None
-        if self.batch_size == 'auto': 
+        if self.batch_size == "auto":
            # using rolling window with maximum context
-            print('Passed argument batch_size = auto. Detecting largest batch size')
+            print("Passed argument batch_size = auto. Detecting largest batch size")
-            @find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
+            @find_executable_batch_size(
+                starting_batch_size=512
+            )  # if OOM, then halves batch_size and tries again
            def forward_batch(batch_size):
-                test_batch = torch.ones((batch_size, self.max_length), device=self.device).long()
+                test_batch = torch.ones(
+                    (batch_size, self.max_length), device=self.device
+                ).long()
                for _ in range(5):
-                    out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
+                    _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
                return batch_size
            batch_size = forward_batch()
@@ -223,7 +228,9 @@ class BaseLM(LM):
            # TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
            # that
            string_nll = self._loglikelihood_tokens(
-                rolling_token_windows, disable_tqdm=True, override_bs = adaptive_batch_size
+                rolling_token_windows,
+                disable_tqdm=True,
+                override_bs=adaptive_batch_size,
            )
            # discard is_greedy
@@ -234,7 +241,7 @@ class BaseLM(LM):
        return loglikelihoods
-    def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs = None):
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
        res = []
@@ -249,11 +256,11 @@ class BaseLM(LM):
            toks = x[1] + x[2]
            return -len(toks), tuple(toks)
        re_ord = utils.Reorderer(requests, _collate)
        # automatic (variable) batch size detection for vectorization
        # pull longest context sample from request
+        if len(re_ord.get_reordered()) > 0:
            _, context_enc, continuation_enc = re_ord.get_reordered()[0]
            max_context = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
            if (self.batch_size == 'auto'):
@@ -273,9 +280,12 @@ class BaseLM(LM):
                else:
                    adaptive_batch_size = override_bs
+        else:
+            adaptive_batch_size = 0 if override_bs is None else override_bs
        for chunk in utils.chunks(
-            tqdm(re_ord.get_reordered(), disable=disable_tqdm), self.batch_size if self.batch_size != "auto" else adaptive_batch_size
+            tqdm(re_ord.get_reordered(), disable=disable_tqdm),
+            self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
        ):
            inps = []
            cont_toks_list = []
@@ -382,7 +392,7 @@ class BaseLM(LM):
        re_ord = utils.Reorderer(requests, _collate)
        for context, request_args in tqdm(re_ord.get_reordered()):
-            until = request_args['until']
+            until = request_args["until"]
            if isinstance(until, str):
                until = [until]
@@ -396,7 +406,7 @@ class BaseLM(LM):
            ).to(self.device)
            max_gen_tokens = min(
-                self.max_gen_toks, request_args.get('max_length', self.max_gen_toks)
+                self.max_gen_toks, request_args.get("max_length", self.max_gen_toks)
            )
            cont = self._model_generate(
                context_enc, context_enc.shape[1] + max_gen_tokens, primary_until

--- a/lm_eval/datasets/bigbench_resources/date_understanding.json
+++ b/lm_eval/datasets/bigbench_resources/date_understanding.json
--- a/lm_eval/datasets/bigbench_resources/snarks.json
+++ b/lm_eval/datasets/bigbench_resources/snarks.json
--- a/lm_eval/datasets/kold/kold.py
+++ b/lm_eval/datasets/kold/kold.py
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Korean Offensive Language Dataset"""
+import json
+import datasets
+_CITATION = """\
+@InProceedings{jeong-etal-2022-kold,
+    title = "{KOLD}: {K}orean Offensive Language Dataset",
+    author = "Jeong, Younghoon  and
+      Oh, Juhyun  and
+      Lee, Jongwon  and
+      Ahn, Jaimeen  and
+      Moon, Jihyung  and
+      Park, Sungjoon  and
+      Oh, Alice",
+    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.emnlp-main.744",
+    pages = "10818--10833",
+    abstract = "Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.",
+}
+"""
+_DESCRIPTION = """\
+They present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. 
+They collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. 
+"""
+_HOMEPAGE = "https://github.com/boychaboy/KOLD"
+_LICENSE = "CC0 1.0 Universal (CC0 1.0)"
+_URLs = "https://raw.githubusercontent.com/Gun1Yun/KOLD/main/data/kold_v1.json"
+# TODO: Name of the dataset usually match the script name with CamelCase instead of snake_case
+class KOLD(datasets.GeneratorBasedBuilder):
+    """Korean Offensive Language Dataset."""
+    VERSION = datasets.Version("1.1.0")
+    def _info(self):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "id": datasets.Value("string"),
+                    "title": datasets.Value("string"),
+                    "comment": datasets.Value("string"),
+                    "off": datasets.ClassLabel(names=["False", "True"]),
+                    "tgt": datasets.ClassLabel(names=["None", 'group', 'individual', 'other', 'untargeted'])
+                    # "GRP": datasets.ClassLabel(names=["None", "ohters"]),
+                }
+            ),
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+    def _split_generators(self, dl_manager):
+        downloaded_files = dl_manager.download_and_extract(_URLs)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepath": downloaded_files,
+                    "split": "test",
+                },
+            ),
+        ]
+    def _generate_examples(self, filepath, split):
+        with open(filepath, "r") as f:
+            data = json.loads(f.read())
+            for id_, row in enumerate(data):
+                yield id_, {
+                    "id": row["guid"],
+                    "title": row["title"],
+                    "comment": row["comment"],
+                    "off": int(row["OFF"]),  
+                    "tgt": row["TGT"],
+                    # "grp": row["GRP"] 
+                }
\ No newline at end of file
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -23,8 +23,9 @@ def simple_evaluate(
    description_dict=None,
    check_integrity=False,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.
    :param model: Union[str, LM]
@@ -42,14 +43,18 @@ def simple_evaluate(
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
    :param no_cache: bool
        Whether or not to cache
-    :param limit: int, optional
+    :param limit: int or float, optional
-        Limit the number of examples per task (only use this for testing)
+        Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
    :param description_dict: dict[str, str]
        Dictionary of custom task descriptions of the form: `task_name: description`
    :param check_integrity: bool
        Whether to run the relevant part of the test suite for the tasks
+    :param write_out: bool
+        If True, write details about prompts and logits to json for all tasks
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir.
    :return
        Dictionary of results
    """
@@ -91,6 +96,8 @@ def simple_evaluate(
        bootstrap_iters=bootstrap_iters,
        description_dict=description_dict,
        decontamination_ngrams_path=decontamination_ngrams_path,
+        write_out=write_out,
+        output_base_path=output_base_path,
    )
    # add info about the model and few shot config
@@ -122,6 +129,8 @@ def evaluate(
    bootstrap_iters=100000,
    description_dict=None,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.
@@ -139,6 +148,10 @@ def evaluate(
        Number of iterations for bootstrap statistics
    :param description_dict: dict[str, str]
        Dictionary of custom task descriptions of the form: `task_name: description`
+    :param write_out: bool
+        If True, write all prompts, logits and metrics to json for offline analysis
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir
    :return
        Dictionary of results
    """
@@ -175,6 +188,7 @@ def evaluate(
    # TODO: we need unit tests & sanity checks or something to ensure that the return of `validation_docs` is stable
    docs = {}
+    write_out_info = {}
    docs_for_decontamination = collections.defaultdict(list)
@@ -197,15 +211,20 @@ def evaluate(
        rnd = random.Random()
        rnd.seed(42)
        rnd.shuffle(task_docs)
+        print(f"Task: {task_name}; number of docs: {len(task_docs)}")
+        if write_out:
+            prompt_details = []
        description = (
            description_dict[task_name]
            if description_dict and task_name in description_dict
            else ""
        )
+        if limit is not None:
+            limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
        for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
            if decontaminate and task.should_decontaminate():
                docs_for_decontamination[(task_name, task_set)].append(
                    task.doc_to_decontamination_query(doc)
@@ -216,6 +235,17 @@ def evaluate(
                doc=doc, num_fewshot=num_fewshot, rnd=rnd, description=description
            )
            reqs = task.construct_requests(doc, ctx)
+            if write_out:
+                prompt_details.append({"doc_id": doc_id})
+            # print the prompt for the first few documents
+            if doc_id < 1:
+                print(
+                    f"Task: {task_name}; document {doc_id}; context prompt (starting on next line):\n{ctx}\n(end of prompt on previous line)"
+                )
+                print("Requests:", reqs)
            if not isinstance(reqs, (list, tuple)):
                reqs = [reqs]
            for i, req in enumerate(reqs):
@@ -224,6 +254,14 @@ def evaluate(
                # doc_id: unique id that we can get back to a doc using `docs`
                requests_origin[req.request_type].append((i, task_name, doc, doc_id))
+                if write_out:
+                    prompt_details[-1][f"prompt_{i}"] = "".join(
+                        (map(lambda x: "".join(x), req.args))
+                    )
+        if write_out:
+            write_out_info[task_name] = prompt_details
    # Compare all tasks/sets at once to ensure a single training set scan
    if decontaminate:
        from lm_eval.decontamination.decontaminate import get_train_overlap
@@ -252,6 +290,18 @@ def evaluate(
        for resp, (i, task_name, doc, doc_id) in zip(resps, requests_origin[reqtype]):
            process_res_queue[(task_name, doc_id)].append((i, resp))
+            if write_out:
+                write_out_info[task_name][doc_id][f"logit_{i}"] = resp
+                task = task_dict[task_name]
+                if isinstance(task, lm_eval.base.MultipleChoiceTask):
+                    write_out_info[task_name][doc_id]["truth"] = doc["gold"]
+                elif isinstance(task, lm_eval.tasks.winogrande.Winogrande):
+                    write_out_info[task_name][doc_id]["truth"] = task.answer_to_num[
+                        doc["answer"]
+                    ]
+                else:
+                    write_out_info[task_name][doc_id]["truth"] = task.doc_to_target(doc)
    vals = collections.defaultdict(list)
    # unpack results and sort back in order and return control to Task
@@ -266,6 +316,9 @@ def evaluate(
        for metric, value in metrics.items():
            vals[(task_name, metric)].append(value)
+            if write_out:
+                write_out_info[task_name][doc_id][metric] = str(value)
            # Re-use the evaluation for the decontaminated set by just ignoring the overlaps
            if decontaminate and task_name in overlaps:
                if doc_id not in overlaps[task_name]:
@@ -294,6 +347,28 @@ def evaluate(
        if stderr is not None:
            results[task_name][metric + "_stderr"] = stderr(items)
+    if write_out:
+        import json
+        import pathlib
+        output_base_path = (
+            pathlib.Path(output_base_path)
+            if output_base_path is not None
+            else pathlib.Path(".")
+        )
+        try:
+            output_base_path.mkdir(parents=True, exist_ok=False)
+        except FileExistsError:
+            pass
+        for task_name, _ in task_dict_items:
+            with open(
+                output_base_path.joinpath(f"{task_name}_write_out_info.json"),
+                "w",
+                encoding="utf8",
+            ) as fp:
+                json.dump(write_out_info[task_name], fp, indent=4, ensure_ascii=False)
    return {"results": dict(results), "versions": dict(versions)}

--- a/lm_eval/models/gpt2.py
+++ b/lm_eval/models/gpt2.py
@@ -3,6 +3,7 @@ import transformers
 from typing import Optional
 from lm_eval.base import BaseLM
 class HFLM(BaseLM):
    def __init__(
        self,
@@ -20,9 +21,11 @@ class HFLM(BaseLM):
        assert isinstance(device, str)
        assert isinstance(pretrained, str)
-        assert isinstance(batch_size, (int,str))
+        assert isinstance(batch_size, (int, str))
-        device_list = set(["cuda", "cpu"] + [f'cuda:{i}' for i in range(torch.cuda.device_count())])
+        device_list = set(
+            ["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
+        )
        if device and device in device_list:
            self._device = torch.device(device)
            print(f"Using device '{device}'")
@@ -49,6 +52,7 @@ class HFLM(BaseLM):
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            pretrained if tokenizer is None else tokenizer,
+<<<<<<< HEAD
 <<<<<<< HEAD
            revision=revision + ("/" + subfolder if subfolder is not None else ""))
@@ -71,6 +75,8 @@ class HFLM(BaseLM):
        # if gpus > 1:
        #     self.gpt2 = nn.DataParallel(self.gpt2)
 =======
+=======
+>>>>>>> e8f38aee79569d51bd6c84f23f4227771291a816
            revision=revision,
            trust_remote_code=trust_remote_code,
        )
@@ -88,11 +94,15 @@ class HFLM(BaseLM):
            ], self.tokenizer.encode("hello\n\nhello")
        # setup for automatic batch size detection
-        if batch_size == 'auto': 
+        if batch_size == "auto":
            self.batch_size_per_gpu = batch_size
        else:
+<<<<<<< HEAD
            self.batch_size_per_gpu = int(batch_size) 
 >>>>>>> 0542d35d5e56768dd9041ef9b88b90256970d843
+=======
+            self.batch_size_per_gpu = int(batch_size)
+>>>>>>> e8f38aee79569d51bd6c84f23f4227771291a816
    @property
    def eot_token_id(self):
@@ -139,9 +149,10 @@ class HFLM(BaseLM):
            return self.gpt2(inps)[0]
    def _model_generate(self, context, max_length, eos_token_id):
-        generation_kwargs = {'do_sample': False, 'max_length': max_length}
+        generation_kwargs = {"do_sample": False, "max_length": max_length}
        if eos_token_id is not None:
            generation_kwargs['eos_token_id'] = eos_token_id
+            generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
        return self.gpt2.generate(context, **generation_kwargs)

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -72,7 +72,7 @@ class HuggingFaceAutoLM(BaseLM):
        tokenizer: Optional[str] = None,
        subfolder: Optional[str] = None,
        revision: Optional[str] = "main",
-        batch_size: Optional[Union[int,str]] = 1,
+        batch_size: Optional[Union[int, str]] = 1,
        max_gen_toks: Optional[int] = 256,
        max_length: Optional[int] = None,
        add_special_tokens: Optional[bool] = None,
@@ -159,7 +159,7 @@ class HuggingFaceAutoLM(BaseLM):
            ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
        # setup for automatic batch size detection
-        if batch_size == 'auto': 
+        if batch_size == "auto":
            self._batch_size = batch_size
        else:
            self._batch_size = int(batch_size)
@@ -369,7 +369,9 @@ class HuggingFaceAutoLM(BaseLM):
    def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
        return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)
-    def greedy_until(self, requests: List[Tuple[str, Union[List[str], str]]]) -> List[str]:
+    def greedy_until(
+        self, requests: List[Tuple[str, Union[List[str], str]]]
+    ) -> List[str]:
        def _collate(x):
            tokens = self.tok_encode(x[0])
            return len(tokens), x[0]
@@ -378,14 +380,19 @@ class HuggingFaceAutoLM(BaseLM):
        reorder = utils.Reorderer(requests, _collate)
        adaptive_batch_size = None
-        if self.batch_size == 'auto': 
+        if self.batch_size == "auto":
            # using rolling window with maximum context
-            print('Passed argument batch_size = auto. Detecting largest batch size')
+            print("Passed argument batch_size = auto. Detecting largest batch size")
-            @find_executable_batch_size(starting_batch_size=512) # if OOM, then halves batch_size and tries again
+            @find_executable_batch_size(
+                starting_batch_size=512
+            )  # if OOM, then halves batch_size and tries again
            def forward_batch(batch_size):
-                test_batch = torch.ones((batch_size, self.max_length), device=self.device).long()
+                test_batch = torch.ones(
+                    (batch_size, self.max_length), device=self.device
+                ).long()
                for _ in range(5):
-                    out = F.log_softmax(self._model_call(test_batch), dim = -1).cpu()
+                    _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
                return batch_size
            batch_size = forward_batch()
@@ -393,11 +400,12 @@ class HuggingFaceAutoLM(BaseLM):
            adaptive_batch_size = batch_size
        for chunk in utils.chunks(
-            tqdm(reorder.get_reordered(), disable=False), self.batch_size if self.batch_size != "auto" else adaptive_batch_size
+            tqdm(reorder.get_reordered(), disable=False),
+            self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
        ):
            context = [c[0] for c in chunk]
            request_args = chunk[0][1]
-            stop = request_args.get('until', None)
+            stop = request_args.get("until", None)
            stop_sequences = stop if isinstance(stop, list) else [stop]
            max_generation_length = request_args.get("max_length", None)

--- a/lm_eval/models/textsynth.py
+++ b/lm_eval/models/textsynth.py
@@ -124,7 +124,7 @@ class TextSynthLM(BaseLM):
        for request in tqdm(requests):
            inp = request[0]
            request_args = request[1]
-            until = request_args['until']
+            until = request_args["until"]
            response = textsynth_completion(
                url=self.api_url + "/v1/engines/" + self.engine + "/completions",
                headers={"Authorization": "Bearer " + self.api_key},

--- a/lm_eval/tasks/bigbench.py
+++ b/lm_eval/tasks/bigbench.py
--- a/lm_eval/tasks/coqa.py
+++ b/lm_eval/tasks/coqa.py
@@ -141,7 +141,7 @@ class CoQA(Task):
            language description, as well as the few shot examples, and the question
            part of the document for `doc`.
        """
-        cont_request = rf.greedy_until(ctx, {'until': ["\nQ:"]})
+        cont_request = rf.greedy_until(ctx, {"until": ["\nQ:"]})
        return cont_request
    def process_results(self, doc, results):

--- a/lm_eval/tasks/drop.py
+++ b/lm_eval/tasks/drop.py
@@ -134,7 +134,7 @@ class DROP(Task):
            language description, as well as the few shot examples, and the question
            part of the document for `doc`.
        """
-        conts = [rf.greedy_until(ctx, {'until': ["."]})]
+        conts = [rf.greedy_until(ctx, {"until": ["."]})]
        return conts
    def process_results(self, doc, results):

--- a/lm_eval/tasks/gsm8k.py
+++ b/lm_eval/tasks/gsm8k.py
@@ -79,7 +79,7 @@ class GradeSchoolMath8K(Task):
        """
        # NOTE: The paper implements "verifiers" that assign a score to multiple
        # solutions and output the highest ranked solution.
-        completion = rf.greedy_until(ctx, {'until': ["\n"]})
+        completion = rf.greedy_until(ctx, {"until": [":", "Question:", "Question"]})
        return completion
    def _extract_answer(self, completion):

--- a/lm_eval/tasks/hendrycks_math.py
+++ b/lm_eval/tasks/hendrycks_math.py
@@ -63,7 +63,7 @@ class Math(Task):
        return " " + doc["solution"]
    def construct_requests(self, doc, ctx):
-        return rf.greedy_until(ctx, {'until': ["\n"]})
+        return rf.greedy_until(ctx, {"until": ["\n"]})
    def process_results(self, doc, results):
        retval = 0

--- a/lm_eval/tasks/json.py
+++ b/lm_eval/tasks/json.py
+import datasets
+from lm_eval.base import PerplexityTask
+from lm_eval.utils import escaped_split
+class JsonPerplexity(PerplexityTask):
+    VERSION = 0
+    DATASET_NAME = "json"
+    def __init__(self, data_dir=None, cache_dir=None, download_mode=None):
+        """
+        :param data_dir: str
+            Use this to specify the path to manually downloaded JSON test data.
+            This also needs to include the split key and text key for the data
+            in the following format:
+            ```
+            split:text:/absolute/path/to/data.json
+            ```
+            If you do not have splits inside the JSON file, it should be "train".
+            Colons in the split or text key can be escaped by backslashes.
+        :param cache_dir: str
+            The directory to read/write the `Task` dataset. This follows the
+            HuggingFace `datasets` API with the default cache directory located at:
+                `~/.cache/huggingface/datasets`
+            NOTE: You can change the cache location globally for a given process
+            by setting the shell environment variable, `HF_DATASETS_CACHE`,
+            to another directory:
+                `export HF_DATASETS_CACHE="/path/to/another/directory"`
+        :param download_mode: datasets.DownloadMode
+            How to treat pre-existing `Task` downloads and data.
+            - `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
+                Reuse download and reuse dataset.
+            - `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
+                Reuse download with fresh dataset.
+            - `datasets.DownloadMode.FORCE_REDOWNLOAD`
+                Fresh download and fresh dataset.
+        """
+        self._split, self._key, data_file = escaped_split(data_dir, ":", 2)
+        self.load(data_file)
+        self._training_docs = None
+        self._fewshot_docs = None
+    def download(self, data_dir=None, cache_dir=None, download_mode=None):
+        raise TypeError("cannot download an arbitrary JSON dataset")
+    def load(self, data_file):
+        self.dataset = datasets.load_dataset("json", data_files=data_file)
+    def has_validation_docs(self):
+        return False
+    def has_test_docs(self):
+        return True
+    def test_docs(self):
+        return map(self._process_doc, self.dataset[self._split])
+    def _process_doc(self, doc):
+        return doc[self._key]