Create a means for caching task registration and request building. Ad… (#1372)

* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate(). * Remove extra S in cache path in caching module Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted. * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache. * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used. * Remove line from gitignore, add to cli for caching datasets. * Add hashing suffix to .pickles. Update test script typo. * Favor isinstance() over type() in evaluator.py * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests(). * Update arg description to simple_evaluate. * Update pyproject.toml * Fix typehint * Remove the use of random() for creating default cache pickle hash. * Check that cache dir exists before clearing it in request cache tests. * Fix linting problems. * Fix additional formatting errors. * Remove trailing whitespace. * Add new line to the end of .gitignore. --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

Create a means for caching task registration and request building. Ad… (#1372)
* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate(). * Remove extra S in cache path in caching module Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted. * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache. * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used. * Remove line from gitignore, add to cli for caching datasets. * Add hashing suffix to .pickles. Update test script typo. * Favor isinstance() over type() in evaluator.py * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests(). * Update arg description to simple_evaluate. * Update pyproject.toml * Fix typehint * Remove the use of random() for creating default cache pickle hash. * Check that cache dir exists before clearing it in request cache tests. * Fix linting problems. * Fix additional formatting errors. * Remove trailing whitespace. * Add new line to the end of .gitignore. --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
1e6c9272 · Aaron V · GitHub · f6befdb9 · 1e6c9272 · 1e6c9272
Unverified Commit 1e6c9272 authored Feb 26, 2024 by Aaron V Committed by GitHub Feb 26, 2024
11 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -16,5 +16,8 @@ temp
 # IPython
 profile_default/
 ipython_config.py
+# don't track (the default location of) the cached requests
+lm_eval/caching/.cache
+# don't track files created by wandb
 wandb
 examples/wandb
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -10,41 +10,43 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
 This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
-* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
+- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
-* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
+- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
-* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
+- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
-* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
+- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
-* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
+- `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
-* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
+- `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
-* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
+- `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
-* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
+- `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
-* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
+- `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
-* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
+- `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
-* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
+- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
-* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
+- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
-* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
+- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
-* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
+- `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
-* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
+- `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
-* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
+- `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
-* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
+- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
-* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
+- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than  `lm_eval/tasks/`
+- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
 * `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
@@ -56,7 +58,6 @@ We also support using the library's external API for use within model training l
 `lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
 `simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
 ```python
@@ -88,7 +89,6 @@ results = lm_eval.simple_evaluate( # call simple_evaluate
 )
 ```
 See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
 Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
@@ -96,6 +96,7 @@ Additionally, the `evaluate()` function offers the core evaluation functionality
 See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
 As a brief example usage of `evaluate()`:
 ```python
 import lm_eval

--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -11,6 +11,7 @@ from typing import Union
 import numpy as np
 from lm_eval import evaluator, utils
+from lm_eval.evaluator import request_caching_arg_to_dict
 from lm_eval.logging_utils import WandbLogger
 from lm_eval.tasks import TaskManager, include_path, initialize_tasks
 from lm_eval.utils import make_table
@@ -119,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace:
        metavar="DIR",
        help="A path to a sqlite db file for caching model responses. `None` if not caching.",
    )
+    parser.add_argument(
+        "--cache_requests",
+        type=str,
+        default=None,
+        choices=["true", "refresh", "delete"],
+        help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
+    )
    parser.add_argument("--decontamination_ngrams_path", default=None)  # TODO: not used
    parser.add_argument(
        "--check_integrity",
@@ -285,6 +293,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
    eval_logger.info(f"Selected Tasks: {task_names}")
    eval_logger.info("Loading selected tasks...")
+    request_caching_args = request_caching_arg_to_dict(
+        cache_requests=args.cache_requests
+    )
    results = evaluator.simple_evaluate(
        model=args.model,
        model_args=args.model_args,
@@ -302,6 +314,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        gen_kwargs=args.gen_kwargs,
        task_manager=task_manager,
        predict_only=args.predict_only,
+        **request_caching_args,
        random_seed=args.seed[0],
        numpy_random_seed=args.seed[1],
        torch_random_seed=args.seed[2],

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
@@ -133,6 +133,28 @@ class LM(abc.ABC):
        args2 = {k: v for k, v in additional_config.items() if v is not None}
        return cls(**args, **args2)
+    @classmethod
+    def create_from_arg_obj(
+        cls: Type[T], arg_dict: dict, additional_config: Optional[dict] = None
+    ) -> T:
+        """
+        Creates an instance of the LM class using the given arg_obj
+        Parameters:
+        - arg_obj: A dict containing arguments in the format key1=value1,key2=value2.
+        - additional_config: Optional dictionary containing additional configuration parameters.
+        Returns:
+        - Instance of the LM class.
+        """
+        additional_config = {} if additional_config is None else additional_config
+        additional_config = {
+            k: v for k, v in additional_config.items() if v is not None
+        }
+        return cls(**arg_dict, **additional_config)
    @property
    def rank(self):
        # used in the case of parallelism. Hardcoded to

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -11,6 +11,7 @@ from typing import Any, List, Literal, Tuple, Union
 import datasets
 import numpy as np
+from tqdm import tqdm
 from lm_eval import utils
 from lm_eval.api import samplers
@@ -28,6 +29,7 @@ from lm_eval.api.registry import (
    get_metric_aggregation,
    is_higher_better,
 )
+from lm_eval.caching.cache import load_from_cache, save_to_cache
 from lm_eval.filters import build_filter_ensemble
 from lm_eval.prompts import get_prompt
@@ -107,9 +109,11 @@ class TaskConfig(dict):
            if self.output_type == "generate_until":
                # ensure that we greedily generate in absence of explicit arguments otherwise
                self.generation_kwargs = {
-                    "until": None
+                    "until": (
-                    if self.fewshot_delimiter is None
+                        None
-                    else [self.fewshot_delimiter],
+                        if self.fewshot_delimiter is None
+                        else [self.fewshot_delimiter]
+                    ),
                    "do_sample": False,
                }
@@ -349,8 +353,35 @@ class Task(abc.ABC):
    def doc_to_target(self, doc):
        pass
-    def build_all_requests(self, limit=None, rank=None, world_size=None) -> None:
+    def build_all_requests(
+        self,
+        limit=None,
+        rank=None,
+        world_size=None,
+        cache_requests=False,
+        rewrite_requests_cache=False,
+    ) -> None:
        """Build a set of Instances for a task, and store them in task.instances"""
+        # used with caching
+        og_limit = limit
+        cache_key = f"requests-{self._config.task}"
+        cached_instances = load_from_cache(file_name=cache_key)
+        if cache_requests and cached_instances and not rewrite_requests_cache:
+            cached_instances = cached_instances[:limit]
+            flattened_instances = [
+                instance
+                for instance_group in cached_instances
+                for instance in instance_group
+            ]
+            self._instances = flattened_instances
+            return
        if self.has_test_docs():
            docs = self.test_docs()
        elif self.has_validation_docs():
@@ -361,8 +392,29 @@ class Task(abc.ABC):
        eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...")
        instances = []
-        for doc_id, doc in utils.create_iterator(
-            enumerate(docs), rank, world_size, limit
+        # process all documents when caching is specified for simplicity
+        if (
+            cache_requests
+            and (not cached_instances or rewrite_requests_cache)
+            and limit is not None
+        ):
+            limit = None
+        doc_id_docs = list(
+            utils.create_iterator(
+                enumerate(docs),
+                rank,
+                world_size,
+                limit,
+            )
+        )
+        num_docs = len(doc_id_docs)
+        for doc_id, doc in tqdm(
+            doc_id_docs,
+            total=num_docs,
        ):
            # sample fewshot context #TODO: need to offset doc_id by rank now!
            fewshot_ctx = self.fewshot_context(
@@ -380,11 +432,25 @@ class Task(abc.ABC):
            if not isinstance(inst, list):
                inst = [inst]
-            instances.extend(inst)
+            instances.append(inst)
+        # now flatten, this is to allow slicing to work with pickles
+        sliced_instances = instances[:og_limit]
+        flattened_instances = [
+            instance
+            for instance_group in sliced_instances
+            for instance in instance_group
+        ]
+        self._instances = flattened_instances
-        self._instances = instances
        assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
+        if cache_requests and (not cached_instances or rewrite_requests_cache):
+            save_to_cache(file_name=cache_key, obj=instances)
    @abc.abstractmethod
    def construct_requests(self, doc, ctx, **kwargs):
        """Uses RequestFactory to construct Requests and returns an iterable of

--- a/lm_eval/caching/cache.py
+++ b/lm_eval/caching/cache.py
+import hashlib
+import os
+import dill
+from lm_eval.utils import eval_logger
+MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
+OVERRIDE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
+PATH = OVERRIDE_PATH if OVERRIDE_PATH else f"{MODULE_DIR}/.cache"
+# This should be sufficient for uniqueness
+HASH_INPUT = "EleutherAI-lm-evaluation-harness"
+HASH_PREFIX = hashlib.sha256(HASH_INPUT.encode("utf-8")).hexdigest()
+FILE_SUFFIX = f".{HASH_PREFIX}.pickle"
+def load_from_cache(file_name):
+    try:
+        path = f"{PATH}/{file_name}{FILE_SUFFIX}"
+        with open(path, "rb") as file:
+            cached_task_dict = dill.loads(file.read())
+            return cached_task_dict
+    except Exception:
+        eval_logger.debug(f"{file_name} is not cached, generating...")
+        pass
+def save_to_cache(file_name, obj):
+    if not os.path.exists(PATH):
+        os.mkdir(PATH)
+    file_path = f"{PATH}/{file_name}{FILE_SUFFIX}"
+    eval_logger.debug(f"Saving {file_path} to cache...")
+    with open(file_path, "wb") as file:
+        file.write(dill.dumps(obj))
+# NOTE the "key" param is to allow for flexibility
+def delete_cache(key: str = ""):
+    files = os.listdir(PATH)
+    for file in files:
+        if file.startswith(key) and file.endswith(FILE_SUFFIX):
+            file_path = f"{PATH}/{file}"
+            os.unlink(file_path)
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
 import collections
 import itertools
 import logging
+import math
 import random
-from typing import Optional, Union
+from typing import TYPE_CHECKING, Optional, Union
 import numpy as np
 import torch
@@ -20,16 +21,26 @@ from lm_eval.utils import (
 )
+if TYPE_CHECKING:
+    from lm_eval.api.model import LM
+    from lm_eval.tasks import Task
+from lm_eval.caching.cache import delete_cache
 @positional_deprecated
 def simple_evaluate(
    model,
-    model_args: Optional[str] = None,
+    model_args: Optional[Union[str, dict, None]] = None,
    tasks=None,
    num_fewshot: Optional[int] = None,
    batch_size: Optional[int] = None,
    max_batch_size: Optional[int] = None,
    device: Optional[str] = None,
    use_cache: Optional[str] = None,
+    cache_requests: bool = False,
+    rewrite_requests_cache: bool = False,
+    delete_requests_cache: bool = False,
    limit: Optional[Union[int, float]] = None,
    bootstrap_iters: int = 100000,
    check_integrity: bool = False,
@@ -48,8 +59,8 @@ def simple_evaluate(
    :param model: Union[str, LM]
        Name of model or LM object, see lm_eval.models.get_model
-    :param model_args: Optional[str]
+    :param model_args: Optional[str, dict]
-        String arguments for each model class, see LM.create_from_arg_string.
+        String or dict arguments for each model class, see LM.create_from_arg_string and LM.create_from_arg_object.
        Ignored if `model` argument is a LM object.
    :param tasks: list[Union[str, dict, Task]]
        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
@@ -63,6 +74,12 @@ def simple_evaluate(
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
    :param use_cache: str, optional
        A path to a sqlite db file for caching model responses. `None` if not caching.
+    :param cache_requests: bool, optional
+        Speed up evaluation by caching the building of dataset requests. `None` if not caching.
+    :param rewrite_requests_cache: bool, optional
+        Rewrites all of the request cache if set to `True`. `None` if not desired.
+    :param delete_requests_cache: bool, optional
+        Deletes all of the request cache if set to `True`. `None` if not desired.
    :param limit: int or float, optional
        Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
    :param bootstrap_iters:
@@ -90,6 +107,10 @@ def simple_evaluate(
    """
    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
+    if delete_requests_cache:
+        eval_logger.info("Deleting requests cache...")
+        delete_cache()
    if random_seed is not None:
        # See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
        eval_logger.info(f"Setting random seed to {random_seed}")
@@ -120,14 +141,26 @@ def simple_evaluate(
    if isinstance(model, str):
        if model_args is None:
            model_args = ""
-        lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
-            model_args,
+        elif isinstance(model_args, dict):
-            {
+            lm = lm_eval.api.registry.get_model(model).create_from_arg_obj(
-                "batch_size": batch_size,
+                model_args,
-                "max_batch_size": max_batch_size,
+                {
-                "device": device,
+                    "batch_size": batch_size,
-            },
+                    "max_batch_size": max_batch_size,
-        )
+                    "device": device,
+                },
+            )
+        else:
+            lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
+                model_args,
+                {
+                    "batch_size": batch_size,
+                    "max_batch_size": max_batch_size,
+                    "device": device,
+                },
+            )
    else:
        assert isinstance(model, lm_eval.api.model.LM)
        lm = model
@@ -191,6 +224,8 @@ def simple_evaluate(
        lm=lm,
        task_dict=task_dict,
        limit=limit,
+        cache_requests=cache_requests,
+        rewrite_requests_cache=rewrite_requests_cache,
        bootstrap_iters=bootstrap_iters,
        decontamination_ngrams_path=decontamination_ngrams_path,
        write_out=write_out,
@@ -211,9 +246,9 @@ def simple_evaluate(
            "model": model_name,
            "model_args": model_args,
            "batch_size": batch_size,
-            "batch_sizes": list(lm.batch_sizes.values())
+            "batch_sizes": (
-            if hasattr(lm, "batch_sizes")
+                list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else []
-            else [],
+            ),
            "device": device,
            "use_cache": use_cache,
            "limit": limit,
@@ -232,9 +267,11 @@ decontaminate_suffix = "_decontaminate"
 @positional_deprecated
 def evaluate(
-    lm,
+    lm: "LM",
    task_dict,
    limit: Optional[int] = None,
+    cache_requests=False,
+    rewrite_requests_cache=False,
    bootstrap_iters: Optional[int] = 100000,
    decontamination_ngrams_path=None,
    write_out: bool = False,
@@ -294,6 +331,8 @@ def evaluate(
    # get lists of each type of request
    for task_name, task in task_dict.items():
+        task: Task
        if isinstance(task, tuple):
            group_name, task = task
            task_hierarchy[group_name].append(task_name)
@@ -331,9 +370,18 @@ def evaluate(
                task_docs = task.validation_docs()
            else:
                raise RuntimeError("Task has neither test_docs nor validation_docs")
-            limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
-        task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
+            num_docs = len(task_docs) * limit
+            # ceil to prevent limit being equal to 0
+            limit = int(math.ceil(num_docs)) if limit < 1.0 else int(limit)
+        task.build_all_requests(
+            limit=limit,
+            rank=lm.rank,
+            world_size=lm.world_size,
+            cache_requests=cache_requests,
+            rewrite_requests_cache=rewrite_requests_cache,
+        )
        eval_logger.debug(
            f"Task: {task_name}; number of requests on this rank: {len(task.instances)}"
@@ -508,9 +556,11 @@ def evaluate(
            if bootstrap_iters > 0:
                stderr_fn = lm_eval.api.metrics.stderr_for_metric(
                    metric=agg_fn,
-                    bootstrap_iters=min(bootstrap_iters, 100)
+                    bootstrap_iters=(
-                    if metric in ["bleu", "chrf", "ter"]
+                        min(bootstrap_iters, 100)
-                    else bootstrap_iters,
+                        if metric in ["bleu", "chrf", "ter"]
+                        else bootstrap_iters
+                    ),
                )
                results[task_name][f"{metric}_stderr,{key}"] = (
@@ -649,3 +699,15 @@ def evaluate(
    else:
        return None
+def request_caching_arg_to_dict(cache_requests: str) -> dict:
+    request_caching_args = {
+        "cache_requests": (
+            True if cache_requests == "true" or cache_requests == "refresh" else False
+        ),
+        "rewrite_requests_cache": True if cache_requests == "refresh" else False,
+        "delete_requests_cache": True if cache_requests == "delete" else False,
+    }
+    return request_caching_args
--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -445,6 +445,7 @@ def get_task_dict(
    assert set(task_name_from_string_dict.keys()).isdisjoint(
        set(task_name_from_object_dict.keys())
    )
    return {
        **task_name_from_string_dict,
        **task_name_from_config_dict,

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -36,6 +36,7 @@ dependencies = [
    "tqdm-multiprocess",
    "transformers>=4.1",
    "zstandard",
+    "dill",
    "word2number",
 ]

--- a/scripts/requests_caching.py
+++ b/scripts/requests_caching.py
+"""
+Usage:
+   python requests_caching.py --tasks=comma,separated,list,of,tasks --cache_requests=<true|refresh|delete]>
+"""
+import argparse
+import os
+from typing import List
+import torch
+from transformers import (
+    pipeline as trans_pipeline,
+)
+from lm_eval import simple_evaluate
+from lm_eval.evaluator import request_caching_arg_to_dict
+from lm_eval.utils import eval_logger
+MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
+# Used to specify alternate cache path, useful if run in a docker container
+# NOTE raw datasets will break if you try to transfer the cache from your host to a docker image
+LM_HARNESS_CACHE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+MODEL = "EleutherAI/pythia-70m"
+TASK = "text-generation"
+def run_model_for_task_caching(tasks: List[str], cache_requests: str):
+    eval_logger.info(f"Loading HF model: {MODEL}")
+    trans_pipe = trans_pipeline(
+        task=TASK, model=MODEL, device=DEVICE, trust_remote_code=True
+    )
+    model = trans_pipe.model
+    tokenizer = trans_pipe.tokenizer
+    eval_logger.info(
+        f"Running simple_evaluate to cache request objects for tasks: {tasks}"
+    )
+    cache_args = request_caching_arg_to_dict(cache_requests=cache_requests)
+    eval_logger.info(
+        f"The following operations will be performed on the cache: {cache_requests}"
+    )
+    eval_data = simple_evaluate(
+        model="hf-auto",
+        model_args={
+            "pretrained": model,
+            "tokenizer": tokenizer,
+        },
+        limit=1,
+        device=DEVICE,
+        tasks=tasks,
+        write_out=True,
+        **cache_args,
+    )
+    return eval_data
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--tasks",
+        "-t",
+        default=None,
+        metavar="task1,task2",
+    )
+    parser.add_argument(
+        "--cache_requests",
+        type=str,
+        default=None,
+        choices=["true", "refresh", "delete"],
+        help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
+    )
+    args = parser.parse_args()
+    tasks = args.tasks.split(",")
+    eval_data = run_model_for_task_caching(
+        tasks=tasks, model=MODEL, device=DEVICE, cache_requests=args.cache_requests
+    )
--- a/tests/test_requests_caching.py
+++ b/tests/test_requests_caching.py
+# import lm_eval.base as base
+import importlib
+import os
+import sys
+from datetime import datetime
+from typing import List, Tuple
+import pytest
+import torch
+# import lm_eval.models as models
+from lm_eval.caching.cache import PATH
+MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
+# NOTE the script this loads uses simple evaluate
+# TODO potentially test both the helper script and the normal script
+sys.path.append(f"{MODULE_DIR}/../scripts")
+model_loader = importlib.import_module("requests_caching")
+run_model_for_task_caching = model_loader.run_model_for_task_caching
+DEFAULT_TASKS = ["lambada_openai", "hellaswag"]
+@pytest.fixture(autouse=True)
+def setup_and_teardown():
+    # Setup
+    torch.use_deterministic_algorithms(False)
+    clear_cache()
+    # Yields control back to the test function
+    yield
+    # Cleanup here
+def clear_cache():
+    if os.path.exists(PATH):
+        cache_files = os.listdir(PATH)
+        for file in cache_files:
+            file_path = f"{PATH}/{file}"
+            os.unlink(file_path)
+# leaving tasks here to allow for the option to select specific task files
+def get_cache_files(tasks: List[str] = None) -> Tuple[List[str], List[str]]:
+    cache_files = os.listdir(PATH)
+    file_task_names = []
+    for file in cache_files:
+        file_without_prefix = file.split("-")[1]
+        file_without_prefix_and_suffix = file_without_prefix.split(".")[0]
+        file_task_names.append(file_without_prefix_and_suffix)
+    return cache_files, file_task_names
+def assert_created(tasks: List[str], file_task_names: List[str]):
+    tasks.sort()
+    file_task_names.sort()
+    assert tasks == file_task_names
+@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
+def test_requests_caching_true(tasks: List[str]):
+    run_model_for_task_caching(tasks=tasks, cache_requests="true")
+    cache_files, file_task_names = get_cache_files()
+    assert_created(tasks=tasks, file_task_names=file_task_names)
+@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
+def test_requests_caching_refresh(tasks: List[str]):
+    run_model_for_task_caching(tasks=tasks, cache_requests="true")
+    timestamp_before_test = datetime.now().timestamp()
+    run_model_for_task_caching(tasks=tasks, cache_requests="refresh")
+    cache_files, file_task_names = get_cache_files()
+    for file in cache_files:
+        modification_time = os.path.getmtime(f"{PATH}/{file}")
+        assert modification_time > timestamp_before_test
+    tasks.sort()
+    file_task_names.sort()
+    assert tasks == file_task_names
+@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
+def test_requests_caching_delete(tasks: List[str]):
+    # populate the data first, rerun this test within this test for additional confidence
+    test_requests_caching_true(tasks=tasks)
+    run_model_for_task_caching(tasks=tasks, cache_requests="delete")
+    cache_files, file_task_names = get_cache_files()
+    assert len(cache_files) == 0
+# useful for locally running tests through the debugger
+if __name__ == "__main__":
+    def run_tests():
+        tests = [
+            test_requests_caching_true,
+            test_requests_caching_refresh,
+            test_requests_caching_delete,
+        ]
+        for test_func in tests:
+            clear_cache()
+            test_func(tasks=DEFAULT_TASKS)
+        print("Tests pass")
+    run_tests()