"vscode:/vscode.git/clone" did not exist on "7b46fdb5b0259ac8dbf8429a7f17d7657d9b8620"
Unverified Commit 1e6c9272 authored by Aaron V's avatar Aaron V Committed by GitHub
Browse files

Create a means for caching task registration and request building. Ad… (#1372)



* Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate().

* Remove extra S in cache path in caching module
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Rename requests cache args, make model_args polymorphic so that a dict can also be accepted.

* Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache.

* Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used.

* Remove line from gitignore, add to cli for caching datasets.

* Add hashing suffix to .pickles. Update test script typo.

* Favor isinstance() over type() in evaluator.py

* Add tests for caching, gets tests working, remove unneeded arg from build_all_requests().

* Update arg description to simple_evaluate.

* Update pyproject.toml

* Fix typehint

* Remove the use of random() for creating default cache pickle hash.

* Check that cache dir exists before clearing it in request cache tests.

* Fix linting problems.

* Fix additional formatting errors.

* Remove trailing whitespace.

* Add new line to the end of .gitignore.

---------
Co-authored-by: default avatarHailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
parent f6befdb9
...@@ -16,5 +16,8 @@ temp ...@@ -16,5 +16,8 @@ temp
# IPython # IPython
profile_default/ profile_default/
ipython_config.py ipython_config.py
# don't track (the default location of) the cached requests
lm_eval/caching/.cache
# don't track files created by wandb
wandb wandb
examples/wandb examples/wandb
...@@ -10,41 +10,43 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th ...@@ -10,41 +10,43 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`: This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs. - `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66) - `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. - `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer. - `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file. - `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length. - `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed. - `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type. - `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well. - `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`. - `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models. - `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again. - `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool. - `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity. - `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task. - `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes. - `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/` - `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results. - `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42. * `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
...@@ -56,7 +58,6 @@ We also support using the library's external API for use within model training l ...@@ -56,7 +58,6 @@ We also support using the library's external API for use within model training l
`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`. `lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows: `simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
```python ```python
...@@ -88,7 +89,6 @@ results = lm_eval.simple_evaluate( # call simple_evaluate ...@@ -88,7 +89,6 @@ results = lm_eval.simple_evaluate( # call simple_evaluate
) )
``` ```
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`. Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
...@@ -96,6 +96,7 @@ Additionally, the `evaluate()` function offers the core evaluation functionality ...@@ -96,6 +96,7 @@ Additionally, the `evaluate()` function offers the core evaluation functionality
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
As a brief example usage of `evaluate()`: As a brief example usage of `evaluate()`:
```python ```python
import lm_eval import lm_eval
......
...@@ -11,6 +11,7 @@ from typing import Union ...@@ -11,6 +11,7 @@ from typing import Union
import numpy as np import numpy as np
from lm_eval import evaluator, utils from lm_eval import evaluator, utils
from lm_eval.evaluator import request_caching_arg_to_dict
from lm_eval.logging_utils import WandbLogger from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table from lm_eval.utils import make_table
...@@ -119,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -119,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace:
metavar="DIR", metavar="DIR",
help="A path to a sqlite db file for caching model responses. `None` if not caching.", help="A path to a sqlite db file for caching model responses. `None` if not caching.",
) )
parser.add_argument(
"--cache_requests",
type=str,
default=None,
choices=["true", "refresh", "delete"],
help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
)
parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used
parser.add_argument( parser.add_argument(
"--check_integrity", "--check_integrity",
...@@ -285,6 +293,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -285,6 +293,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
eval_logger.info("Loading selected tasks...") eval_logger.info("Loading selected tasks...")
request_caching_args = request_caching_arg_to_dict(
cache_requests=args.cache_requests
)
results = evaluator.simple_evaluate( results = evaluator.simple_evaluate(
model=args.model, model=args.model,
model_args=args.model_args, model_args=args.model_args,
...@@ -302,6 +314,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -302,6 +314,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
gen_kwargs=args.gen_kwargs, gen_kwargs=args.gen_kwargs,
task_manager=task_manager, task_manager=task_manager,
predict_only=args.predict_only, predict_only=args.predict_only,
**request_caching_args,
random_seed=args.seed[0], random_seed=args.seed[0],
numpy_random_seed=args.seed[1], numpy_random_seed=args.seed[1],
torch_random_seed=args.seed[2], torch_random_seed=args.seed[2],
......
...@@ -133,6 +133,28 @@ class LM(abc.ABC): ...@@ -133,6 +133,28 @@ class LM(abc.ABC):
args2 = {k: v for k, v in additional_config.items() if v is not None} args2 = {k: v for k, v in additional_config.items() if v is not None}
return cls(**args, **args2) return cls(**args, **args2)
@classmethod
def create_from_arg_obj(
cls: Type[T], arg_dict: dict, additional_config: Optional[dict] = None
) -> T:
"""
Creates an instance of the LM class using the given arg_obj
Parameters:
- arg_obj: A dict containing arguments in the format key1=value1,key2=value2.
- additional_config: Optional dictionary containing additional configuration parameters.
Returns:
- Instance of the LM class.
"""
additional_config = {} if additional_config is None else additional_config
additional_config = {
k: v for k, v in additional_config.items() if v is not None
}
return cls(**arg_dict, **additional_config)
@property @property
def rank(self): def rank(self):
# used in the case of parallelism. Hardcoded to # used in the case of parallelism. Hardcoded to
......
...@@ -11,6 +11,7 @@ from typing import Any, List, Literal, Tuple, Union ...@@ -11,6 +11,7 @@ from typing import Any, List, Literal, Tuple, Union
import datasets import datasets
import numpy as np import numpy as np
from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api import samplers from lm_eval.api import samplers
...@@ -28,6 +29,7 @@ from lm_eval.api.registry import ( ...@@ -28,6 +29,7 @@ from lm_eval.api.registry import (
get_metric_aggregation, get_metric_aggregation,
is_higher_better, is_higher_better,
) )
from lm_eval.caching.cache import load_from_cache, save_to_cache
from lm_eval.filters import build_filter_ensemble from lm_eval.filters import build_filter_ensemble
from lm_eval.prompts import get_prompt from lm_eval.prompts import get_prompt
...@@ -107,9 +109,11 @@ class TaskConfig(dict): ...@@ -107,9 +109,11 @@ class TaskConfig(dict):
if self.output_type == "generate_until": if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = { self.generation_kwargs = {
"until": None "until": (
if self.fewshot_delimiter is None None
else [self.fewshot_delimiter], if self.fewshot_delimiter is None
else [self.fewshot_delimiter]
),
"do_sample": False, "do_sample": False,
} }
...@@ -349,8 +353,35 @@ class Task(abc.ABC): ...@@ -349,8 +353,35 @@ class Task(abc.ABC):
def doc_to_target(self, doc): def doc_to_target(self, doc):
pass pass
def build_all_requests(self, limit=None, rank=None, world_size=None) -> None: def build_all_requests(
self,
limit=None,
rank=None,
world_size=None,
cache_requests=False,
rewrite_requests_cache=False,
) -> None:
"""Build a set of Instances for a task, and store them in task.instances""" """Build a set of Instances for a task, and store them in task.instances"""
# used with caching
og_limit = limit
cache_key = f"requests-{self._config.task}"
cached_instances = load_from_cache(file_name=cache_key)
if cache_requests and cached_instances and not rewrite_requests_cache:
cached_instances = cached_instances[:limit]
flattened_instances = [
instance
for instance_group in cached_instances
for instance in instance_group
]
self._instances = flattened_instances
return
if self.has_test_docs(): if self.has_test_docs():
docs = self.test_docs() docs = self.test_docs()
elif self.has_validation_docs(): elif self.has_validation_docs():
...@@ -361,8 +392,29 @@ class Task(abc.ABC): ...@@ -361,8 +392,29 @@ class Task(abc.ABC):
eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...") eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...")
instances = [] instances = []
for doc_id, doc in utils.create_iterator(
enumerate(docs), rank, world_size, limit # process all documents when caching is specified for simplicity
if (
cache_requests
and (not cached_instances or rewrite_requests_cache)
and limit is not None
):
limit = None
doc_id_docs = list(
utils.create_iterator(
enumerate(docs),
rank,
world_size,
limit,
)
)
num_docs = len(doc_id_docs)
for doc_id, doc in tqdm(
doc_id_docs,
total=num_docs,
): ):
# sample fewshot context #TODO: need to offset doc_id by rank now! # sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context( fewshot_ctx = self.fewshot_context(
...@@ -380,11 +432,25 @@ class Task(abc.ABC): ...@@ -380,11 +432,25 @@ class Task(abc.ABC):
if not isinstance(inst, list): if not isinstance(inst, list):
inst = [inst] inst = [inst]
instances.extend(inst) instances.append(inst)
# now flatten, this is to allow slicing to work with pickles
sliced_instances = instances[:og_limit]
flattened_instances = [
instance
for instance_group in sliced_instances
for instance in instance_group
]
self._instances = flattened_instances
self._instances = instances
assert len(self._instances) != 0, "task.build_requests() did not find any docs!" assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
if cache_requests and (not cached_instances or rewrite_requests_cache):
save_to_cache(file_name=cache_key, obj=instances)
@abc.abstractmethod @abc.abstractmethod
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(self, doc, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of """Uses RequestFactory to construct Requests and returns an iterable of
......
import hashlib
import os
import dill
from lm_eval.utils import eval_logger
MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
OVERRIDE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
PATH = OVERRIDE_PATH if OVERRIDE_PATH else f"{MODULE_DIR}/.cache"
# This should be sufficient for uniqueness
HASH_INPUT = "EleutherAI-lm-evaluation-harness"
HASH_PREFIX = hashlib.sha256(HASH_INPUT.encode("utf-8")).hexdigest()
FILE_SUFFIX = f".{HASH_PREFIX}.pickle"
def load_from_cache(file_name):
try:
path = f"{PATH}/{file_name}{FILE_SUFFIX}"
with open(path, "rb") as file:
cached_task_dict = dill.loads(file.read())
return cached_task_dict
except Exception:
eval_logger.debug(f"{file_name} is not cached, generating...")
pass
def save_to_cache(file_name, obj):
if not os.path.exists(PATH):
os.mkdir(PATH)
file_path = f"{PATH}/{file_name}{FILE_SUFFIX}"
eval_logger.debug(f"Saving {file_path} to cache...")
with open(file_path, "wb") as file:
file.write(dill.dumps(obj))
# NOTE the "key" param is to allow for flexibility
def delete_cache(key: str = ""):
files = os.listdir(PATH)
for file in files:
if file.startswith(key) and file.endswith(FILE_SUFFIX):
file_path = f"{PATH}/{file}"
os.unlink(file_path)
import collections import collections
import itertools import itertools
import logging import logging
import math
import random import random
from typing import Optional, Union from typing import TYPE_CHECKING, Optional, Union
import numpy as np import numpy as np
import torch import torch
...@@ -20,16 +21,26 @@ from lm_eval.utils import ( ...@@ -20,16 +21,26 @@ from lm_eval.utils import (
) )
if TYPE_CHECKING:
from lm_eval.api.model import LM
from lm_eval.tasks import Task
from lm_eval.caching.cache import delete_cache
@positional_deprecated @positional_deprecated
def simple_evaluate( def simple_evaluate(
model, model,
model_args: Optional[str] = None, model_args: Optional[Union[str, dict, None]] = None,
tasks=None, tasks=None,
num_fewshot: Optional[int] = None, num_fewshot: Optional[int] = None,
batch_size: Optional[int] = None, batch_size: Optional[int] = None,
max_batch_size: Optional[int] = None, max_batch_size: Optional[int] = None,
device: Optional[str] = None, device: Optional[str] = None,
use_cache: Optional[str] = None, use_cache: Optional[str] = None,
cache_requests: bool = False,
rewrite_requests_cache: bool = False,
delete_requests_cache: bool = False,
limit: Optional[Union[int, float]] = None, limit: Optional[Union[int, float]] = None,
bootstrap_iters: int = 100000, bootstrap_iters: int = 100000,
check_integrity: bool = False, check_integrity: bool = False,
...@@ -48,8 +59,8 @@ def simple_evaluate( ...@@ -48,8 +59,8 @@ def simple_evaluate(
:param model: Union[str, LM] :param model: Union[str, LM]
Name of model or LM object, see lm_eval.models.get_model Name of model or LM object, see lm_eval.models.get_model
:param model_args: Optional[str] :param model_args: Optional[str, dict]
String arguments for each model class, see LM.create_from_arg_string. String or dict arguments for each model class, see LM.create_from_arg_string and LM.create_from_arg_object.
Ignored if `model` argument is a LM object. Ignored if `model` argument is a LM object.
:param tasks: list[Union[str, dict, Task]] :param tasks: list[Union[str, dict, Task]]
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise. List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
...@@ -63,6 +74,12 @@ def simple_evaluate( ...@@ -63,6 +74,12 @@ def simple_evaluate(
PyTorch device (e.g. "cpu" or "cuda:0") for running models PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param use_cache: str, optional :param use_cache: str, optional
A path to a sqlite db file for caching model responses. `None` if not caching. A path to a sqlite db file for caching model responses. `None` if not caching.
:param cache_requests: bool, optional
Speed up evaluation by caching the building of dataset requests. `None` if not caching.
:param rewrite_requests_cache: bool, optional
Rewrites all of the request cache if set to `True`. `None` if not desired.
:param delete_requests_cache: bool, optional
Deletes all of the request cache if set to `True`. `None` if not desired.
:param limit: int or float, optional :param limit: int or float, optional
Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples. Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters: :param bootstrap_iters:
...@@ -90,6 +107,10 @@ def simple_evaluate( ...@@ -90,6 +107,10 @@ def simple_evaluate(
""" """
eval_logger.setLevel(getattr(logging, f"{verbosity}")) eval_logger.setLevel(getattr(logging, f"{verbosity}"))
if delete_requests_cache:
eval_logger.info("Deleting requests cache...")
delete_cache()
if random_seed is not None: if random_seed is not None:
# See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412 # See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
eval_logger.info(f"Setting random seed to {random_seed}") eval_logger.info(f"Setting random seed to {random_seed}")
...@@ -120,14 +141,26 @@ def simple_evaluate( ...@@ -120,14 +141,26 @@ def simple_evaluate(
if isinstance(model, str): if isinstance(model, str):
if model_args is None: if model_args is None:
model_args = "" model_args = ""
lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args, elif isinstance(model_args, dict):
{ lm = lm_eval.api.registry.get_model(model).create_from_arg_obj(
"batch_size": batch_size, model_args,
"max_batch_size": max_batch_size, {
"device": device, "batch_size": batch_size,
}, "max_batch_size": max_batch_size,
) "device": device,
},
)
else:
lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args,
{
"batch_size": batch_size,
"max_batch_size": max_batch_size,
"device": device,
},
)
else: else:
assert isinstance(model, lm_eval.api.model.LM) assert isinstance(model, lm_eval.api.model.LM)
lm = model lm = model
...@@ -191,6 +224,8 @@ def simple_evaluate( ...@@ -191,6 +224,8 @@ def simple_evaluate(
lm=lm, lm=lm,
task_dict=task_dict, task_dict=task_dict,
limit=limit, limit=limit,
cache_requests=cache_requests,
rewrite_requests_cache=rewrite_requests_cache,
bootstrap_iters=bootstrap_iters, bootstrap_iters=bootstrap_iters,
decontamination_ngrams_path=decontamination_ngrams_path, decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out, write_out=write_out,
...@@ -211,9 +246,9 @@ def simple_evaluate( ...@@ -211,9 +246,9 @@ def simple_evaluate(
"model": model_name, "model": model_name,
"model_args": model_args, "model_args": model_args,
"batch_size": batch_size, "batch_size": batch_size,
"batch_sizes": list(lm.batch_sizes.values()) "batch_sizes": (
if hasattr(lm, "batch_sizes") list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else []
else [], ),
"device": device, "device": device,
"use_cache": use_cache, "use_cache": use_cache,
"limit": limit, "limit": limit,
...@@ -232,9 +267,11 @@ decontaminate_suffix = "_decontaminate" ...@@ -232,9 +267,11 @@ decontaminate_suffix = "_decontaminate"
@positional_deprecated @positional_deprecated
def evaluate( def evaluate(
lm, lm: "LM",
task_dict, task_dict,
limit: Optional[int] = None, limit: Optional[int] = None,
cache_requests=False,
rewrite_requests_cache=False,
bootstrap_iters: Optional[int] = 100000, bootstrap_iters: Optional[int] = 100000,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out: bool = False, write_out: bool = False,
...@@ -294,6 +331,8 @@ def evaluate( ...@@ -294,6 +331,8 @@ def evaluate(
# get lists of each type of request # get lists of each type of request
for task_name, task in task_dict.items(): for task_name, task in task_dict.items():
task: Task
if isinstance(task, tuple): if isinstance(task, tuple):
group_name, task = task group_name, task = task
task_hierarchy[group_name].append(task_name) task_hierarchy[group_name].append(task_name)
...@@ -331,9 +370,18 @@ def evaluate( ...@@ -331,9 +370,18 @@ def evaluate(
task_docs = task.validation_docs() task_docs = task.validation_docs()
else: else:
raise RuntimeError("Task has neither test_docs nor validation_docs") raise RuntimeError("Task has neither test_docs nor validation_docs")
limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size) num_docs = len(task_docs) * limit
# ceil to prevent limit being equal to 0
limit = int(math.ceil(num_docs)) if limit < 1.0 else int(limit)
task.build_all_requests(
limit=limit,
rank=lm.rank,
world_size=lm.world_size,
cache_requests=cache_requests,
rewrite_requests_cache=rewrite_requests_cache,
)
eval_logger.debug( eval_logger.debug(
f"Task: {task_name}; number of requests on this rank: {len(task.instances)}" f"Task: {task_name}; number of requests on this rank: {len(task.instances)}"
...@@ -508,9 +556,11 @@ def evaluate( ...@@ -508,9 +556,11 @@ def evaluate(
if bootstrap_iters > 0: if bootstrap_iters > 0:
stderr_fn = lm_eval.api.metrics.stderr_for_metric( stderr_fn = lm_eval.api.metrics.stderr_for_metric(
metric=agg_fn, metric=agg_fn,
bootstrap_iters=min(bootstrap_iters, 100) bootstrap_iters=(
if metric in ["bleu", "chrf", "ter"] min(bootstrap_iters, 100)
else bootstrap_iters, if metric in ["bleu", "chrf", "ter"]
else bootstrap_iters
),
) )
results[task_name][f"{metric}_stderr,{key}"] = ( results[task_name][f"{metric}_stderr,{key}"] = (
...@@ -649,3 +699,15 @@ def evaluate( ...@@ -649,3 +699,15 @@ def evaluate(
else: else:
return None return None
def request_caching_arg_to_dict(cache_requests: str) -> dict:
request_caching_args = {
"cache_requests": (
True if cache_requests == "true" or cache_requests == "refresh" else False
),
"rewrite_requests_cache": True if cache_requests == "refresh" else False,
"delete_requests_cache": True if cache_requests == "delete" else False,
}
return request_caching_args
...@@ -445,6 +445,7 @@ def get_task_dict( ...@@ -445,6 +445,7 @@ def get_task_dict(
assert set(task_name_from_string_dict.keys()).isdisjoint( assert set(task_name_from_string_dict.keys()).isdisjoint(
set(task_name_from_object_dict.keys()) set(task_name_from_object_dict.keys())
) )
return { return {
**task_name_from_string_dict, **task_name_from_string_dict,
**task_name_from_config_dict, **task_name_from_config_dict,
......
...@@ -36,6 +36,7 @@ dependencies = [ ...@@ -36,6 +36,7 @@ dependencies = [
"tqdm-multiprocess", "tqdm-multiprocess",
"transformers>=4.1", "transformers>=4.1",
"zstandard", "zstandard",
"dill",
"word2number", "word2number",
] ]
......
"""
Usage:
python requests_caching.py --tasks=comma,separated,list,of,tasks --cache_requests=<true|refresh|delete]>
"""
import argparse
import os
from typing import List
import torch
from transformers import (
pipeline as trans_pipeline,
)
from lm_eval import simple_evaluate
from lm_eval.evaluator import request_caching_arg_to_dict
from lm_eval.utils import eval_logger
MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
# Used to specify alternate cache path, useful if run in a docker container
# NOTE raw datasets will break if you try to transfer the cache from your host to a docker image
LM_HARNESS_CACHE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL = "EleutherAI/pythia-70m"
TASK = "text-generation"
def run_model_for_task_caching(tasks: List[str], cache_requests: str):
eval_logger.info(f"Loading HF model: {MODEL}")
trans_pipe = trans_pipeline(
task=TASK, model=MODEL, device=DEVICE, trust_remote_code=True
)
model = trans_pipe.model
tokenizer = trans_pipe.tokenizer
eval_logger.info(
f"Running simple_evaluate to cache request objects for tasks: {tasks}"
)
cache_args = request_caching_arg_to_dict(cache_requests=cache_requests)
eval_logger.info(
f"The following operations will be performed on the cache: {cache_requests}"
)
eval_data = simple_evaluate(
model="hf-auto",
model_args={
"pretrained": model,
"tokenizer": tokenizer,
},
limit=1,
device=DEVICE,
tasks=tasks,
write_out=True,
**cache_args,
)
return eval_data
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--tasks",
"-t",
default=None,
metavar="task1,task2",
)
parser.add_argument(
"--cache_requests",
type=str,
default=None,
choices=["true", "refresh", "delete"],
help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
)
args = parser.parse_args()
tasks = args.tasks.split(",")
eval_data = run_model_for_task_caching(
tasks=tasks, model=MODEL, device=DEVICE, cache_requests=args.cache_requests
)
# import lm_eval.base as base
import importlib
import os
import sys
from datetime import datetime
from typing import List, Tuple
import pytest
import torch
# import lm_eval.models as models
from lm_eval.caching.cache import PATH
MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
# NOTE the script this loads uses simple evaluate
# TODO potentially test both the helper script and the normal script
sys.path.append(f"{MODULE_DIR}/../scripts")
model_loader = importlib.import_module("requests_caching")
run_model_for_task_caching = model_loader.run_model_for_task_caching
DEFAULT_TASKS = ["lambada_openai", "hellaswag"]
@pytest.fixture(autouse=True)
def setup_and_teardown():
# Setup
torch.use_deterministic_algorithms(False)
clear_cache()
# Yields control back to the test function
yield
# Cleanup here
def clear_cache():
if os.path.exists(PATH):
cache_files = os.listdir(PATH)
for file in cache_files:
file_path = f"{PATH}/{file}"
os.unlink(file_path)
# leaving tasks here to allow for the option to select specific task files
def get_cache_files(tasks: List[str] = None) -> Tuple[List[str], List[str]]:
cache_files = os.listdir(PATH)
file_task_names = []
for file in cache_files:
file_without_prefix = file.split("-")[1]
file_without_prefix_and_suffix = file_without_prefix.split(".")[0]
file_task_names.append(file_without_prefix_and_suffix)
return cache_files, file_task_names
def assert_created(tasks: List[str], file_task_names: List[str]):
tasks.sort()
file_task_names.sort()
assert tasks == file_task_names
@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
def test_requests_caching_true(tasks: List[str]):
run_model_for_task_caching(tasks=tasks, cache_requests="true")
cache_files, file_task_names = get_cache_files()
assert_created(tasks=tasks, file_task_names=file_task_names)
@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
def test_requests_caching_refresh(tasks: List[str]):
run_model_for_task_caching(tasks=tasks, cache_requests="true")
timestamp_before_test = datetime.now().timestamp()
run_model_for_task_caching(tasks=tasks, cache_requests="refresh")
cache_files, file_task_names = get_cache_files()
for file in cache_files:
modification_time = os.path.getmtime(f"{PATH}/{file}")
assert modification_time > timestamp_before_test
tasks.sort()
file_task_names.sort()
assert tasks == file_task_names
@pytest.mark.parametrize("tasks", [DEFAULT_TASKS])
def test_requests_caching_delete(tasks: List[str]):
# populate the data first, rerun this test within this test for additional confidence
test_requests_caching_true(tasks=tasks)
run_model_for_task_caching(tasks=tasks, cache_requests="delete")
cache_files, file_task_names = get_cache_files()
assert len(cache_files) == 0
# useful for locally running tests through the debugger
if __name__ == "__main__":
def run_tests():
tests = [
test_requests_caching_true,
test_requests_caching_refresh,
test_requests_caching_delete,
]
for test_func in tests:
clear_cache()
test_func(tasks=DEFAULT_TASKS)
print("Tests pass")
run_tests()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment