Commit 90ad5db7 authored by lintangsutawika's avatar lintangsutawika
Browse files

merged main

parents f692caa9 b177c82c
...@@ -16,3 +16,8 @@ temp ...@@ -16,3 +16,8 @@ temp
# IPython # IPython
profile_default/ profile_default/
ipython_config.py ipython_config.py
# don't track (the default location of) the cached requests
lm_eval/caching/.cache
# don't track files created by wandb
wandb
examples/wandb
...@@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github ...@@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github
## Visualizing Results ## Visualizing Results
You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.
### Zeno
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs. You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account). First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
...@@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a ...@@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a
You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb). You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).
### Weights and Biases
With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.
The integration provide functionalities
- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.
First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.
Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.
Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.
```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```
In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).
## How to Contribute or Learn More? ## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help. For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
......
...@@ -10,51 +10,54 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th ...@@ -10,51 +10,54 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`: This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs. - `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66) - `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. - `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer. - `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file. - `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length. - `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed. - `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type. - `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well. - `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`. - `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models. - `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again. - `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool. - `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity. - `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task. - `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes. - `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/` - `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results. - `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42. * `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]
## External Library Usage ## External Library Usage
We also support using the library's external API for use within model training loops or other scripts. We also support using the library's external API for use within model training loops or other scripts.
`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`. `lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows: `simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
```python ```python
...@@ -86,7 +89,6 @@ results = lm_eval.simple_evaluate( # call simple_evaluate ...@@ -86,7 +89,6 @@ results = lm_eval.simple_evaluate( # call simple_evaluate
) )
``` ```
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`. Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
...@@ -94,6 +96,7 @@ Additionally, the `evaluate()` function offers the core evaluation functionality ...@@ -94,6 +96,7 @@ Additionally, the `evaluate()` function offers the core evaluation functionality
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
As a brief example usage of `evaluate()`: As a brief example usage of `evaluate()`:
```python ```python
import lm_eval import lm_eval
......
...@@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h ...@@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input. - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!
**Tip: be careful of indexing in loglikelihood!** **Tip: be careful of indexing in loglikelihood!**
......
{
"cells": [
{
"cell_type": "markdown",
"id": "fc477b96-adee-4829-a9d7-a5eb990df358",
"metadata": {},
"source": [
"# Visualizing Results in Weights and Biases\n",
"\n",
"With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
"\n",
"The integration provide functionalities\n",
"\n",
"- to automatically log the evaluation results,\n",
"- log the samples as W&B Tables for easy visualization,\n",
"- log the `results.json` file as an artifact for version control,\n",
"- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
"- generate a comprehensive report for analysis and visualization with all the important metric,\n",
"- log task and cli configs,\n",
"- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
"\n",
"The integration is super easy to use with the eval harness. Let's see how!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Install this project if you did not already have it.\n",
"# This is all that is needed to be installed to start using Weights and Biases\n",
"\n",
"!pip -qq install -e ..[wandb]"
]
},
{
"cell_type": "markdown",
"id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
"\n",
"If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
]
},
{
"cell_type": "markdown",
"id": "eec5866e-f01e-42f8-8803-9d77472ef991",
"metadata": {},
"source": [
"## Set your API Key\n",
"\n",
"Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d824d163-71a9-4313-935d-f1d56397841c",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"wandb.login()"
]
},
{
"cell_type": "markdown",
"id": "124e4a34-1547-4bed-bc09-db012bacbda6",
"metadata": {},
"source": [
"> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
]
},
{
"cell_type": "markdown",
"id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
"metadata": {},
"source": [
"## Run and log to W&B"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
" --tasks hellaswag,mmlu_abstract_algebra \\\n",
" --device cuda:0 \\\n",
" --batch_size 8 \\\n",
" --output_path output/phi-2 \\\n",
" --limit 10 \\\n",
" --wandb_args project=lm-eval-harness-integration \\\n",
" --log_samples"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
...@@ -11,7 +11,9 @@ from typing import Union ...@@ -11,7 +11,9 @@ from typing import Union
import numpy as np import numpy as np
from lm_eval import evaluator, utils from lm_eval import evaluator, utils
from lm_eval.tasks import TaskManager, initialize_tasks from lm_eval.evaluator import request_caching_arg_to_dict
from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table from lm_eval.utils import make_table
...@@ -118,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -118,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace:
metavar="DIR", metavar="DIR",
help="A path to a sqlite db file for caching model responses. `None` if not caching.", help="A path to a sqlite db file for caching model responses. `None` if not caching.",
) )
parser.add_argument(
"--cache_requests",
type=str,
default=None,
choices=["true", "refresh", "delete"],
help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
)
parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used
parser.add_argument( parser.add_argument(
"--check_integrity", "--check_integrity",
...@@ -167,6 +176,11 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -167,6 +176,11 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG", metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.", help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
) )
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
)
parser.add_argument( parser.add_argument(
"--predict_only", "--predict_only",
"-x", "-x",
...@@ -195,6 +209,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -195,6 +209,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# we allow for args to be passed externally, else we parse them ourselves # we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args() args = parse_eval_args()
if args.wandb_args:
wandb_logger = WandbLogger(args)
eval_logger = utils.eval_logger eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}")) eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}") eval_logger.info(f"Verbosity set to {args.verbosity}")
...@@ -272,6 +289,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -272,6 +289,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
eval_logger.info("Loading selected tasks...") eval_logger.info("Loading selected tasks...")
request_caching_args = request_caching_arg_to_dict(
cache_requests=args.cache_requests
)
results = evaluator.simple_evaluate( results = evaluator.simple_evaluate(
model=args.model, model=args.model,
model_args=args.model_args, model_args=args.model_args,
...@@ -289,6 +310,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -289,6 +310,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
gen_kwargs=args.gen_kwargs, gen_kwargs=args.gen_kwargs,
task_manager=task_manager, task_manager=task_manager,
predict_only=args.predict_only, predict_only=args.predict_only,
**request_caching_args,
random_seed=args.seed[0], random_seed=args.seed[0],
numpy_random_seed=args.seed[1], numpy_random_seed=args.seed[1],
torch_random_seed=args.seed[2], torch_random_seed=args.seed[2],
...@@ -305,6 +327,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -305,6 +327,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"])) batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
# Add W&B logging
if args.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if args.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to Weights and Biases failed due to {e}")
if args.output_path: if args.output_path:
output_path_file.open("w", encoding="utf-8").write(dumped) output_path_file.open("w", encoding="utf-8").write(dumped)
...@@ -330,6 +362,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -330,6 +362,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if "groups" in results: if "groups" in results:
print(make_table(results, "groups")) print(make_table(results, "groups"))
if args.wandb_args:
# Tear down wandb run once all the logging is done.
wandb_logger.run.finish()
if __name__ == "__main__": if __name__ == "__main__":
cli_evaluate() cli_evaluate()
...@@ -5,7 +5,7 @@ from collections import defaultdict ...@@ -5,7 +5,7 @@ from collections import defaultdict
from collections.abc import Iterable from collections.abc import Iterable
from typing import List from typing import List
import evaluate import evaluate as hf_evaluate
import numpy as np import numpy as np
import sacrebleu import sacrebleu
import sklearn.metrics import sklearn.metrics
...@@ -179,7 +179,7 @@ def acc_mutual_info_fn(items): # This is a passthrough function ...@@ -179,7 +179,7 @@ def acc_mutual_info_fn(items): # This is a passthrough function
return items return items
exact_match = evaluate.load("exact_match") exact_match = hf_evaluate.load("exact_match")
@register_metric( @register_metric(
......
...@@ -133,6 +133,28 @@ class LM(abc.ABC): ...@@ -133,6 +133,28 @@ class LM(abc.ABC):
args2 = {k: v for k, v in additional_config.items() if v is not None} args2 = {k: v for k, v in additional_config.items() if v is not None}
return cls(**args, **args2) return cls(**args, **args2)
@classmethod
def create_from_arg_obj(
cls: Type[T], arg_dict: dict, additional_config: Optional[dict] = None
) -> T:
"""
Creates an instance of the LM class using the given arg_obj
Parameters:
- arg_obj: A dict containing arguments in the format key1=value1,key2=value2.
- additional_config: Optional dictionary containing additional configuration parameters.
Returns:
- Instance of the LM class.
"""
additional_config = {} if additional_config is None else additional_config
additional_config = {
k: v for k, v in additional_config.items() if v is not None
}
return cls(**arg_dict, **additional_config)
@property @property
def rank(self): def rank(self):
# used in the case of parallelism. Hardcoded to # used in the case of parallelism. Hardcoded to
...@@ -203,7 +225,7 @@ class CachingLM: ...@@ -203,7 +225,7 @@ class CachingLM:
eval_logger.info( eval_logger.info(
f"Loading '{attr}' responses from cache '{self.cache_db}' where possible..." f"Loading '{attr}' responses from cache '{self.cache_db}' where possible..."
) )
for req in tqdm(requests): for req in tqdm(requests, desc="Checking cached requests"):
hsh = hash_args(attr, req.args) hsh = hash_args(attr, req.args)
if attr == "generate_until" and req.args[1].get("do_sample", False): if attr == "generate_until" and req.args[1].get("do_sample", False):
# when we are doing non-greedy generation, don't use the cache # when we are doing non-greedy generation, don't use the cache
...@@ -224,7 +246,9 @@ class CachingLM: ...@@ -224,7 +246,9 @@ class CachingLM:
else: else:
res.append(None) res.append(None)
remaining_reqs.append(req) remaining_reqs.append(req)
eval_logger.info(
f"Cached requests: {len(requests) - len(remaining_reqs)}, Requests remaining: {len(remaining_reqs)}"
)
# actually run the LM on the requests that do not have cached results # actually run the LM on the requests that do not have cached results
rem_res = getattr(self.lm, attr)(remaining_reqs) rem_res = getattr(self.lm, attr)(remaining_reqs)
...@@ -247,3 +271,61 @@ class CachingLM: ...@@ -247,3 +271,61 @@ class CachingLM:
def get_cache_hook(self): def get_cache_hook(self):
return CacheHook(self) return CacheHook(self)
class TemplateLM(LM):
"""
A class acting as intermediary between the LM base class
and boilerplate often included in other LM subclasses.
"""
@property
@abc.abstractmethod
def eot_token_id(self):
pass
@abc.abstractmethod
def tok_encode(self, string: str, **kwargs):
pass
@abc.abstractmethod
def _loglikelihood_tokens(self, requests, **kwargs):
pass
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
@abc.abstractmethod
def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
pass
@abc.abstractmethod
def generate_until(self, requests) -> List[str]:
pass
import logging import logging
from typing import Callable, Dict from typing import Callable, Dict
import evaluate import evaluate as hf_evaluate
from lm_eval.api.model import LM from lm_eval.api.model import LM
...@@ -129,7 +129,7 @@ def get_metric(name: str, hf_evaluate_metric=False) -> Callable: ...@@ -129,7 +129,7 @@ def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
) )
try: try:
metric_object = evaluate.load(name) metric_object = hf_evaluate.load(name)
return metric_object.compute return metric_object.compute
except Exception: except Exception:
eval_logger.error( eval_logger.error(
......
...@@ -4,12 +4,14 @@ import logging ...@@ -4,12 +4,14 @@ import logging
import random import random
import re import re
from collections.abc import Callable from collections.abc import Callable
from copy import deepcopy
from dataclasses import asdict, dataclass from dataclasses import asdict, dataclass
from inspect import getsource from inspect import getsource
from typing import Any, List, Literal, Tuple, Union from typing import Any, Iterator, List, Literal, Tuple, Union
import datasets import datasets
import numpy as np import numpy as np
from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api import samplers from lm_eval.api import samplers
...@@ -27,6 +29,7 @@ from lm_eval.api.registry import ( ...@@ -27,6 +29,7 @@ from lm_eval.api.registry import (
get_metric_aggregation, get_metric_aggregation,
is_higher_better, is_higher_better,
) )
from lm_eval.caching.cache import load_from_cache, save_to_cache
from lm_eval.filters import build_filter_ensemble from lm_eval.filters import build_filter_ensemble
from lm_eval.prompts import get_prompt from lm_eval.prompts import get_prompt
...@@ -122,9 +125,11 @@ class TaskConfig(dict): ...@@ -122,9 +125,11 @@ class TaskConfig(dict):
if self.output_type == "generate_until": if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = { self.generation_kwargs = {
"until": None "until": (
if self.fewshot_delimiter is None None
else [self.fewshot_delimiter], if self.fewshot_delimiter is None
else [self.fewshot_delimiter]
),
"do_sample": False, "do_sample": False,
} }
...@@ -338,7 +343,7 @@ class Task(abc.ABC): ...@@ -338,7 +343,7 @@ class Task(abc.ABC):
return doc return doc
@property @property
def instances(self): def instances(self) -> List[Instance]:
"""After calling `task.build_all_requests()`, tasks """After calling `task.build_all_requests()`, tasks
maintain a list of the dataset instances which will be evaluated. maintain a list of the dataset instances which will be evaluated.
""" """
...@@ -364,20 +369,57 @@ class Task(abc.ABC): ...@@ -364,20 +369,57 @@ class Task(abc.ABC):
def doc_to_target(self, doc): def doc_to_target(self, doc):
pass pass
def build_all_requests(self, limit=None, rank=None, world_size=None) -> None: def build_all_requests(
self,
*,
limit=None,
rank=None,
world_size=None,
cache_requests=False,
rewrite_requests_cache=False,
) -> None:
"""Build a set of Instances for a task, and store them in task.instances""" """Build a set of Instances for a task, and store them in task.instances"""
if self.has_test_docs():
docs = self.test_docs() # used with caching
elif self.has_validation_docs(): og_limit = limit
docs = self.validation_docs()
else: cache_key = f"requests-{self._config.task}"
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
cached_instances = load_from_cache(file_name=cache_key)
if cache_requests and cached_instances and not rewrite_requests_cache:
cached_instances = cached_instances[:limit]
flattened_instances = [
instance
for instance_group in cached_instances
for instance in instance_group
]
self._instances = flattened_instances
return
eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...") eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...")
instances = [] instances = []
for doc_id, doc in utils.create_iterator(
enumerate(docs), rank, world_size, limit # process all documents when caching is specified for simplicity
if (
cache_requests
and (not cached_instances or rewrite_requests_cache)
and limit is not None
):
limit = None
doc_id_docs = list(
self.doc_iterator(rank=rank, limit=limit, world_size=world_size)
)
num_docs = len(doc_id_docs)
for doc_id, doc in tqdm(
doc_id_docs,
total=num_docs,
): ):
# sample fewshot context #TODO: need to offset doc_id by rank now! # sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context( fewshot_ctx = self.fewshot_context(
...@@ -395,11 +437,25 @@ class Task(abc.ABC): ...@@ -395,11 +437,25 @@ class Task(abc.ABC):
if not isinstance(inst, list): if not isinstance(inst, list):
inst = [inst] inst = [inst]
instances.extend(inst) instances.append(inst)
# now flatten, this is to allow slicing to work with pickles
sliced_instances = instances[:og_limit]
flattened_instances = [
instance
for instance_group in sliced_instances
for instance in instance_group
]
self._instances = flattened_instances
self._instances = instances
assert len(self._instances) != 0, "task.build_requests() did not find any docs!" assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
if cache_requests and (not cached_instances or rewrite_requests_cache):
save_to_cache(file_name=cache_key, obj=instances)
@abc.abstractmethod @abc.abstractmethod
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(self, doc, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of """Uses RequestFactory to construct Requests and returns an iterable of
...@@ -581,6 +637,27 @@ class Task(abc.ABC): ...@@ -581,6 +637,27 @@ class Task(abc.ABC):
setattr(self._config, "metric_list", [{"metric": metric_name}]) setattr(self._config, "metric_list", [{"metric": metric_name}])
setattr(self._config, "process_results", None) setattr(self._config, "process_results", None)
@property
def eval_docs(self) -> Union[datasets.Dataset, List[dict]]:
if self.has_test_docs():
return self.test_docs()
elif self.has_validation_docs():
return self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
def doc_iterator(
self, *, rank: int = 0, limit: Union[int, None] = None, world_size: int = 1
) -> Iterator[Tuple[int, Any]]:
limit = int(limit) if limit else None
doc_iterator = utils.create_iterator(
enumerate(self.eval_docs),
rank=int(rank),
limit=limit,
world_size=int(world_size),
)
return doc_iterator
class ConfigurableTask(Task): class ConfigurableTask(Task):
VERSION = "Yaml" VERSION = "Yaml"
...@@ -730,12 +807,7 @@ class ConfigurableTask(Task): ...@@ -730,12 +807,7 @@ class ConfigurableTask(Task):
else "default" else "default"
)(list(self.fewshot_docs()), self, rnd=random.Random(1234)) )(list(self.fewshot_docs()), self, rnd=random.Random(1234))
if self.has_test_docs(): self.task_docs = self.eval_docs
self.task_docs = self.test_docs()
elif self.has_validation_docs():
self.task_docs = self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
# Test One Doc # Test One Doc
self.features = list(self.task_docs.features.keys()) self.features = list(self.task_docs.features.keys())
...@@ -1080,7 +1152,7 @@ class ConfigurableTask(Task): ...@@ -1080,7 +1152,7 @@ class ConfigurableTask(Task):
return request_list return request_list
elif self.OUTPUT_TYPE == "generate_until": elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self.config.generation_kwargs) arguments = (ctx, deepcopy(self.config.generation_kwargs))
return Instance( return Instance(
request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs
...@@ -1193,8 +1265,8 @@ class ConfigurableTask(Task): ...@@ -1193,8 +1265,8 @@ class ConfigurableTask(Task):
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}), **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}), **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}), **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
# {"brier_score": (gold, prob_norm)}
**( **(
# {"brier_score": (gold, prob_norm)}
{"brier_score": [np.eye(len(prob_norm))[gold], prob_norm]} {"brier_score": [np.eye(len(prob_norm))[gold], prob_norm]}
if "brier_score" in use_metric if "brier_score" in use_metric
else {} else {}
...@@ -1293,6 +1365,15 @@ class ConfigurableTask(Task): ...@@ -1293,6 +1365,15 @@ class ConfigurableTask(Task):
def get_config(self, key: str) -> Any: def get_config(self, key: str) -> Any:
return getattr(self._config, key, None) return getattr(self._config, key, None)
def __repr__(self):
return (
f"ConfigurableTask(task_name={getattr(self.config, 'task', None)},"
f"group_name={getattr(self.config, 'group', None)},"
f"output_type={self.OUTPUT_TYPE},"
f"num_fewshot={getattr(self.config, 'num_fewshot', None)},"
f"num_samples={len(self.eval_docs)})"
)
class MultipleChoiceTask(Task): class MultipleChoiceTask(Task):
OUTPUT_TYPE: str = "loglikelihood" OUTPUT_TYPE: str = "loglikelihood"
......
import hashlib
import os
import dill
from lm_eval.utils import eval_logger
MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
OVERRIDE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
PATH = OVERRIDE_PATH if OVERRIDE_PATH else f"{MODULE_DIR}/.cache"
# This should be sufficient for uniqueness
HASH_INPUT = "EleutherAI-lm-evaluation-harness"
HASH_PREFIX = hashlib.sha256(HASH_INPUT.encode("utf-8")).hexdigest()
FILE_SUFFIX = f".{HASH_PREFIX}.pickle"
def load_from_cache(file_name):
try:
path = f"{PATH}/{file_name}{FILE_SUFFIX}"
with open(path, "rb") as file:
cached_task_dict = dill.loads(file.read())
return cached_task_dict
except Exception:
eval_logger.debug(f"{file_name} is not cached, generating...")
pass
def save_to_cache(file_name, obj):
if not os.path.exists(PATH):
os.mkdir(PATH)
file_path = f"{PATH}/{file_name}{FILE_SUFFIX}"
eval_logger.debug(f"Saving {file_path} to cache...")
with open(file_path, "wb") as file:
file.write(dill.dumps(obj))
# NOTE the "key" param is to allow for flexibility
def delete_cache(key: str = ""):
files = os.listdir(PATH)
for file in files:
if file.startswith(key) and file.endswith(FILE_SUFFIX):
file_path = f"{PATH}/{file}"
os.unlink(file_path)
This diff is collapsed.
import collections
import math
import pathlib
import sys
from typing import Dict, List, Optional, Tuple, Union
from lm_eval.api import metrics
from lm_eval.utils import eval_logger, positional_deprecated
class TaskOutput:
"""
Wrapper class for Task outputs.It contains various attributes and methods to manage and calculate metrics for the task.
Attributes:
task (object): The task object.
task_name (str): The name of the task.
task_config (dict): The configuration of the task.
version (str): The version of the task.
group_name (str): The name of the task group.
n_shot (int): The number of shots for the task.
task_alias (str): The alias of the task.
group_alias (str): The alias of the task group.
is_group (bool): Indicates if the task is a group.
logged_samples (list): The list of logged samples.
sample_len (int): The length of the samples.
sample_metrics (defaultdict): The dictionary of samples' metrics.
agg_metrics (defaultdict): The dictionary of aggregate metrics.
Methods:
from_taskdict(cls, task_name: str, task):
Creates a TaskOutput instance from a task dictionary.
calculate_aggregate_metric(bootstrap_iters=100000) -> None:
Calculates the aggregate metrics for the task.
"""
def __init__(
self,
task=None,
task_name=None,
task_config=None,
version=None,
group_name=None,
n_shot=None,
task_alias=None,
group_alias=None,
is_group=None,
):
self.task = task
self.task_config = task_config
self.task_name = task_name
self.group_name = group_name
self.version = version
self.n_shot = n_shot
self.task_alias = task_alias
self.group_alias = group_alias
self.is_group = is_group
self.logged_samples = []
self.sample_len = None
self.sample_metrics = collections.defaultdict(list)
self.agg_metrics = collections.defaultdict(list)
@classmethod
def from_taskdict(cls, task_name: str, task):
if isinstance(task, tuple):
group_name, task = task
else:
group_name = None
if not task:
# these gets filtered out in get_task_list
# once they are added to group hierarchy
is_group = True
return cls(
task=task, task_name=task_name, is_group=is_group, group_name=group_name
)
version = task.VERSION
task_config = dict(task.dump_config())
if (n_shot := task_config.get("num_fewshot")) == 0:
n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
task_alias = task_config.get("alias")
group_alias = task_config.get("group_alias")
return cls(
task=task,
task_name=task_name,
task_config=task_config,
group_name=group_name,
version=version,
n_shot=n_shot,
task_alias=task_alias,
group_alias=group_alias,
)
def calculate_aggregate_metric(self, bootstrap_iters=100000) -> None:
for (metric, filter_key), items in self.sample_metrics.items():
agg_fn = self.task.aggregation()[metric]
metric_key = f"{metric},{filter_key}"
self.agg_metrics[metric_key] = agg_fn(items)
self.sample_len = len(items) # TODO: same sample size for each metric?
if bootstrap_iters:
stderr_fn = metrics.stderr_for_metric(
metric=agg_fn,
bootstrap_iters=min(bootstrap_iters, 100)
if metric in ["bleu", "chrf", "ter"]
else bootstrap_iters,
)
self.agg_metrics[f"{metric}_stderr,{filter_key}"] = (
stderr_fn(items) if (stderr_fn and len(items) > 1) else "N/A"
)
def __repr__(self):
return (
f"TaskOutput(task_name={self.task_name}, "
f"group_name={self.group_name}, "
f"version={self.version},"
f"n_shot={self.n_shot}"
f"task_alias={self.task_alias}, group_alias={self.group_alias})"
)
def get_task_list(task_dict: dict) -> Tuple[Dict[str, list], List[TaskOutput]]:
task_hierarchy = collections.defaultdict(list)
outputs = list(TaskOutput.from_taskdict(x, y) for x, y in task_dict.items())
for task_output in outputs:
if group_name := task_output.group_name:
task_hierarchy[group_name].append(task_output.task_name)
else:
task_hierarchy[task_output.task_name] = []
# returns task_hierarchy tracking which groups contain which subtasks,
# and a list of TaskOutput classes for each non-group subtask
return task_hierarchy, [x for x in outputs if x.task]
def print_writeout(task) -> None:
for inst in task.instances:
# print the prompt for the first few documents
if inst.doc_id < 1:
eval_logger.info(
f"Task: {task}; document {inst.doc_id}; context prompt (starting on next line):\
\n{inst.args[0]}\n(end of prompt on previous line)\ntarget string or answer choice index (starting on next line):\n{task.doc_to_target(inst.doc)}\n(end of target on previous line)"
)
eval_logger.info(f"Request: {str(inst)}")
def get_sample_size(task, limit: Optional[int]) -> Union[int, None]:
if limit is not None:
limit = (
int(math.ceil(len(task.eval_docs) * limit)) if limit < 1.0 else int(limit)
)
return limit
def prepare_print_tasks(
task_hierarchy: dict, results: dict, tab=0
) -> Tuple[dict, dict]:
"""
@param task_hierarchy: Dictionary representing the group hierarchy of tasks. Each key is a group name and its
value is a list of task names.
@param results: Dictionary containing the results of each task. Each key is a
group name and its value is a dictionary of task results.
@param tab: The indentation level for printing the task
hierarchy. Default is 0.
@return: A tuple of two dictionaries: results_agg and groups_agg. results_agg contains
aggregated results for each task, and groups_agg contains aggregated results for each group.
Prepares the task hierarchy and aggregates the results for each task and group recursively for printing.
"""
results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict)
(group_name, task_list), *_ = task_hierarchy.items()
task_list = sorted(task_list)
results_agg[group_name] = results[group_name].copy()
# results_agg[group_name]["tab"] = tab
if "samples" in results_agg[group_name]:
results_agg[group_name].pop("samples")
tab_string = " " * tab + "- " if tab > 0 else ""
if "alias" in results_agg[group_name]:
results_agg[group_name]["alias"] = tab_string + results_agg[group_name]["alias"]
else:
results_agg[group_name]["alias"] = tab_string + group_name
if len(task_list) > 0:
groups_agg[group_name] = results[group_name].copy()
# groups_agg[group_name]["tab"] = tab
if "samples" in groups_agg[group_name]:
groups_agg[group_name].pop("samples")
if "alias" in groups_agg[group_name]:
groups_agg[group_name]["alias"] = (
tab_string + groups_agg[group_name]["alias"]
)
else:
groups_agg[group_name]["alias"] = tab_string + group_name
for task_name in task_list:
if task_name in task_hierarchy:
_task_hierarchy = {
**{task_name: task_hierarchy[task_name]},
**task_hierarchy,
}
else:
_task_hierarchy = {
**{task_name: []},
**task_hierarchy,
}
_results_agg, _groups_agg = prepare_print_tasks(
_task_hierarchy, results, tab + 1
)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
return results_agg, groups_agg
def consolidate_results(
eval_tasks: List[TaskOutput],
) -> Tuple[dict, dict, dict, dict, dict]:
"""
@param eval_tasks: list(TaskOutput).
@return: A tuple containing the consolidated results, samples, configs, versions, and num_fewshot.
Consolidates the results of multiple evaluation tasks into a single structure.
The method iterates over each evaluation instance and extracts relevant information to create the consolidated
results structure. The consolidated results structure has the following properties:
- results: A defaultdict with task names as keys and dictionaries as values. Each dictionary contains
metric/filter pairs as keys and corresponding metric values as values. The "alias" key is used to store task
aliases specified in the task configuration.
- samples: A defaultdict with task names as keys and lists of log samples as values.
- configs: A defaultdict with task names as keys and task configurations as values.
- versions: A defaultdict with task names as keys and task versions as values.
- num_fewshot: A defaultdict with task names as keys and number of few-shot samples as values.
The method then returns the consolidated results, samples, configs, versions, and num_fewshot as a tuple.
"""
# stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict)
# logs info about each document evaluated.
samples = collections.defaultdict(list)
# store num-fewshot value per task
num_fewshot = collections.defaultdict(int)
# Tracks the YAML configs of all chosen task
configs = collections.defaultdict(dict)
# Tracks each task's version.
versions = collections.defaultdict(dict)
for task_output in eval_tasks:
if "task_alias" in (task_config := task_output.task_config):
results[task_output.task_name]["alias"] = task_config["task_alias"]
if group_alias := task_output.group_alias:
if group_alias not in results and (group_name := task_output.group_name):
results[group_name]["alias"] = group_alias
num_fewshot[task_output.task_name] = task_output.n_shot
configs[task_output.task_name] = task_output.task_config
versions[task_output.task_name] = task_output.version
samples[task_output.task_name] = task_output.logged_samples
for (metric, filter_key), items in task_output.sample_metrics.items():
metric_key = f"{metric},{filter_key}"
results[task_output.task_name][metric_key] = task_output.agg_metrics[
metric_key
]
results[task_output.task_name]["samples"] = task_output.sample_len
results[task_output.task_name][
f"{metric}_stderr,{filter_key}"
] = task_output.agg_metrics[f"{metric}_stderr,{filter_key}"]
return results, samples, configs, versions, num_fewshot
@positional_deprecated
def find_test_root(start_path: pathlib.Path) -> pathlib.Path:
"""
Search upward in the directory tree to a maximum of three layers
to find and return the package root (containing the 'tests' folder)
"""
cur_path = start_path.resolve()
max_layers = 3
for _ in range(max_layers):
if (cur_path / "tests" / "test_version_stable.py").exists():
return cur_path
else:
cur_path = cur_path.parent.resolve()
raise FileNotFoundError(
f"Unable to find package root within {max_layers} upwards" + f"of {start_path}"
)
@positional_deprecated
def run_task_tests(task_list: List[str]):
"""
Find the package root and run the tests for the given tasks
"""
import pytest
package_root = find_test_root(start_path=pathlib.Path(__file__))
task_string = " or ".join(task_list)
args = [
f"{package_root}/tests/test_version_stable.py",
f"--rootdir={package_root}",
"-k",
f"{task_string}",
]
sys.path.append(str(package_root))
pytest_return_val = pytest.main(args)
if pytest_return_val:
raise ValueError(
f"Not all tests for the specified tasks ({task_list}) ran successfully! Error code: {pytest_return_val}"
)
from typing import List, Union
from functools import partial from functools import partial
from typing import List, Union
from lm_eval.api.filter import FilterEnsemble from lm_eval.api.filter import FilterEnsemble
from . import selection
from . import extraction from . import extraction, selection, transformation
from . import transformation
FILTER_REGISTRY = { FILTER_REGISTRY = {
......
import copy
import json
import logging
import os
import re
import subprocess
from pathlib import Path
from typing import Any, Dict, List, Literal, Optional, Tuple, Union
import numpy as np
import pandas as pd
from packaging.version import Version
from torch.utils.collect_env import get_pretty_env_info
from transformers import __version__ as trans_version
from lm_eval.utils import simple_parse_args_string
logger = logging.getLogger(__name__)
try:
import wandb
assert Version(wandb.__version__) >= Version("0.13.6")
if Version(wandb.__version__) < Version("0.13.6"):
wandb.require("report-editing:v0")
except Exception as e:
logger.warning(
"To use the wandb reporting functionality please install wandb>=0.13.6.\n"
"To install the latest version of wandb run `pip install wandb --upgrade`\n"
f"{e}"
)
def remove_none_pattern(input_string: str) -> Tuple[str, bool]:
"""Remove the ',none' substring from the input_string if it exists at the end.
Args:
input_string (str): The input string from which to remove the ',none' substring.
Returns:
Tuple[str, bool]: A tuple containing the modified input_string with the ',none' substring removed
and a boolean indicating whether the modification was made (True) or not (False).
"""
# Define the pattern to match ',none' at the end of the string
pattern = re.compile(r",none$")
# Use sub() to replace ',none' with an empty string
result = re.sub(pattern, "", input_string)
# check if the input_string changed
removed = result != input_string
return result, removed
def _handle_non_serializable(o: Any) -> Union[int, str, list]:
"""Handle non-serializable objects by converting them to serializable types.
Args:
o (Any): The object to be handled.
Returns:
Union[int, str, list]: The converted object. If the object is of type np.int64 or np.int32,
it will be converted to int. If the object is of type set, it will be converted
to a list. Otherwise, it will be converted to str.
"""
if isinstance(o, np.int64) or isinstance(o, np.int32):
return int(o)
elif isinstance(o, set):
return list(o)
else:
return str(o)
def get_wandb_printer() -> Literal["Printer"]:
"""Returns a wandb printer instance for pretty stdout."""
from wandb.sdk.lib.printer import get_printer
from wandb.sdk.wandb_settings import Settings
printer = get_printer(Settings()._jupyter)
return printer
class WandbLogger:
def __init__(self, args: Any) -> None:
"""Initialize the WandbLogger.
Args:
results (Dict[str, Any]): The results dictionary.
args (Any): Arguments for configuration.
"""
self.wandb_args: Dict[str, Any] = simple_parse_args_string(args.wandb_args)
# initialize a W&B run
if wandb.run is None:
self.run = wandb.init(**self.wandb_args)
else:
self.run = wandb.run
self.printer = get_wandb_printer()
def post_init(self, results: Dict[str, Any]) -> None:
self.results: Dict[str, Any] = copy.deepcopy(results)
self.task_names: List[str] = list(results.get("results", {}).keys())
self.group_names: List[str] = list(results.get("groups", {}).keys())
def _get_config(self) -> Dict[str, Any]:
"""Get configuration parameters."""
self.task_configs = self.results.get("configs", {})
cli_configs = self.results.get("config", {})
configs = {
"task_configs": self.task_configs,
"cli_configs": cli_configs,
}
return configs
def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
"""Sanitize the results dictionary."""
_results = copy.deepcopy(self.results.get("results", dict()))
# Remove None from the metric string name
tmp_results = copy.deepcopy(_results)
for task_name in self.task_names:
task_result = tmp_results.get(task_name, dict())
for metric_name, metric_value in task_result.items():
_metric_name, removed = remove_none_pattern(metric_name)
if removed:
_results[task_name][_metric_name] = metric_value
_results[task_name].pop(metric_name)
# remove string valued keys from the results dict
wandb_summary = {}
for task in self.task_names:
task_result = _results.get(task, dict())
for metric_name, metric_value in task_result.items():
if isinstance(metric_value, str):
wandb_summary[f"{task}/{metric_name}"] = metric_value
for summary_metric, summary_value in wandb_summary.items():
_task, _summary_metric = summary_metric.split("/")
_results[_task].pop(_summary_metric)
tmp_results = copy.deepcopy(_results)
for task_name, task_results in tmp_results.items():
for metric_name, metric_value in task_results.items():
_results[f"{task_name}/{metric_name}"] = metric_value
_results[task_name].pop(metric_name)
for task in self.task_names:
_results.pop(task)
return wandb_summary, _results
def _log_results_as_table(self) -> None:
"""Generate and log evaluation results as a table to W&B."""
columns = [
"Version",
"Filter",
"num_fewshot",
"Metric",
"Value",
"Stderr",
]
def make_table(columns: List[str], key: str = "results"):
table = wandb.Table(columns=columns)
results = copy.deepcopy(self.results)
for k, dic in results.get(key).items():
if k in self.group_names and not key == "groups":
continue
version = results.get("versions").get(k)
if version == "N/A":
version = None
n = results.get("n-shot").get(k)
for (mf), v in dic.items():
m, _, f = mf.partition(",")
if m.endswith("_stderr"):
continue
if m == "alias":
continue
if m + "_stderr" + "," + f in dic:
se = dic[m + "_stderr" + "," + f]
if se != "N/A":
se = "%.4f" % se
table.add_data(*[k, version, f, n, m, str(v), str(se)])
else:
table.add_data(*[k, version, f, n, m, str(v), ""])
return table
# log the complete eval result to W&B Table
table = make_table(["Tasks"] + columns, "results")
self.run.log({"evaluation/eval_results": table})
if "groups" in self.results.keys():
table = make_table(["Groups"] + columns, "groups")
self.run.log({"evaluation/group_eval_results": table})
def _log_results_as_artifact(self) -> None:
"""Log results as JSON artifact to W&B."""
dumped = json.dumps(
self.results, indent=2, default=_handle_non_serializable, ensure_ascii=False
)
artifact = wandb.Artifact("results", type="eval_results")
with artifact.new_file("results.json", mode="w", encoding="utf-8") as f:
f.write(dumped)
self.run.log_artifact(artifact)
def log_eval_result(self) -> None:
"""Log evaluation results to W&B."""
# Log configs to wandb
configs = self._get_config()
self.run.config.update(configs)
wandb_summary, self.wandb_results = self._sanitize_results_dict()
# update wandb.run.summary with items that were removed
self.run.summary.update(wandb_summary)
# Log the evaluation metrics to wandb
self.run.log(self.wandb_results)
# Log the evaluation metrics as W&B Table
self._log_results_as_table()
# Log the results dict as json to W&B Artifacts
self._log_results_as_artifact()
def _generate_dataset(
self, data: List[Dict[str, Any]], config: Dict[str, Any]
) -> pd.DataFrame:
"""Generate a dataset from evaluation data.
Args:
data (List[Dict[str, Any]]): The data to generate a dataset for.
config (Dict[str, Any]): The configuration of the task.
Returns:
pd.DataFrame: A dataframe that is ready to be uploaded to W&B.
"""
ids = [x["doc_id"] for x in data]
labels = [x["target"] for x in data]
instance = [""] * len(ids)
resps = [""] * len(ids)
filtered_resps = [""] * len(ids)
model_outputs = {}
metrics_list = config["metric_list"]
metrics = {}
for metric in metrics_list:
metric = metric.get("metric")
if metric in ["word_perplexity", "byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_loglikelihood"] = [x[metric][0] for x in data]
if metric in ["byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_bytes"] = [x[metric][1] for x in data]
else:
metrics[f"{metric}_words"] = [x[metric][1] for x in data]
else:
metrics[metric] = [x[metric] for x in data]
if config["output_type"] == "loglikelihood":
instance = [x["arguments"][0][0] for x in data]
labels = [x["arguments"][0][1] for x in data]
resps = [
f'log probability of continuation is {x["resps"][0][0][0]} '
+ "\n\n"
+ "continuation will {} generated with greedy sampling".format(
"not be" if not x["resps"][0][0][1] else "be"
)
for x in data
]
filtered_resps = [
f'log probability of continuation is {x["filtered_resps"][0][0]} '
+ "\n\n"
+ "continuation will {} generated with greedy sampling".format(
"not be" if not x["filtered_resps"][0][1] else "be"
)
for x in data
]
elif config["output_type"] == "multiple_choice":
instance = [x["arguments"][0][0] for x in data]
choices = [
"\n".join([f"{idx}. {y[1]}" for idx, y in enumerate(x["arguments"])])
for x in data
]
resps = [np.argmax([n[0][0] for n in x["resps"]]) for x in data]
filtered_resps = [
np.argmax([n[0] for n in x["filtered_resps"]]) for x in data
]
elif config["output_type"] == "loglikelihood_rolling":
instance = [x["arguments"][0][0] for x in data]
resps = [x["resps"][0][0] for x in data]
filtered_resps = [x["filtered_resps"][0] for x in data]
elif config["output_type"] == "generate_until":
instance = [x["arguments"][0][0] for x in data]
resps = [x["resps"][0][0] for x in data]
filtered_resps = [x["filtered_resps"][0] for x in data]
model_outputs["raw_predictions"] = resps
model_outputs["filtered_predictions"] = filtered_resps
df_data = {
"id": ids,
"data": instance,
}
if config["output_type"] == "multiple_choice":
df_data["choices"] = choices
tmp_data = {
"input_len": [len(x) for x in instance],
"labels": labels,
"output_type": config["output_type"],
}
df_data.update(tmp_data)
df_data.update(model_outputs)
df_data.update(metrics)
return pd.DataFrame(df_data)
def _log_samples_as_artifact(
self, data: List[Dict[str, Any]], task_name: str
) -> None:
# log the samples as an artifact
dumped = json.dumps(
data,
indent=2,
default=_handle_non_serializable,
ensure_ascii=False,
)
artifact = wandb.Artifact(f"{task_name}", type="samples_by_task")
with artifact.new_file(
f"{task_name}_eval_samples.json", mode="w", encoding="utf-8"
) as f:
f.write(dumped)
self.run.log_artifact(artifact)
# artifact.wait()
def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None:
"""Log evaluation samples to W&B.
Args:
samples (Dict[str, List[Dict[str, Any]]]): Evaluation samples for each task.
"""
task_names: List[str] = [
x for x in self.task_names if x not in self.group_names
]
ungrouped_tasks = []
tasks_by_groups = {}
for task_name in task_names:
group_names = self.task_configs[task_name].get("group", None)
if group_names:
if isinstance(group_names, str):
group_names = [group_names]
for group_name in group_names:
if not tasks_by_groups.get(group_name):
tasks_by_groups[group_name] = [task_name]
else:
tasks_by_groups[group_name].append(task_name)
else:
ungrouped_tasks.append(task_name)
for task_name in ungrouped_tasks:
eval_preds = samples[task_name]
# log the samples as a W&B Table
df = self._generate_dataset(eval_preds, self.task_configs.get(task_name))
self.run.log({f"{task_name}_eval_results": df})
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
for group, grouped_tasks in tasks_by_groups.items():
grouped_df = pd.DataFrame()
for task_name in grouped_tasks:
eval_preds = samples[task_name]
df = self._generate_dataset(
eval_preds, self.task_configs.get(task_name)
)
df["group"] = group
df["task"] = task_name
grouped_df = pd.concat([grouped_df, df], ignore_index=True)
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
self.run.log({f"{group}_eval_results": grouped_df})
def get_commit_from_path(repo_path: Path) -> Optional[str]:
git_folder = Path(repo_path, ".git")
if git_folder.is_file():
git_folder = Path(
git_folder.parent,
git_folder.read_text(encoding="utf-8").split("\n")[0].split(" ")[-1],
)
if Path(git_folder, "HEAD").exists():
head_name = (
Path(git_folder, "HEAD")
.read_text(encoding="utf-8")
.split("\n")[0]
.split(" ")[-1]
)
head_ref = Path(git_folder, head_name)
git_hash = head_ref.read_text(encoding="utf-8").replace("\n", "")
else:
git_hash = None
return git_hash
def get_git_commit_hash():
"""
Gets the git commit hash of your current repo (if it exists).
Source: https://github.com/EleutherAI/gpt-neox/blob/b608043be541602170bfcfb8ec9bf85e8a0799e0/megatron/neox_arguments/neox_args.py#L42
"""
try:
git_hash = subprocess.check_output(["git", "describe", "--always"]).strip()
git_hash = git_hash.decode()
except (subprocess.CalledProcessError, FileNotFoundError):
# FileNotFoundError occurs when git not installed on system
git_hash = get_commit_from_path(os.getcwd()) # git hash of repo if exists
return git_hash
def add_env_info(storage: Dict[str, Any]):
try:
pretty_env_info = get_pretty_env_info()
except Exception as err:
pretty_env_info = str(err)
transformers_version = trans_version
upper_dir_commit = get_commit_from_path(
Path(os.getcwd(), "..")
) # git hash of upper repo if exists
added_info = {
"pretty_env_info": pretty_env_info,
"transformers_version": transformers_version,
"upper_git_hash": upper_dir_commit, # in case this repo is submodule
}
storage.update(added_info)
from . import huggingface from . import (
from . import openai_completions anthropic_llms,
from . import textsynth dummy,
from . import dummy gguf,
from . import anthropic_llms huggingface,
from . import gguf mamba_lm,
from . import vllm_causallms neuron_optimum,
from . import mamba_lm openai_completions,
from . import optimum_lm optimum_lm,
from . import neuron_optimum textsynth,
vllm_causallms,
)
# TODO: implement __all__ # TODO: implement __all__
......
...@@ -24,7 +24,7 @@ from transformers.models.auto.modeling_auto import ( ...@@ -24,7 +24,7 @@ from transformers.models.auto.modeling_auto import (
from lm_eval import utils from lm_eval import utils
from lm_eval.api.instance import Instance from lm_eval.api.instance import Instance
from lm_eval.api.model import LM from lm_eval.api.model import TemplateLM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.utils import ( from lm_eval.models.utils import (
Collator, Collator,
...@@ -64,7 +64,7 @@ def _get_accelerate_args( ...@@ -64,7 +64,7 @@ def _get_accelerate_args(
@register_model("hf-auto", "hf", "huggingface") @register_model("hf-auto", "hf", "huggingface")
class HFLM(LM): class HFLM(TemplateLM):
""" """
An abstracted Huggingface model class. Enables usage with both models of An abstracted Huggingface model class. Enables usage with both models of
`transformers.AutoModelForCausalLM` and `transformers.AutoModelForSeq2SeqLM` classes. `transformers.AutoModelForCausalLM` and `transformers.AutoModelForSeq2SeqLM` classes.
...@@ -78,9 +78,8 @@ class HFLM(LM): ...@@ -78,9 +78,8 @@ class HFLM(LM):
def __init__( def __init__(
self, self,
pretrained: Optional[Union[str, transformers.PreTrainedModel]] = "gpt2", pretrained: Optional[Union[str, transformers.PreTrainedModel]] = "gpt2",
backend: Optional[ backend: Optional[Literal["default", "causal", "seq2seq"]] = "default",
Literal["default", "causal", "seq2seq"] # override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
] = "default", # override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
revision: Optional[str] = "main", revision: Optional[str] = "main",
subfolder: Optional[str] = None, subfolder: Optional[str] = None,
tokenizer: Optional[ tokenizer: Optional[
...@@ -91,6 +90,7 @@ class HFLM(LM): ...@@ -91,6 +90,7 @@ class HFLM(LM):
] ]
] = None, ] = None,
truncation: Optional[bool] = False, truncation: Optional[bool] = False,
logits_cache: bool = True,
max_length: Optional[int] = None, max_length: Optional[int] = None,
device: Optional[str] = "cuda", device: Optional[str] = "cuda",
dtype: Optional[Union[str, torch.dtype]] = "auto", dtype: Optional[Union[str, torch.dtype]] = "auto",
...@@ -98,6 +98,7 @@ class HFLM(LM): ...@@ -98,6 +98,7 @@ class HFLM(LM):
max_batch_size: Optional[int] = 64, max_batch_size: Optional[int] = 64,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
use_fast_tokenizer: Optional[bool] = True, use_fast_tokenizer: Optional[bool] = True,
add_bos_token: Optional[bool] = False,
# arguments used for splitting a model across GPUs naively. # arguments used for splitting a model across GPUs naively.
# only used if `parallelize=True`. # only used if `parallelize=True`.
parallelize: Optional[bool] = False, parallelize: Optional[bool] = False,
...@@ -239,7 +240,7 @@ class HFLM(LM): ...@@ -239,7 +240,7 @@ class HFLM(LM):
) )
self.truncation = truncation self.truncation = truncation
self.logits_cache = logits_cache
self.vocab_size = self.tokenizer.vocab_size self.vocab_size = self.tokenizer.vocab_size
# select (or create) a pad token to use # select (or create) a pad token to use
if self.tokenizer.pad_token: if self.tokenizer.pad_token:
...@@ -249,7 +250,7 @@ class HFLM(LM): ...@@ -249,7 +250,7 @@ class HFLM(LM):
elif self.tokenizer.eos_token: elif self.tokenizer.eos_token:
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
else: else:
if self.config.model_type == "qwen": if getattr(self.config, "model_type", None) == "qwen":
# Qwen's trust_remote_code tokenizer does not allow for adding special tokens # Qwen's trust_remote_code tokenizer does not allow for adding special tokens
self.tokenizer.pad_token = "<|endoftext|>" self.tokenizer.pad_token = "<|endoftext|>"
elif ( elif (
...@@ -265,6 +266,14 @@ class HFLM(LM): ...@@ -265,6 +266,14 @@ class HFLM(LM):
else: else:
self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"}) self.tokenizer.add_special_tokens({"pad_token": "<|pad|>"})
# TODO: override this for Gemma
self.add_bos_token = add_bos_token
if getattr(self.config, "model_type", None) == "gemma":
self.add_bos_token = True
eval_logger.info(
f"Model type is '{self.config.model_type}', a BOS token will be used as Gemma underperforms without it."
)
self._max_length = max_length self._max_length = max_length
self.batch_schedule = 1 self.batch_schedule = 1
...@@ -657,8 +666,9 @@ class HFLM(LM): ...@@ -657,8 +666,9 @@ class HFLM(LM):
""" """ """ """
if add_special_tokens is None: if add_special_tokens is None:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
add_special_tokens = False add_special_tokens = False or self.add_bos_token
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
# TODO: investigate best practices for enc-dec models + special tokens
add_special_tokens = True add_special_tokens = True
encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens) encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
...@@ -681,7 +691,7 @@ class HFLM(LM): ...@@ -681,7 +691,7 @@ class HFLM(LM):
self.tokenizer.padding_side = padding_side self.tokenizer.padding_side = padding_side
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
add_special_tokens = False add_special_tokens = False or self.add_bos_token
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
add_special_tokens = True add_special_tokens = True
...@@ -760,7 +770,9 @@ class HFLM(LM): ...@@ -760,7 +770,9 @@ class HFLM(LM):
**generation_kwargs, **generation_kwargs,
) )
def _select_cont_toks(self, logits, contlen=None, inplen=None): def _select_cont_toks(
self, logits: torch.Tensor, contlen: int = None, inplen: int = None
) -> torch.Tensor:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
assert ( assert (
contlen and inplen contlen and inplen
...@@ -778,39 +790,6 @@ class HFLM(LM): ...@@ -778,39 +790,6 @@ class HFLM(LM):
return logits return logits
def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
context_enc = self.tok_encode(context, add_special_tokens=False)
# whole_enc = self.tok_encode(context + continuation)
# context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def loglikelihood_rolling(self, requests: List[Instance]) -> List[float]: def loglikelihood_rolling(self, requests: List[Instance]) -> List[float]:
loglikelihoods = [] loglikelihoods = []
...@@ -851,7 +830,7 @@ class HFLM(LM): ...@@ -851,7 +830,7 @@ class HFLM(LM):
rolling_token_windows += pad_amnt * [rolling_token_windows[0]] rolling_token_windows += pad_amnt * [rolling_token_windows[0]]
string_nll = self._loglikelihood_tokens( string_nll = self._loglikelihood_tokens(
rolling_token_windows, requests=rolling_token_windows,
disable_tqdm=True, disable_tqdm=True,
override_bs=adaptive_batch_size, override_bs=adaptive_batch_size,
) )
...@@ -893,7 +872,7 @@ class HFLM(LM): ...@@ -893,7 +872,7 @@ class HFLM(LM):
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
res = [] res = []
def _collate(x): def _collate(req: Tuple[Tuple[str, str], List[int], List[int]]):
"""Defines the key for the sorted method""" """Defines the key for the sorted method"""
# the negative sign on len(toks) sorts descending - this has a few advantages: # the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning # - time estimates will always be over not underestimates, which is more useful for planning
...@@ -902,10 +881,26 @@ class HFLM(LM): ...@@ -902,10 +881,26 @@ class HFLM(LM):
# automatic adaptive batches much much easier to implement # automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end # - any OOMs will happen right away rather than near the end
toks = x[1] + x[2] toks = req[1] + req[2]
return -len(toks), tuple(toks) return -len(toks), tuple(toks)
re_ord = Collator(requests, sort_fn=_collate) def _lookup_one_token_cont(req: Tuple[Tuple[str, str], List[int], List[int]]):
"""Defines the key to group and lookup one-token continuations"""
# Use with group_by="contexts" (optional)"
# allows for the creation of a lookup, so we can re-use logits in case of one-token continuations.
# speeds up some multiple-choice tasks proportionally to the number of choices.
# groups requests by context+continuation[:-1] and infer on one request/group.
return req[-2] + req[-1][:-1]
re_ord = Collator(
requests,
sort_fn=_collate,
group_by="contexts"
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
and self.logits_cache
else None,
group_fn=_lookup_one_token_cont,
)
# automatic (variable) batch size detection for vectorization # automatic (variable) batch size detection for vectorization
# pull longest context sample from request # pull longest context sample from request
...@@ -926,7 +921,11 @@ class HFLM(LM): ...@@ -926,7 +921,11 @@ class HFLM(LM):
) )
chunks = re_ord.get_batched(n=batch_size, batch_fn=batch_fn) chunks = re_ord.get_batched(n=batch_size, batch_fn=batch_fn)
pbar = tqdm(total=len(requests), disable=(disable_tqdm or (self.rank != 0))) pbar = tqdm(
total=len(requests),
disable=(disable_tqdm or (self.rank != 0)),
desc="Running loglikelihood requests",
)
for chunk in chunks: for chunk in chunks:
inps = [] inps = []
cont_toks_list = [] cont_toks_list = []
...@@ -1026,7 +1025,7 @@ class HFLM(LM): ...@@ -1026,7 +1025,7 @@ class HFLM(LM):
self._model_call(batched_inps, **call_kwargs), dim=-1 self._model_call(batched_inps, **call_kwargs), dim=-1
) # [batch, padding_length (inp or cont), vocab] ) # [batch, padding_length (inp or cont), vocab]
for (cache_key, _, _), logits, inplen, cont_toks in zip( for (request_str, ctx_tokens, _), logits, inplen, cont_toks in zip(
chunk, multi_logits, inplens, cont_toks_list chunk, multi_logits, inplens, cont_toks_list
): ):
# Slice to original seq length # Slice to original seq length
...@@ -1045,24 +1044,36 @@ class HFLM(LM): ...@@ -1045,24 +1044,36 @@ class HFLM(LM):
# Check if per-token argmax is exactly equal to continuation # Check if per-token argmax is exactly equal to continuation
greedy_tokens = logits.argmax(dim=-1) greedy_tokens = logits.argmax(dim=-1)
cont_toks = torch.tensor(
cont_toks, dtype=torch.long, device=self.device
).unsqueeze(0) # [1, seq]
max_equal = (greedy_tokens == cont_toks).all()
# Obtain log-probs at the corresponding continuation token indices
# last_token_slice = logits[:, -1, :].squeeze(0).tolist()
logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
-1
) # [1, seq]
# Answer: (log prob, is-exact-match)
answer = (float(logits.sum()), bool(max_equal))
res.append(answer) # check for one-token continuation cache hits.
# noop in case group_by != "contexts" or no cache hit and returns the
self.cache_hook.add_partial("loglikelihood", cache_key, answer) # original args. Otherwise, expands the logits batch dimension and yields each
pbar.update(1) # batch along with matching continuation tokens and prompt strings.
# logits -> [1, seq, vocab]
for request_str, cont_toks, logits in re_ord.get_cache(
req_str=request_str,
cxt_toks=ctx_tokens,
cont_toks=cont_toks,
logits=logits,
):
cont_toks = torch.tensor(
cont_toks, dtype=torch.long, device=self.device
).unsqueeze(0) # [1, seq]
max_equal = (greedy_tokens == cont_toks).all()
# Obtain log-probs at the corresponding continuation token indices
# last_token_slice = logits[:, -1, :].squeeze(0).tolist()
logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
-1
) # [1, seq]
# Answer: (log prob, is-exact-match)
answer = (float(logits.sum()), bool(max_equal))
res.append(answer)
self.cache_hook.add_partial("loglikelihood", request_str, answer)
pbar.update(1)
pbar.close() pbar.close()
...@@ -1071,7 +1082,7 @@ class HFLM(LM): ...@@ -1071,7 +1082,7 @@ class HFLM(LM):
def generate_until(self, requests: List[Instance]) -> List[str]: def generate_until(self, requests: List[Instance]) -> List[str]:
res = [] res = []
def _collate(x): def _collate(req: Tuple[str, dict]):
"""Defines the key for the sorted method""" """Defines the key for the sorted method"""
# the negative sign on len(toks) sorts descending - this has a few advantages: # the negative sign on len(toks) sorts descending - this has a few advantages:
# - time estimates will always be over not underestimates, which is more useful for planning # - time estimates will always be over not underestimates, which is more useful for planning
...@@ -1079,10 +1090,14 @@ class HFLM(LM): ...@@ -1079,10 +1090,14 @@ class HFLM(LM):
# padded context length. this is useful to simplify the batching logic and more importantly to make # padded context length. this is useful to simplify the batching logic and more importantly to make
# automatic adaptive batches much much easier to implement # automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end # - any OOMs will happen right away rather than near the end
toks = self.tok_encode(x[0]) toks = self.tok_encode(req[0])
return -len(toks), x[0] return -len(toks), req[0]
pbar = tqdm(total=len(requests), disable=(self.rank != 0)) pbar = tqdm(
total=len(requests),
disable=(self.rank != 0),
desc="Running generate_until requests",
)
adaptive_batch_size = None adaptive_batch_size = None
if self.batch_size == "auto": if self.batch_size == "auto":
# using rolling window with maximum context # using rolling window with maximum context
...@@ -1107,7 +1122,13 @@ class HFLM(LM): ...@@ -1107,7 +1122,13 @@ class HFLM(LM):
# we group requests by their generation_kwargs, # we group requests by their generation_kwargs,
# so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
# in the same batch. # in the same batch.
re_ords = Collator([reg.args for reg in requests], _collate, grouping=True) # group_fn=lambda x: x[1] -> x=(context, gen_kwargs)
re_ords = Collator(
[reg.args for reg in requests],
sort_fn=_collate,
group_by="gen_kwargs",
group_fn=lambda x: x[1],
)
chunks = re_ords.get_batched(n=batch_size, batch_fn=batch_fn) chunks = re_ords.get_batched(n=batch_size, batch_fn=batch_fn)
for chunk in chunks: for chunk in chunks:
contexts, all_gen_kwargs = zip(*chunk) contexts, all_gen_kwargs = zip(*chunk)
......
...@@ -56,9 +56,9 @@ class MambaLMWrapper(HFLM): ...@@ -56,9 +56,9 @@ class MambaLMWrapper(HFLM):
super().__init__( super().__init__(
pretrained=pretrained, pretrained=pretrained,
# set appropriate defaults for tokenizer, max length, etc # set appropriate defaults for tokenizer, max length, etc
backend=kwargs.get("backend", "causal"), backend=kwargs.pop("backend", "causal"),
tokenizer=kwargs.get("tokenizer", "EleutherAI/gpt-neox-20b"), tokenizer=kwargs.pop("tokenizer", "EleutherAI/gpt-neox-20b"),
max_length=kwargs.get("max_length", 2048), max_length=kwargs.pop("max_length", 2048),
**kwargs, **kwargs,
) )
......
...@@ -15,7 +15,7 @@ from transformers.generation import StoppingCriteriaList ...@@ -15,7 +15,7 @@ from transformers.generation import StoppingCriteriaList
import lm_eval.models.utils import lm_eval.models.utils
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import TemplateLM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.utils import stop_sequences_criteria from lm_eval.models.utils import stop_sequences_criteria
...@@ -172,7 +172,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM): ...@@ -172,7 +172,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
@register_model("neuronx") @register_model("neuronx")
class NEURON_HF(LM): class NEURON_HF(TemplateLM):
""" """
Enables usage with on AWS Neuron Enables usage with on AWS Neuron
using the HuggingFace Transformers + Transformers neuronx library. using the HuggingFace Transformers + Transformers neuronx library.
...@@ -195,8 +195,7 @@ class NEURON_HF(LM): ...@@ -195,8 +195,7 @@ class NEURON_HF(LM):
low_cpu_mem_usage: Optional[bool] = True, low_cpu_mem_usage: Optional[bool] = True,
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
use_fast_tokenizer: Optional[bool] = True, use_fast_tokenizer: Optional[bool] = True,
# arguments used for splitting a model across GPUs naively. add_bos_token: Optional[bool] = False,
# only used if `parallelize=True`.
) -> None: ) -> None:
if not NEURON_AVAILABLE: if not NEURON_AVAILABLE:
raise Exception( raise Exception(
...@@ -289,6 +288,7 @@ class NEURON_HF(LM): ...@@ -289,6 +288,7 @@ class NEURON_HF(LM):
self.vocab_size = self.tokenizer.vocab_size self.vocab_size = self.tokenizer.vocab_size
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
self.add_bos_token = self.add_bos_token
self._max_length = max_length self._max_length = max_length
...@@ -343,7 +343,7 @@ class NEURON_HF(LM): ...@@ -343,7 +343,7 @@ class NEURON_HF(LM):
def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None): def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None):
""" """ """ """
if add_special_tokens is None: if add_special_tokens is None:
add_special_tokens = False add_special_tokens = False or self.add_bos_token
encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens) encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
...@@ -364,7 +364,7 @@ class NEURON_HF(LM): ...@@ -364,7 +364,7 @@ class NEURON_HF(LM):
old_padding_side = self.tokenizer.padding_side old_padding_side = self.tokenizer.padding_side
self.tokenizer.padding_side = padding_side self.tokenizer.padding_side = padding_side
add_special_tokens = False add_special_tokens = False or self.add_bos_token
encoding = self.tokenizer( encoding = self.tokenizer(
strings, strings,
...@@ -447,37 +447,6 @@ class NEURON_HF(LM): ...@@ -447,37 +447,6 @@ class NEURON_HF(LM):
return logits return logits
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
context_enc = self.tok_encode(context, add_special_tokens=False)
# whole_enc = self.tok_encode(context + continuation)
# context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests):
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def loglikelihood_rolling(self, requests): def loglikelihood_rolling(self, requests):
loglikelihoods = [] loglikelihoods = []
......
...@@ -8,7 +8,7 @@ from tqdm import tqdm ...@@ -8,7 +8,7 @@ from tqdm import tqdm
import lm_eval.models.utils import lm_eval.models.utils
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import LM, TemplateLM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.utils import retry_on_specific_exceptions from lm_eval.models.utils import retry_on_specific_exceptions
from lm_eval.utils import eval_logger from lm_eval.utils import eval_logger
...@@ -75,7 +75,7 @@ def oa_completion(client, chat: bool = False, **kwargs): ...@@ -75,7 +75,7 @@ def oa_completion(client, chat: bool = False, **kwargs):
@register_model("openai-completions", "local-completions") @register_model("openai-completions", "local-completions")
class OpenaiCompletionsLM(LM): class OpenaiCompletionsLM(TemplateLM):
_DEFAULT_MAX_LENGTH = 2048 _DEFAULT_MAX_LENGTH = 2048
def __init__( def __init__(
...@@ -171,41 +171,12 @@ class OpenaiCompletionsLM(LM): ...@@ -171,41 +171,12 @@ class OpenaiCompletionsLM(LM):
# Isn't used because we override _loglikelihood_tokens # Isn't used because we override _loglikelihood_tokens
raise NotImplementedError() raise NotImplementedError()
def tok_encode(self, string: str) -> List[int]: def tok_encode(self, string: str, **kwargs) -> List[int]:
return self.tokenizer.encode(string) return self.tokenizer.encode(string)
def tok_decode(self, tokens: List[int]) -> str: def tok_decode(self, tokens: List[int]) -> str:
return self.tokenizer.decode(tokens) return self.tokenizer.decode(tokens)
def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
def _loglikelihood_tokens( def _loglikelihood_tokens(
self, requests, disable_tqdm: bool = False self, requests, disable_tqdm: bool = False
) -> List[Tuple[float, bool]]: ) -> List[Tuple[float, bool]]:
...@@ -290,14 +261,13 @@ class OpenaiCompletionsLM(LM): ...@@ -290,14 +261,13 @@ class OpenaiCompletionsLM(LM):
list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)) list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size))
): ):
inps = [] inps = []
self._max_gen_toks = request_args.pop("max_gen_toks", self.max_gen_toks) self._max_gen_toks = request_args.get("max_gen_toks", self.max_gen_toks)
for context, _ in chunk: for context, _ in chunk:
context_enc = self.tok_encode(context) context_enc = self.tok_encode(context)
inp = context_enc[-(self.max_length - self.max_gen_toks) :] inp = context_enc[-(self.max_length - self.max_gen_toks) :]
inps.append(inp) inps.append(inp)
until = request_args.pop("until", ["<|endoftext|>"]) until = request_args.get("until", ["<|endoftext|>"])
request_args.pop("do_sample", None)
request_args["temperature"] = request_args.get("temperature", 0) request_args["temperature"] = request_args.get("temperature", 0)
response = oa_completion( response = oa_completion(
...@@ -307,7 +277,11 @@ class OpenaiCompletionsLM(LM): ...@@ -307,7 +277,11 @@ class OpenaiCompletionsLM(LM):
max_tokens=self.max_gen_toks, max_tokens=self.max_gen_toks,
stop=until, stop=until,
seed=self.seed, seed=self.seed,
**request_args, **{
k: v
for k, v in request_args.items()
if k not in ["do_sample", "max_gen_toks"]
},
) )
for resp, (context, args_) in zip(response.choices, chunk): for resp, (context, args_) in zip(response.choices, chunk):
s = getattr(resp, "text") s = getattr(resp, "text")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment