Unverified Commit 9822b06e authored by Lintang Sutawika's avatar Lintang Sutawika Committed by GitHub
Browse files

Merge branch 'main' into weight_by_size

parents 51f27158 b177c82c
...@@ -16,3 +16,8 @@ temp ...@@ -16,3 +16,8 @@ temp
# IPython # IPython
profile_default/ profile_default/
ipython_config.py ipython_config.py
# don't track (the default location of) the cached requests
lm_eval/caching/.cache
# don't track files created by wandb
wandb
examples/wandb
...@@ -45,6 +45,7 @@ git clone https://github.com/EleutherAI/lm-evaluation-harness ...@@ -45,6 +45,7 @@ git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness cd lm-evaluation-harness
pip install -e . pip install -e .
``` ```
We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document. We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.
## Basic Usage ## Basic Usage
...@@ -174,6 +175,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_ ...@@ -174,6 +175,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | | Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Huggingface Optimum (Causal LMs) | ✔️ | `openvino` | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | ... | | Huggingface Optimum (Causal LMs) | ✔️ | `openvino` | Any decoder-only AutoModelForCausalLM converted with Huggingface Optimum into OpenVINO™ Intermediate Representation (IR) format | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | ... |
| Neuron via AWS Inf2 (Causal LMs) | ✔️ | `neuronx` | Any decoder-only AutoModelForCausalLM supported to run on [huggingface-ami image for inferentia2](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` | ... |
| Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions or ChatCompletions interface | `generate_until` | | ... | | Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions or ChatCompletions interface | `generate_until` | | ... |
Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`. Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while local models, or APIs that supply logprobs/logits of their prompts, can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
...@@ -196,7 +198,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b ...@@ -196,7 +198,7 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> You can inspect what the LM inputs look like by running the following command: > You can inspect what the LM inputs look like by running the following command:
> ```bash > ```bash
> python write_out.py \ > python write_out.py \
> --tasks all_tasks \ > --tasks <task1,task2,...> \
> --num_fewshot 5 \ > --num_fewshot 5 \
> --num_examples 10 \ > --num_examples 10 \
> --output_base_path /path/to/output/folder > --output_base_path /path/to/output/folder
...@@ -243,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github ...@@ -243,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github
## Visualizing Results ## Visualizing Results
You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.
### Zeno
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs. You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account). First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
...@@ -282,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a ...@@ -282,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a
You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb). You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).
### Weights and Biases
With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.
The integration provide functionalities
- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.
First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.
Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.
Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.
```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```
In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).
## How to Contribute or Learn More? ## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help. For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
...@@ -312,7 +353,9 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"` ...@@ -312,7 +353,9 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
| anthropic | For using Anthropic's models | | anthropic | For using Anthropic's models |
| dev | For linting PRs and contributions | | dev | For linting PRs and contributions |
| gptq | For loading models with GPTQ | | gptq | For loading models with GPTQ |
| hf_transfer | For speeding up HF Hub file downloads |
| ifeval | For running the IFEval task | | ifeval | For running the IFEval task |
| neuronx | For running on AWS inf2 instances |
| mamba | For loading Mamba SSM models | | mamba | For loading Mamba SSM models |
| math | For running math task answer checking | | math | For running math task answer checking |
| multilingual | For multilingual tokenizers | | multilingual | For multilingual tokenizers |
......
...@@ -10,41 +10,47 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th ...@@ -10,41 +10,47 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`: This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs. - `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66) - `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. - `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer. - `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file. - `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length. - `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed. - `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type. - `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well. - `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`. - `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models. - `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again. - `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool. - `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity. - `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task. - `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes. - `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/` - `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results. - `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]
## External Library Usage ## External Library Usage
...@@ -52,7 +58,6 @@ We also support using the library's external API for use within model training l ...@@ -52,7 +58,6 @@ We also support using the library's external API for use within model training l
`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`. `lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows: `simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
```python ```python
...@@ -61,19 +66,29 @@ import lm_eval ...@@ -61,19 +66,29 @@ import lm_eval
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code) my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
... ...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()` # instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory. # - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)
# indexes all tasks from the `lm_eval/tasks` subdirectory.
# Alternatively, you can set `TaskManager(include_path="path/to/my/custom/task/configs")`
# to include a set of tasks in a separate directory.
task_manager = lm_eval.tasks.TaskManager()
# Setting `task_manager` to the one above is optional and should generally be done
# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
results = lm_eval.simple_evaluate( # call simple_evaluate results = lm_eval.simple_evaluate( # call simple_evaluate
model=lm_obj, model=lm_obj,
tasks=["taskname1", "taskname2"], tasks=["taskname1", "taskname2"],
num_fewshot=0, num_fewshot=0,
task_manager=task_manager,
... ...
) )
``` ```
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`. Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.
...@@ -81,21 +96,53 @@ Additionally, the `evaluate()` function offers the core evaluation functionality ...@@ -81,21 +96,53 @@ Additionally, the `evaluate()` function offers the core evaluation functionality
See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details. See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.
As a brief example usage of `evaluate()`: As a brief example usage of `evaluate()`:
```python ```python
import lm_eval import lm_eval
from my_tasks import MyTask1 # suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase # suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
from my_tasks import MyTask1
... ...
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code) # create your model (could be running finetuning with some custom modeling code)
my_model = initialize_my_model()
... ...
lm_obj = Your_LM(model=my_model, batch_size=16) # instantiate an LM subclass that takes your initialized model and can run `Your_LM.loglikelihood()`, `Your_LM.loglikelihood_rolling()`, `Your_LM.generate_until()`
lm_eval.tasks.initialize_tasks() # register all tasks from the `lm_eval/tasks` subdirectory. Alternatively, can call `lm_eval.tasks.include_path("path/to/my/custom/task/configs")` to only register a set of tasks in a separate directory. # instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
# - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)
# The task_manager indexes tasks including ones
# specified by the user through `include_path`
task_manager = lm_eval.tasks.TaskManager(
include_path="/path/to/custom/yaml"
)
# To get a task dict for `evaluate`
task_dict = lm_eval.tasks.get_task_dict(
[
"mmlu", # A stock task
"my_custom_task", # A custom task
{
"task": ..., # A dict that configures a task
"doc_to_text": ...,
},
MyTask1 # A task object from `lm_eval.task.Task`
],
task_manager # A task manager that allows lm_eval to
# load the task during evaluation.
# If none is provided, `get_task_dict`
# will instantiated one itself, but this
# only includes the stock tasks so users
# will need to set this if including
# custom paths is required.
)
def evaluate( def evaluate(
lm=lm_obj, lm=lm_obj,
task_dict={"mytask1": MyTask1}, task_dict=task_dict,
... ...
): ):
``` ```
...@@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h ...@@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input. - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!
**Tip: be careful of indexing in loglikelihood!** **Tip: be careful of indexing in loglikelihood!**
......
...@@ -294,17 +294,80 @@ This will add your task to the `group1` and `group2` groups, enabling people to ...@@ -294,17 +294,80 @@ This will add your task to the `group1` and `group2` groups, enabling people to
If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files. If your task is not in the `lm_eval/tasks` folder, you'll need to tell the Eval Harness where to look for YAML files.
You can do this via adding the Python snippet You can do this via the `--include_path` argument in `__main__.py`. This command will be used to initialize the `TaskManager` object which you can also use for your custom scripts.
```python ```python
from lm_eval.tasks import include_task_folder task_manager = TaskManager(args.verbosity, include_path=args.include_path)
include_task_folder("/path/to/yaml/parent/folder")
``` ```
to the top of any Python file that is run or imported when performing evaluation, such as `\_\_main\_\_.py`.
Passing `--tasks /path/to/yaml/file` is also accepted. Passing `--tasks /path/to/yaml/file` is also accepted.
### Advanced Group Configs
You can make more complete group config while also tailoring parameters for individual tasks.
For example, let's build a config for evaluating MMLU and a few natural language inference tasks. For MMLU, we can write the name for the benchmark as a subtask written under `task`. You can configure the parameters such as `num_fewshot`. If the task being configured is a group such as `mmlu` or `super_glue`, the parameter set will be applied to all of the subtasks.
```yaml
group: nli_and_mmlu
task:
- group: nli_tasks
task:
- cb
- anli_r1
- rte
- task: mmlu
num_fewshot: 2
```
It's also important to note how you can basically insert a group config as a task. Here, to make a group of natural language inference tasks, you simply write like how you would normally write a group config but this time place that as part of a task list under the main group being built.
### Duplicate Tasks in Group Configs
There might be cases where you might want to evaluate prompts and how models perform over prompt variations. You can list an existing task (In the example below, `anli_r1`) which varying `doc_to_text` implementation. To differentiate from each variation, we can utilize `task_alias`. LM-Eval will recognize that there are multiple variations of the same tasks and differentiate them.
```yaml
group: flan_held_in
group_alias: Flan (Held-In)
task:
# ANLI R1
- group: anli_r1_flan
group_alias: ANLI R1
task:
- task: anli_r1
task_alias: prompt-0
include: _held_in_template_yaml
doc_to_text: "{{premise}}\n\nChoose your answer ..."
...
- task: anli_r1
task_alias: prompt-1
include: _held_in_template_yaml
doc_to_text: "{{premise}}\n\nBased on ..."
...
```
### Configuring python classes
There can occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.
```yaml
task: squadv2
class: !function task.SQuAD2
```
This also applies to building group configurations with subtasks that are python classes.
```yaml
group: scrolls
task:
- task: scrolls_qasper
class: !function task.Qasper
- task: scrolls_quality
class: !function task.QuALITY
- task: scrolls_narrativeqa
class: !function task.NarrativeQA
...
```
## Beautifying Table Display ## Beautifying Table Display
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.
......
...@@ -50,7 +50,7 @@ Scoring details: ...@@ -50,7 +50,7 @@ Scoring details:
- **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`. - **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
Other: Other:
- **metadata** (`Union[str, list]`, *optional*) — An optional field where arbitrary metadata can be passed. A good example would be `version` that is used to denote the version of the yaml config. - **metadata** (`dict`, *optional*) — An optional field where arbitrary metadata can be passed. Most tasks should include a `version` key in this field that is used to denote the version of the yaml config. Other special metadata keys are: `num_fewshot`, to override the printed `n-shot` table column for a task.
## Filters ## Filters
......
{
"cells": [
{
"cell_type": "markdown",
"id": "fc477b96-adee-4829-a9d7-a5eb990df358",
"metadata": {},
"source": [
"# Visualizing Results in Weights and Biases\n",
"\n",
"With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
"\n",
"The integration provide functionalities\n",
"\n",
"- to automatically log the evaluation results,\n",
"- log the samples as W&B Tables for easy visualization,\n",
"- log the `results.json` file as an artifact for version control,\n",
"- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
"- generate a comprehensive report for analysis and visualization with all the important metric,\n",
"- log task and cli configs,\n",
"- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
"\n",
"The integration is super easy to use with the eval harness. Let's see how!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Install this project if you did not already have it.\n",
"# This is all that is needed to be installed to start using Weights and Biases\n",
"\n",
"!pip -qq install -e ..[wandb]"
]
},
{
"cell_type": "markdown",
"id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
"\n",
"If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
]
},
{
"cell_type": "markdown",
"id": "eec5866e-f01e-42f8-8803-9d77472ef991",
"metadata": {},
"source": [
"## Set your API Key\n",
"\n",
"Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d824d163-71a9-4313-935d-f1d56397841c",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"wandb.login()"
]
},
{
"cell_type": "markdown",
"id": "124e4a34-1547-4bed-bc09-db012bacbda6",
"metadata": {},
"source": [
"> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
]
},
{
"cell_type": "markdown",
"id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
"metadata": {},
"source": [
"## Run and log to W&B"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
" --tasks hellaswag,mmlu_abstract_algebra \\\n",
" --device cuda:0 \\\n",
" --batch_size 8 \\\n",
" --output_path output/phi-2 \\\n",
" --limit 10 \\\n",
" --wandb_args project=lm-eval-harness-integration \\\n",
" --log_samples"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
...@@ -4,14 +4,16 @@ import logging ...@@ -4,14 +4,16 @@ import logging
import os import os
import re import re
import sys import sys
from functools import partial
from pathlib import Path from pathlib import Path
from typing import Union from typing import Union
import numpy as np import numpy as np
from lm_eval import evaluator, utils from lm_eval import evaluator, utils
from lm_eval.api.registry import ALL_TASKS from lm_eval.evaluator import request_caching_arg_to_dict
from lm_eval.tasks import include_path, initialize_tasks from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table from lm_eval.utils import make_table
...@@ -24,6 +26,30 @@ def _handle_non_serializable(o): ...@@ -24,6 +26,30 @@ def _handle_non_serializable(o):
return str(o) return str(o)
def _int_or_none_list_arg_type(max_len: int, value: str, split_char: str = ","):
def parse_value(item):
item = item.strip().lower()
if item == "none":
return None
try:
return int(item)
except ValueError:
raise argparse.ArgumentTypeError(f"{item} is not an integer or None")
items = [parse_value(v) for v in value.split(split_char)]
num_items = len(items)
if num_items == 1:
# Makes downstream handling the same for single and multiple values
items = items * max_len
elif num_items != max_len:
raise argparse.ArgumentTypeError(
f"Argument requires {max_len} integers or None, separated by '{split_char}'"
)
return items
def parse_eval_args() -> argparse.Namespace: def parse_eval_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter) parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--model", "-m", default="hf", help="Name of model e.g. `hf`") parser.add_argument("--model", "-m", default="hf", help="Name of model e.g. `hf`")
...@@ -94,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -94,6 +120,13 @@ def parse_eval_args() -> argparse.Namespace:
metavar="DIR", metavar="DIR",
help="A path to a sqlite db file for caching model responses. `None` if not caching.", help="A path to a sqlite db file for caching model responses. `None` if not caching.",
) )
parser.add_argument(
"--cache_requests",
type=str,
default=None,
choices=["true", "refresh", "delete"],
help="Speed up evaluation by caching the building of dataset requests. `None` if not caching.",
)
parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used parser.add_argument("--decontamination_ngrams_path", default=None) # TODO: not used
parser.add_argument( parser.add_argument(
"--check_integrity", "--check_integrity",
...@@ -143,6 +176,11 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -143,6 +176,11 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG", metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.", help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
) )
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
)
parser.add_argument( parser.add_argument(
"--predict_only", "--predict_only",
"-x", "-x",
...@@ -150,6 +188,19 @@ def parse_eval_args() -> argparse.Namespace: ...@@ -150,6 +188,19 @@ def parse_eval_args() -> argparse.Namespace:
default=False, default=False,
help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.", help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
) )
parser.add_argument(
"--seed",
type=partial(_int_or_none_list_arg_type, 3),
default="0,1234,1234", # for backward compatibility
help=(
"Set seed for python's random, numpy and torch.\n"
"Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, "
"or a single integer to set the same seed for all three.\n"
"The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).\n"
"E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.\n"
"E.g, `--seed 42` sets all three seeds to 42."
),
)
return parser.parse_args() return parser.parse_args()
...@@ -158,6 +209,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -158,6 +209,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# we allow for args to be passed externally, else we parse them ourselves # we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args() args = parse_eval_args()
if args.wandb_args:
wandb_logger = WandbLogger(args)
eval_logger = utils.eval_logger eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}")) eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}") eval_logger.info(f"Verbosity set to {args.verbosity}")
...@@ -169,6 +223,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -169,6 +223,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
assert args.output_path, "Specify --output_path" assert args.output_path, "Specify --output_path"
initialize_tasks(args.verbosity) initialize_tasks(args.verbosity)
task_manager = TaskManager(args.verbosity, include_path=args.include_path)
if args.limit: if args.limit:
eval_logger.warning( eval_logger.warning(
...@@ -180,10 +235,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -180,10 +235,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
include_path(args.include_path) include_path(args.include_path)
if args.tasks is None: if args.tasks is None:
task_names = ALL_TASKS eval_logger.error("Need to specify task to evaluate.")
sys.exit()
elif args.tasks == "list": elif args.tasks == "list":
eval_logger.info( eval_logger.info(
f"Available Tasks:\n - {(os.linesep + ' - ').join(sorted(ALL_TASKS))}" "Available Tasks:\n - {}".format("\n - ".join(task_manager.all_tasks))
) )
sys.exit() sys.exit()
else: else:
...@@ -196,16 +252,14 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -196,16 +252,14 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
config = utils.load_yaml_config(yaml_file) config = utils.load_yaml_config(yaml_file)
task_names.append(config) task_names.append(config)
else: else:
tasks_list = args.tasks.split(",") task_list = args.tasks.split(",")
task_names = utils.pattern_match(tasks_list, ALL_TASKS) task_names = task_manager.match_tasks(task_list)
for task in [task for task in tasks_list if task not in task_names]: for task in [task for task in task_list if task not in task_names]:
if os.path.isfile(task): if os.path.isfile(task):
config = utils.load_yaml_config(task) config = utils.load_yaml_config(task)
task_names.append(config) task_names.append(config)
task_missing = [ task_missing = [
task task for task in task_list if task not in task_names and "*" not in task
for task in tasks_list
if task not in task_names and "*" not in task
] # we don't want errors if a wildcard ("*") task name was used ] # we don't want errors if a wildcard ("*") task name was used
if task_missing: if task_missing:
...@@ -237,6 +291,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -237,6 +291,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
output_path_file = path.joinpath("results.json") output_path_file = path.joinpath("results.json")
eval_logger.info(f"Selected Tasks: {task_names}") eval_logger.info(f"Selected Tasks: {task_names}")
eval_logger.info("Loading selected tasks...")
request_caching_args = request_caching_arg_to_dict(
cache_requests=args.cache_requests
)
results = evaluator.simple_evaluate( results = evaluator.simple_evaluate(
model=args.model, model=args.model,
...@@ -253,7 +312,12 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -253,7 +312,12 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
write_out=args.write_out, write_out=args.write_out,
log_samples=args.log_samples, log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs, gen_kwargs=args.gen_kwargs,
task_manager=task_manager,
predict_only=args.predict_only, predict_only=args.predict_only,
**request_caching_args,
random_seed=args.seed[0],
numpy_random_seed=args.seed[1],
torch_random_seed=args.seed[2],
) )
if results is not None: if results is not None:
...@@ -267,6 +331,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -267,6 +331,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
batch_sizes = ",".join(map(str, results["config"]["batch_sizes"])) batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
# Add W&B logging
if args.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if args.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to Weights and Biases failed due to {e}")
if args.output_path: if args.output_path:
output_path_file.open("w", encoding="utf-8").write(dumped) output_path_file.open("w", encoding="utf-8").write(dumped)
...@@ -292,6 +366,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: ...@@ -292,6 +366,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if "groups" in results: if "groups" in results:
print(make_table(results, "groups")) print(make_table(results, "groups"))
if args.wandb_args:
# Tear down wandb run once all the logging is done.
wandb_logger.run.finish()
if __name__ == "__main__": if __name__ == "__main__":
cli_evaluate() cli_evaluate()
...@@ -2,8 +2,9 @@ import logging ...@@ -2,8 +2,9 @@ import logging
import math import math
import random import random
from collections.abc import Iterable from collections.abc import Iterable
from typing import List
import evaluate import evaluate as hf_evaluate
import numpy as np import numpy as np
import sacrebleu import sacrebleu
import sklearn.metrics import sklearn.metrics
...@@ -115,6 +116,25 @@ def ter(items): ...@@ -115,6 +116,25 @@ def ter(items):
return sacrebleu.corpus_ter(preds, refs).score return sacrebleu.corpus_ter(preds, refs).score
@register_aggregation("brier_score")
def brier_score(items): # This is a passthrough function
gold, predictions = list(zip(*items))
gold = list(gold)
gold_one_hot = np.eye(np.max(gold) + 1)[gold]
predictions = list(zip(*items))[1]
return np.mean(np.sum((predictions - gold_one_hot) ** 2, axis=1))
@register_metric(
metric="brier_score",
higher_is_better=False,
output_type=["multiple_choice"],
aggregation="brier_score",
)
def brier_score_fn(items): # This is a passthrough function
return items
@register_metric( @register_metric(
metric="acc", metric="acc",
higher_is_better=True, higher_is_better=True,
...@@ -145,7 +165,7 @@ def acc_mutual_info_fn(items): # This is a passthrough function ...@@ -145,7 +165,7 @@ def acc_mutual_info_fn(items): # This is a passthrough function
return items return items
exact_match = evaluate.load("exact_match") exact_match = hf_evaluate.load("exact_match")
@register_metric( @register_metric(
...@@ -425,3 +445,65 @@ def stderr_for_metric(metric, bootstrap_iters): ...@@ -425,3 +445,65 @@ def stderr_for_metric(metric, bootstrap_iters):
stderr = {mean: mean_stderr, acc_all: acc_all_stderr} stderr = {mean: mean_stderr, acc_all: acc_all_stderr}
return stderr.get(metric, None) return stderr.get(metric, None)
def pooled_sample_stderr(stderrs: List[float], sizes: List[int]):
# Used to aggregate bootstrapped stderrs across subtasks in a group,
# when we are weighting by the size of each subtask.
#
assert len(stderrs) == len(sizes)
# formula source: https://en.wikipedia.org/wiki/Pooled_variance
# and: https://stats.stackexchange.com/a/4841331
# this empirically seems to match running `stderr_for_metric` on all instances
# from the subtasks concatenated with each other.
pooled_sample_var = (
sum([(size - 1) * stderr**2 * size for size, stderr in zip(sizes, stderrs)])
) / (sum(sizes) - len(sizes))
return np.sqrt(pooled_sample_var / sum(sizes))
def combined_sample_stderr(stderrs: List[float], sizes: List[int], metrics=None):
assert (
metrics is not None
), "Need to pass a list of each subtask's metric for this stderr aggregation"
assert len(stderrs) == len(sizes) and len(sizes) == len(metrics)
# See https://github.com/EleutherAI/lm-evaluation-harness/pull/1390 for more documentation.
# This formula depends on sample means.
# removed because it seems to give erroneously huge stderrs for groupings of tasks
# and does not seem to match up with bootstrap-calculated stderrs for groups.
### don't use this unless a statistician has told you it's the right thing to do ###
# accumulators: we'll aggregate pairwise N - 1 times
variance = stderrs[0] ** 2
curr_size = sizes[0]
curr_score = metrics[0]
for stderr, size, score in zip(stderrs[1:], sizes[1:], metrics[1:]):
curr_score = ((curr_score * curr_size) + (score * size)) / (
curr_size + size
) # NOTE: this assumes our aggregation fn is "mean"
variance = ((curr_size - 1) * variance + (size - 1) * (stderr**2)) / (
curr_size + size - 1
) + curr_size * size / ((curr_size + size) * (curr_size + size - 1)) * (
curr_score - score
) ** 2
return np.sqrt(variance)
def aggregate_subtask_metrics(metrics, sizes, weight_by_size=True):
# A helper function that is used to aggregate
# subtask scores cross-task.
# TODO: does not hold for non-mean aggregations
if not weight_by_size:
sizes = [1] * len(sizes)
assert len(metrics) == len(sizes)
return sum([metric * size for metric, size in zip(metrics, sizes)]) / sum(sizes)
...@@ -133,6 +133,28 @@ class LM(abc.ABC): ...@@ -133,6 +133,28 @@ class LM(abc.ABC):
args2 = {k: v for k, v in additional_config.items() if v is not None} args2 = {k: v for k, v in additional_config.items() if v is not None}
return cls(**args, **args2) return cls(**args, **args2)
@classmethod
def create_from_arg_obj(
cls: Type[T], arg_dict: dict, additional_config: Optional[dict] = None
) -> T:
"""
Creates an instance of the LM class using the given arg_obj
Parameters:
- arg_obj: A dict containing arguments in the format key1=value1,key2=value2.
- additional_config: Optional dictionary containing additional configuration parameters.
Returns:
- Instance of the LM class.
"""
additional_config = {} if additional_config is None else additional_config
additional_config = {
k: v for k, v in additional_config.items() if v is not None
}
return cls(**arg_dict, **additional_config)
@property @property
def rank(self): def rank(self):
# used in the case of parallelism. Hardcoded to # used in the case of parallelism. Hardcoded to
...@@ -203,7 +225,7 @@ class CachingLM: ...@@ -203,7 +225,7 @@ class CachingLM:
eval_logger.info( eval_logger.info(
f"Loading '{attr}' responses from cache '{self.cache_db}' where possible..." f"Loading '{attr}' responses from cache '{self.cache_db}' where possible..."
) )
for req in tqdm(requests): for req in tqdm(requests, desc="Checking cached requests"):
hsh = hash_args(attr, req.args) hsh = hash_args(attr, req.args)
if attr == "generate_until" and req.args[1].get("do_sample", False): if attr == "generate_until" and req.args[1].get("do_sample", False):
# when we are doing non-greedy generation, don't use the cache # when we are doing non-greedy generation, don't use the cache
...@@ -224,7 +246,9 @@ class CachingLM: ...@@ -224,7 +246,9 @@ class CachingLM:
else: else:
res.append(None) res.append(None)
remaining_reqs.append(req) remaining_reqs.append(req)
eval_logger.info(
f"Cached requests: {len(requests) - len(remaining_reqs)}, Requests remaining: {len(remaining_reqs)}"
)
# actually run the LM on the requests that do not have cached results # actually run the LM on the requests that do not have cached results
rem_res = getattr(self.lm, attr)(remaining_reqs) rem_res = getattr(self.lm, attr)(remaining_reqs)
...@@ -247,3 +271,61 @@ class CachingLM: ...@@ -247,3 +271,61 @@ class CachingLM:
def get_cache_hook(self): def get_cache_hook(self):
return CacheHook(self) return CacheHook(self)
class TemplateLM(LM):
"""
A class acting as intermediary between the LM base class
and boilerplate often included in other LM subclasses.
"""
@property
@abc.abstractmethod
def eot_token_id(self):
pass
@abc.abstractmethod
def tok_encode(self, string: str, **kwargs):
pass
@abc.abstractmethod
def _loglikelihood_tokens(self, requests, **kwargs):
pass
def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
return self._loglikelihood_tokens(new_reqs)
@abc.abstractmethod
def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
pass
@abc.abstractmethod
def generate_until(self, requests) -> List[str]:
pass
import logging import logging
from typing import Callable, Dict
import evaluate import evaluate as hf_evaluate
from lm_eval.api.model import LM from lm_eval.api.model import LM
...@@ -75,7 +76,7 @@ def register_group(name): ...@@ -75,7 +76,7 @@ def register_group(name):
OUTPUT_TYPE_REGISTRY = {} OUTPUT_TYPE_REGISTRY = {}
METRIC_REGISTRY = {} METRIC_REGISTRY = {}
METRIC_AGGREGATION_REGISTRY = {} METRIC_AGGREGATION_REGISTRY = {}
AGGREGATION_REGISTRY = {} AGGREGATION_REGISTRY: Dict[str, Callable[[], Dict[str, Callable]]] = {}
HIGHER_IS_BETTER_REGISTRY = {} HIGHER_IS_BETTER_REGISTRY = {}
DEFAULT_METRIC_REGISTRY = { DEFAULT_METRIC_REGISTRY = {
...@@ -118,7 +119,7 @@ def register_metric(**args): ...@@ -118,7 +119,7 @@ def register_metric(**args):
return decorate return decorate
def get_metric(name, hf_evaluate_metric=False): def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
if not hf_evaluate_metric: if not hf_evaluate_metric:
if name in METRIC_REGISTRY: if name in METRIC_REGISTRY:
return METRIC_REGISTRY[name] return METRIC_REGISTRY[name]
...@@ -128,7 +129,7 @@ def get_metric(name, hf_evaluate_metric=False): ...@@ -128,7 +129,7 @@ def get_metric(name, hf_evaluate_metric=False):
) )
try: try:
metric_object = evaluate.load(name) metric_object = hf_evaluate.load(name)
return metric_object.compute return metric_object.compute
except Exception: except Exception:
eval_logger.error( eval_logger.error(
...@@ -136,7 +137,7 @@ def get_metric(name, hf_evaluate_metric=False): ...@@ -136,7 +137,7 @@ def get_metric(name, hf_evaluate_metric=False):
) )
def register_aggregation(name): def register_aggregation(name: str):
def decorate(fn): def decorate(fn):
assert ( assert (
name not in AGGREGATION_REGISTRY name not in AGGREGATION_REGISTRY
...@@ -148,21 +149,21 @@ def register_aggregation(name): ...@@ -148,21 +149,21 @@ def register_aggregation(name):
return decorate return decorate
def get_aggregation(name): def get_aggregation(name: str) -> Callable[[], Dict[str, Callable]]:
try: try:
return AGGREGATION_REGISTRY[name] return AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
eval_logger.warning(f"{name} not a registered aggregation metric!") eval_logger.warning(f"{name} not a registered aggregation metric!")
def get_metric_aggregation(name): def get_metric_aggregation(name: str) -> Callable[[], Dict[str, Callable]]:
try: try:
return METRIC_AGGREGATION_REGISTRY[name] return METRIC_AGGREGATION_REGISTRY[name]
except KeyError: except KeyError:
eval_logger.warning(f"{name} metric is not assigned a default aggregation!") eval_logger.warning(f"{name} metric is not assigned a default aggregation!")
def is_higher_better(metric_name): def is_higher_better(metric_name) -> bool:
try: try:
return HIGHER_IS_BETTER_REGISTRY[metric_name] return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError: except KeyError:
......
...@@ -4,12 +4,14 @@ import logging ...@@ -4,12 +4,14 @@ import logging
import random import random
import re import re
from collections.abc import Callable from collections.abc import Callable
from copy import deepcopy
from dataclasses import asdict, dataclass from dataclasses import asdict, dataclass
from inspect import getsource from inspect import getsource
from typing import Any, List, Literal, Tuple, Union from typing import Any, Iterator, List, Literal, Tuple, Union
import datasets import datasets
import numpy as np import numpy as np
from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api import samplers from lm_eval.api import samplers
...@@ -27,6 +29,7 @@ from lm_eval.api.registry import ( ...@@ -27,6 +29,7 @@ from lm_eval.api.registry import (
get_metric_aggregation, get_metric_aggregation,
is_higher_better, is_higher_better,
) )
from lm_eval.caching.cache import load_from_cache, save_to_cache
from lm_eval.filters import build_filter_ensemble from lm_eval.filters import build_filter_ensemble
from lm_eval.prompts import get_prompt from lm_eval.prompts import get_prompt
...@@ -86,9 +89,7 @@ class TaskConfig(dict): ...@@ -86,9 +89,7 @@ class TaskConfig(dict):
should_decontaminate: bool = False should_decontaminate: bool = False
doc_to_decontamination_query: str = None doc_to_decontamination_query: str = None
weight_by_size: bool = False weight_by_size: bool = False
metadata: Union[ metadata: dict = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
str, list
] = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
def __post_init__(self) -> None: def __post_init__(self) -> None:
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
...@@ -109,9 +110,11 @@ class TaskConfig(dict): ...@@ -109,9 +110,11 @@ class TaskConfig(dict):
if self.output_type == "generate_until": if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = { self.generation_kwargs = {
"until": None "until": (
if self.fewshot_delimiter is None None
else [self.fewshot_delimiter], if self.fewshot_delimiter is None
else [self.fewshot_delimiter]
),
"do_sample": False, "do_sample": False,
} }
...@@ -308,7 +311,7 @@ class Task(abc.ABC): ...@@ -308,7 +311,7 @@ class Task(abc.ABC):
return self.validation_docs() return self.validation_docs()
else: else:
eval_logger.warning( eval_logger.warning(
"has_training_docs and has_validation_docs are False" f"[Task: {self.config.task}] has_training_docs and has_validation_docs are False"
", using test_docs as fewshot_docs but this is not recommended." ", using test_docs as fewshot_docs but this is not recommended."
) )
return self.test_docs() return self.test_docs()
...@@ -325,7 +328,7 @@ class Task(abc.ABC): ...@@ -325,7 +328,7 @@ class Task(abc.ABC):
return doc return doc
@property @property
def instances(self): def instances(self) -> List[Instance]:
"""After calling `task.build_all_requests()`, tasks """After calling `task.build_all_requests()`, tasks
maintain a list of the dataset instances which will be evaluated. maintain a list of the dataset instances which will be evaluated.
""" """
...@@ -351,20 +354,57 @@ class Task(abc.ABC): ...@@ -351,20 +354,57 @@ class Task(abc.ABC):
def doc_to_target(self, doc): def doc_to_target(self, doc):
pass pass
def build_all_requests(self, limit=None, rank=None, world_size=None) -> None: def build_all_requests(
self,
*,
limit=None,
rank=None,
world_size=None,
cache_requests=False,
rewrite_requests_cache=False,
) -> None:
"""Build a set of Instances for a task, and store them in task.instances""" """Build a set of Instances for a task, and store them in task.instances"""
if self.has_test_docs():
docs = self.test_docs()
elif self.has_validation_docs():
docs = self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
eval_logger.info(f"Building contexts for task on rank {rank}...") # used with caching
og_limit = limit
cache_key = f"requests-{self._config.task}"
cached_instances = load_from_cache(file_name=cache_key)
if cache_requests and cached_instances and not rewrite_requests_cache:
cached_instances = cached_instances[:limit]
flattened_instances = [
instance
for instance_group in cached_instances
for instance in instance_group
]
self._instances = flattened_instances
return
eval_logger.info(f"Building contexts for {self.config.task} on rank {rank}...")
instances = [] instances = []
for doc_id, doc in utils.create_iterator(
enumerate(docs), rank, world_size, limit # process all documents when caching is specified for simplicity
if (
cache_requests
and (not cached_instances or rewrite_requests_cache)
and limit is not None
):
limit = None
doc_id_docs = list(
self.doc_iterator(rank=rank, limit=limit, world_size=world_size)
)
num_docs = len(doc_id_docs)
for doc_id, doc in tqdm(
doc_id_docs,
total=num_docs,
): ):
# sample fewshot context #TODO: need to offset doc_id by rank now! # sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context( fewshot_ctx = self.fewshot_context(
...@@ -382,11 +422,25 @@ class Task(abc.ABC): ...@@ -382,11 +422,25 @@ class Task(abc.ABC):
if not isinstance(inst, list): if not isinstance(inst, list):
inst = [inst] inst = [inst]
instances.extend(inst) instances.append(inst)
# now flatten, this is to allow slicing to work with pickles
sliced_instances = instances[:og_limit]
flattened_instances = [
instance
for instance_group in sliced_instances
for instance in instance_group
]
self._instances = flattened_instances
self._instances = instances
assert len(self._instances) != 0, "task.build_requests() did not find any docs!" assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
if cache_requests and (not cached_instances or rewrite_requests_cache):
save_to_cache(file_name=cache_key, obj=instances)
@abc.abstractmethod @abc.abstractmethod
def construct_requests(self, doc, ctx, **kwargs): def construct_requests(self, doc, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of """Uses RequestFactory to construct Requests and returns an iterable of
...@@ -439,6 +493,9 @@ class Task(abc.ABC): ...@@ -439,6 +493,9 @@ class Task(abc.ABC):
""" """
pass pass
def get_config(self, key: str) -> Any:
return getattr(self._config, key, None)
@classmethod @classmethod
def count_bytes(cls, doc): def count_bytes(cls, doc):
"""Used for byte-level perplexity metrics in rolling loglikelihood""" """Used for byte-level perplexity metrics in rolling loglikelihood"""
...@@ -511,6 +568,7 @@ class Task(abc.ABC): ...@@ -511,6 +568,7 @@ class Task(abc.ABC):
return description + labeled_examples + example return description + labeled_examples + example
def apply_filters(self): def apply_filters(self):
"""Iterates over FilterEnsembles and applies them to instances"""
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances) f.apply(self._instances)
...@@ -519,15 +577,72 @@ class Task(abc.ABC): ...@@ -519,15 +577,72 @@ class Task(abc.ABC):
return self._instances return self._instances
def dump_config(self) -> dict: def dump_config(self) -> dict:
"""Returns a dictionary representing the task's config. """Returns the config as a dictionary."""
:returns: str
The fewshot context.
"""
# TODO: this should only return the overrides applied to a non-YAML task's configuration. # TODO: this should only return the overrides applied to a non-YAML task's configuration.
# (num_fewshot) # (num_fewshot)
return self.config.to_dict() return self.config.to_dict()
def set_config(self, key: str, value: Any, update: bool = False) -> None:
"""Set or update the configuration for a given key."""
if key is None:
raise ValueError("Key must be provided.")
if update:
current_value = getattr(self._config, key, {})
if not isinstance(current_value, dict):
raise TypeError(
f"Expected a dict for key '{key}', got {type(current_value).__name__} instead."
)
current_value.update(value)
else:
setattr(self._config, key, value)
def override_metric(self, metric_name: str) -> None:
"""
Override the default metrics used for evaluation with custom metrics.
Parameters:
- metric_name (str): The name of the custom metric to override. Should be registered in api.metrics.
"""
(
self._metric_fn_list,
self._aggregation_list,
self._metric_fn_kwargs,
self._higher_is_better,
) = ({}, {}, {}, {})
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
self._metric_fn_kwargs[metric_name] = {}
if not isinstance(self, ConfigurableTask):
self.process_results = lambda x, y: {metric_name: get_metric(metric_name)}
self.aggregation = lambda: {
metric_name: get_metric_aggregation(metric_name)
}
setattr(self._config, "metric_list", [{"metric": metric_name}])
setattr(self._config, "process_results", None)
@property
def eval_docs(self) -> Union[datasets.Dataset, List[dict]]:
if self.has_test_docs():
return self.test_docs()
elif self.has_validation_docs():
return self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
def doc_iterator(
self, *, rank: int = 0, limit: Union[int, None] = None, world_size: int = 1
) -> Iterator[Tuple[int, Any]]:
limit = int(limit) if limit else None
doc_iterator = utils.create_iterator(
enumerate(self.eval_docs),
rank=int(rank),
limit=limit,
world_size=int(world_size),
)
return doc_iterator
class ConfigurableTask(Task): class ConfigurableTask(Task):
VERSION = "Yaml" VERSION = "Yaml"
...@@ -624,7 +739,7 @@ class ConfigurableTask(Task): ...@@ -624,7 +739,7 @@ class ConfigurableTask(Task):
INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()} INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
metric_agg = get_metric_aggregation(metric_name) metric_agg = get_metric_aggregation(metric_name)
eval_logger.warning( eval_logger.warning(
f"[Task: {self._config.task}] metric {metric_name} is defined, but aggregation is not. " f"[Task: {self.config.task}] metric {metric_name} is defined, but aggregation is not. "
f"using default " f"using default "
f"aggregation={INV_AGG_REGISTRY[metric_agg]}" f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
) )
...@@ -636,7 +751,7 @@ class ConfigurableTask(Task): ...@@ -636,7 +751,7 @@ class ConfigurableTask(Task):
] ]
else: else:
eval_logger.warning( eval_logger.warning(
f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. " f"[Task: {self.config.task}] metric {metric_name} is defined, but higher_is_better is not. "
f"using default " f"using default "
f"higher_is_better={is_higher_better(metric_name)}" f"higher_is_better={is_higher_better(metric_name)}"
) )
...@@ -677,12 +792,7 @@ class ConfigurableTask(Task): ...@@ -677,12 +792,7 @@ class ConfigurableTask(Task):
else "default" else "default"
)(list(self.fewshot_docs()), self, rnd=random.Random(1234)) )(list(self.fewshot_docs()), self, rnd=random.Random(1234))
if self.has_test_docs(): self.task_docs = self.eval_docs
self.task_docs = self.test_docs()
elif self.has_validation_docs():
self.task_docs = self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
# Test One Doc # Test One Doc
self.features = list(self.task_docs.features.keys()) self.features = list(self.task_docs.features.keys())
...@@ -833,6 +943,7 @@ class ConfigurableTask(Task): ...@@ -833,6 +943,7 @@ class ConfigurableTask(Task):
return labeled_examples + str(example) return labeled_examples + str(example)
def apply_filters(self): def apply_filters(self):
"""Iterates over FilterEnsembles and applies them to instances"""
if hasattr(self, "_filters"): if hasattr(self, "_filters"):
for f in self._filters: for f in self._filters:
f.apply(self._instances) f.apply(self._instances)
...@@ -1026,7 +1137,7 @@ class ConfigurableTask(Task): ...@@ -1026,7 +1137,7 @@ class ConfigurableTask(Task):
return request_list return request_list
elif self.OUTPUT_TYPE == "generate_until": elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self.config.generation_kwargs) arguments = (ctx, deepcopy(self.config.generation_kwargs))
return Instance( return Instance(
request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs
...@@ -1122,12 +1233,21 @@ class ConfigurableTask(Task): ...@@ -1122,12 +1233,21 @@ class ConfigurableTask(Task):
# TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly # TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
exact_match = int(is_greedy[gold]) if gold != -100 else 0 exact_match = int(is_greedy[gold]) if gold != -100 else 0
prob_norm = utils.softmax(lls)
# TODO use keyword arguments to the metric?
# gold, pred, norm stuff, the original lls,
result_dict = { result_dict = {
**({"acc": acc} if "acc" in use_metric else {}), **({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}), **({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}), **({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}), **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}), **({"exact_match": exact_match} if "exact_match" in use_metric else {}),
**(
{"brier_score": (gold, prob_norm)}
if "brier_score" in use_metric
else {}
),
} }
if "acc_mutual_info" in use_metric: if "acc_mutual_info" in use_metric:
...@@ -1222,36 +1342,14 @@ class ConfigurableTask(Task): ...@@ -1222,36 +1342,14 @@ class ConfigurableTask(Task):
def get_config(self, key: str) -> Any: def get_config(self, key: str) -> Any:
return getattr(self._config, key, None) return getattr(self._config, key, None)
def override_metric(self, metric_name: str) -> None: def __repr__(self):
""" return (
Override the default metrics used for evaluation with custom metrics. f"ConfigurableTask(task_name={getattr(self.config, 'task', None)},"
f"group_name={getattr(self.config, 'group', None)},"
Parameters: f"output_type={self.OUTPUT_TYPE},"
- metric_name (str): The name of the custom metric to override. Should be registered in api.metrics. f"num_fewshot={getattr(self.config, 'num_fewshot', None)},"
""" f"num_samples={len(self.eval_docs)})"
( )
self._metric_fn_list,
self._aggregation_list,
self._metric_fn_kwargs,
self._higher_is_better,
) = ({}, {}, {}, {})
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
self._metric_fn_kwargs[metric_name] = {}
setattr(self._config, "metric_list", [{"metric": metric_name}])
setattr(self._config, "process_results", None)
def override_config(
self, key: str = None, value: Any = None, update: bool = False
) -> None:
if update:
current_value = getattr(self._config, key)
assert isinstance(current_value, dict)
current_value.update(value)
setattr(self._config, key, current_value)
else:
setattr(self._config, key, value)
class MultipleChoiceTask(Task): class MultipleChoiceTask(Task):
......
import hashlib
import os
import dill
from lm_eval.utils import eval_logger
MODULE_DIR = os.path.dirname(os.path.realpath(__file__))
OVERRIDE_PATH = os.getenv("LM_HARNESS_CACHE_PATH")
PATH = OVERRIDE_PATH if OVERRIDE_PATH else f"{MODULE_DIR}/.cache"
# This should be sufficient for uniqueness
HASH_INPUT = "EleutherAI-lm-evaluation-harness"
HASH_PREFIX = hashlib.sha256(HASH_INPUT.encode("utf-8")).hexdigest()
FILE_SUFFIX = f".{HASH_PREFIX}.pickle"
def load_from_cache(file_name):
try:
path = f"{PATH}/{file_name}{FILE_SUFFIX}"
with open(path, "rb") as file:
cached_task_dict = dill.loads(file.read())
return cached_task_dict
except Exception:
eval_logger.debug(f"{file_name} is not cached, generating...")
pass
def save_to_cache(file_name, obj):
if not os.path.exists(PATH):
os.mkdir(PATH)
file_path = f"{PATH}/{file_name}{FILE_SUFFIX}"
eval_logger.debug(f"Saving {file_path} to cache...")
with open(file_path, "wb") as file:
file.write(dill.dumps(obj))
# NOTE the "key" param is to allow for flexibility
def delete_cache(key: str = ""):
files = os.listdir(PATH)
for file in files:
if file.startswith(key) and file.endswith(FILE_SUFFIX):
file_path = f"{PATH}/{file}"
os.unlink(file_path)
import random
import itertools
import collections import collections
import itertools
import torch import logging
import random
from typing import TYPE_CHECKING, Optional, Union
import numpy as np import numpy as np
import torch
import lm_eval.api
import lm_eval.tasks
import lm_eval.models
import lm_eval.api.metrics import lm_eval.api.metrics
import lm_eval.api.registry import lm_eval.api.registry
import lm_eval.models
from lm_eval.evaluator_utils import (
consolidate_results,
get_sample_size,
get_task_list,
prepare_print_tasks,
print_writeout,
run_task_tests,
)
from lm_eval.logging_utils import add_env_info, get_git_commit_hash
from lm_eval.tasks import TaskManager, get_task_dict
from lm_eval.utils import ( from lm_eval.utils import (
eval_logger,
positional_deprecated, positional_deprecated,
run_task_tests,
get_git_commit_hash,
simple_parse_args_string, simple_parse_args_string,
eval_logger,
) )
if TYPE_CHECKING:
from lm_eval.api.model import LM
from lm_eval.tasks import Task
from lm_eval.caching.cache import delete_cache
@positional_deprecated @positional_deprecated
def simple_evaluate( def simple_evaluate(
model, model,
model_args=None, model_args: Optional[Union[str, dict, None]] = None,
tasks=None, tasks=None,
num_fewshot=None, num_fewshot: Optional[int] = None,
batch_size=None, batch_size: Optional[int] = None,
max_batch_size=None, max_batch_size: Optional[int] = None,
device=None, device: Optional[str] = None,
use_cache=None, use_cache: Optional[str] = None,
limit=None, cache_requests: bool = False,
rewrite_requests_cache: bool = False,
delete_requests_cache: bool = False,
limit: Optional[Union[int, float]] = None,
bootstrap_iters: int = 100000, bootstrap_iters: int = 100000,
check_integrity: bool = False, check_integrity: bool = False,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out: bool = False, write_out: bool = False,
log_samples: bool = True, log_samples: bool = True,
gen_kwargs: str = None, gen_kwargs: str = None,
task_manager: TaskManager = None,
verbosity: str = "INFO",
predict_only: bool = False, predict_only: bool = False,
random_seed: int = 0,
numpy_random_seed: int = 1234,
torch_random_seed: int = 1234,
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LM] :param model: Union[str, LM]
Name of model or LM object, see lm_eval.models.get_model Name of model or LM object, see lm_eval.models.get_model
:param model_args: Optional[str] :param model_args: Optional[str, dict]
String arguments for each model class, see LM.create_from_arg_string. String or dict arguments for each model class, see LM.create_from_arg_string and LM.create_from_arg_object.
Ignored if `model` argument is a LM object. Ignored if `model` argument is a LM object.
:param tasks: list[Union[str, Task]] :param tasks: list[Union[str, dict, Task]]
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise. List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int :param num_fewshot: int
Number of examples in few-shot context Number of examples in few-shot context
...@@ -59,6 +80,12 @@ def simple_evaluate( ...@@ -59,6 +80,12 @@ def simple_evaluate(
PyTorch device (e.g. "cpu" or "cuda:0") for running models PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param use_cache: str, optional :param use_cache: str, optional
A path to a sqlite db file for caching model responses. `None` if not caching. A path to a sqlite db file for caching model responses. `None` if not caching.
:param cache_requests: bool, optional
Speed up evaluation by caching the building of dataset requests. `None` if not caching.
:param rewrite_requests_cache: bool, optional
Rewrites all of the request cache if set to `True`. `None` if not desired.
:param delete_requests_cache: bool, optional
Deletes all of the request cache if set to `True`. `None` if not desired.
:param limit: int or float, optional :param limit: int or float, optional
Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples. Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters: :param bootstrap_iters:
...@@ -74,15 +101,38 @@ def simple_evaluate( ...@@ -74,15 +101,38 @@ def simple_evaluate(
Ignored for all tasks with loglikelihood output_type Ignored for all tasks with loglikelihood output_type
:param predict_only: bool :param predict_only: bool
If true only model outputs will be generated and returned. Metrics will not be evaluated If true only model outputs will be generated and returned. Metrics will not be evaluated
:param random_seed: int
Random seed for python's random module. If set to None, the seed will not be set.
:param numpy_random_seed: int
Random seed for numpy. If set to None, the seed will not be set.
:param torch_random_seed: int
Random seed for torch. If set to None, the seed will not be set.
:return :return
Dictionary of results Dictionary of results
""" """
random.seed(0) eval_logger.setLevel(getattr(logging, f"{verbosity}"))
np.random.seed(1234)
torch.manual_seed( if delete_requests_cache:
1234 eval_logger.info("Deleting requests cache...")
) # TODO: this may affect training runs that are run with evaluation mid-run. delete_cache()
seed_message = []
if random_seed is not None:
# See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
seed_message.append(f"Setting random seed to {random_seed}")
random.seed(random_seed)
if numpy_random_seed is not None:
seed_message.append(f"Setting numpy seed to {numpy_random_seed}")
np.random.seed(numpy_random_seed)
if torch_random_seed is not None:
seed_message.append(f"Setting torch manual seed to {torch_random_seed}")
torch.manual_seed(torch_random_seed)
if seed_message:
eval_logger.info(" | ".join(seed_message))
if tasks is None: if tasks is None:
tasks = [] tasks = []
...@@ -101,20 +151,32 @@ def simple_evaluate( ...@@ -101,20 +151,32 @@ def simple_evaluate(
if isinstance(model, str): if isinstance(model, str):
if model_args is None: if model_args is None:
model_args = "" model_args = ""
lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args, elif isinstance(model_args, dict):
{ lm = lm_eval.api.registry.get_model(model).create_from_arg_obj(
"batch_size": batch_size, model_args,
"max_batch_size": max_batch_size, {
"device": device, "batch_size": batch_size,
}, "max_batch_size": max_batch_size,
) "device": device,
},
)
else:
lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
model_args,
{
"batch_size": batch_size,
"max_batch_size": max_batch_size,
"device": device,
},
)
else: else:
assert isinstance(model, lm_eval.api.model.LM) assert isinstance(model, lm_eval.api.model.LM)
lm = model lm = model
if use_cache is not None: if use_cache is not None:
print(f"Using cache at {use_cache + '_rank' + str(lm.rank) + '.db'}") eval_logger.info(f"Using cache at {use_cache + '_rank' + str(lm.rank) + '.db'}")
lm = lm_eval.api.model.CachingLM( lm = lm_eval.api.model.CachingLM(
lm, lm,
use_cache use_cache
...@@ -125,7 +187,14 @@ def simple_evaluate( ...@@ -125,7 +187,14 @@ def simple_evaluate(
+ ".db", + ".db",
) )
task_dict = lm_eval.tasks.get_task_dict(tasks) if task_manager is None:
task_manager = TaskManager(verbosity)
eval_logger.info(
"get_task_dict has been updated to accept an optional argument, `task_manager`"
"Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage"
)
task_dict = get_task_dict(tasks, task_manager)
for task_name in task_dict.keys(): for task_name in task_dict.keys():
task_obj = task_dict[task_name] task_obj = task_dict[task_name]
if isinstance(task_obj, tuple): if isinstance(task_obj, tuple):
...@@ -135,17 +204,17 @@ def simple_evaluate( ...@@ -135,17 +204,17 @@ def simple_evaluate(
if task_obj.get_config("output_type") == "generate_until": if task_obj.get_config("output_type") == "generate_until":
if gen_kwargs is not None: if gen_kwargs is not None:
task_obj.override_config( task_obj.set_config(
key="generation_kwargs", value=gen_kwargs, update=True key="generation_kwargs", value=gen_kwargs, update=True
) )
if predict_only: if predict_only:
log_samples = True log_samples = True
eval_logger.info( eval_logger.info(
f"Processing {task_name} in output-only mode. Metrics will not be calculated!" f"Processing {task_name} in output-only mode. Metrics will not be calculated!"
) )
# we have to change the class properties post-hoc. This is pretty hacky. # we have to change the class properties post-hoc. This is pretty hacky.
task_obj.override_metric(metric_name="bypass") task_obj.override_metric(metric_name="bypass")
if num_fewshot is not None: if num_fewshot is not None:
if (default_num_fewshot := task_obj.get_config("num_fewshot")) == 0: if (default_num_fewshot := task_obj.get_config("num_fewshot")) == 0:
...@@ -156,7 +225,7 @@ def simple_evaluate( ...@@ -156,7 +225,7 @@ def simple_evaluate(
eval_logger.warning( eval_logger.warning(
f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}" f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}"
) )
task_obj.override_config(key="num_fewshot", value=num_fewshot) task_obj.set_config(key="num_fewshot", value=num_fewshot)
if check_integrity: if check_integrity:
run_task_tests(task_list=tasks) run_task_tests(task_list=tasks)
...@@ -165,10 +234,13 @@ def simple_evaluate( ...@@ -165,10 +234,13 @@ def simple_evaluate(
lm=lm, lm=lm,
task_dict=task_dict, task_dict=task_dict,
limit=limit, limit=limit,
cache_requests=cache_requests,
rewrite_requests_cache=rewrite_requests_cache,
bootstrap_iters=bootstrap_iters, bootstrap_iters=bootstrap_iters,
decontamination_ngrams_path=decontamination_ngrams_path, decontamination_ngrams_path=decontamination_ngrams_path,
write_out=write_out, write_out=write_out,
log_samples=log_samples, log_samples=log_samples,
verbosity=verbosity,
) )
if lm.rank == 0: if lm.rank == 0:
...@@ -184,9 +256,9 @@ def simple_evaluate( ...@@ -184,9 +256,9 @@ def simple_evaluate(
"model": model_name, "model": model_name,
"model_args": model_args, "model_args": model_args,
"batch_size": batch_size, "batch_size": batch_size,
"batch_sizes": list(lm.batch_sizes.values()) "batch_sizes": (
if hasattr(lm, "batch_sizes") list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else []
else [], ),
"device": device, "device": device,
"use_cache": use_cache, "use_cache": use_cache,
"limit": limit, "limit": limit,
...@@ -194,6 +266,7 @@ def simple_evaluate( ...@@ -194,6 +266,7 @@ def simple_evaluate(
"gen_kwargs": gen_kwargs, "gen_kwargs": gen_kwargs,
} }
results["git_hash"] = get_git_commit_hash() results["git_hash"] = get_git_commit_hash()
add_env_info(results) # additional environment info to results
return results return results
else: else:
return None return None
...@@ -204,13 +277,16 @@ decontaminate_suffix = "_decontaminate" ...@@ -204,13 +277,16 @@ decontaminate_suffix = "_decontaminate"
@positional_deprecated @positional_deprecated
def evaluate( def evaluate(
lm, lm: "LM",
task_dict, task_dict,
limit=None, limit: Optional[int] = None,
bootstrap_iters: int = 100000, cache_requests=False,
rewrite_requests_cache=False,
bootstrap_iters: Optional[int] = 100000,
decontamination_ngrams_path=None, decontamination_ngrams_path=None,
write_out: bool = False, write_out: bool = False,
log_samples: bool = True, log_samples: bool = True,
verbosity: str = "INFO",
): ):
"""Instantiate and evaluate a model on a list of tasks. """Instantiate and evaluate a model on a list of tasks.
...@@ -230,96 +306,38 @@ def evaluate( ...@@ -230,96 +306,38 @@ def evaluate(
Dictionary of results Dictionary of results
""" """
eval_logger.setLevel(getattr(logging, f"{verbosity}"))
# decontaminate = decontamination_ngrams_path is not None # decontaminate = decontamination_ngrams_path is not None
for task_name, task in task_dict.items():
if isinstance(task, tuple):
_, task = task
if not log_samples:
assert (
"bypass" not in getattr(task, "_metric_fn_list", {}).keys()
), f"log_samples must be True for 'bypass' only tasks: {task_name}"
# stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict)
# Tracks each task's version.
versions = collections.defaultdict(dict)
# Tracks the YAML configs of all chosen tasks.
configs = collections.defaultdict(dict)
# logs info about each document evaluated.
samples = collections.defaultdict(list)
# tracks all Instances/requests a model must generate output on. # tracks all Instances/requests a model must generate output on.
requests = collections.defaultdict(list) requests = collections.defaultdict(list)
# Aggregated task scores presented with groups
results_agg = collections.defaultdict(dict)
# Aggregated groups scores only
groups_agg = collections.defaultdict(dict)
# stores the amount to pad out reqs per req. type so that # stores the amount to pad out reqs per req. type so that
# number of fwd passes per distributed rank is equal # number of fwd passes per distributed rank is equal
padding_requests = collections.defaultdict(int) padding_requests = collections.defaultdict(int)
# store the hierarchy to do proper ordering
task_hierarchy = collections.defaultdict(list)
# store num-fewshot value per task
num_fewshot = collections.defaultdict(int)
# get lists of each type of request
for task_name, task in task_dict.items():
if isinstance(task, tuple):
group_name, task = task
task_hierarchy[group_name].append(task_name)
versions[group_name] = "N/A"
else:
group_name = None
task_hierarchy[task_name] = []
if task is None:
continue
versions[task_name] = task.VERSION
configs[task_name] = dict(task.dump_config())
if "num_fewshot" in configs[task_name]:
n_shot = configs[task_name]["num_fewshot"]
else:
n_shot = 0
num_fewshot[task_name] = n_shot
if "task_alias" in configs[task_name]:
results[task_name]["alias"] = configs[task_name]["task_alias"]
if (
("group_alias" in configs[task_name])
and (group_name not in results)
and (group_name is not None)
):
results[group_name]["alias"] = configs[task_name]["group_alias"]
if limit is not None:
if task.has_test_docs():
task_docs = task.test_docs()
elif task.has_validation_docs():
task_docs = task.validation_docs()
else:
raise RuntimeError("Task has neither test_docs nor validation_docs")
limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
# get lists of group hierarchy and each type of request
task_hierarchy, eval_tasks = get_task_list(task_dict)
if not log_samples:
assert all(
"bypass" not in getattr(task_output.task, "_metric_fn_list", {}).keys()
for task_output in eval_tasks
), "log_samples must be True for 'bypass' only tasks"
for task_output in eval_tasks:
task: Task = task_output.task
limit = get_sample_size(task, limit)
task.build_all_requests(
limit=limit,
rank=lm.rank,
world_size=lm.world_size,
cache_requests=cache_requests,
rewrite_requests_cache=rewrite_requests_cache,
)
eval_logger.debug( eval_logger.debug(
f"Task: {task_name}; number of requests on this rank: {len(task.instances)}" f"Task: {task_output.task_name}; number of requests on this rank: {len(task.instances)}"
) )
if write_out: if write_out:
for inst in task.instances: print_writeout(task)
# print the prompt for the first few documents
if inst.doc_id < 1:
eval_logger.info(
f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\
\n{inst.args[0]}\n(end of prompt on previous line)\ntarget string or answer choice index (starting on next line):\n{task.doc_to_target(inst.doc)}\n(end of target on previous line)"
)
eval_logger.info(f"Request: {str(inst)}")
# aggregate Instances by LM method requested to get output. # aggregate Instances by LM method requested to get output.
for instance in task.instances: for instance in task.instances:
reqtype = instance.request_type reqtype = instance.request_type
...@@ -331,7 +349,7 @@ def evaluate( ...@@ -331,7 +349,7 @@ def evaluate(
lm.accelerator.gather(instances_rnk).cpu().detach().numpy().tolist() lm.accelerator.gather(instances_rnk).cpu().detach().numpy().tolist()
) )
# compute number of pseudobatches to pad with (FSDP/DDP require even batches among ranks) # compute number of pseudo-batches to pad with (FSDP/DDP require even batches among ranks)
numpad = max(gathered_item) - gathered_item[lm.rank] numpad = max(gathered_item) - gathered_item[lm.rank]
padding_requests[task.OUTPUT_TYPE] += numpad padding_requests[task.OUTPUT_TYPE] += numpad
...@@ -358,42 +376,33 @@ def evaluate( ...@@ -358,42 +376,33 @@ def evaluate(
if lm.world_size > 1: if lm.world_size > 1:
lm.accelerator.wait_for_everyone() lm.accelerator.wait_for_everyone()
RANK = lm.rank
WORLD_SIZE = lm.world_size
### Postprocess outputs ### ### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately) # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_name, task in task_dict.items(): for task_output in eval_tasks:
if isinstance(task, tuple): task = task_output.task
group, task = task
if task is None:
continue
task.apply_filters() task.apply_filters()
### Collect values of metrics on all datapoints ### ### Collect values of metrics on all datapoints ###
vals = collections.defaultdict(list) # # unpack results and sort back in order and return control to Task
# unpack results and sort back in order and return control to Task
for task_name, task in task_dict.items():
if isinstance(task, tuple):
group, task = task
if task is None:
continue
# TODO: make it possible to use a different metric per filter # TODO: make it possible to use a different metric per filter
# Pre-process task.instances to group by doc_id
instances_by_doc_id = collections.defaultdict(list)
for instance in task.instances:
instances_by_doc_id[instance.doc_id].append(instance)
# Sort instances within each group
for instances in instances_by_doc_id.values():
instances.sort(key=lambda x: x.idx)
# iterate over different filters used # iterate over different filters used
for key in task.instances[0].filtered_resps.keys(): for filter_key in task.instances[0].filtered_resps.keys():
doc_iterator = ( doc_iterator = task.doc_iterator(
itertools.islice( rank=RANK, limit=limit, world_size=WORLD_SIZE
enumerate(task.test_docs()), lm.rank, limit, lm.world_size
)
if task.has_test_docs()
else itertools.islice(
enumerate(task.validation_docs()), lm.rank, limit, lm.world_size
)
) )
for doc_id, doc in doc_iterator: for doc_id, doc in doc_iterator:
# subset instances to only this document id ; sort by idx requests = instances_by_doc_id[doc_id]
requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
requests.sort(key=lambda x: x.idx)
metrics = task.process_results( metrics = task.process_results(
doc, [req.filtered_resps[key] for req in requests] doc, [req.filtered_resps[filter_key] for req in requests]
) )
if log_samples: if log_samples:
target = task.doc_to_target(doc) target = task.doc_to_target(doc)
...@@ -403,214 +412,108 @@ def evaluate( ...@@ -403,214 +412,108 @@ def evaluate(
"target": target, "target": target,
"arguments": [req.args for req in requests], "arguments": [req.args for req in requests],
"resps": [req.resps for req in requests], "resps": [req.resps for req in requests],
"filtered_resps": [req.filtered_resps[key] for req in requests], "filtered_resps": [
req.filtered_resps[filter_key] for req in requests
],
} }
example.update(metrics) example.update(metrics)
samples[task_name].append(example) task_output.logged_samples.append(example)
for metric, value in metrics.items(): for metric, value in metrics.items():
vals[(task_name, key, metric)].append(value) task_output.sample_metrics[(metric, filter_key)].append(value)
if lm.world_size > 1: if WORLD_SIZE > 1:
# if multigpu, then gather data across all ranks # if multigpu, then gather data across all ranks to rank 0
# first gather logged samples across all ranks # first gather logged samples across all ranks
for task_name, task_samples in list(samples.items()): for task_output in eval_tasks:
full_samples = [None] * lm.world_size if log_samples:
torch.distributed.all_gather_object(full_samples, task_samples) # for task_name, task_samples in list(samples.items()):
full_samples = [None] * WORLD_SIZE if RANK == 0 else None
samples[task_name] = list(itertools.chain.from_iterable(full_samples)) torch.distributed.gather_object(
obj=task_output.logged_samples,
# then collect metrics across all ranks object_gather_list=full_samples,
vals_torch = collections.defaultdict(list) dst=0,
for (task_name, key, metric), items in vals.items():
numitem = 0
if isinstance(items[0], tuple):
numitem = len(items[0])
if isinstance(items[0], (str, list, tuple)):
# handle the string case
gathered_items = [None] * lm.accelerator.num_processes
torch.distributed.all_gather_object(gathered_items, items)
gathered_item = list(itertools.chain.from_iterable(gathered_items))
else:
# distributed gather requires all ranks to have same dimensions
# so we pad out with float32 min value
pad_value = torch.finfo(torch.float32).min
metrics_tensor = torch.tensor(items, device=lm.device)
original_dtype = metrics_tensor.dtype # store original dtype
torch_device_tensor = lm.accelerator.pad_across_processes(
metrics_tensor.to(torch.float32), pad_index=pad_value
) )
gathered_item = lm.accelerator.gather(torch_device_tensor)
if numitem > 0: if RANK == 0:
gathered_filtered = gathered_item[gathered_item[:, 0] != pad_value] task_output.logged_samples = list(
else: itertools.chain.from_iterable(full_samples)
gathered_filtered = gathered_item[gathered_item != pad_value] )
gathered_item = ( # then collect metrics across all ranks
gathered_filtered.to(original_dtype).cpu().detach().numpy().tolist() for metrics in task_output.sample_metrics:
metric_list = [None] * WORLD_SIZE if RANK == 0 else None
torch.distributed.gather_object(
obj=task_output.sample_metrics[metrics],
object_gather_list=metric_list,
dst=0,
) )
# reconvert if we were passed a tuple of values if RANK == 0:
if numitem > 0: task_output.sample_metrics[metrics] = list(
gathered_item = [tuple(g) for g in gathered_item] itertools.chain.from_iterable(metric_list)
)
if lm.rank == 0:
vals_torch[(task_name, key, metric)] = gathered_item
vals = vals_torch
if lm.rank == 0:
if RANK == 0:
### Aggregate results over all datapoints ### ### Aggregate results over all datapoints ###
# aggregate results ; run bootstrap CIs # aggregate results ; run bootstrap CIs
for (task_name, key, metric), items in vals.items(): for task_output in eval_tasks:
task = task_dict[task_name] task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
metric_key = metric + "," + key results, samples, configs, versions, num_fewshot = consolidate_results(
eval_tasks
if isinstance(task, tuple): )
group_name, task = task
else:
group_name = None
agg_fn = task.aggregation()[metric]
results[task_name][metric_key] = agg_fn(items)
results[task_name]["samples"] = len(items)
# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this
if bootstrap_iters > 0:
stderr = lm_eval.api.metrics.stderr_for_metric(
metric=task.aggregation()[metric],
bootstrap_iters=min(bootstrap_iters, 100)
if metric in ["bleu", "chrf", "ter"]
else bootstrap_iters,
)
if stderr is not None and len(items) > 1:
results[task_name][metric + "_stderr" + "," + key] = stderr(items)
else:
results[task_name][metric + "_stderr" + "," + key] = "N/A"
### Calculate group metrics ###
if bool(results): if bool(results):
for group, task_list in reversed(task_hierarchy.items()): for group, task_list in reversed(task_hierarchy.items()):
if task_list == []: if len(task_list) == 0:
# TODO: No samples when bypass # task_hierarchy entries are either
total_size = results[group].get("samples", 999) # `group_name: [subtask1, subtask2, ...]`
else: # or `task_name: []`.
total_size = 0 # we only want to operate on groups here.
continue
for task in task_list: metric_list = list(
metrics = results[task].copy() {
key
if "alias" in metrics: for task in task_list
metrics.pop("alias") for key in results[task].keys()
if "_stderr" not in key and key not in ["alias", "samples"]
if ("weight_by_size" in configs) and configs[task]["weight_by_size"]: }
current_size = metrics.pop("samples")
else:
metrics.pop("samples")
current_size = 1
all_stderr = []
for metric in [
key for key in metrics.keys() if "_stderr" not in key
]:
stderr = "_stderr,".join(metric.split(","))
stderr_score = results[task][stderr]
if stderr_score == "N/A":
var_score = "N/A"
else:
var_score = stderr_score**2
all_stderr.append(stderr)
metric_score = results[task][metric]
if metric in results[group]:
results[group][metric] = (
results[group][metric] * total_size
+ metric_score * current_size
) / (total_size + current_size)
# $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
if var_score == "N/A" or results[group][stderr] == "N/A":
results[group][stderr] = "N/A"
else:
results[group][stderr] = (
(total_size - 1) * results[group][stderr]
+ (current_size - 1) * var_score
) / (
total_size + current_size - 1
) + total_size * current_size / (
(total_size + current_size)
* (total_size + current_size - 1)
) * (
results[group][metric] - metric_score
) ** 2
else:
results[group][metric] = metric_score
results[group][stderr] = var_score
total_size += current_size
for stderr in all_stderr:
results[group][stderr] = np.sqrt(results[group][stderr])
results[group]["samples"] = total_size
def print_tasks(task_hierarchy, results, tab=0):
results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict)
(group_name, task_list), *_ = task_hierarchy.items()
task_list = sorted(task_list)
results_agg[group_name] = results[group_name].copy()
# results_agg[group_name]["tab"] = tab
if "samples" in results_agg[group_name]:
results_agg[group_name].pop("samples")
tab_string = " " * tab + "- " if tab > 0 else ""
if "alias" in results_agg[group_name]:
results_agg[group_name]["alias"] = (
tab_string + results_agg[group_name]["alias"]
) )
else: for metric in metric_list:
results_agg[group_name]["alias"] = tab_string + group_name stderr = "_stderr,".join(metric.split(","))
if len(task_list) > 0: # gather metrics, sizes, and stderrs from subtasks
groups_agg[group_name] = results[group_name].copy() metrics = [
# groups_agg[group_name]["tab"] = tab results[task][metric]
if "samples" in groups_agg[group_name]: for task in task_list
groups_agg[group_name].pop("samples") if metric in results[task]
] # TODO: copy?
if "alias" in groups_agg[group_name]: stderrs = [
groups_agg[group_name]["alias"] = ( results[task][stderr]
tab_string + groups_agg[group_name]["alias"] for task in task_list
) if stderr in results[task]
else: ]
groups_agg[group_name]["alias"] = tab_string + group_name sizes = [
results[task]["samples"]
for task_name in task_list: for task in task_list
if task_name in task_hierarchy: if metric in results[task]
_task_hierarchy = { ]
**{task_name: task_hierarchy[task_name]},
**task_hierarchy, # compute group's pooled metric and stderr
} results[group][
metric
] = lm_eval.api.metrics.aggregate_subtask_metrics(metrics, sizes)
# TODO: calculate grouped metric using aggregation fn
if "N/A" in stderrs:
results[group][stderr] = "N/A"
else: else:
_task_hierarchy = { results[group][
**{task_name: []}, stderr
**task_hierarchy, ] = lm_eval.api.metrics.pooled_sample_stderr(stderrs, sizes)
} # TODO: allow GroupConfigs to choose which variance formula is used, for back-compatibility
# To use the old (likely incorrect) variance formula, comment out the above and uncomment this line:
# results[group][stderr] = lm_eval.api.metrics.combined_sample_stderr(stderrs, sizes, metrics=metrics)
_results_agg, _groups_agg = print_tasks( results[group]["samples"] = sum(sizes)
_task_hierarchy, results, tab + 1
)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
return results_agg, groups_agg
results_agg = collections.defaultdict(dict) results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict) groups_agg = collections.defaultdict(dict)
...@@ -625,18 +528,21 @@ def evaluate( ...@@ -625,18 +528,21 @@ def evaluate(
_task_hierarchy = { _task_hierarchy = {
k: v for k, v in task_hierarchy.items() if k in left_tasks_list k: v for k, v in task_hierarchy.items() if k in left_tasks_list
} }
_results_agg, _groups_agg = print_tasks(_task_hierarchy, results) _results_agg, _groups_agg = prepare_print_tasks(_task_hierarchy, results)
results_agg = {**results_agg, **_results_agg} results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg} groups_agg = {**groups_agg, **_groups_agg}
for group_name, task_list in task_hierarchy.items(): for group_name, task_list in task_hierarchy.items():
if task_list != []: if task_list:
num_fewshot[group_name] = num_fewshot[task_list[0]] num_fewshot[group_name] = num_fewshot[
task_list[0]
] # TODO: validate this
results_dict = { results_dict = {
"results": dict(results_agg.items()), "results": dict(results_agg.items()),
**({"groups": dict(groups_agg.items())} if bool(groups_agg) else {}), **({"groups": dict(groups_agg.items())} if bool(groups_agg) else {}),
"group_subtasks": {k: v for k, v in reversed(task_hierarchy.items())},
"configs": dict(sorted(configs.items())), "configs": dict(sorted(configs.items())),
"versions": dict(sorted(versions.items())), "versions": dict(sorted(versions.items())),
"n-shot": dict(sorted(num_fewshot.items())), "n-shot": dict(sorted(num_fewshot.items())),
...@@ -648,3 +554,15 @@ def evaluate( ...@@ -648,3 +554,15 @@ def evaluate(
else: else:
return None return None
def request_caching_arg_to_dict(cache_requests: str) -> dict:
request_caching_args = {
"cache_requests": (
True if cache_requests == "true" or cache_requests == "refresh" else False
),
"rewrite_requests_cache": True if cache_requests == "refresh" else False,
"delete_requests_cache": True if cache_requests == "delete" else False,
}
return request_caching_args
import collections
import math
import pathlib
import sys
from typing import Dict, List, Optional, Tuple, Union
from lm_eval.api import metrics
from lm_eval.utils import eval_logger, positional_deprecated
class TaskOutput:
"""
Wrapper class for Task outputs.It contains various attributes and methods to manage and calculate metrics for the task.
Attributes:
task (object): The task object.
task_name (str): The name of the task.
task_config (dict): The configuration of the task.
version (str): The version of the task.
group_name (str): The name of the task group.
n_shot (int): The number of shots for the task.
task_alias (str): The alias of the task.
group_alias (str): The alias of the task group.
is_group (bool): Indicates if the task is a group.
logged_samples (list): The list of logged samples.
sample_len (int): The length of the samples.
sample_metrics (defaultdict): The dictionary of samples' metrics.
agg_metrics (defaultdict): The dictionary of aggregate metrics.
Methods:
from_taskdict(cls, task_name: str, task):
Creates a TaskOutput instance from a task dictionary.
calculate_aggregate_metric(bootstrap_iters=100000) -> None:
Calculates the aggregate metrics for the task.
"""
def __init__(
self,
task=None,
task_name=None,
task_config=None,
version=None,
group_name=None,
n_shot=None,
task_alias=None,
group_alias=None,
is_group=None,
):
self.task = task
self.task_config = task_config
self.task_name = task_name
self.group_name = group_name
self.version = version
self.n_shot = n_shot
self.task_alias = task_alias
self.group_alias = group_alias
self.is_group = is_group
self.logged_samples = []
self.sample_len = None
self.sample_metrics = collections.defaultdict(list)
self.agg_metrics = collections.defaultdict(list)
@classmethod
def from_taskdict(cls, task_name: str, task):
if isinstance(task, tuple):
group_name, task = task
else:
group_name = None
if not task:
# these gets filtered out in get_task_list
# once they are added to group hierarchy
is_group = True
return cls(
task=task, task_name=task_name, is_group=is_group, group_name=group_name
)
version = task.VERSION
task_config = dict(task.dump_config())
if (n_shot := task_config.get("num_fewshot")) == 0:
n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
task_alias = task_config.get("alias")
group_alias = task_config.get("group_alias")
return cls(
task=task,
task_name=task_name,
task_config=task_config,
group_name=group_name,
version=version,
n_shot=n_shot,
task_alias=task_alias,
group_alias=group_alias,
)
def calculate_aggregate_metric(self, bootstrap_iters=100000) -> None:
for (metric, filter_key), items in self.sample_metrics.items():
agg_fn = self.task.aggregation()[metric]
metric_key = f"{metric},{filter_key}"
self.agg_metrics[metric_key] = agg_fn(items)
self.sample_len = len(items) # TODO: same sample size for each metric?
if bootstrap_iters:
stderr_fn = metrics.stderr_for_metric(
metric=agg_fn,
bootstrap_iters=min(bootstrap_iters, 100)
if metric in ["bleu", "chrf", "ter"]
else bootstrap_iters,
)
self.agg_metrics[f"{metric}_stderr,{filter_key}"] = (
stderr_fn(items) if (stderr_fn and len(items) > 1) else "N/A"
)
def __repr__(self):
return (
f"TaskOutput(task_name={self.task_name}, "
f"group_name={self.group_name}, "
f"version={self.version},"
f"n_shot={self.n_shot}"
f"task_alias={self.task_alias}, group_alias={self.group_alias})"
)
def get_task_list(task_dict: dict) -> Tuple[Dict[str, list], List[TaskOutput]]:
task_hierarchy = collections.defaultdict(list)
outputs = list(TaskOutput.from_taskdict(x, y) for x, y in task_dict.items())
for task_output in outputs:
if group_name := task_output.group_name:
task_hierarchy[group_name].append(task_output.task_name)
else:
task_hierarchy[task_output.task_name] = []
# returns task_hierarchy tracking which groups contain which subtasks,
# and a list of TaskOutput classes for each non-group subtask
return task_hierarchy, [x for x in outputs if x.task]
def print_writeout(task) -> None:
for inst in task.instances:
# print the prompt for the first few documents
if inst.doc_id < 1:
eval_logger.info(
f"Task: {task}; document {inst.doc_id}; context prompt (starting on next line):\
\n{inst.args[0]}\n(end of prompt on previous line)\ntarget string or answer choice index (starting on next line):\n{task.doc_to_target(inst.doc)}\n(end of target on previous line)"
)
eval_logger.info(f"Request: {str(inst)}")
def get_sample_size(task, limit: Optional[int]) -> Union[int, None]:
if limit is not None:
limit = (
int(math.ceil(len(task.eval_docs) * limit)) if limit < 1.0 else int(limit)
)
return limit
def prepare_print_tasks(
task_hierarchy: dict, results: dict, tab=0
) -> Tuple[dict, dict]:
"""
@param task_hierarchy: Dictionary representing the group hierarchy of tasks. Each key is a group name and its
value is a list of task names.
@param results: Dictionary containing the results of each task. Each key is a
group name and its value is a dictionary of task results.
@param tab: The indentation level for printing the task
hierarchy. Default is 0.
@return: A tuple of two dictionaries: results_agg and groups_agg. results_agg contains
aggregated results for each task, and groups_agg contains aggregated results for each group.
Prepares the task hierarchy and aggregates the results for each task and group recursively for printing.
"""
results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict)
(group_name, task_list), *_ = task_hierarchy.items()
task_list = sorted(task_list)
results_agg[group_name] = results[group_name].copy()
# results_agg[group_name]["tab"] = tab
if "samples" in results_agg[group_name]:
results_agg[group_name].pop("samples")
tab_string = " " * tab + "- " if tab > 0 else ""
if "alias" in results_agg[group_name]:
results_agg[group_name]["alias"] = tab_string + results_agg[group_name]["alias"]
else:
results_agg[group_name]["alias"] = tab_string + group_name
if len(task_list) > 0:
groups_agg[group_name] = results[group_name].copy()
# groups_agg[group_name]["tab"] = tab
if "samples" in groups_agg[group_name]:
groups_agg[group_name].pop("samples")
if "alias" in groups_agg[group_name]:
groups_agg[group_name]["alias"] = (
tab_string + groups_agg[group_name]["alias"]
)
else:
groups_agg[group_name]["alias"] = tab_string + group_name
for task_name in task_list:
if task_name in task_hierarchy:
_task_hierarchy = {
**{task_name: task_hierarchy[task_name]},
**task_hierarchy,
}
else:
_task_hierarchy = {
**{task_name: []},
**task_hierarchy,
}
_results_agg, _groups_agg = prepare_print_tasks(
_task_hierarchy, results, tab + 1
)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
return results_agg, groups_agg
def consolidate_results(
eval_tasks: List[TaskOutput],
) -> Tuple[dict, dict, dict, dict, dict]:
"""
@param eval_tasks: list(TaskOutput).
@return: A tuple containing the consolidated results, samples, configs, versions, and num_fewshot.
Consolidates the results of multiple evaluation tasks into a single structure.
The method iterates over each evaluation instance and extracts relevant information to create the consolidated
results structure. The consolidated results structure has the following properties:
- results: A defaultdict with task names as keys and dictionaries as values. Each dictionary contains
metric/filter pairs as keys and corresponding metric values as values. The "alias" key is used to store task
aliases specified in the task configuration.
- samples: A defaultdict with task names as keys and lists of log samples as values.
- configs: A defaultdict with task names as keys and task configurations as values.
- versions: A defaultdict with task names as keys and task versions as values.
- num_fewshot: A defaultdict with task names as keys and number of few-shot samples as values.
The method then returns the consolidated results, samples, configs, versions, and num_fewshot as a tuple.
"""
# stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict)
# logs info about each document evaluated.
samples = collections.defaultdict(list)
# store num-fewshot value per task
num_fewshot = collections.defaultdict(int)
# Tracks the YAML configs of all chosen task
configs = collections.defaultdict(dict)
# Tracks each task's version.
versions = collections.defaultdict(dict)
for task_output in eval_tasks:
if "task_alias" in (task_config := task_output.task_config):
results[task_output.task_name]["alias"] = task_config["task_alias"]
if group_alias := task_output.group_alias:
if group_alias not in results and (group_name := task_output.group_name):
results[group_name]["alias"] = group_alias
num_fewshot[task_output.task_name] = task_output.n_shot
configs[task_output.task_name] = task_output.task_config
versions[task_output.task_name] = task_output.version
samples[task_output.task_name] = task_output.logged_samples
for (metric, filter_key), items in task_output.sample_metrics.items():
metric_key = f"{metric},{filter_key}"
results[task_output.task_name][metric_key] = task_output.agg_metrics[
metric_key
]
results[task_output.task_name]["samples"] = task_output.sample_len
results[task_output.task_name][
f"{metric}_stderr,{filter_key}"
] = task_output.agg_metrics[f"{metric}_stderr,{filter_key}"]
return results, samples, configs, versions, num_fewshot
@positional_deprecated
def find_test_root(start_path: pathlib.Path) -> pathlib.Path:
"""
Search upward in the directory tree to a maximum of three layers
to find and return the package root (containing the 'tests' folder)
"""
cur_path = start_path.resolve()
max_layers = 3
for _ in range(max_layers):
if (cur_path / "tests" / "test_version_stable.py").exists():
return cur_path
else:
cur_path = cur_path.parent.resolve()
raise FileNotFoundError(
f"Unable to find package root within {max_layers} upwards" + f"of {start_path}"
)
@positional_deprecated
def run_task_tests(task_list: List[str]):
"""
Find the package root and run the tests for the given tasks
"""
import pytest
package_root = find_test_root(start_path=pathlib.Path(__file__))
task_string = " or ".join(task_list)
args = [
f"{package_root}/tests/test_version_stable.py",
f"--rootdir={package_root}",
"-k",
f"{task_string}",
]
sys.path.append(str(package_root))
pytest_return_val = pytest.main(args)
if pytest_return_val:
raise ValueError(
f"Not all tests for the specified tasks ({task_list}) ran successfully! Error code: {pytest_return_val}"
)
from typing import List, Union
from functools import partial from functools import partial
from typing import List, Union
from lm_eval.api.filter import FilterEnsemble from lm_eval.api.filter import FilterEnsemble
from . import selection
from . import extraction from . import extraction, selection, transformation
from . import transformation
FILTER_REGISTRY = { FILTER_REGISTRY = {
......
...@@ -7,7 +7,10 @@ class RegexFilter(Filter): ...@@ -7,7 +7,10 @@ class RegexFilter(Filter):
""" """ """ """
def __init__( def __init__(
self, regex_pattern: str = r"#### (\-?[0-9\.\,]+)", fallback: str = "[invalid]" self,
regex_pattern: str = r"#### (\-?[0-9\.\,]+)",
group_select=0,
fallback: str = "[invalid]",
) -> None: ) -> None:
""" """
pass a string `regex` to run `re.compile(r"regex")` on. pass a string `regex` to run `re.compile(r"regex")` on.
...@@ -15,6 +18,7 @@ class RegexFilter(Filter): ...@@ -15,6 +18,7 @@ class RegexFilter(Filter):
""" """
self.regex_pattern = regex_pattern self.regex_pattern = regex_pattern
self.regex = re.compile(regex_pattern) self.regex = re.compile(regex_pattern)
self.group_select = group_select
self.fallback = fallback self.fallback = fallback
def apply(self, resps, docs): def apply(self, resps, docs):
...@@ -25,9 +29,12 @@ class RegexFilter(Filter): ...@@ -25,9 +29,12 @@ class RegexFilter(Filter):
def filter_set(inst): def filter_set(inst):
filtered = [] filtered = []
for resp in inst: for resp in inst:
match = self.regex.search(resp) match = self.regex.findall(resp)
if match: if match:
match = match.group(1).strip() match = match[self.group_select]
if isinstance(match, tuple):
match = [m for m in match if m][0]
match = match.strip()
else: else:
match = self.fallback match = self.fallback
filtered.append(match) filtered.append(match)
......
import copy
import json
import logging
import os
import re
import subprocess
from pathlib import Path
from typing import Any, Dict, List, Literal, Optional, Tuple, Union
import numpy as np
import pandas as pd
from packaging.version import Version
from torch.utils.collect_env import get_pretty_env_info
from transformers import __version__ as trans_version
from lm_eval.utils import simple_parse_args_string
logger = logging.getLogger(__name__)
try:
import wandb
assert Version(wandb.__version__) >= Version("0.13.6")
if Version(wandb.__version__) < Version("0.13.6"):
wandb.require("report-editing:v0")
except Exception as e:
logger.warning(
"To use the wandb reporting functionality please install wandb>=0.13.6.\n"
"To install the latest version of wandb run `pip install wandb --upgrade`\n"
f"{e}"
)
def remove_none_pattern(input_string: str) -> Tuple[str, bool]:
"""Remove the ',none' substring from the input_string if it exists at the end.
Args:
input_string (str): The input string from which to remove the ',none' substring.
Returns:
Tuple[str, bool]: A tuple containing the modified input_string with the ',none' substring removed
and a boolean indicating whether the modification was made (True) or not (False).
"""
# Define the pattern to match ',none' at the end of the string
pattern = re.compile(r",none$")
# Use sub() to replace ',none' with an empty string
result = re.sub(pattern, "", input_string)
# check if the input_string changed
removed = result != input_string
return result, removed
def _handle_non_serializable(o: Any) -> Union[int, str, list]:
"""Handle non-serializable objects by converting them to serializable types.
Args:
o (Any): The object to be handled.
Returns:
Union[int, str, list]: The converted object. If the object is of type np.int64 or np.int32,
it will be converted to int. If the object is of type set, it will be converted
to a list. Otherwise, it will be converted to str.
"""
if isinstance(o, np.int64) or isinstance(o, np.int32):
return int(o)
elif isinstance(o, set):
return list(o)
else:
return str(o)
def get_wandb_printer() -> Literal["Printer"]:
"""Returns a wandb printer instance for pretty stdout."""
from wandb.sdk.lib.printer import get_printer
from wandb.sdk.wandb_settings import Settings
printer = get_printer(Settings()._jupyter)
return printer
class WandbLogger:
def __init__(self, args: Any) -> None:
"""Initialize the WandbLogger.
Args:
results (Dict[str, Any]): The results dictionary.
args (Any): Arguments for configuration.
"""
self.wandb_args: Dict[str, Any] = simple_parse_args_string(args.wandb_args)
# initialize a W&B run
if wandb.run is None:
self.run = wandb.init(**self.wandb_args)
else:
self.run = wandb.run
self.printer = get_wandb_printer()
def post_init(self, results: Dict[str, Any]) -> None:
self.results: Dict[str, Any] = copy.deepcopy(results)
self.task_names: List[str] = list(results.get("results", {}).keys())
self.group_names: List[str] = list(results.get("groups", {}).keys())
def _get_config(self) -> Dict[str, Any]:
"""Get configuration parameters."""
self.task_configs = self.results.get("configs", {})
cli_configs = self.results.get("config", {})
configs = {
"task_configs": self.task_configs,
"cli_configs": cli_configs,
}
return configs
def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
"""Sanitize the results dictionary."""
_results = copy.deepcopy(self.results.get("results", dict()))
# Remove None from the metric string name
tmp_results = copy.deepcopy(_results)
for task_name in self.task_names:
task_result = tmp_results.get(task_name, dict())
for metric_name, metric_value in task_result.items():
_metric_name, removed = remove_none_pattern(metric_name)
if removed:
_results[task_name][_metric_name] = metric_value
_results[task_name].pop(metric_name)
# remove string valued keys from the results dict
wandb_summary = {}
for task in self.task_names:
task_result = _results.get(task, dict())
for metric_name, metric_value in task_result.items():
if isinstance(metric_value, str):
wandb_summary[f"{task}/{metric_name}"] = metric_value
for summary_metric, summary_value in wandb_summary.items():
_task, _summary_metric = summary_metric.split("/")
_results[_task].pop(_summary_metric)
tmp_results = copy.deepcopy(_results)
for task_name, task_results in tmp_results.items():
for metric_name, metric_value in task_results.items():
_results[f"{task_name}/{metric_name}"] = metric_value
_results[task_name].pop(metric_name)
for task in self.task_names:
_results.pop(task)
return wandb_summary, _results
def _log_results_as_table(self) -> None:
"""Generate and log evaluation results as a table to W&B."""
columns = [
"Version",
"Filter",
"num_fewshot",
"Metric",
"Value",
"Stderr",
]
def make_table(columns: List[str], key: str = "results"):
table = wandb.Table(columns=columns)
results = copy.deepcopy(self.results)
for k, dic in results.get(key).items():
if k in self.group_names and not key == "groups":
continue
version = results.get("versions").get(k)
if version == "N/A":
version = None
n = results.get("n-shot").get(k)
for (mf), v in dic.items():
m, _, f = mf.partition(",")
if m.endswith("_stderr"):
continue
if m == "alias":
continue
if m + "_stderr" + "," + f in dic:
se = dic[m + "_stderr" + "," + f]
if se != "N/A":
se = "%.4f" % se
table.add_data(*[k, version, f, n, m, str(v), str(se)])
else:
table.add_data(*[k, version, f, n, m, str(v), ""])
return table
# log the complete eval result to W&B Table
table = make_table(["Tasks"] + columns, "results")
self.run.log({"evaluation/eval_results": table})
if "groups" in self.results.keys():
table = make_table(["Groups"] + columns, "groups")
self.run.log({"evaluation/group_eval_results": table})
def _log_results_as_artifact(self) -> None:
"""Log results as JSON artifact to W&B."""
dumped = json.dumps(
self.results, indent=2, default=_handle_non_serializable, ensure_ascii=False
)
artifact = wandb.Artifact("results", type="eval_results")
with artifact.new_file("results.json", mode="w", encoding="utf-8") as f:
f.write(dumped)
self.run.log_artifact(artifact)
def log_eval_result(self) -> None:
"""Log evaluation results to W&B."""
# Log configs to wandb
configs = self._get_config()
self.run.config.update(configs)
wandb_summary, self.wandb_results = self._sanitize_results_dict()
# update wandb.run.summary with items that were removed
self.run.summary.update(wandb_summary)
# Log the evaluation metrics to wandb
self.run.log(self.wandb_results)
# Log the evaluation metrics as W&B Table
self._log_results_as_table()
# Log the results dict as json to W&B Artifacts
self._log_results_as_artifact()
def _generate_dataset(
self, data: List[Dict[str, Any]], config: Dict[str, Any]
) -> pd.DataFrame:
"""Generate a dataset from evaluation data.
Args:
data (List[Dict[str, Any]]): The data to generate a dataset for.
config (Dict[str, Any]): The configuration of the task.
Returns:
pd.DataFrame: A dataframe that is ready to be uploaded to W&B.
"""
ids = [x["doc_id"] for x in data]
labels = [x["target"] for x in data]
instance = [""] * len(ids)
resps = [""] * len(ids)
filtered_resps = [""] * len(ids)
model_outputs = {}
metrics_list = config["metric_list"]
metrics = {}
for metric in metrics_list:
metric = metric.get("metric")
if metric in ["word_perplexity", "byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_loglikelihood"] = [x[metric][0] for x in data]
if metric in ["byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_bytes"] = [x[metric][1] for x in data]
else:
metrics[f"{metric}_words"] = [x[metric][1] for x in data]
else:
metrics[metric] = [x[metric] for x in data]
if config["output_type"] == "loglikelihood":
instance = [x["arguments"][0][0] for x in data]
labels = [x["arguments"][0][1] for x in data]
resps = [
f'log probability of continuation is {x["resps"][0][0][0]} '
+ "\n\n"
+ "continuation will {} generated with greedy sampling".format(
"not be" if not x["resps"][0][0][1] else "be"
)
for x in data
]
filtered_resps = [
f'log probability of continuation is {x["filtered_resps"][0][0]} '
+ "\n\n"
+ "continuation will {} generated with greedy sampling".format(
"not be" if not x["filtered_resps"][0][1] else "be"
)
for x in data
]
elif config["output_type"] == "multiple_choice":
instance = [x["arguments"][0][0] for x in data]
choices = [
"\n".join([f"{idx}. {y[1]}" for idx, y in enumerate(x["arguments"])])
for x in data
]
resps = [np.argmax([n[0][0] for n in x["resps"]]) for x in data]
filtered_resps = [
np.argmax([n[0] for n in x["filtered_resps"]]) for x in data
]
elif config["output_type"] == "loglikelihood_rolling":
instance = [x["arguments"][0][0] for x in data]
resps = [x["resps"][0][0] for x in data]
filtered_resps = [x["filtered_resps"][0] for x in data]
elif config["output_type"] == "generate_until":
instance = [x["arguments"][0][0] for x in data]
resps = [x["resps"][0][0] for x in data]
filtered_resps = [x["filtered_resps"][0] for x in data]
model_outputs["raw_predictions"] = resps
model_outputs["filtered_predictions"] = filtered_resps
df_data = {
"id": ids,
"data": instance,
}
if config["output_type"] == "multiple_choice":
df_data["choices"] = choices
tmp_data = {
"input_len": [len(x) for x in instance],
"labels": labels,
"output_type": config["output_type"],
}
df_data.update(tmp_data)
df_data.update(model_outputs)
df_data.update(metrics)
return pd.DataFrame(df_data)
def _log_samples_as_artifact(
self, data: List[Dict[str, Any]], task_name: str
) -> None:
# log the samples as an artifact
dumped = json.dumps(
data,
indent=2,
default=_handle_non_serializable,
ensure_ascii=False,
)
artifact = wandb.Artifact(f"{task_name}", type="samples_by_task")
with artifact.new_file(
f"{task_name}_eval_samples.json", mode="w", encoding="utf-8"
) as f:
f.write(dumped)
self.run.log_artifact(artifact)
# artifact.wait()
def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None:
"""Log evaluation samples to W&B.
Args:
samples (Dict[str, List[Dict[str, Any]]]): Evaluation samples for each task.
"""
task_names: List[str] = [
x for x in self.task_names if x not in self.group_names
]
ungrouped_tasks = []
tasks_by_groups = {}
for task_name in task_names:
group_names = self.task_configs[task_name].get("group", None)
if group_names:
if isinstance(group_names, str):
group_names = [group_names]
for group_name in group_names:
if not tasks_by_groups.get(group_name):
tasks_by_groups[group_name] = [task_name]
else:
tasks_by_groups[group_name].append(task_name)
else:
ungrouped_tasks.append(task_name)
for task_name in ungrouped_tasks:
eval_preds = samples[task_name]
# log the samples as a W&B Table
df = self._generate_dataset(eval_preds, self.task_configs.get(task_name))
self.run.log({f"{task_name}_eval_results": df})
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
for group, grouped_tasks in tasks_by_groups.items():
grouped_df = pd.DataFrame()
for task_name in grouped_tasks:
eval_preds = samples[task_name]
df = self._generate_dataset(
eval_preds, self.task_configs.get(task_name)
)
df["group"] = group
df["task"] = task_name
grouped_df = pd.concat([grouped_df, df], ignore_index=True)
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
self.run.log({f"{group}_eval_results": grouped_df})
def get_commit_from_path(repo_path: Path) -> Optional[str]:
git_folder = Path(repo_path, ".git")
if git_folder.is_file():
git_folder = Path(
git_folder.parent,
git_folder.read_text(encoding="utf-8").split("\n")[0].split(" ")[-1],
)
if Path(git_folder, "HEAD").exists():
head_name = (
Path(git_folder, "HEAD")
.read_text(encoding="utf-8")
.split("\n")[0]
.split(" ")[-1]
)
head_ref = Path(git_folder, head_name)
git_hash = head_ref.read_text(encoding="utf-8").replace("\n", "")
else:
git_hash = None
return git_hash
def get_git_commit_hash():
"""
Gets the git commit hash of your current repo (if it exists).
Source: https://github.com/EleutherAI/gpt-neox/blob/b608043be541602170bfcfb8ec9bf85e8a0799e0/megatron/neox_arguments/neox_args.py#L42
"""
try:
git_hash = subprocess.check_output(["git", "describe", "--always"]).strip()
git_hash = git_hash.decode()
except (subprocess.CalledProcessError, FileNotFoundError):
# FileNotFoundError occurs when git not installed on system
git_hash = get_commit_from_path(os.getcwd()) # git hash of repo if exists
return git_hash
def add_env_info(storage: Dict[str, Any]):
try:
pretty_env_info = get_pretty_env_info()
except Exception as err:
pretty_env_info = str(err)
transformers_version = trans_version
upper_dir_commit = get_commit_from_path(
Path(os.getcwd(), "..")
) # git hash of upper repo if exists
added_info = {
"pretty_env_info": pretty_env_info,
"transformers_version": transformers_version,
"upper_git_hash": upper_dir_commit, # in case this repo is submodule
}
storage.update(added_info)
from . import huggingface from . import (
from . import openai_completions anthropic_llms,
from . import textsynth dummy,
from . import dummy gguf,
from . import anthropic_llms huggingface,
from . import gguf mamba_lm,
from . import vllm_causallms neuron_optimum,
from . import mamba_lm openai_completions,
from . import optimum_lm optimum_lm,
textsynth,
vllm_causallms,
)
# TODO: implement __all__ # TODO: implement __all__
try:
# enable hf hub transfer if available
import hf_transfer # type: ignore # noqa
import huggingface_hub.constants # type: ignore
huggingface_hub.constants.HF_HUB_ENABLE_HF_TRANSFER = True
except ImportError:
pass
...@@ -5,7 +5,7 @@ from tqdm import tqdm ...@@ -5,7 +5,7 @@ from tqdm import tqdm
from lm_eval import utils from lm_eval import utils
from lm_eval.api.model import LM from lm_eval.api.model import LM
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.utils import retry_on_specific_exceptions from lm_eval.models.utils import retry_on_specific_exceptions
eval_logger = utils.eval_logger eval_logger = utils.eval_logger
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment