Commit b13753cd authored by haileyschoelkopf's avatar haileyschoelkopf
Browse files

Merge branch 'main' into fix-task-table

parents 8ea9c59d 5c25dd55
......@@ -52,7 +52,7 @@ jobs:
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with pytest
run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/tests_master --ignore=tests/extra
run: python -m pytest --showlocals -s -vv -n=auto
- name: Archive artifacts
uses: actions/upload-artifact@v3
with:
......
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
month = 12,
year = 2023,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
}
......@@ -34,7 +34,7 @@ This project provides a unified framework to test generative language models on
- Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
- Easy support for custom prompts and evaluation metrics.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,17476825572045927382,18443729326628441434,12854182577605049984) is used internally by dozens of companies including NVIDIA, Cohere, Nous Research, Booz Allen Hamilton, and Mosaic ML.
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), has been used in [hundreds of papers](https://scholar.google.com/scholar?oi=bibs&hl=en&authuser=2&cites=15052937328817631261,4097184744846514103,1520777361382155671,17476825572045927382,18443729326628441434,14801318227356878622,7890865700763267262,12854182577605049984,15641002901115500560,5104500764547628290), and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
## Install
......@@ -109,33 +109,45 @@ The full list of supported arguments are provided [here](./docs/interface.md), a
#### Multi-GPU Evaluation with Hugging Face `accelerate`
To parallelize evaluation of HuggingFace models across multiple GPUs, we leverage the [accelerate 🚀](https://github.com/huggingface/accelerate) library as follows:
We support two main ways of using Hugging Face's [accelerate 🚀](https://github.com/huggingface/accelerate) library for multi-GPU evaluation.
To perform *data-parallel evaluation* (where each GPU loads a **separate full copy** of the model), we leverage the `accelerate` launcher as follows:
```
accelerate launch -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--batch_size 16
```
(or via `accelerate launch --no-python lm_eval`).
For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.
**WARNING**: This setup does not work with FSDP model sharding, so in `accelerate config` FSDP must be disabled, or the NO_SHARD FSDP option must be used.
If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.
The second way of using `accelerate` for multi-GPU evaluation is when your model is *too large to fit on a single GPU.*
To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well:
In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
```
lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--model_args parallelize=True \
--batch_size 16
```
This means that your model's weights will be split across all available GPUs.
For more advanced users or even larger models, we allow for the following arguments when `parallelize=True` as well:
- `device_map_option`: How to split model weights across available GPUs. defaults to "auto".
- `max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
- `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
- `offload_folder`: a folder where model weights will be offloaded to disk if needed.
To use `accelerate` with the `lm-eval` command, use
```
accelerate launch --no_python lm-eval --model ...
```
These two options (`accelerate launch` and `parallelize=True`) are mutually exclusive.
### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
```bash
lm_eval --model vllm \
......@@ -160,7 +172,7 @@ lm_eval --model openai-completions \
--tasks lambada_openai,hellaswag
```
We also support using your own local inference server with an implemented version of the OpenAI ChatCompletions endpoint and passing trained HuggingFace artifacts and tokenizers.
We also support using your own local inference server with servers that mirror the OpenAI Completions and ChatCompletions APIs.
```bash
lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook/opt-125m,base_url=http://{yourip}:8000/v1
......@@ -169,7 +181,7 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| API or Inference Server | Implemented? | `--model <xxx>` name | Models supported: | Request Types: |
|---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
| OpenAI Completions | :heavy_check_mark: | `openai-completions` | up to `code-davinci-002` | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI Completions | :heavy_check_mark: | `openai-completions`, `local-completions` | All OpenAI Completions API models | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| OpenAI ChatCompletions | :heavy_check_mark: | `openai-chat-completions`, `local-chat-completions` | [All ChatCompletions API models](https://platform.openai.com/docs/guides/gpt) | `generate_until` (no logprobs) |
| Anthropic | :heavy_check_mark: | `anthropic` | [Supported Anthropic Engines](https://docs.anthropic.com/claude/reference/selecting-a-model) | `generate_until` (no logprobs) |
| Textsynth | :heavy_check_mark: | `textsynth` | [All supported engines](https://textsynth.com/documentation.html#engines) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
......@@ -177,14 +189,19 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
| [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark: | `gguf`, `ggml` | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp) | `generate_until`, `loglikelihood`, (perplexity evaluation not yet implemented) |
| vLLM | :heavy_check_mark: | `vllm` | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Mamba | :heavy_check_mark: | `mamba_ssm` | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| Your local inference server! | :heavy_check_mark: | `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
| Your local inference server! | :heavy_check_mark: | `local-completions` or `local-chat-completions` (using `openai-chat-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface | `generate_until` | | ... |
| `local-completions` (using `openai-completions` model type) | Any server address that accepts GET requests using HF models and mirror's OpenAI's Completions interface | `generate_until` | | ... |
Models which do not supply logits or logprobs can be used with tasks of type `generate_until` only, while models that are local or APIs that supply logprobs/logits can be run on all task types: `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.
For more information on the different task `output_types` and model request types, see [our documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/model_guide.md#interface).
### Other Frameworks
A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
To create your own custom integration you can follow instructions from [this tutorial](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage).
### Additional Features
If you have a Metal compatible Mac, you can run the eval harness using the MPS back-end by replacing `--device cuda:0` with `--device mps` (requires PyTorch version 2.1 or higher).
......@@ -219,11 +236,11 @@ lm_eval --model hf \
--device cuda:0
```
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,autogptq=NAME` (or `,autogptq=True` for default names) in the `model_args` argument:
```bash
lm_eval --model hf \
--model_args pretrained=model-name-or-path,gptq=model.safetensors,gptq_use_triton=True \
--model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
--tasks hellaswag
```
......@@ -239,7 +256,7 @@ For a full list of supported arguments, check out the [interface](https://github
You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
First, head to [hub.zenoml.com](hub.zenoml.com) to create an account and get an API key [on your account page](hub.zenoml.com/account).
First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Add this key as an environment variable:
```bash
......
......@@ -46,16 +46,6 @@ dataset_name: ... # the dataset configuration to use. Leave `null` if your datas
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
```
------------------------------
**Tip:** To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
```
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
```
-------------------------------
Next, we'd like to tell our task what the dataset's train, validation, and test splits are named, if they exist:
```yaml
......@@ -99,6 +89,36 @@ Now, in our YAML config file we'll use the `!function` constructor, and tell the
process_docs: !function utils.process_docs
```
### Using Local Datasets
To load a local dataset for evaluation, you can specify data files in the `dataset_kwargs` field, such as the following for JSON files:
```
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: /path/to/my/json
```
Or with files already split into separate directories:
```
dataset_path: arrow
dataset_kwargs:
data_files:
train: /path/to/arrow/train/data-00000-of-00001.arrow
validation: /path/to/arrow/validation/data-00000-of-00001.arrow
```
Alternatively, if you have previously downloaded a dataset from huggingface hub (using `save_to_disk()`) and wish to use the local files, you will need to use `data_dir` under `dataset_kwargs` to point to where the directory is.
```
dataset_path: hellaswag
dataset_kwargs:
data_dir: hellaswag_local/
```
You can also set `dataset_path` as a directory path in your local system. This will assume that there is a loading script with the same name as the directory. [See datasets docs](https://huggingface.co/docs/datasets/loading#local-loading-script).
## Writing a Prompt Template
The next thing we need to do is decide what format to use when presenting the data to the LM. This is our **prompt**, where we'll define both an input and output format.
......
......@@ -4,7 +4,7 @@ The `lm-evaluation-harness` is meant to be an extensible and flexible framework
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lm-eval` task implementations.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
......@@ -301,6 +301,23 @@ task:
- hendrycksTest*
```
It is also possible to list an existing task in your benchmark configuration with some adjustments. For example, a few tasks from mmlu is included `multimedqa`. There, the `task_alias` and `group_alias` (See [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#beautifying-table-display) for more details) are modified to suit the benchmark.
```yaml
group: multimedqa
task:
- pubmedqa
- medmcqa
- medqa_4options
- task: mmlu_anatomy
task_alias: "anatomy (mmlu)"
group_alias: null
- task: mmlu_clinical_knowledge
task_alias: "clinical_knowledge (mmlu)"
group_alias: null
...
```
Alternatively, benchmarks can have tasks that are customizable for each task. They can be defined like how a yaml task is usually set.
```yaml
......@@ -363,4 +380,4 @@ task:
ignore_punctuation: true
```
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/benchmarks/`
Calling the benchmark is done the same way we would call any task with `--tasks`. Benchmarks can be added in `lm_eval/tasks/benchmarks/`
......@@ -248,7 +248,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if results is not None:
if args.log_samples:
samples = results.pop("samples")
dumped = json.dumps(results, indent=2, default=_handle_non_serializable)
dumped = json.dumps(
results, indent=2, default=_handle_non_serializable, ensure_ascii=False
)
if args.show_config:
print(dumped)
......@@ -264,9 +266,12 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
)
filename = path.joinpath(f"{output_name}.jsonl")
samples_dumped = json.dumps(
samples[task_name], indent=2, default=_handle_non_serializable
samples[task_name],
indent=2,
default=_handle_non_serializable,
ensure_ascii=False,
)
filename.open("w").write(samples_dumped)
filename.write_text(samples_dumped, encoding="utf-8")
print(
f"{args.model} ({args.model_args}), gen_kwargs: ({args.gen_kwargs}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
......
import abc
import ast
import logging
import os
import random
import re
from collections.abc import Callable
......@@ -87,12 +86,6 @@ class TaskConfig(dict):
] = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
def __post_init__(self) -> None:
if self.dataset_path and os.path.exists(os.path.dirname(self.dataset_path)):
import inspect
from importlib import import_module
self.dataset_path = inspect.getfile(import_module(self.dataset_path))
if self.generation_kwargs is not None:
if self.output_type != "generate_until":
eval_logger.warning(
......@@ -125,7 +118,7 @@ class TaskConfig(dict):
def __setitem__(self, item, value):
return setattr(self, item, value)
def to_dict(self):
def to_dict(self, keep_callable=False):
"""dumps the current config as a dictionary object, as a printable format.
null fields will not be printed.
Used for dumping results alongside full task configuration
......@@ -141,8 +134,11 @@ class TaskConfig(dict):
if v is None:
cfg_dict.pop(k)
elif isinstance(v, Callable):
# TODO: this should handle Promptsource template objects as a separate case?
cfg_dict[k] = str(v)
if keep_callable:
cfg_dict[k] = v
else:
# TODO: this should handle Promptsource template objects as a separate case?
cfg_dict[k] = str(v)
return cfg_dict
......@@ -534,6 +530,10 @@ class ConfigurableTask(Task):
"Must pass a config to ConfigurableTask, either in cls.CONFIG or `config` kwarg"
)
if isinstance(self.config.metadata, dict):
if "version" in self.config.metadata:
self.VERSION = self.config.metadata["version"]
if self.config.output_type is not None:
assert self.config.output_type in ALL_OUTPUT_TYPES
self.OUTPUT_TYPE = self.config.output_type
......@@ -705,11 +705,11 @@ class ConfigurableTask(Task):
)
if delimiter_has_whitespace and choice_has_whitespace:
eval_logger.warning(
f'Both target_delimiter and target choice: "{choice}" have whitespace'
eval_logger.debug(
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" have whitespace'
)
elif (not delimiter_has_whitespace) and (not choice_has_whitespace):
eval_logger.warning(
eval_logger.debug(
f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace'
)
......@@ -762,6 +762,8 @@ class ConfigurableTask(Task):
def fewshot_docs(self):
if self.config.fewshot_split is not None:
if self.config.process_docs is not None:
return self.config.process_docs(self.dataset[self.config.fewshot_split])
return self.dataset[self.config.fewshot_split]
else:
if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0):
......@@ -794,16 +796,19 @@ class ConfigurableTask(Task):
)
example = self.doc_to_text(doc)
if isinstance(example, str):
return labeled_examples + example
elif isinstance(example, list):
return [labeled_examples + ex for ex in example]
elif isinstance(example, int):
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
if self.multiple_input:
return labeled_examples
else:
if isinstance(example, str):
return labeled_examples + example
elif isinstance(example, list):
return [labeled_examples + ex for ex in example]
elif isinstance(example, int):
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
def apply_filters(self):
if hasattr(self, "_filters"):
......@@ -959,7 +964,9 @@ class ConfigurableTask(Task):
if self.multiple_input:
# If there are multiple inputs, choices are placed in the ctx
cont = self.doc_to_target(doc)
arguments = [(ctx, f"{target_delimiter}{cont}") for ctx in choices]
arguments = [
(ctx + choice, f"{target_delimiter}{cont}") for choice in choices
]
else:
# Otherwise they are placed in the continuation
arguments = [(ctx, f"{target_delimiter}{cont}") for cont in choices]
......@@ -1133,27 +1140,36 @@ class ConfigurableTask(Task):
# sometimes, a multiple_target dataset has exceptions where one doc has only one string answer
# print(gold)
gold = [gold]
for gold_option in gold:
try:
result_score = self._metric_fn_list[metric](
references=[gold_option],
predictions=[result],
**self._metric_fn_kwargs[metric],
)
except (
TypeError
): # TODO: this is hacky and I don't want to do it
result_score = self._metric_fn_list[metric](
[gold_option, result]
)
if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict.
result_score = result_score[metric]
scores.append(result_score)
if any(scores):
result_score = 1.0
if metric == "exact_match":
result = [result for _ in range(len(gold))]
scores = self._metric_fn_list[metric](
references=gold,
predictions=result,
**self._metric_fn_kwargs[metric],
)[metric]
result_score = 1.0 if scores > 0.0 else 0.0
else:
result_score = 0.0
for gold_option in gold:
try:
result_score = self._metric_fn_list[metric](
references=[gold_option],
predictions=[result],
**self._metric_fn_kwargs[metric],
)
except (
TypeError
): # TODO: this is hacky and I don't want to do it
result_score = self._metric_fn_list[metric](
[gold_option, result]
)
if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict.
result_score = result_score[metric]
scores.append(result_score)
if any(scores):
result_score = 1.0
else:
result_score = 0.0
else:
try:
result_score = self._metric_fn_list[metric](
......
......@@ -399,7 +399,7 @@ def evaluate(
if type(items[0]) == tuple:
numitem = len(items[0])
if isinstance(items[0], (str, list)):
if isinstance(items[0], (str, list, tuple)):
# handle the string case
gathered_items = [None] * lm.accelerator.num_processes
torch.distributed.all_gather_object(gathered_items, items)
......@@ -492,10 +492,13 @@ def evaluate(
]:
stderr = "_stderr,".join(metric.split(","))
stderr_score = results[task][stderr]
var_score = stderr_score**2
metric_score = results[task][metric]
if stderr_score == "N/A":
var_score = "N/A"
else:
var_score = stderr_score**2
all_stderr.append(stderr)
all_stderr.append(stderr)
metric_score = results[task][metric]
if metric in results[group]:
results[group][metric] = (
......@@ -503,15 +506,20 @@ def evaluate(
+ metric_score * current_size
) / (total_size + current_size)
# $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
results[group][stderr] = (
(total_size - 1) * results[group][stderr]
+ (current_size - 1) * var_score
) / (
total_size + current_size - 1
) + total_size * current_size / (
(total_size + current_size)
* (total_size + current_size - 1)
) * (results[group][metric] - metric_score) ** 2
if var_score == "N/A":
results[group][stderr] = "N/A"
else:
results[group][stderr] = (
(total_size - 1) * results[group][stderr]
+ (current_size - 1) * var_score
) / (
total_size + current_size - 1
) + total_size * current_size / (
(total_size + current_size)
* (total_size + current_size - 1)
) * (
results[group][metric] - metric_score
) ** 2
else:
results[group][metric] = metric_score
results[group][stderr] = var_score
......
......@@ -133,6 +133,8 @@ class HFLM(LM):
gpus = torch.cuda.device_count()
accelerator = Accelerator()
if accelerator.num_processes > 1:
self.accelerator = accelerator
if not (parallelize or accelerator.num_processes > 1):
# use user-passed device
......@@ -202,15 +204,16 @@ class HFLM(LM):
self.model.tie_weights()
if isinstance(pretrained, str) and (gpus >= 1 or str(self.device) == "mps"):
if not (parallelize or autogptq or ("device_map" in kwargs)):
# TODO: can remove this whole snippet except in the mps case, perhaps?
if not (parallelize or autogptq or hasattr(self, "accelerator")):
# place model onto device requested manually,
# if not using HF Accelerate or device_map
# or any other option that preloads model onto device
try:
self.model.to(self.device)
except ValueError:
eval_logger.info(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
eval_logger.debug(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
)
self._create_tokenizer(
......@@ -456,12 +459,24 @@ class HFLM(LM):
if parallelize:
model_kwargs.update(
_get_accelerate_args(
device_map_option,
device_map_option, # TODO: phase out device_map_option?
max_memory_per_gpu,
max_cpu_memory,
offload_folder,
)
)
elif "device_map" not in model_kwargs:
# set a device_map to initialize model on the right GPU.
# this is needed because it seems that the default behavior
# for quantized models now seems to be device_map="auto"
# which breaks data-parallel mode.
if hasattr(self, "accelerator"):
model_kwargs.update(
{"device_map": {"": f"cuda:{self.accelerator.local_process_index}"}}
)
else:
model_kwargs.update({"device_map": {"": str(self.device)}})
if not autogptq:
if model_kwargs.get("load_in_4bit", None):
assert (
......@@ -696,7 +711,7 @@ class HFLM(LM):
generation_kwargs["do_sample"] = False
# build stopping criteria
stopping_criteria = stop_sequences_criteria(
self.tokenizer, stop, 1, context.shape[0]
self.tokenizer, stop, context.shape[1], context.shape[0]
)
return self.model.generate(
input_ids=context,
......
......@@ -42,7 +42,7 @@ class MambaLMWrapper(HFLM):
The HFLM arguments
`backend`, `revision`, `subfolder`, `tokenizer`, `truncation`, `max_length`,
`backend`, `tokenizer`, `truncation`, `max_length`,
`device`, `dtype`, `batch_size`, `max_batch_size`, `trust_remote_code`, `use_fast_tokenizer`
Are all supported by Mamba where they do not conflict
......@@ -98,7 +98,6 @@ please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba
pretrained,
device=self._device,
dtype=torch.float16 if dtype == "auto" else utils.get_dtype(dtype),
**kwargs,
)
def _model_generate(self, context, max_length, stop, **generation_kwargs):
......
......@@ -2,14 +2,14 @@ import copy
import os
from collections import defaultdict
from importlib.util import find_spec
from typing import List, Optional, Tuple
from typing import List, Literal, Optional, Tuple
from tqdm import tqdm
from lm_eval import utils
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
from lm_eval.utils import retry_on_specific_exceptions
from lm_eval.utils import eval_logger, retry_on_specific_exceptions
def get_result(response, ctxlen: int) -> Tuple[float, bool]:
......@@ -40,7 +40,7 @@ def get_result(response, ctxlen: int) -> Tuple[float, bool]:
return continuation_logprobs, is_greedy
def oa_completion(**kwargs):
def oa_completion(client, chat: bool = False, **kwargs):
"""Query OpenAI API for completion.
Retry with back-off until they respond
......@@ -64,19 +64,24 @@ def oa_completion(**kwargs):
on_exception_callback=_exception_callback,
)
def completion():
return openai.completions.create(**kwargs)
if chat:
return client.chat.completions.create(**kwargs)
else:
return client.completions.create(**kwargs)
return completion()
@register_model("openai-completions")
@register_model("openai-completions", "local-completions")
class OpenaiCompletionsLM(LM):
REQ_CHUNK_SIZE = 20
_DEFAULT_MAX_LENGTH = 2048
def __init__(
self,
model: str = "text-davinci-003",
model: str,
base_url: str = None,
tokenizer: Optional[str] = None,
tokenizer_backend: Literal["tiktoken", "huggingface"] = "tiktoken",
truncate: bool = False,
max_gen_toks: int = 256,
batch_size: int = 1,
......@@ -86,7 +91,7 @@ class OpenaiCompletionsLM(LM):
"""
:param engine: str
OpenAI API engine (e.g. davinci)
OpenAI API engine (e.g. gpt-3.5-turbo-instruct)
:param truncate: bool
Truncate input if too long (if False and input is too long, throw error)
"""
......@@ -101,15 +106,44 @@ class OpenaiCompletionsLM(LM):
please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`",
)
self.model = model
self.tokenizer = tiktoken.encoding_for_model(self.model)
self.vocab_size = self.tokenizer.n_vocab
self.base_url = base_url
self.tokenizer_backend = tokenizer_backend
self.truncate = truncate
self.end_of_text_token_id = self.tokenizer.eot_token
self._batch_size = batch_size
self._max_gen_toks = max_gen_toks
self._max_length = max_length
# if we have a local model, use HF tokenizer over tiktoken
if self.tokenizer_backend == "huggingface":
import transformers # noqa: E401
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
tokenizer if tokenizer else self.model
)
self.vocab_size = self.tokenizer.vocab
self.end_of_text_token_id = self.tokenizer.eos_token
elif self.tokenizer_backend == "tiktoken":
if self.base_url:
eval_logger.warning(
f"Passed `base_url={self.base_url}` but using Tiktoken tokenizer backend. "
"Pass `tokenizer_backend=huggingface` and provide the HF tokenizer name if your model does not use Tiktoken."
)
self.tokenizer = tiktoken.encoding_for_model(self.model)
self.vocab_size = self.tokenizer.n_vocab
self.end_of_text_token_id = self.tokenizer.eot_token
else:
raise ValueError(
f"Expected tokenizer_backend to be one of ['tiktoken', 'huggingface'] but got {self.tokenizer_backend}"
)
# Read from environment variable OPENAI_API_KEY
# Set to EMPTY for local
openai.api_key = os.environ["OPENAI_API_KEY"]
if self.base_url:
self.client = openai.OpenAI(base_url=self.base_url)
else:
self.client = openai.OpenAI()
@property
def eot_token_id(self):
......@@ -127,9 +161,8 @@ class OpenaiCompletionsLM(LM):
return self._max_gen_toks
@property
def batch_size(self):
# Isn't used because we override _loglikelihood_tokens
raise NotImplementedError()
def batch_size(self) -> int:
return self._batch_size
@property
def device(self):
......@@ -186,7 +219,7 @@ class OpenaiCompletionsLM(LM):
re_ord = utils.Reorderer(requests, _collate)
for chunk in tqdm(
list(utils.chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE)),
list(utils.chunks(re_ord.get_reordered(), self.batch_size)),
disable=disable_tqdm,
):
inps = []
......@@ -203,6 +236,7 @@ class OpenaiCompletionsLM(LM):
ctxlens.append(ctxlen)
response = oa_completion(
client=self.client,
model=self.model,
prompt=inps,
echo=True,
......@@ -251,7 +285,7 @@ class OpenaiCompletionsLM(LM):
# todo: more intelligent batching for heterogeneous `until`
for chunk, request_args in tqdm(
list(sameuntil_chunks(re_ord.get_reordered(), self.REQ_CHUNK_SIZE))
list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size))
):
inps = []
self._max_gen_toks = request_args.pop("max_gen_toks", self.max_gen_toks)
......@@ -265,6 +299,7 @@ class OpenaiCompletionsLM(LM):
request_args["temperature"] = request_args.get("temperature", 0)
response = oa_completion(
client=self.client,
model=self.model,
prompt=inps,
max_tokens=self.max_gen_toks,
......@@ -329,35 +364,6 @@ class OpenaiCompletionsLM(LM):
return loglikelihoods
def oa_chat_completion(client, **kwargs):
"""Query OpenAI API for chat completion.
Retry with back-off until they respond
"""
if not find_spec("openai") or not find_spec("tiktoken"):
raise Exception(
"attempted to use 'openai' LM type, but package `openai` or `tiktoken` are not installed. "
"Please install these via `pip install lm-eval[openai]` or `pip install -e .[openai]`"
)
else:
import openai
def _exception_callback(e: Exception, sleep_time: float) -> None:
import traceback
traceback.print_exc()
@retry_on_specific_exceptions(
on_exceptions=[openai.OpenAIError],
max_retries=None, # retry forever, consider changing
on_exception_callback=_exception_callback,
)
def completion():
return client.chat.completions.create(**kwargs)
return completion()
@register_model("openai-chat-completions", "local-chat-completions")
class OpenaiChatCompletionsLM(LM):
def __init__(
......@@ -460,8 +466,12 @@ class OpenaiChatCompletionsLM(LM):
f"Expected repr(kwargs) to be of type repr(dict) but got {kwargs}"
)
response = oa_chat_completion(
client=self.client, messages=inps, model=self.model, **kwargs
response = oa_completion(
client=self.client,
chat=True,
messages=inps,
model=self.model,
**kwargs,
)
for resp, (context, args_) in zip(response.choices, chunk):
......
import copy
from collections import defaultdict
from importlib.util import find_spec
from typing import List, Literal, Optional, Tuple, Union
from tqdm import tqdm
from lm_eval import utils
from lm_eval.api.instance import Instance
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
from lm_eval.utils import (
Collator,
divide,
eval_logger,
get_rolling_token_windows,
make_disjoint_window,
)
try:
import ray
from ray.util.multiprocessing import Pool
from vllm import LLM, SamplingParams
from vllm.transformers_utils.tokenizer import get_tokenizer
except ModuleNotFoundError:
pass
eval_logger = utils.eval_logger
eval_logger = eval_logger
# adapted from https://github.com/vllm-project/vllm/issues/367#issuecomment-1788341727
def run_inference_one_model(model_args: dict, sampling_params, requests: List[int]):
# gpu_id = [x for x in gpu_id]
# os.environ["CUDA_VISIBLE_DEVICES"]= str(gpu_id)
def run_inference_one_model(
model_args: dict, sampling_params, requests: List[List[int]]
):
llm = LLM(**model_args)
return llm.generate(prompt_token_ids=requests, sampling_params=sampling_params)
......@@ -43,7 +49,7 @@ class VLLM(LM):
tokenizer_mode: Literal["auto", "slow"] = "auto",
tokenizer_revision: Optional[str] = None,
tensor_parallel_size: int = 1,
quantization: Optional[Literal["awq"]] = None,
quantization: Optional[str] = None,
max_gen_toks: int = 256,
swap_space: int = 4,
batch_size: Union[str, int] = 1,
......@@ -86,17 +92,30 @@ class VLLM(LM):
"quantization": quantization,
"seed": int(seed),
}
self.batch_size = (
"auto"
if isinstance(batch_size, str) and "auto" in batch_size
else batch_size
)
if self.data_parallel_size <= 1:
self.model = LLM(**self.model_args)
else:
self.model_args["worker_use_ray"] = True
self.batch_size = "auto"
eval_logger.info("Manual batching is not compatible with data parallelism.")
from transformers import AutoConfig
self._config = AutoConfig.from_pretrained(
pretrained, trust_remote_code=trust_remote_code, revision=revision
)
self.tokenizer = get_tokenizer(
tokenizer if tokenizer else pretrained,
tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,
tokenizer_revision=tokenizer_revision,
)
self.batch_size = batch_size
self._max_gen_toks = max_gen_toks
@property
......@@ -108,9 +127,18 @@ class VLLM(LM):
def max_length(self):
if self._max_length: # if max length manually set, return it
return self._max_length
if hasattr(self.tokenizer, "model_max_length"):
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
if self.data_parallel_size <= 1:
return self.model.llm_engine.model_config.max_model_len
else:
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self._config, attr):
return getattr(self._config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
def max_gen_toks(self):
......@@ -156,13 +184,13 @@ class VLLM(LM):
temperature=0, prompt_logprobs=2, max_tokens=1
)
if self.data_parallel_size > 1:
requests = [
list(x) for x in utils.divide(requests, self.data_parallel_size)
]
requests = [list(x) for x in divide(requests, self.data_parallel_size)]
inputs = [(self.model_args, sampling_params, req) for req in requests]
with Pool(self.data_parallel_size) as pool:
results = pool.starmap(run_inference_one_model, inputs)
# Invoke ray.shutdown() to prevent hang-ups if subsequent calls required.
ray.shutdown()
# flatten results
return [item for sublist in results for item in sublist]
......@@ -171,7 +199,6 @@ class VLLM(LM):
sampling_params=sampling_params,
use_tqdm=True if self.batch_size == "auto" else False,
)
return outputs
def _encode_pair(
......@@ -211,8 +238,8 @@ class VLLM(LM):
for (string,) in tqdm([req.args for req in requests]):
rolling_token_windows = list(
map(
utils.make_disjoint_window,
utils.get_rolling_token_windows(
make_disjoint_window,
get_rolling_token_windows(
token_list=self.tok_encode(string),
prefix_token=self.eot_token_id,
max_seq_len=self.max_length - 1,
......@@ -235,8 +262,7 @@ class VLLM(LM):
return loglikelihoods
def generate_until(self, requests: List[Instance]) -> List[str]:
res = defaultdict(list)
re_ords = {}
res = []
# batch tokenize contexts
context, all_gen_kwargs = zip(*(req.args for req in requests))
......@@ -252,84 +278,73 @@ class VLLM(LM):
# padded context length. this is useful to simplify the batching logic and more importantly to make
# automatic adaptive batches much much easier to implement
# - any OOMs will happen right away rather than near the end
return -len(_requests[0][1]), tuple(_requests[0][1])
return -len(_requests[0][1]), _requests[0][0]
# we group requests by their generation_kwargs,
# so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
# in the same batch.
grouper = utils.Grouper(requests, lambda x: str(x[1]))
for key, reqs in grouper.get_grouped().items():
# within each set of reqs for given kwargs, we reorder by token length, descending.
re_ords[key] = utils.Reorderer(requests, _collate_gen)
re_ords = Collator(requests, _collate_gen, grouping=True)
chunks = re_ords.get_batched(
n=int(self.batch_size) if self.batch_size != "auto" else 0, batch_fn=None
)
pbar = tqdm(total=len(requests), disable=(self.rank != 0))
# for each different set of kwargs, we execute all requests, by batch.
for key, re_ord in re_ords.items():
chunks = utils.chunks(
re_ord.get_reordered(),
n=int(self.batch_size) if self.batch_size != "auto" else 0,
fn=None,
)
for chunk in chunks:
context_and_encoding, all_gen_kwargs = zip(*chunk)
context, context_encoding = zip(*context_and_encoding)
# we assume all gen kwargs in the batch are the same
# this is safe to assume because the `grouper` object ensures it.
gen_kwargs = all_gen_kwargs[0]
# unpack our keyword arguments.
until = None
if isinstance(gen_kwargs, dict):
kwargs = copy.deepcopy(gen_kwargs) # edge case for repeats > 1
if "until" in kwargs.keys():
until = kwargs.pop("until")
if isinstance(until, str):
until = [until]
elif not isinstance(until, list):
raise ValueError(
f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
)
else:
raise ValueError(
f"Expected `kwargs` to be of type `dict` but got {gen_kwargs}"
)
if not until:
until = [self.tokenizer.decode(self.eot_token_id)]
if "max_gen_toks" in kwargs.keys():
max_gen_toks = kwargs.pop("max_gen_toks")
else:
max_gen_toks = self.max_gen_toks
# set the max length in tokens of inputs ("context_enc")
# max len for inputs = max length, minus room to generate the max new tokens
max_ctx_len = self.max_length - max_gen_toks
context_encoding = [x[-max_ctx_len:] for x in context_encoding]
# TODO: max_length in kwargs
# perform batched generation
cont = self._model_generate(
requests=context_encoding,
generate=True,
max_tokens=max_gen_toks,
stop=until,
**kwargs,
for chunk in chunks:
context_and_encoding, all_gen_kwargs = zip(*chunk)
context, context_encoding = zip(*context_and_encoding)
# we assume all gen kwargs in the batch are the same
# this is safe to assume because the `grouper` object ensures it.
gen_kwargs = all_gen_kwargs[0]
# unpack our keyword arguments.
until = None
if isinstance(gen_kwargs, dict):
kwargs = copy.deepcopy(gen_kwargs) # edge case for repeats > 1
if "until" in kwargs.keys():
until = kwargs.pop("until")
if isinstance(until, str):
until = [until]
elif not isinstance(until, list):
raise ValueError(
f"Expected `kwargs['until']` to be of type Union[str,list] but got {until}"
)
else:
raise ValueError(
f"Expected `kwargs` to be of type `dict` but got {gen_kwargs}"
)
if not until:
until = [self.tokenizer.decode(self.eot_token_id)]
if "max_gen_toks" in kwargs.keys():
max_gen_toks = kwargs.pop("max_gen_toks")
else:
max_gen_toks = self.max_gen_toks
# set the max length in tokens of inputs ("context_enc")
# max len for inputs = max length, minus room to generate the max new tokens
max_ctx_len = self.max_length - max_gen_toks
context_encoding = [x[-max_ctx_len:] for x in context_encoding]
# perform batched generation
cont = self._model_generate(
requests=context_encoding,
generate=True,
max_tokens=max_gen_toks,
stop=until,
**kwargs,
)
# cache generations
for output, context in zip(cont, context):
generated_text = output.outputs[0].text
res[key].append(generated_text)
self.cache_hook.add_partial(
"generate_until", (context, gen_kwargs), generated_text
)
pbar.update(1)
# reorder this group of results back to original unsorted form
res[key] = re_ord.get_original(res[key])
# cache generations
for output, context in zip(cont, context):
generated_text = output.outputs[0].text
res.append(generated_text)
self.cache_hook.add_partial(
"generate_until", (context, gen_kwargs), generated_text
)
pbar.update(1)
pbar.close()
return grouper.get_original(res)
# reorder all group of results back to original unsorted form
return re_ords.get_original(res)
def _loglikelihood_tokens(
self,
......@@ -342,16 +357,15 @@ class VLLM(LM):
toks = x[1] + x[2]
return -len(toks), tuple(toks)
re_ord = utils.Reorderer(requests, _collate)
chunks = utils.chunks(
re_ord.get_reordered(),
n=int(self.batch_size) if self.batch_size != "auto" else 0,
fn=None,
# Reorder requests by length and batch
re_ord = Collator(requests, sort_fn=_collate)
chunks = re_ord.get_batched(
n=int(self.batch_size) if self.batch_size != "auto" else 0, batch_fn=None
)
pbar = tqdm(total=len(requests), disable=disable_tqdm)
for chunk in chunks:
inps = []
inputs = []
ctxlens = []
for cache_key, context_enc, continuation_enc in chunk:
inp = (context_enc + continuation_enc)[-(self.max_length) :]
......@@ -359,18 +373,18 @@ class VLLM(LM):
0, len(context_enc) + len(continuation_enc) - (self.max_length)
)
inps.append(inp)
inputs.append(inp)
ctxlens.append(ctxlen)
outputs = self._model_generate(requests=inps, generate=False)
outputs = self._model_generate(requests=inputs, generate=False)
for output, ctxlen, (cache_key, context_enc, continuation_enc) in zip(
outputs, ctxlens, chunk
for output, ctxlen, (cache_key, _, _), inp in zip(
outputs, ctxlens, chunk, inputs
):
answer = self._parse_logprobs(
(context_enc + continuation_enc),
output,
ctxlen,
tokens=inp,
outputs=output,
ctxlen=ctxlen,
)
res.append(answer)
......@@ -378,7 +392,7 @@ class VLLM(LM):
# partial caching
if cache_key is not None:
self.cache_hook.add_partial("loglikelihood", cache_key, answer)
pbar.update(1)
pbar.update(1)
pbar.close()
return re_ord.get_original(res)
......@@ -387,9 +401,9 @@ class VLLM(LM):
"""Process logprobs and tokens.
:param tokens: list
Tokens from context+continuations
Input tokens (potentially left-truncated)
:param outputs: RequestOutput
Contains prompt
Contains prompt_logprobs
:param ctxlen: int
Length of context (so we can slice them away and only keep the predictions)
:return:
......@@ -399,11 +413,11 @@ class VLLM(LM):
Whether argmax matches given continuation exactly
"""
# prompt_logprobs = [None, {}*len(context-1)]
# The first entry of prompt_logprobs is None because the model has no previous tokens to condition on.
continuation_logprobs_dicts = outputs.prompt_logprobs
# Calculate continuation_logprobs
# assume ctxlen always > 1
# assume ctxlen always >= 1
continuation_logprobs = sum(
logprob_dict.get(token)
for token, logprob_dict in zip(
......
......@@ -61,11 +61,27 @@ def register_configurable_group(config: Dict[str, str], yaml_path: str = None) -
task_list = [task for task in all_task_list if type(task) == str]
for task_config in config_list:
base_config = {}
task_name_config = {}
if "task" in task_config:
task_name = task_config["task"]
if task_name in ALL_TASKS:
task_obj = get_task_dict(task_name)[task_name]
if type(task_obj) == tuple:
_, task_obj = task_obj
if task_obj is not None:
base_config = task_obj._config.to_dict(keep_callable=True)
task_name_config["task"] = f"{group}_{task_name}"
task_config = utils.load_yaml_config(yaml_path, task_config)
var_configs = check_prompt_config(
{
**base_config,
**task_config,
**{"group": group},
**task_name_config,
},
yaml_path=os.path.dirname(yaml_path),
)
......@@ -131,6 +147,9 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None:
"""
Calling this function
"""
# Track whether any tasks failed during loading
import_fail = False
for root, subdirs, file_list in os.walk(task_dir):
# if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
......@@ -155,20 +174,27 @@ def include_task_folder(task_dir: str, register_task: bool = True) -> None:
# Log this silently and show it only when
# the user defines the appropriate verbosity.
except ModuleNotFoundError as e:
except (ImportError, ModuleNotFoundError) as e:
import_fail = True
eval_logger.debug(
f"{yaml_path}: {e}. Config will not be added to registry."
)
except Exception as error:
import traceback
eval_logger.debug(
"Failed to load config in\n"
eval_logger.warning(
"Unexpected error loading config in\n"
f" {yaml_path}\n"
" Config will not be added to registry\n"
f" Error: {error}\n"
f" Traceback: {traceback.format_exc()}"
)
if import_fail:
eval_logger.warning(
"Some tasks could not be loaded due to missing dependencies."
" Run with `--verbosity DEBUG` for full details."
)
return 0
......
......@@ -17,4 +17,4 @@ metric_list:
aggregation: mean
higher_is_better: true
metadata:
version: 0.0
version: 1.0
......@@ -27,4 +27,4 @@ filter_list:
- function: "take_first"
num_fewshot: 0
metadata:
version: 1.0
version: 2.0
......@@ -24,4 +24,4 @@ filter_list:
- function: "take_first"
num_fewshot: 0
metadata:
version: 0
version: 1.0
......@@ -18,4 +18,4 @@ generation_kwargs:
temperature: 0.0
num_fewshot: 0
metadata:
version: 0
version: 1.0
......@@ -18,4 +18,4 @@ generation_kwargs:
temperature: 0.0
num_fewshot: 0
metadata:
version: 0
version: 1.0
group: belebele
dataset_path: facebook/belebele
test_split: test
fewshot_split: test
fewshot_config:
sampler: first_n
output_type: multiple_choice
......
......@@ -8,7 +8,7 @@ import requests
from tqdm import tqdm
from lm_eval.logger import eval_logger
from lm_eval.utils import logging
API_URL = "https://datasets-server.huggingface.co/splits?dataset=facebook/belebele"
......@@ -39,8 +39,8 @@ if __name__ == "__main__":
def query():
response = requests.get(API_URL)
return response.json()["splits"]
languages = [split["config"] for split in query()]
print(query())
languages = [split["split"] for split in query()]
for lang in tqdm(languages):
yaml_dict = {
......@@ -48,11 +48,12 @@ if __name__ == "__main__":
"task": f"belebele_{args.task_prefix}_{lang}"
if args.task_prefix != ""
else f"belebele_{lang}",
"dataset_name": lang,
"test_split": lang,
"fewshot_split":lang,
}
file_save_path = args.save_prefix_path + f"_{lang}.yaml"
eval_logger.info(f"Saving yaml for subset {lang} to {file_save_path}")
logging.info(f"Saving yaml for subset {lang} to {file_save_path}")
with open(file_save_path, "w") as yaml_file:
yaml.dump(
yaml_dict,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment