Commit 6f4f9e1c authored by lintangsutawika's avatar lintangsutawika
Browse files

resolved merge conflict

parents 0d5748b7 aed90773
......@@ -3,10 +3,10 @@ name: Tasks Modified
on:
push:
branches:
- 'big-refactor*'
- 'main'
pull_request:
branches:
- 'big-refactor*'
- 'main'
workflow_dispatch:
# comment/edit out the above to stop/change the triggers
jobs:
......
......@@ -6,10 +6,10 @@ name: Unit Tests
on:
push:
branches:
- 'big-refactor*'
- 'main'
pull_request:
branches:
- 'big-refactor*'
- 'main'
workflow_dispatch:
# Jobs run concurrently and steps run sequentially within a job.
# jobs: linter and cpu_tests. Add more jobs/steps as required.
......
......@@ -19,6 +19,7 @@ repos:
- id: no-commit-to-branch
- id: requirements-txt-fixer
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
- id: fix-byte-order-marker
exclude: docs/CNAME
- id: fix-encoding-pragma
......
# Language Model Evaluation Harness
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)
## Announcement
**A new v0.4.0 release of lm-evaluation-harness is available** !
New updates and features include:
- Internal refactoring
- Config-based task creation and configuration
- Easier import and sharing of externally-defined task config YAMLs
- Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource
- More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more
- Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more
- Logging and usability changes
- New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more
Please see our updated documentation pages in `docs/` for more details.
Development will be continuing on the `main` branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the [EleutherAI discord](discord.gg/eleutherai)!
## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
......@@ -25,7 +45,7 @@ cd lm-evaluation-harness
pip install -e .
```
We also provide a number of optional dependencies for . Extras can be installed via `pip install -e ".[NAME]"`
We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
| Name | Use |
| ------------- | ------------------------------------- |
......@@ -106,18 +126,21 @@ To use `accelerate` with the `lm-eval` command, use
accelerate launch --no_python lm-eval --model ...
```
### Tensor Parallel + Optimized Inference with vLLM
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html).
### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
```bash
lm_eval --model vllm \
--model_args pretrained={model_name},tensor_parallel_size={number of GPUs to use},dtype=auto,gpu_memory_utilization=0.8 \
--model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
--tasks lambada_openai \
--batch_size auto
```
For a full list of supported vLLM configurations, please reference our vLLM integration and the vLLM documentation.
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a script at [./scripts/model_comparator.py](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking validity of vllm results against HF.
### Model APIs and Inference Servers
Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
......@@ -158,7 +181,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> [!Note]
> You can inspect what the LM inputs look like by running the following command:
>
> ```bash
> python write_out.py \
> --tasks all_tasks \
......@@ -166,7 +188,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> --num_examples 10 \
> --output_base_path /path/to/output/folder
> ```
>
> This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
......@@ -202,19 +223,17 @@ To save evaluation results provide an `--output_path`. We also support logging m
Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md) guide in our documentation!
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
### Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
In general, we following the following priority list for addressing concerns about prompting and other eval details:
In general, we follow this priority list for addressing concerns about prompting and other eval details:
1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
2. If there is a clear and unambiguous official implementation, use that procedure.
3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
......@@ -222,11 +241,11 @@ In general, we following the following priority list for addressing concerns abo
These are guidelines and not rules, and can be overruled in special circumstances.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165) as our original goal was specifically to compare results with that paper.
### Support
The best way to get support is to open an issue on this repo or join the EleutherAI discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Cite as
......@@ -234,11 +253,11 @@ The best way to get support is to open an issue on this repo or join the Eleuthe
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
month = 12,
year = 2023,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
}
```
......@@ -10,7 +10,7 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
......@@ -51,7 +51,7 @@ We also support using the library's external API for use within model training l
`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs/model_guide.md), and wrapping your custom model in that class as follows:
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
```python
import lm_eval
......
......@@ -12,7 +12,6 @@ To get started contributing, go ahead and fork the main repo, clone it, create a
# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout big-refactor
git checkout -b <model-type>
pip install -e ".[dev]"
```
......@@ -46,7 +45,7 @@ class MyCustomLM(LM):
#...
#...
```
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below.
We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
......
......@@ -2,9 +2,9 @@
`lm-evaluation-harness` is a framework that strives to support a wide range of zero- and few-shot evaluation tasks on autoregressive language models (LMs).
This documentation page provides a walkthrough to get started creating your own task, on the `big-refactor` branch of the repository (which will be v0.4.0 in the future.)
This documentation page provides a walkthrough to get started creating your own task, in `lm-eval` versions v0.4.0 and later.
A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/examples/lm-eval-overview.ipynb).
A more interactive tutorial is available as a Jupyter notebook [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/examples/lm-eval-overview.ipynb).
## Setup
......@@ -14,12 +14,11 @@ If you haven't already, go ahead and fork the main repo, clone it, create a bran
# After forking...
git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout big-refactor
git checkout -b <task-name>
pip install -e ".[dev]"
```
In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/sciq/sciq.yaml).
In this document, we'll walk through the basics of implementing a static benchmark evaluation in two formats: a *generative* task which requires sampling text from a model, such as [`gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k.yaml), and a *discriminative*, or *multiple choice*, task where the model picks the most likely of several fixed answer choices, such as [`sciq`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/sciq/sciq.yaml).
## Creating a YAML file
......
......@@ -112,22 +112,3 @@ def get_sampler(name):
raise ValueError(
f"Attempted to use contextsampler '{name}', but no sampling strategy for this name found! Supported model names: {', '.join(SAMPLER_REGISTRY.keys())}"
)
# TODO: how should we do design here? might be better to have a single sampler and pass more kwargs at init.
# Depends what's easier for new user to add own functionality on top of
# types of sampler:
# - class-balanced, randomly shuffled
# - class-balanced, one particular set of fewshot examples for all evaled instances
# - hand-specify number of fewshot examples per class?
# - random, varies per example (check that this is curr. default in old repo)
# - random, unified per example
# - enforce a specific fixed fewshot string! (or should we not use this, in favor of including it in prompt template directly)
# - user-specified doc indices to restrict fewshot doc options to
# - user specifies split to use for drawing fewshot instances (TODO: manually prevent this from being same split you eval!)
# - user specifies a prepended "description"/string to add in front of the (prompted) input
# - user specifies a location to draw fewshot samples from? DO THIS IN TASK CLASS
......@@ -234,8 +234,7 @@ def evaluate(
padding_requests = collections.defaultdict(int)
# store the hierarchy to do proper ordering
task_hierarchy = collections.defaultdict(list)
# store the ordering of tasks and groups
task_order = collections.defaultdict(int)
# store task aliases
task_group_alias = collections.defaultdict(dict)
# store num-fewshot value per task
num_fewshot = collections.defaultdict(int)
......@@ -440,32 +439,6 @@ def evaluate(
vals = vals_torch
if lm.rank == 0:
### Get task ordering for correct sample-wide aggregation
group_to_task = {}
for group in task_hierarchy.keys():
if group not in task_order:
task_order[group] = 0
if len(task_hierarchy[group]) > 0:
group_to_task[group] = task_hierarchy[group].copy()
for task in task_hierarchy[group]:
if task in task_order:
task_order[task] += 1
else:
task_order[task] = 1 + task_order[group]
if task in task_hierarchy:
group_to_task[group].remove(task)
group_to_task[group].extend(task_hierarchy[task])
task_to_group = {}
for group in group_to_task:
for task in group_to_task[group]:
if task in task_to_group:
task_to_group[task].append(group)
else:
task_to_group[task] = [group]
### Aggregate results over all datapoints ###
# aggregate results ; run bootstrap CIs
......@@ -492,10 +465,10 @@ def evaluate(
else bootstrap_iters,
)
if stderr is not None:
if stderr is not None and len(items) > 1:
results[task_name][metric + "_stderr" + "," + key] = stderr(items)
else:
results[task_name][metric + "_stderr" + "," + key] = 0
results[task_name][metric + "_stderr" + "," + key] = "N/A"
if bool(results):
for group, task_list in reversed(task_hierarchy.items()):
......@@ -553,37 +526,36 @@ def evaluate(
results[group]["samples"] = total_size
def print_tasks(task_hierarchy, task_order, task_version, task_group_alias):
def print_tasks(task_hierarchy, tab=0):
results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict)
for group_name, task_list in task_hierarchy.items():
order = task_order[group_name]
results_agg[group_name] = results[group_name].copy()
results_agg[group_name]["tab"] = order
if (order < max(task_order.values())) and (len(task_list) > 0):
groups_agg[group_name] = results[group_name].copy()
groups_agg[group_name]["tab"] = order
(group_name, task_list), *_ = task_hierarchy.items()
task_list = sorted(task_list)
if task_list != []:
for task in sorted(task_list):
if task in task_hierarchy:
_task_hierarchy = {task: task_hierarchy[task]}
else:
_task_hierarchy = {task: []}
results_agg[group_name] = results[group_name].copy()
results_agg[group_name]["tab"] = tab
_results_agg, _groups_agg, task_version = print_tasks(
_task_hierarchy, task_order, task_version, task_group_alias
)
if len(task_list) > 0:
groups_agg[group_name] = results[group_name].copy()
groups_agg[group_name]["tab"] = tab
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
for task_name in task_list:
if task_name in task_hierarchy:
_task_hierarchy = {
**{task_name: task_hierarchy[task_name]},
**task_hierarchy,
}
else:
_task_hierarchy = {task_name: []}
return results_agg, groups_agg, task_version
_results_agg, _groups_agg = print_tasks(_task_hierarchy, tab + 1)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
results_agg, groups_agg, versions = print_tasks(
task_hierarchy, task_order, versions, task_group_alias
)
return results_agg, groups_agg
results_agg, groups_agg = print_tasks(task_hierarchy)
for task in results_agg:
task_results = results_agg[task]
......
from collections import defaultdict
from typing import List, Tuple, Optional, Literal, Union
from typing import List, Tuple, Optional, Literal, Union, Any
from transformers import AutoTokenizer
from lm_eval.api.instance import Instance
from lm_eval.api.model import LM
import copy
......@@ -10,13 +10,22 @@ from lm_eval import utils
try:
from vllm import LLM, SamplingParams
from ray.util.multiprocessing import Pool
from vllm.transformers_utils.tokenizer import get_tokenizer
except ModuleNotFoundError:
pass
eval_logger = utils.eval_logger
# adapted from https://github.com/vllm-project/vllm/issues/367#issuecomment-1788341727
def run_inference_one_model(model_args: dict, sampling_params, requests: List[int]):
# gpu_id = [x for x in gpu_id]
# os.environ["CUDA_VISIBLE_DEVICES"]= str(gpu_id)
llm = LLM(**model_args)
return llm.generate(prompt_token_ids=requests, sampling_params=sampling_params)
@register_model("vllm")
class VLLM(LM):
_DEFAULT_MAX_LENGTH = 2048
......@@ -27,7 +36,9 @@ class VLLM(LM):
dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
revision: Optional[str] = None,
trust_remote_code: Optional[bool] = False,
tokenizer: Optional[str] = None,
tokenizer_mode: Literal["auto", "slow"] = "auto",
tokenizer_revision: Optional[str] = None,
tensor_parallel_size: int = 1,
quantization: Optional[Literal["awq"]] = None,
max_gen_toks: int = 256,
......@@ -38,6 +49,7 @@ class VLLM(LM):
seed: int = 1234,
gpu_memory_utilization: float = 0.9,
device: str = "cuda",
data_parallel_size: int = 1,
):
super().__init__()
......@@ -50,19 +62,32 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
)
assert "cuda" in device or device is None, "vLLM only supports CUDA"
self.model = LLM(
model=pretrained,
gpu_memory_utilization=float(gpu_memory_utilization),
revision=revision,
dtype=dtype,
self.tensor_parallel_size = int(tensor_parallel_size)
self.data_parallel_size = int(data_parallel_size)
self.model_args = {
"model": pretrained,
"gpu_memory_utilization": float(gpu_memory_utilization),
"revision": revision,
"dtype": dtype,
"tokenizer": tokenizer,
"tokenizer_mode": tokenizer_mode,
"tokenizer_revision": tokenizer_revision,
"trust_remote_code": trust_remote_code,
"tensor_parallel_size": int(tensor_parallel_size),
"swap_space": int(swap_space),
"quantization": quantization,
"seed": int(seed),
}
if self.data_parallel_size <= 1:
self.model = LLM(**self.model_args)
else:
self.model_args["worker_use_ray"] = True
self.tokenizer = get_tokenizer(
tokenizer if tokenizer else pretrained,
tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,
tensor_parallel_size=int(tensor_parallel_size),
swap_space=int(swap_space),
quantization=quantization,
seed=int(seed),
tokenizer_revision=tokenizer_revision,
)
self.tokenizer = self.model.get_tokenizer()
self.batch_size = batch_size
self._max_length = max_length
self._max_gen_toks = max_gen_toks
......@@ -76,8 +101,8 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
def max_length(self):
if self._max_length: # if max length manually set, return it
return self._max_length
if hasattr(self.model.llm_engine.model_config, "max_model_len"):
return self.model.llm_engine.model_config.max_model_len
if hasattr(self.tokenizer, "model_max_length"):
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
......@@ -104,7 +129,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
def _model_generate(
self,
requests: List[int] = None,
requests: List[List[int]] = None,
generate: bool = False,
max_tokens: int = None,
stop: Optional[List[str]] = None,
......@@ -114,25 +139,50 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
if "do_sample" in kwargs.keys():
kwargs.pop("do_sample")
if generate:
generate_sampling_params = SamplingParams(
max_tokens=max_tokens, stop=stop, **kwargs
)
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=generate_sampling_params,
use_tqdm=use_tqdm,
# hf defaults
kwargs["skip_special_tokens"] = kwargs.get("skip_special_tokens", False)
kwargs["spaces_between_special_tokens"] = kwargs.get(
"spaces_between_special_tokens", False
)
sampling_params = SamplingParams(max_tokens=max_tokens, stop=stop, **kwargs)
else:
logliklihood_sampling_params = SamplingParams(
sampling_params = SamplingParams(
temperature=0, prompt_logprobs=2, max_tokens=1
)
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=logliklihood_sampling_params,
use_tqdm=use_tqdm,
)
if self.data_parallel_size > 1:
requests = [
list(x) for x in utils.divide(requests, self.data_parallel_size)
]
inputs = [(self.model_args, sampling_params, req) for req in requests]
with Pool(self.data_parallel_size) as pool:
results = pool.starmap(run_inference_one_model, inputs)
# flatten results
return [item for sublist in results for item in sublist]
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=sampling_params,
use_tqdm=use_tqdm,
)
return outputs
def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
......@@ -142,12 +192,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
continuation
)
else:
context_enc, continuation_enc = self.tokenizer(
[context, continuation],
truncation="do_not_truncate",
add_special_tokens=False,
return_attention_mask=False,
).input_ids
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
......@@ -188,7 +233,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
# batch tokenize contexts
context, all_gen_kwargs = zip(*(req.args for req in requests))
context_encoding = self.tokenizer(context).input_ids
context_encoding = self.tokenizer(context, add_special_tokens=False).input_ids
requests = [
((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
]
......
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
task: arc_easy_alt_ov_01a
doc_to_text: !function ../styles.template_01
doc_to_choice: !function ../styles.choice_01a
doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_01
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
task: arc_easy_alt_ov_01b
doc_to_text: !function ../styles.template_01
doc_to_choice: !function ../styles.choice_01b
doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_01
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_01
task: arc_easy_alt_ov_01c
doc_to_text: !function ../styles.template_01
doc_to_choice: !function ../styles.choice_01c
doc_to_decontamination_query: !function ../styles.template_01
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_01
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
task: arc_easy_alt_ov_02a
doc_to_text: !function ../styles.template_02
doc_to_choice: !function ../styles.choice_02a
doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_02
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
task: arc_easy_alt_ov_02b
doc_to_text: !function ../styles.template_02
doc_to_choice: !function ../styles.choice_02b
doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_02
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_02
task: arc_easy_alt_ov_02c
doc_to_text: !function ../styles.template_02
doc_to_choice: !function ../styles.choice_02c
doc_to_decontamination_query: !function ../styles.template_02
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_02
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
task: arc_easy_alt_ov_03a
doc_to_text: !function ../styles.template_03
doc_to_choice: !function ../styles.choice_03a
doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_03
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
task: arc_easy_alt_ov_03b
doc_to_text: !function ../styles.template_03
doc_to_choice: !function ../styles.choice_03b
doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_03
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_03
task: arc_easy_alt_ov_03c
doc_to_text: !function ../styles.template_03
doc_to_choice: !function ../styles.choice_03c
doc_to_decontamination_query: !function ../styles.template_03
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_03
......@@ -3,4 +3,4 @@ group: arc_easy_alt_ov_04
task: arc_easy_alt_ov_04a
doc_to_text: !function ../styles.template_04
doc_to_choice: !function ../styles.choice_04a
doc_to_decontamination_query: !function ../styles.template_04
\ No newline at end of file
doc_to_decontamination_query: !function ../styles.template_04
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment