Unverified Commit aed90773 authored by Hailey Schoelkopf's avatar Hailey Schoelkopf Committed by GitHub
Browse files

Merge pull request #1035 from baberabb/big-refactor_dp

[Refactor] vllm data parallel
parents 61b0cd29 d588a466
......@@ -19,6 +19,7 @@ repos:
- id: no-commit-to-branch
- id: requirements-txt-fixer
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
- id: fix-byte-order-marker
exclude: docs/CNAME
- id: fix-encoding-pragma
......
......@@ -45,7 +45,7 @@ cd lm-evaluation-harness
pip install -e .
```
We also provide a number of optional dependencies for . Extras can be installed via `pip install -e ".[NAME]"`
We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`
| Name | Use |
| ------------- | ------------------------------------- |
......@@ -126,18 +126,21 @@ To use `accelerate` with the `lm-eval` command, use
accelerate launch --no_python lm-eval --model ...
```
### Tensor Parallel + Optimized Inference with vLLM
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html).
### Tensor + Data Parallel and Optimized Inference with `vLLM`
We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
```bash
lm_eval --model vllm \
--model_args pretrained={model_name},tensor_parallel_size={number of GPUs to use},dtype=auto,gpu_memory_utilization=0.8 \
--model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
--tasks lambada_openai \
--batch_size auto
```
For a full list of supported vLLM configurations, please reference our vLLM integration and the vLLM documentation.
vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a script at [./scripts/model_comparator.py](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking validity of vllm results against HF.
### Model APIs and Inference Servers
Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
......@@ -178,7 +181,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> [!Note]
> You can inspect what the LM inputs look like by running the following command:
>
> ```bash
> python write_out.py \
> --tasks all_tasks \
......@@ -186,7 +188,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
> --num_examples 10 \
> --output_base_path /path/to/output/folder
> ```
>
> This will write out one text file for each task.
To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
......@@ -222,19 +223,17 @@ To save evaluation results provide an `--output_path`. We also support logging m
Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md) guide in our documentation!
For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!
## How to Contribute or Learn More?
For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
### Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).
In general, we following the following priority list for addressing concerns about prompting and other eval details:
In general, we follow this priority list for addressing concerns about prompting and other eval details:
1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
2. If there is a clear and unambiguous official implementation, use that procedure.
3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
......@@ -242,11 +241,11 @@ In general, we following the following priority list for addressing concerns abo
These are guidelines and not rules, and can be overruled in special circumstances.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165) as our original goal was specifically to compare results with that paper.
### Support
The best way to get support is to open an issue on this repo or join the [EleutherAI discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
## Cite as
......
from collections import defaultdict
from typing import List, Tuple, Optional, Literal, Union
from typing import List, Tuple, Optional, Literal, Union, Any
from transformers import AutoTokenizer
from lm_eval.api.instance import Instance
from lm_eval.api.model import LM
import copy
......@@ -10,13 +10,22 @@ from lm_eval import utils
try:
from vllm import LLM, SamplingParams
from ray.util.multiprocessing import Pool
from vllm.transformers_utils.tokenizer import get_tokenizer
except ModuleNotFoundError:
pass
eval_logger = utils.eval_logger
# adapted from https://github.com/vllm-project/vllm/issues/367#issuecomment-1788341727
def run_inference_one_model(model_args: dict, sampling_params, requests: List[int]):
# gpu_id = [x for x in gpu_id]
# os.environ["CUDA_VISIBLE_DEVICES"]= str(gpu_id)
llm = LLM(**model_args)
return llm.generate(prompt_token_ids=requests, sampling_params=sampling_params)
@register_model("vllm")
class VLLM(LM):
_DEFAULT_MAX_LENGTH = 2048
......@@ -27,7 +36,9 @@ class VLLM(LM):
dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
revision: Optional[str] = None,
trust_remote_code: Optional[bool] = False,
tokenizer: Optional[str] = None,
tokenizer_mode: Literal["auto", "slow"] = "auto",
tokenizer_revision: Optional[str] = None,
tensor_parallel_size: int = 1,
quantization: Optional[Literal["awq"]] = None,
max_gen_toks: int = 256,
......@@ -38,6 +49,7 @@ class VLLM(LM):
seed: int = 1234,
gpu_memory_utilization: float = 0.9,
device: str = "cuda",
data_parallel_size: int = 1,
):
super().__init__()
......@@ -50,19 +62,32 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
)
assert "cuda" in device or device is None, "vLLM only supports CUDA"
self.model = LLM(
model=pretrained,
gpu_memory_utilization=float(gpu_memory_utilization),
revision=revision,
dtype=dtype,
self.tensor_parallel_size = int(tensor_parallel_size)
self.data_parallel_size = int(data_parallel_size)
self.model_args = {
"model": pretrained,
"gpu_memory_utilization": float(gpu_memory_utilization),
"revision": revision,
"dtype": dtype,
"tokenizer": tokenizer,
"tokenizer_mode": tokenizer_mode,
"tokenizer_revision": tokenizer_revision,
"trust_remote_code": trust_remote_code,
"tensor_parallel_size": int(tensor_parallel_size),
"swap_space": int(swap_space),
"quantization": quantization,
"seed": int(seed),
}
if self.data_parallel_size <= 1:
self.model = LLM(**self.model_args)
else:
self.model_args["worker_use_ray"] = True
self.tokenizer = get_tokenizer(
tokenizer if tokenizer else pretrained,
tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,
tensor_parallel_size=int(tensor_parallel_size),
swap_space=int(swap_space),
quantization=quantization,
seed=int(seed),
tokenizer_revision=tokenizer_revision,
)
self.tokenizer = self.model.get_tokenizer()
self.batch_size = batch_size
self._max_length = max_length
self._max_gen_toks = max_gen_toks
......@@ -76,8 +101,8 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
def max_length(self):
if self._max_length: # if max length manually set, return it
return self._max_length
if hasattr(self.model.llm_engine.model_config, "max_model_len"):
return self.model.llm_engine.model_config.max_model_len
if hasattr(self.tokenizer, "model_max_length"):
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property
......@@ -104,7 +129,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
def _model_generate(
self,
requests: List[int] = None,
requests: List[List[int]] = None,
generate: bool = False,
max_tokens: int = None,
stop: Optional[List[str]] = None,
......@@ -114,25 +139,50 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
if "do_sample" in kwargs.keys():
kwargs.pop("do_sample")
if generate:
generate_sampling_params = SamplingParams(
max_tokens=max_tokens, stop=stop, **kwargs
)
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=generate_sampling_params,
use_tqdm=use_tqdm,
# hf defaults
kwargs["skip_special_tokens"] = kwargs.get("skip_special_tokens", False)
kwargs["spaces_between_special_tokens"] = kwargs.get(
"spaces_between_special_tokens", False
)
sampling_params = SamplingParams(max_tokens=max_tokens, stop=stop, **kwargs)
else:
logliklihood_sampling_params = SamplingParams(
sampling_params = SamplingParams(
temperature=0, prompt_logprobs=2, max_tokens=1
)
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=logliklihood_sampling_params,
use_tqdm=use_tqdm,
)
if self.data_parallel_size > 1:
requests = [
list(x) for x in utils.divide(requests, self.data_parallel_size)
]
inputs = [(self.model_args, sampling_params, req) for req in requests]
with Pool(self.data_parallel_size) as pool:
results = pool.starmap(run_inference_one_model, inputs)
# flatten results
return [item for sublist in results for item in sublist]
outputs = self.model.generate(
prompt_token_ids=requests,
sampling_params=sampling_params,
use_tqdm=use_tqdm,
)
return outputs
def _encode_pair(
self, context: str, continuation: str
) -> Tuple[List[int], List[int]]:
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]
whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
context_enc = self.tok_encode(context, add_special_tokens=False)
context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]
return context_enc, continuation_enc
def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
......@@ -142,12 +192,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
continuation
)
else:
context_enc, continuation_enc = self.tokenizer(
[context, continuation],
truncation="do_not_truncate",
add_special_tokens=False,
return_attention_mask=False,
).input_ids
context_enc, continuation_enc = self._encode_pair(context, continuation)
new_reqs.append(((context, continuation), context_enc, continuation_enc))
......@@ -188,7 +233,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
# batch tokenize contexts
context, all_gen_kwargs = zip(*(req.args for req in requests))
context_encoding = self.tokenizer(context).input_ids
context_encoding = self.tokenizer(context, add_special_tokens=False).input_ids
requests = [
((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
]
......
......@@ -664,3 +664,55 @@ def stop_sequences_criteria(
],
]
)
# from more_itertools
def divide(iterable, n) -> List[Iterator]:
"""Divide the elements from *iterable* into *n* parts, maintaining
order.
>>> group_1, group_2 = divide(2, [1, 2, 3, 4, 5, 6])
>>> list(group_1)
[1, 2, 3]
>>> list(group_2)
[4, 5, 6]
If the length of *iterable* is not evenly divisible by *n*, then the
length of the returned iterables will not be identical:
>>> children = divide(3, [1, 2, 3, 4, 5, 6, 7])
>>> [list(c) for c in children]
[[1, 2, 3], [4, 5], [6, 7]]
If the length of the iterable is smaller than n, then the last returned
iterables will be empty:
>>> children = divide(5, [1, 2, 3])
>>> [list(c) for c in children]
[[1], [2], [3], [], []]
This function will exhaust the iterable before returning and may require
significant storage. If order is not important, see :func:`distribute`,
which does not first pull the iterable into memory.
"""
if n < 1:
raise ValueError("n must be at least 1")
try:
iterable[:0]
except TypeError:
seq = tuple(iterable)
else:
seq = iterable
q, r = divmod(len(seq), n)
ret = []
stop = 0
for i in range(1, n + 1):
start = stop
stop += q + 1 if i <= r else q
ret.append(iter(seq[start:stop]))
return ret
import argparse
import numpy as np
import lm_eval.evaluator
from lm_eval import tasks
import scipy.stats
from typing import Tuple, Dict, List
import pandas as pd
import torch
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
eval_logger = lm_eval.utils.eval_logger
def calculate_z_value(res1: Dict, res2: Dict) -> Tuple[float, float]:
acc1, acc2 = res1["acc,none"], res2["acc,none"]
st_err1, st_err2 = res1["acc_stderr,none"], res2["acc_stderr,none"]
Z = (acc1 - acc2) / np.sqrt((st_err1**2) + (st_err2**2))
# Determining the p-value
p_value = 2 * scipy.stats.norm.sf(abs(Z)) # two-tailed test
return Z, p_value
def print_results(
data_to_print: List = None, results_dict: Dict = None, alpha: float = None
):
model1_data = data_to_print[0]
model2_data = data_to_print[1]
table_data = []
for task in model1_data.keys():
row = {
"Task": task,
"HF Accuracy": model1_data[task]["acc,none"],
"vLLM Accuracy": model2_data[task]["acc,none"],
"HF StdErr": model1_data[task]["acc_stderr,none"],
"vLLM StdErr": model2_data[task]["acc_stderr,none"],
}
table_data.append(row)
comparison_df = pd.DataFrame(table_data)
comparison_df["Z-Score"] = comparison_df["Task"].apply(
lambda task: results_dict[task]["z"]
)
comparison_df["P-Value"] = comparison_df["Task"].apply(
lambda task: results_dict[task]["p_value"]
)
comparison_df[f"p > {alpha}"] = comparison_df["P-Value"].apply(
lambda p: "✓" if p > alpha else "×"
)
return comparison_df
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--pretrained", default="EleutherAI/pythia-70m", help="name of model to compare"
)
parser.add_argument(
"--hf_args", help="huggingface model args <arg>=<value>", default=""
)
parser.add_argument("--vllm_args", help="vllm model args <arg>=<value>", default="")
parser.add_argument("--tasks", type=str, default="arc_easy,hellaswag")
parser.add_argument(
"--limit",
type=float,
default=100,
)
parser.add_argument(
"--alpha",
type=float,
default=0.05,
help="Significance level for two-tailed z-test",
)
parser.add_argument(
"--device",
type=str,
default="cuda",
)
parser.add_argument(
"--batch",
type=str,
default=8,
)
parser.add_argument(
"--verbosity",
type=str,
default="INFO",
help="Logging verbosity",
)
return parser.parse_args()
if __name__ == "__main__":
tasks.initialize_tasks()
args = parse_args()
tasks = args.tasks.split(",")
print(tasks)
hf_args, vllm_args = "," + args.hf_args, "," + args.vllm_args
results_vllm = lm_eval.evaluator.simple_evaluate(
model="vllm",
model_args=f"pretrained={args.pretrained}" + vllm_args,
tasks=tasks,
limit=args.limit,
device=args.device,
batch_size=args.batch,
)
torch.cuda.empty_cache()
results_hf = lm_eval.evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={args.pretrained}" + hf_args,
tasks=tasks,
limit=args.limit,
device=args.device,
batch_size=args.batch,
)
all_res = {}
for task1, task2 in zip(
results_hf["results"].items(), results_vllm["results"].items()
):
assert task1[0] == task2[0]
z, p_value = calculate_z_value(task1[1], task2[1])
all_res[task1[0]] = {"z": z, "p_value": p_value}
df = print_results(
[results_hf["results"], results_vllm["results"]], all_res, args.alpha
)
print(df)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment