Merge pull request #1035 from baberabb/big-refactor_dp

[Refactor] vllm data parallel

Merge pull request #1035 from baberabb/big-refactor_dp
[Refactor] vllm data parallel
aed90773 · Hailey Schoelkopf · GitHub · 61b0cd29 · d588a466 · aed90773
Unverified Commit aed90773 authored Dec 12, 2023 by Hailey Schoelkopf Committed by GitHub Dec 12, 2023
5 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,6 +19,7 @@ repos:
      - id: no-commit-to-branch
      - id: requirements-txt-fixer
      - id: trailing-whitespace
+        args: [--markdown-linebreak-ext=md]
      - id: fix-byte-order-marker
        exclude: docs/CNAME
      - id: fix-encoding-pragma

--- a/README.md
+++ b/README.md
@@ -45,7 +45,7 @@ cd lm-evaluation-harness
 pip install -e .
 ```

-We also provide a number of optional dependencies for . Extras can be installed via `pip install -e ".[NAME]"`
+We also provide a number of optional dependencies for extended functionality. Extras can be installed via `pip install -e ".[NAME]"`

 | Name          | Use                                   |
 | ------------- | ------------------------------------- |
@@ -126,18 +126,21 @@ To use `accelerate` with the `lm-eval` command, use
 accelerate launch --no_python lm-eval --model ...
 ```

-### Tensor Parallel + Optimized Inference with vLLM

-We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html).
+### Tensor + Data Parallel and Optimized Inference with `vLLM`
+
+We also support vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html). For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:

 ```bash
 lm_eval --model vllm \
-    --model_args pretrained={model_name},tensor_parallel_size={number of GPUs to use},dtype=auto,gpu_memory_utilization=0.8 \
+    --model_args pretrained={model_name},tensor_parallel_size={GPUs_per_model},dtype=auto,gpu_memory_utilization=0.8,data_parallel_size={model_replicas} \
    --tasks lambada_openai \
    --batch_size auto
 ```
 For a full list of supported vLLM configurations, please reference our vLLM integration and the vLLM documentation.

+vLLM occasionally differs in output from Huggingface. We treat Huggingface as the reference implementation, and provide a script at [./scripts/model_comparator.py](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/scripts/model_comparator.py) for checking validity of vllm results against HF.
+
 ### Model APIs and Inference Servers

 Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
@@ -178,7 +181,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b

 > [!Note]
 > You can inspect what the LM inputs look like by running the following command:
->
 > ```bash
 > python write_out.py \
 >     --tasks all_tasks \
@@ -186,7 +188,6 @@ If you have a Metal compatible Mac, you can run the eval harness using the MPS b
 >     --num_examples 10 \
 >     --output_base_path /path/to/output/folder
 > ```
->
 > This will write out one text file for each task.

 To verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
@@ -222,19 +223,17 @@ To save evaluation results provide an `--output_path`. We also support logging m

 Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.

-For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md) guide in our documentation!
+For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!

 ## How to Contribute or Learn More?

 For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.

-You can also ask for help, or discuss new features with the maintainers in the #lm-thunderdome channel of the EleutherAI discord! If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!
-
 ### Implementing new tasks

 To implement a new task in the eval harness, see [this guide](./docs/new_task_guide.md).

-In general, we following the following priority list for addressing concerns about prompting and other eval details:
+In general, we follow this priority list for addressing concerns about prompting and other eval details:
 1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
 2. If there is a clear and unambiguous official implementation, use that procedure.
 3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
@@ -242,11 +241,11 @@ In general, we following the following priority list for addressing concerns abo

 These are guidelines and not rules, and can be overruled in special circumstances.

-We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
+We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from [Language Models are Few Shot Learners](https://arxiv.org/abs/2005.14165) as our original goal was specifically to compare results with that paper.

 ### Support

-The best way to get support is to open an issue on this repo or join the [EleutherAI discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases.
+The best way to get support is to open an issue on this repo or join the [EleutherAI Discord server](https://discord.gg/eleutherai). The `#lm-thunderdome` channel is dedicated to developing this project and the `#release-discussion` channel is for receiving support for our releases. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!

 ## Cite as


--- a/lm_eval/models/vllm_causallms.py
+++ b/lm_eval/models/vllm_causallms.py
 from collections import defaultdict
-from typing import List, Tuple, Optional, Literal, Union
-
+from typing import List, Tuple, Optional, Literal, Union, Any
+from transformers import AutoTokenizer
 from lm_eval.api.instance import Instance
 from lm_eval.api.model import LM
 import copy
@@ -10,13 +10,22 @@ from lm_eval import utils

 try:
    from vllm import LLM, SamplingParams
+    from ray.util.multiprocessing import Pool
+    from vllm.transformers_utils.tokenizer import get_tokenizer
 except ModuleNotFoundError:
    pass

-
 eval_logger = utils.eval_logger


+# adapted from https://github.com/vllm-project/vllm/issues/367#issuecomment-1788341727
+def run_inference_one_model(model_args: dict, sampling_params, requests: List[int]):
+    # gpu_id = [x for x in gpu_id]
+    # os.environ["CUDA_VISIBLE_DEVICES"]= str(gpu_id)
+    llm = LLM(**model_args)
+    return llm.generate(prompt_token_ids=requests, sampling_params=sampling_params)
+
+
 @register_model("vllm")
 class VLLM(LM):
    _DEFAULT_MAX_LENGTH = 2048
@@ -27,7 +36,9 @@ class VLLM(LM):
        dtype: Literal["float16", "bfloat16", "float32", "auto"] = "auto",
        revision: Optional[str] = None,
        trust_remote_code: Optional[bool] = False,
+        tokenizer: Optional[str] = None,
        tokenizer_mode: Literal["auto", "slow"] = "auto",
+        tokenizer_revision: Optional[str] = None,
        tensor_parallel_size: int = 1,
        quantization: Optional[Literal["awq"]] = None,
        max_gen_toks: int = 256,
@@ -38,6 +49,7 @@ class VLLM(LM):
        seed: int = 1234,
        gpu_memory_utilization: float = 0.9,
        device: str = "cuda",
+        data_parallel_size: int = 1,
    ):
        super().__init__()

@@ -50,19 +62,32 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
            )

        assert "cuda" in device or device is None, "vLLM only supports CUDA"
-        self.model = LLM(
-            model=pretrained,
-            gpu_memory_utilization=float(gpu_memory_utilization),
-            revision=revision,
-            dtype=dtype,
+        self.tensor_parallel_size = int(tensor_parallel_size)
+        self.data_parallel_size = int(data_parallel_size)
+        self.model_args = {
+            "model": pretrained,
+            "gpu_memory_utilization": float(gpu_memory_utilization),
+            "revision": revision,
+            "dtype": dtype,
+            "tokenizer": tokenizer,
+            "tokenizer_mode": tokenizer_mode,
+            "tokenizer_revision": tokenizer_revision,
+            "trust_remote_code": trust_remote_code,
+            "tensor_parallel_size": int(tensor_parallel_size),
+            "swap_space": int(swap_space),
+            "quantization": quantization,
+            "seed": int(seed),
+        }
+        if self.data_parallel_size <= 1:
+            self.model = LLM(**self.model_args)
+        else:
+            self.model_args["worker_use_ray"] = True
+        self.tokenizer = get_tokenizer(
+            tokenizer if tokenizer else pretrained,
            tokenizer_mode=tokenizer_mode,
            trust_remote_code=trust_remote_code,
-            tensor_parallel_size=int(tensor_parallel_size),
-            swap_space=int(swap_space),
-            quantization=quantization,
-            seed=int(seed),
+            tokenizer_revision=tokenizer_revision,
        )
-        self.tokenizer = self.model.get_tokenizer()
        self.batch_size = batch_size
        self._max_length = max_length
        self._max_gen_toks = max_gen_toks
@@ -76,8 +101,8 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
    def max_length(self):
        if self._max_length:  # if max length manually set, return it
            return self._max_length
-        if hasattr(self.model.llm_engine.model_config, "max_model_len"):
-            return self.model.llm_engine.model_config.max_model_len
+        if hasattr(self.tokenizer, "model_max_length"):
+            return self.tokenizer.model_max_length
        return self._DEFAULT_MAX_LENGTH

    @property
@@ -104,7 +129,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"

    def _model_generate(
        self,
-        requests: List[int] = None,
+        requests: List[List[int]] = None,
        generate: bool = False,
        max_tokens: int = None,
        stop: Optional[List[str]] = None,
@@ -114,25 +139,50 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
        if "do_sample" in kwargs.keys():
            kwargs.pop("do_sample")
        if generate:
-            generate_sampling_params = SamplingParams(
-                max_tokens=max_tokens, stop=stop, **kwargs
-            )
-            outputs = self.model.generate(
-                prompt_token_ids=requests,
-                sampling_params=generate_sampling_params,
-                use_tqdm=use_tqdm,
+            # hf defaults
+            kwargs["skip_special_tokens"] = kwargs.get("skip_special_tokens", False)
+            kwargs["spaces_between_special_tokens"] = kwargs.get(
+                "spaces_between_special_tokens", False
            )
+            sampling_params = SamplingParams(max_tokens=max_tokens, stop=stop, **kwargs)
        else:
-            logliklihood_sampling_params = SamplingParams(
+            sampling_params = SamplingParams(
                temperature=0, prompt_logprobs=2, max_tokens=1
            )
-            outputs = self.model.generate(
-                prompt_token_ids=requests,
-                sampling_params=logliklihood_sampling_params,
-                use_tqdm=use_tqdm,
-            )
+        if self.data_parallel_size > 1:
+            requests = [
+                list(x) for x in utils.divide(requests, self.data_parallel_size)
+            ]
+            inputs = [(self.model_args, sampling_params, req) for req in requests]
+
+            with Pool(self.data_parallel_size) as pool:
+                results = pool.starmap(run_inference_one_model, inputs)
+            # flatten results
+            return [item for sublist in results for item in sublist]
+
+        outputs = self.model.generate(
+            prompt_token_ids=requests,
+            sampling_params=sampling_params,
+            use_tqdm=use_tqdm,
+        )
+
        return outputs

+    def _encode_pair(
+        self, context: str, continuation: str
+    ) -> Tuple[List[int], List[int]]:
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+
+        whole_enc = self.tok_encode(context + continuation, add_special_tokens=False)
+        context_enc = self.tok_encode(context, add_special_tokens=False)
+
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
        new_reqs = []
        for context, continuation in [req.args for req in requests]:
@@ -142,12 +192,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"
                    continuation
                )
            else:
-                context_enc, continuation_enc = self.tokenizer(
-                    [context, continuation],
-                    truncation="do_not_truncate",
-                    add_special_tokens=False,
-                    return_attention_mask=False,
-                ).input_ids
+                context_enc, continuation_enc = self._encode_pair(context, continuation)

            new_reqs.append(((context, continuation), context_enc, continuation_enc))

@@ -188,7 +233,7 @@ please install vllm via `pip install lm-eval[vllm]` or `pip install -e .[vllm]`"

        # batch tokenize contexts
        context, all_gen_kwargs = zip(*(req.args for req in requests))
-        context_encoding = self.tokenizer(context).input_ids
+        context_encoding = self.tokenizer(context, add_special_tokens=False).input_ids
        requests = [
            ((a, b), c) for a, b, c in zip(context, context_encoding, all_gen_kwargs)
        ]

--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -664,3 +664,55 @@ def stop_sequences_criteria(
            ],
        ]
    )
+
+
+# from more_itertools
+def divide(iterable, n) -> List[Iterator]:
+    """Divide the elements from *iterable* into *n* parts, maintaining
+    order.
+
+        >>> group_1, group_2 = divide(2, [1, 2, 3, 4, 5, 6])
+        >>> list(group_1)
+        [1, 2, 3]
+        >>> list(group_2)
+        [4, 5, 6]
+
+    If the length of *iterable* is not evenly divisible by *n*, then the
+    length of the returned iterables will not be identical:
+
+        >>> children = divide(3, [1, 2, 3, 4, 5, 6, 7])
+        >>> [list(c) for c in children]
+        [[1, 2, 3], [4, 5], [6, 7]]
+
+    If the length of the iterable is smaller than n, then the last returned
+    iterables will be empty:
+
+        >>> children = divide(5, [1, 2, 3])
+        >>> [list(c) for c in children]
+        [[1], [2], [3], [], []]
+
+    This function will exhaust the iterable before returning and may require
+    significant storage. If order is not important, see :func:`distribute`,
+    which does not first pull the iterable into memory.
+
+    """
+    if n < 1:
+        raise ValueError("n must be at least 1")
+
+    try:
+        iterable[:0]
+    except TypeError:
+        seq = tuple(iterable)
+    else:
+        seq = iterable
+
+    q, r = divmod(len(seq), n)
+
+    ret = []
+    stop = 0
+    for i in range(1, n + 1):
+        start = stop
+        stop += q + 1 if i <= r else q
+        ret.append(iter(seq[start:stop]))
+
+    return ret
--- a/scripts/model_comparator.py
+++ b/scripts/model_comparator.py
+import argparse
+import numpy as np
+import lm_eval.evaluator
+from lm_eval import tasks
+import scipy.stats
+from typing import Tuple, Dict, List
+import pandas as pd
+import torch
+import os
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+eval_logger = lm_eval.utils.eval_logger
+
+
+def calculate_z_value(res1: Dict, res2: Dict) -> Tuple[float, float]:
+    acc1, acc2 = res1["acc,none"], res2["acc,none"]
+    st_err1, st_err2 = res1["acc_stderr,none"], res2["acc_stderr,none"]
+    Z = (acc1 - acc2) / np.sqrt((st_err1**2) + (st_err2**2))
+    # Determining the p-value
+    p_value = 2 * scipy.stats.norm.sf(abs(Z))  # two-tailed test
+    return Z, p_value
+
+
+def print_results(
+    data_to_print: List = None, results_dict: Dict = None, alpha: float = None
+):
+    model1_data = data_to_print[0]
+    model2_data = data_to_print[1]
+    table_data = []
+    for task in model1_data.keys():
+        row = {
+            "Task": task,
+            "HF Accuracy": model1_data[task]["acc,none"],
+            "vLLM Accuracy": model2_data[task]["acc,none"],
+            "HF StdErr": model1_data[task]["acc_stderr,none"],
+            "vLLM StdErr": model2_data[task]["acc_stderr,none"],
+        }
+        table_data.append(row)
+    comparison_df = pd.DataFrame(table_data)
+    comparison_df["Z-Score"] = comparison_df["Task"].apply(
+        lambda task: results_dict[task]["z"]
+    )
+    comparison_df["P-Value"] = comparison_df["Task"].apply(
+        lambda task: results_dict[task]["p_value"]
+    )
+    comparison_df[f"p > {alpha}"] = comparison_df["P-Value"].apply(
+        lambda p: "✓" if p > alpha else "×"
+    )
+    return comparison_df
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--pretrained", default="EleutherAI/pythia-70m", help="name of model to compare"
+    )
+    parser.add_argument(
+        "--hf_args", help="huggingface model args <arg>=<value>", default=""
+    )
+    parser.add_argument("--vllm_args", help="vllm model args <arg>=<value>", default="")
+    parser.add_argument("--tasks", type=str, default="arc_easy,hellaswag")
+    parser.add_argument(
+        "--limit",
+        type=float,
+        default=100,
+    )
+    parser.add_argument(
+        "--alpha",
+        type=float,
+        default=0.05,
+        help="Significance level for two-tailed z-test",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cuda",
+    )
+    parser.add_argument(
+        "--batch",
+        type=str,
+        default=8,
+    )
+    parser.add_argument(
+        "--verbosity",
+        type=str,
+        default="INFO",
+        help="Logging verbosity",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    tasks.initialize_tasks()
+    args = parse_args()
+    tasks = args.tasks.split(",")
+    print(tasks)
+    hf_args, vllm_args = "," + args.hf_args, "," + args.vllm_args
+    results_vllm = lm_eval.evaluator.simple_evaluate(
+        model="vllm",
+        model_args=f"pretrained={args.pretrained}" + vllm_args,
+        tasks=tasks,
+        limit=args.limit,
+        device=args.device,
+        batch_size=args.batch,
+    )
+    torch.cuda.empty_cache()
+    results_hf = lm_eval.evaluator.simple_evaluate(
+        model="hf",
+        model_args=f"pretrained={args.pretrained}" + hf_args,
+        tasks=tasks,
+        limit=args.limit,
+        device=args.device,
+        batch_size=args.batch,
+    )
+    all_res = {}
+    for task1, task2 in zip(
+        results_hf["results"].items(), results_vllm["results"].items()
+    ):
+        assert task1[0] == task2[0]
+        z, p_value = calculate_z_value(task1[1], task2[1])
+        all_res[task1[0]] = {"z": z, "p_value": p_value}
+    df = print_results(
+        [results_hf["results"], results_vllm["results"]], all_res, args.alpha
+    )
+    print(df)