Commit a5753ff5 authored by zhuwenwen's avatar zhuwenwen
Browse files

v0.5.0.post1

parents 21c06ecb 0f0d8bc0
...@@ -5,6 +5,7 @@ vLLM Meetups ...@@ -5,6 +5,7 @@ vLLM Meetups
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
- `The fourth vLLM meetup <https://lu.ma/agivllm>`__, with Cloudflare and BentoML, June 11th 2024. `[Slides] <https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing>`__
- `The third vLLM meetup <https://robloxandvllmmeetup2024.splashthat.com/>`__, with Roblox, April 2nd 2024. `[Slides] <https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing>`__ - `The third vLLM meetup <https://robloxandvllmmeetup2024.splashthat.com/>`__, with Roblox, April 2nd 2024. `[Slides] <https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing>`__
- `The second vLLM meetup <https://lu.ma/ygxbpzhl>`__, with IBM Research, January 31st 2024. `[Slides] <https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing>`__ `[Video (vLLM Update)] <https://youtu.be/Y0C-DUvEnZQ>`__ `[Video (IBM Research & torch.compile)] <https://youtu.be/m0dMtFLI-dg>`__ - `The second vLLM meetup <https://lu.ma/ygxbpzhl>`__, with IBM Research, January 31st 2024. `[Slides] <https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing>`__ `[Video (vLLM Update)] <https://youtu.be/Y0C-DUvEnZQ>`__ `[Video (IBM Research & torch.compile)] <https://youtu.be/m0dMtFLI-dg>`__
- `The first vLLM meetup <https://lu.ma/first-vllm-meetup>`__, with a16z, October 5th 2023. `[Slides] <https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing>`__ - `The first vLLM meetup <https://lu.ma/first-vllm-meetup>`__, with a16z, October 5th 2023. `[Slides] <https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing>`__
......
...@@ -10,6 +10,7 @@ Table of contents: ...@@ -10,6 +10,7 @@ Table of contents:
#. :ref:`Requirements <cpu_backend_requirements>` #. :ref:`Requirements <cpu_backend_requirements>`
#. :ref:`Quick start using Dockerfile <cpu_backend_quick_start_dockerfile>` #. :ref:`Quick start using Dockerfile <cpu_backend_quick_start_dockerfile>`
#. :ref:`Build from source <build_cpu_backend_from_source>` #. :ref:`Build from source <build_cpu_backend_from_source>`
#. :ref:`Intel Extension for PyTorch <ipex_guidance>`
#. :ref:`Performance tips <cpu_backend_performance_tips>` #. :ref:`Performance tips <cpu_backend_performance_tips>`
.. _cpu_backend_requirements: .. _cpu_backend_requirements:
...@@ -18,7 +19,7 @@ Requirements ...@@ -18,7 +19,7 @@ Requirements
------------ ------------
* OS: Linux * OS: Linux
* Compiler: gcc/g++>=12.3.0 (recommended) * Compiler: gcc/g++>=12.3.0 (optional, recommended)
* Instruction set architecture (ISA) requirement: AVX512 is required. * Instruction set architecture (ISA) requirement: AVX512 is required.
.. _cpu_backend_quick_start_dockerfile: .. _cpu_backend_quick_start_dockerfile:
...@@ -41,7 +42,7 @@ Quick start using Dockerfile ...@@ -41,7 +42,7 @@ Quick start using Dockerfile
Build from source Build from source
----------------- -----------------
- First, install required compiler. We recommend to use ``gcc/g++ >= 12.3.0`` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: - First, install recommended compiler. We recommend to use ``gcc/g++ >= 12.3.0`` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
.. code-block:: console .. code-block:: console
...@@ -70,6 +71,15 @@ Build from source ...@@ -70,6 +71,15 @@ Build from source
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building. - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.
.. _ipex_guidance:
Intel Extension for PyTorch
---------------------------
- `Intel Extension for PyTorch (IPEX) <https://github.com/intel/intel-extension-for-pytorch>`_ extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
- IPEX after the ``2.3.0`` can be enabled in the CPU backend by default if it is installed.
.. _cpu_backend_performance_tips: .. _cpu_backend_performance_tips:
Performance tips Performance tips
...@@ -77,6 +87,15 @@ Performance tips ...@@ -77,6 +87,15 @@ Performance tips
- vLLM CPU backend uses environment variable ``VLLM_CPU_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_CPU_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. - vLLM CPU backend uses environment variable ``VLLM_CPU_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_CPU_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
.. code-block:: console
$ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
$ find / -name *libtcmalloc* # find the dynamic link library path
$ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
$ python examples/offline_inference.py # run vLLM
- vLLM CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it will be very critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service event-loop), to avoid CPU oversubscription. - vLLM CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it will be very critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service event-loop), to avoid CPU oversubscription.
- If using vLLM CPU backend on a bare-metal machine, it is recommended to disable the hyper-threading. - If using vLLM CPU backend on a bare-metal machine, it is recommended to disable the hyper-threading.
......
...@@ -8,27 +8,30 @@ Debugging hang/crash issues ...@@ -8,27 +8,30 @@ Debugging hang/crash issues
When an vLLM instance hangs or crashes, it is very difficult to debug the issue. But wait a minute, it is also possible that vLLM is doing something that indeed takes a long time: When an vLLM instance hangs or crashes, it is very difficult to debug the issue. But wait a minute, it is also possible that vLLM is doing something that indeed takes a long time:
- Downloading a model: do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue. - **Downloading a model**: Do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface-cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue.
- Loading the model from disk: if the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory. - **Loading the model from disk**: If the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory.
- Tensor parallel inference: if the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. - **Tensor parallel inference**: If the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
If you already take care of the above issues, and the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue: If you have already taken care of the above issues, but the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue:
- Set the environment variable ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging. - Set the environment variable ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging.
- Set the environment variable ``export CUDA_LAUNCH_BLOCKING=1`` to know exactly which CUDA kernel is causing the trouble. - Set the environment variable ``export CUDA_LAUNCH_BLOCKING=1`` to know exactly which CUDA kernel is causing the trouble.
- Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL. - Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL.
- Set the environment variable ``export VLLM_TRACE_FUNCTION=1`` . All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs. **Note: it will generate a lot of logs and slow down the system. Only use it for debugging purposes.** - Set the environment variable ``export VLLM_TRACE_FUNCTION=1``. All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs.
.. warning::
vLLM function tracing will generate a lot of logs and slow down the system. Only use it for debugging purposes.
With more logging, hopefully you can find the root cause of the issue. With more logging, hopefully you can find the root cause of the issue.
Here are some common issues that can cause hangs: Here are some common issues that can cause hangs:
- The network setup is incorrect. The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``. - **Incorrect network setup**: The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``.
- Hardware/driver setup is incorrect. GPU communication cannot be established. You can run a sanity check script below to see if the GPU communication is working correctly. - **Incorrect hardware/driver**: GPU communication cannot be established. You can run the following sanity check script to see if the GPU communication is working correctly.
.. code-block:: python .. code-block:: python
# save it as `test.py`` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py` # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
# adjust `--nproc-per-node` to the number of GPUs you want to use. # adjust `--nproc-per-node` to the number of GPUs you want to use.
import torch import torch
import torch.distributed as dist import torch.distributed as dist
...@@ -39,4 +42,4 @@ Here are some common issues that can cause hangs: ...@@ -39,4 +42,4 @@ Here are some common issues that can cause hangs:
value = data.mean().item() value = data.mean().item()
assert value == dist.get_world_size() assert value == dist.get_world_size()
If the problem persists, feel free to open an `issue <https://github.com/vllm-project/vllm/issues/new/choose>`_ on GitHub, with a detailed description of the issue, your environment, and the logs. If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs.
.. _installation_tpu:
Installation with TPU
=====================
vLLM supports Google Cloud TPUs using PyTorch XLA.
Requirements
------------
* Google Cloud TPU VM (single host)
* TPU versions: v5e, v5p, v4
* Python: 3.10
Installation options:
1. :ref:`Build a docker image with Dockerfile <build_docker_tpu>`.
2. :ref:`Build from source <build_from_source_tpu>`.
.. _build_docker_tpu:
Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------
`Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ is provided to build a docker image with TPU support.
.. code-block:: console
$ docker build -f Dockerfile.tpu -t vllm-tpu .
You can run the docker image with the following command:
.. code-block:: console
$ # Make sure to add `--privileged --net host --shm-size=16G`.
$ docker run --privileged --net host --shm-size=16G -it vllm-tpu
.. _build_from_source_tpu:
Build from source
-----------------
You can also build and install the TPU backend from source.
First, install the dependencies:
.. code-block:: console
$ # (Recommended) Create a new conda environment.
$ conda create -n myenv python=3.10 -y
$ conda activate myenv
$ # Clean up the existing torch and torch-xla packages.
$ pip uninstall torch torch-xla -y
$ # Install PyTorch and PyTorch XLA.
$ export DATE="+20240601"
$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-nightly${DATE}-cp310-cp310-linux_x86_64.whl
$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly${DATE}-cp310-cp310-linux_x86_64.whl
$ # Install JAX and Pallas.
$ pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
$ pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
$ # Install other build dependencies.
$ pip install packaging aiohttp
Next, build vLLM from source. This will only take a few seconds:
.. code-block:: console
$ VLLM_TARGET_DEVICE="tpu" python setup.py develop
...@@ -63,8 +63,9 @@ Documentation ...@@ -63,8 +63,9 @@ Documentation
getting_started/installation getting_started/installation
getting_started/amd-installation getting_started/amd-installation
getting_started/neuron-installation
getting_started/cpu-installation getting_started/cpu-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/quickstart getting_started/quickstart
getting_started/debugging getting_started/debugging
getting_started/examples/examples_index getting_started/examples/examples_index
......
...@@ -20,9 +20,9 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs: ...@@ -20,9 +20,9 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
Currently, the support for vision language models on vLLM has the following limitations: Currently, the support for vision language models on vLLM has the following limitations:
* Only single image input is supported per text prompt. * Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation. * Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests. We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
Offline Batched Inference Offline Batched Inference
------------------------- -------------------------
......
...@@ -332,7 +332,7 @@ def main(args): ...@@ -332,7 +332,7 @@ def main(args):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model_dir", parser.add_argument("--model-dir",
help="Specify where the HuggingFace model is", help="Specify where the HuggingFace model is",
required=True) required=True)
parser.add_argument("--device", default="cuda") parser.add_argument("--device", default="cuda")
...@@ -346,19 +346,19 @@ if __name__ == "__main__": ...@@ -346,19 +346,19 @@ if __name__ == "__main__":
"full_prec" "full_prec"
], ],
) )
parser.add_argument("--batch_size", parser.add_argument("--batch-size",
help="Batch size for calibration.", help="Batch size for calibration.",
type=int, type=int,
default=1) default=1)
parser.add_argument("--calib_size", parser.add_argument("--calib-size",
help="Number of samples for calibration.", help="Number of samples for calibration.",
type=int, type=int,
default=512) default=512)
parser.add_argument("--output_dir", default="exported_model") parser.add_argument("--output-dir", default="exported_model")
parser.add_argument("--tp_size", type=int, default=1) parser.add_argument("--tp-size", type=int, default=1)
parser.add_argument("--pp_size", type=int, default=1) parser.add_argument("--pp-size", type=int, default=1)
parser.add_argument("--awq_block_size", type=int, default=128) parser.add_argument("--awq-block-size", type=int, default=128)
parser.add_argument("--kv_cache_dtype", parser.add_argument("--kv-cache-dtype",
help="KV Cache dtype.", help="KV Cache dtype.",
default=None, default=None,
choices=["int8", "fp8", None]) choices=["int8", "fp8", None])
......
...@@ -3,18 +3,12 @@ import dataclasses ...@@ -3,18 +3,12 @@ import dataclasses
import json import json
import os import os
import uuid import uuid
from functools import partial
from tensorizer import stream_io
from vllm import LLM from vllm import LLM
from vllm.distributed import (init_distributed_environment,
initialize_model_parallel)
from vllm.engine.arg_utils import EngineArgs from vllm.engine.arg_utils import EngineArgs
from vllm.engine.llm_engine import LLMEngine
from vllm.model_executor.model_loader.tensorizer import (TensorizerArgs, from vllm.model_executor.model_loader.tensorizer import (TensorizerArgs,
TensorizerConfig, TensorizerConfig,
serialize_vllm_model) tensorize_vllm_model)
# yapf conflicts with isort for this docstring # yapf conflicts with isort for this docstring
# yapf: disable # yapf: disable
...@@ -61,6 +55,12 @@ Which downloads the model tensors from your S3 bucket and deserializes them. ...@@ -61,6 +55,12 @@ Which downloads the model tensors from your S3 bucket and deserializes them.
You can also provide a `--keyfile` argument to decrypt the model weights if You can also provide a `--keyfile` argument to decrypt the model weights if
they were serialized with encryption. they were serialized with encryption.
To support distributed tensor-parallel models, each model shard will be
serialized to a separate file. The tensorizer_uri is then specified as a string
template with a format specifier such as '%03d' that will be rendered with the
shard's rank. Sharded models serialized with this script will be named as
model-rank-%03d.tensors
For more information on the available arguments for serializing, run For more information on the available arguments for serializing, run
`python -m examples.tensorize_vllm_model serialize --help`. `python -m examples.tensorize_vllm_model serialize --help`.
...@@ -168,71 +168,66 @@ def parse_args(): ...@@ -168,71 +168,66 @@ def parse_args():
def deserialize(): def deserialize():
llm = LLM(model=args.model, llm = LLM(model=args.model,
load_format="tensorizer", load_format="tensorizer",
tensor_parallel_size=args.tensor_parallel_size,
model_loader_extra_config=tensorizer_config model_loader_extra_config=tensorizer_config
) )
return llm return llm
if __name__ == '__main__':
args = parse_args()
args = parse_args() s3_access_key_id = (getattr(args, 's3_access_key_id', None)
s3_access_key_id = (getattr(args, 's3_access_key_id', None)
or os.environ.get("S3_ACCESS_KEY_ID", None)) or os.environ.get("S3_ACCESS_KEY_ID", None))
s3_secret_access_key = (getattr(args, 's3_secret_access_key', None) s3_secret_access_key = (getattr(args, 's3_secret_access_key', None)
or os.environ.get("S3_SECRET_ACCESS_KEY", None)) or os.environ.get("S3_SECRET_ACCESS_KEY", None))
s3_endpoint = (getattr(args, 's3_endpoint', None) s3_endpoint = (getattr(args, 's3_endpoint', None)
or os.environ.get("S3_ENDPOINT_URL", None)) or os.environ.get("S3_ENDPOINT_URL", None))
credentials = { credentials = {
"s3_access_key_id": s3_access_key_id, "s3_access_key_id": s3_access_key_id,
"s3_secret_access_key": s3_secret_access_key, "s3_secret_access_key": s3_secret_access_key,
"s3_endpoint": s3_endpoint "s3_endpoint": s3_endpoint
} }
_read_stream, _write_stream = (partial(
stream_io.open_stream,
mode=mode,
s3_access_key_id=s3_access_key_id,
s3_secret_access_key=s3_secret_access_key,
s3_endpoint=s3_endpoint,
) for mode in ("rb", "wb+"))
model_ref = args.model
model_name = model_ref.split("/")[1]
os.environ["MASTER_ADDR"] = "127.0.0.1" model_ref = args.model
os.environ["MASTER_PORT"] = "8080"
init_distributed_environment(world_size=1, rank=0, local_rank=0) model_name = model_ref.split("/")[1]
initialize_model_parallel()
keyfile = args.keyfile if args.keyfile else None keyfile = args.keyfile if args.keyfile else None
if args.model_loader_extra_config:
if args.model_loader_extra_config:
config = json.loads(args.model_loader_extra_config) config = json.loads(args.model_loader_extra_config)
tensorizer_args = TensorizerConfig(**config)._construct_tensorizer_args() tensorizer_args = \
TensorizerConfig(**config)._construct_tensorizer_args()
tensorizer_args.tensorizer_uri = args.path_to_tensors tensorizer_args.tensorizer_uri = args.path_to_tensors
else: else:
tensorizer_args = None tensorizer_args = None
if args.command == "serialize": if args.command == "serialize":
eng_args_dict = {f.name: getattr(args, f.name) for f in eng_args_dict = {f.name: getattr(args, f.name) for f in
dataclasses.fields(EngineArgs)} dataclasses.fields(EngineArgs)}
engine_args = EngineArgs.from_cli_args(argparse.Namespace(**eng_args_dict)) engine_args = EngineArgs.from_cli_args(
engine = LLMEngine.from_engine_args(engine_args) argparse.Namespace(**eng_args_dict)
)
input_dir = args.serialized_directory.rstrip('/') input_dir = args.serialized_directory.rstrip('/')
suffix = args.suffix if args.suffix else uuid.uuid4().hex suffix = args.suffix if args.suffix else uuid.uuid4().hex
base_path = f"{input_dir}/vllm/{model_ref}/{suffix}" base_path = f"{input_dir}/vllm/{model_ref}/{suffix}"
if engine_args.tensor_parallel_size > 1:
model_path = f"{base_path}/model-rank-%03d.tensors"
else:
model_path = f"{base_path}/model.tensors" model_path = f"{base_path}/model.tensors"
tensorizer_config = TensorizerConfig( tensorizer_config = TensorizerConfig(
tensorizer_uri=model_path, tensorizer_uri=model_path,
encryption_keyfile=keyfile,
**credentials) **credentials)
serialize_vllm_model(engine, tensorizer_config, keyfile)
elif args.command == "deserialize": tensorize_vllm_model(engine_args, tensorizer_config)
elif args.command == "deserialize":
if not tensorizer_args: if not tensorizer_args:
tensorizer_config = TensorizerConfig( tensorizer_config = TensorizerConfig(
tensorizer_uri=args.path_to_tensors, tensorizer_uri=args.path_to_tensors,
...@@ -240,5 +235,5 @@ elif args.command == "deserialize": ...@@ -240,5 +235,5 @@ elif args.command == "deserialize":
**credentials **credentials
) )
deserialize() deserialize()
else: else:
raise ValueError("Either serialize or deserialize must be specified.") raise ValueError("Either serialize or deserialize must be specified.")
...@@ -36,12 +36,12 @@ tool_version_check() { ...@@ -36,12 +36,12 @@ tool_version_check() {
fi fi
} }
tool_version_check "yapf" $YAPF_VERSION "$(grep yapf requirements-dev.txt | cut -d'=' -f3)" tool_version_check "yapf" $YAPF_VERSION "$(grep yapf requirements-lint.txt | cut -d'=' -f3)"
tool_version_check "ruff" $RUFF_VERSION "$(grep "ruff==" requirements-dev.txt | cut -d'=' -f3)" tool_version_check "ruff" $RUFF_VERSION "$(grep "ruff==" requirements-lint.txt | cut -d'=' -f3)"
tool_version_check "mypy" "$MYPY_VERSION" "$(grep mypy requirements-dev.txt | cut -d'=' -f3)" tool_version_check "mypy" "$MYPY_VERSION" "$(grep mypy requirements-lint.txt | cut -d'=' -f3)"
tool_version_check "isort" "$ISORT_VERSION" "$(grep isort requirements-dev.txt | cut -d'=' -f3)" tool_version_check "isort" "$ISORT_VERSION" "$(grep isort requirements-lint.txt | cut -d'=' -f3)"
tool_version_check "codespell" "$CODESPELL_VERSION" "$(grep codespell requirements-dev.txt | cut -d'=' -f3)" tool_version_check "codespell" "$CODESPELL_VERSION" "$(grep codespell requirements-lint.txt | cut -d'=' -f3)"
tool_version_check "clang-format" "$CLANGFORMAT_VERSION" "$(grep clang-format requirements-dev.txt | cut -d'=' -f3)" tool_version_check "clang-format" "$CLANGFORMAT_VERSION" "$(grep clang-format requirements-lint.txt | cut -d'=' -f3)"
YAPF_FLAGS=( YAPF_FLAGS=(
'--recursive' '--recursive'
......
...@@ -2,5 +2,5 @@ ...@@ -2,5 +2,5 @@
-r requirements-common.txt -r requirements-common.txt
# Dependencies for x86_64 CPUs # Dependencies for x86_64 CPUs
torch == 2.3.0+cpu torch == 2.3.1+cpu
triton >= 2.2.0 # FIXME(woosuk): This is a hack to avoid import error. triton >= 2.2.0 # FIXME(woosuk): This is a hack to avoid import error.
\ No newline at end of file
# formatting -r requirements-lint.txt
yapf==0.32.0 -r requirements-test.txt
toml==0.10.2
tomli==2.0.1
ruff==0.1.5
codespell==2.2.6
isort==5.13.2
clang-format==18.1.5
# type checking # Avoid adding requirements directly to this file.
mypy==1.9.0 # Instead, modify the two files referenced above.
types-PyYAML
types-requests
types-setuptools
# testing
pytest
tensorizer>=2.9.0
pytest-forked
pytest-asyncio
pytest-rerunfailures
pytest-shard
# testing utils
awscli
einops # required for MPT
httpx
peft
requests
ray
sentence-transformers # required for embedding
# Benchmarking
aiohttp
# quantization
bitsandbytes==0.42.0
# formatting
yapf==0.32.0
toml==0.10.2
tomli==2.0.1
ruff==0.1.5
codespell==2.3.0
isort==5.13.2
clang-format==18.1.5
# type checking
mypy==1.9.0
types-PyYAML
types-requests
types-setuptools
# testing
pytest
tensorizer>=2.9.0
pytest-forked
pytest-asyncio
pytest-rerunfailures
pytest-shard
# testing utils
awscli
einops # required for MPT
httpx
peft
requests
ray
sentence-transformers # required for embedding
# Benchmarking
aiohttp
# quantization
bitsandbytes==0.42.0
# Common dependencies
-r requirements-common.txt
# Dependencies for TPU
# Currently, the TPU backend uses a nightly version of PyTorch XLA.
# You can install the dependencies in Dockerfile.tpu.
triton # To avoid import errors
...@@ -144,6 +144,7 @@ class cmake_build_ext(build_ext): ...@@ -144,6 +144,7 @@ class cmake_build_ext(build_ext):
cmake_args += [ cmake_args += [
'-DCMAKE_CXX_COMPILER_LAUNCHER=sccache', '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache',
'-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache', '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache',
'-DCMAKE_C_COMPILER_LAUNCHER=sccache',
] ]
elif is_ccache_available(): elif is_ccache_available():
cmake_args += [ cmake_args += [
...@@ -175,7 +176,6 @@ class cmake_build_ext(build_ext): ...@@ -175,7 +176,6 @@ class cmake_build_ext(build_ext):
else: else:
# Default build tool to whatever cmake picks. # Default build tool to whatever cmake picks.
build_tool = [] build_tool = []
subprocess.check_call( subprocess.check_call(
['cmake', ext.cmake_lists_dir, *build_tool, *cmake_args], ['cmake', ext.cmake_lists_dir, *build_tool, *cmake_args],
cwd=self.build_temp) cwd=self.build_temp)
...@@ -210,9 +210,9 @@ class cmake_build_ext(build_ext): ...@@ -210,9 +210,9 @@ class cmake_build_ext(build_ext):
def _is_cuda() -> bool: def _is_cuda() -> bool:
return VLLM_TARGET_DEVICE == "cuda" \ has_cuda = torch.version.cuda is not None
and torch.version.cuda is not None \ return (VLLM_TARGET_DEVICE == "cuda" and has_cuda
and not _is_neuron() and not (_is_neuron() or _is_tpu()))
def _is_hip() -> bool: def _is_hip() -> bool:
...@@ -229,10 +229,18 @@ def _is_neuron() -> bool: ...@@ -229,10 +229,18 @@ def _is_neuron() -> bool:
return torch_neuronx_installed or VLLM_TARGET_DEVICE == "neuron" return torch_neuronx_installed or VLLM_TARGET_DEVICE == "neuron"
def _is_tpu() -> bool:
return VLLM_TARGET_DEVICE == "tpu"
def _is_cpu() -> bool: def _is_cpu() -> bool:
return VLLM_TARGET_DEVICE == "cpu" return VLLM_TARGET_DEVICE == "cpu"
def _build_custom_ops() -> bool:
return _is_cuda() or _is_hip() or _is_cpu()
def _install_punica() -> bool: def _install_punica() -> bool:
return envs.VLLM_INSTALL_PUNICA_KERNELS return envs.VLLM_INSTALL_PUNICA_KERNELS
...@@ -350,8 +358,8 @@ def get_version_add(sha: Optional[str] = None) -> str: ...@@ -350,8 +358,8 @@ def get_version_add(sha: Optional[str] = None) -> str:
version += ".torch" + torch.__version__[:5] version += ".torch" + torch.__version__[:5]
with open(add_version_path, encoding="utf-8",mode="w") as file: with open(add_version_path, encoding="utf-8",mode="w") as file:
file.write("__version__='0.5.0'\n") file.write("__version__='0.5.0.post1'\n")
file.write("__dcu_version__='0.5.0+{}'\n".format(version)) file.write("__dcu_version__='0.5.0.post1+{}'\n".format(version))
file.close() file.close()
...@@ -364,7 +372,7 @@ def get_version(): ...@@ -364,7 +372,7 @@ def get_version():
def get_vllm_version() -> str: def get_vllm_version() -> str:
version = find_version(get_path("vllm", "__init__.py")) version = find_version(get_path("vllm", "version.py"))
if _is_cuda(): if _is_cuda():
cuda_version = str(get_nvcc_cuda_version()) cuda_version = str(get_nvcc_cuda_version())
...@@ -384,6 +392,8 @@ def get_vllm_version() -> str: ...@@ -384,6 +392,8 @@ def get_vllm_version() -> str:
if neuron_version != MAIN_CUDA_VERSION: if neuron_version != MAIN_CUDA_VERSION:
neuron_version_str = neuron_version.replace(".", "")[:3] neuron_version_str = neuron_version.replace(".", "")[:3]
version += f"+neuron{neuron_version_str}" version += f"+neuron{neuron_version_str}"
elif _is_tpu():
version += "+tpu"
elif _is_cpu(): elif _is_cpu():
version += "+cpu" version += "+cpu"
else: else:
...@@ -431,6 +441,8 @@ def get_requirements() -> List[str]: ...@@ -431,6 +441,8 @@ def get_requirements() -> List[str]:
requirements = _read_requirements("requirements-rocm.txt") requirements = _read_requirements("requirements-rocm.txt")
elif _is_neuron(): elif _is_neuron():
requirements = _read_requirements("requirements-neuron.txt") requirements = _read_requirements("requirements-neuron.txt")
elif _is_tpu():
requirements = _read_requirements("requirements-tpu.txt")
elif _is_cpu(): elif _is_cpu():
requirements = _read_requirements("requirements-cpu.txt") requirements = _read_requirements("requirements-cpu.txt")
else: else:
...@@ -444,7 +456,7 @@ ext_modules = [] ...@@ -444,7 +456,7 @@ ext_modules = []
if _is_cuda() or _is_hip(): if _is_cuda() or _is_hip():
ext_modules.append(CMakeExtension(name="vllm._moe_C")) ext_modules.append(CMakeExtension(name="vllm._moe_C"))
if not _is_neuron(): if _build_custom_ops():
ext_modules.append(CMakeExtension(name="vllm._C")) ext_modules.append(CMakeExtension(name="vllm._C"))
if _install_punica(): if _install_punica():
...@@ -487,6 +499,6 @@ setup( ...@@ -487,6 +499,6 @@ setup(
extras_require={ extras_require={
"tensorizer": ["tensorizer>=2.9.0"], "tensorizer": ["tensorizer>=2.9.0"],
}, },
cmdclass={"build_ext": cmake_build_ext} if not _is_neuron() else {}, cmdclass={"build_ext": cmake_build_ext} if _build_custom_ops() else {},
package_data=package_data, package_data=package_data,
) )
...@@ -4,16 +4,22 @@ import pytest ...@@ -4,16 +4,22 @@ import pytest
# and debugging. # and debugging.
import ray import ray
from ..utils import ServerRunner from ..utils import VLLM_PATH, RemoteOpenAIServer
# any model with a chat template should work here # any model with a chat template should work here
MODEL_NAME = "facebook/opt-125m" MODEL_NAME = "facebook/opt-125m"
@pytest.fixture(scope="module") @pytest.fixture(scope="module")
def server(): def ray_ctx():
ray.init() ray.init(runtime_env={"working_dir": VLLM_PATH})
server_runner = ServerRunner.remote([ yield
ray.shutdown()
@pytest.fixture(scope="module")
def server(ray_ctx):
return RemoteOpenAIServer([
"--model", "--model",
MODEL_NAME, MODEL_NAME,
# use half precision for speed and memory savings in CI environment # use half precision for speed and memory savings in CI environment
...@@ -24,22 +30,15 @@ def server(): ...@@ -24,22 +30,15 @@ def server():
"--enforce-eager", "--enforce-eager",
"--engine-use-ray" "--engine-use-ray"
]) ])
ray.get(server_runner.ready.remote())
yield server_runner
ray.shutdown()
@pytest.fixture(scope="module") @pytest.fixture(scope="module")
def client(): def client(server):
client = openai.AsyncOpenAI( return server.get_async_client()
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
yield client
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_check_models(server, client: openai.AsyncOpenAI): async def test_check_models(client: openai.AsyncOpenAI):
models = await client.models.list() models = await client.models.list()
models = models.data models = models.data
served_model = models[0] served_model = models[0]
...@@ -48,7 +47,7 @@ async def test_check_models(server, client: openai.AsyncOpenAI): ...@@ -48,7 +47,7 @@ async def test_check_models(server, client: openai.AsyncOpenAI):
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_single_completion(server, client: openai.AsyncOpenAI): async def test_single_completion(client: openai.AsyncOpenAI):
completion = await client.completions.create(model=MODEL_NAME, completion = await client.completions.create(model=MODEL_NAME,
prompt="Hello, my name is", prompt="Hello, my name is",
max_tokens=5, max_tokens=5,
...@@ -72,7 +71,7 @@ async def test_single_completion(server, client: openai.AsyncOpenAI): ...@@ -72,7 +71,7 @@ async def test_single_completion(server, client: openai.AsyncOpenAI):
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_single_chat_session(server, client: openai.AsyncOpenAI): async def test_single_chat_session(client: openai.AsyncOpenAI):
messages = [{ messages = [{
"role": "system", "role": "system",
"content": "you are a helpful assistant" "content": "you are a helpful assistant"
......
import contextlib import contextlib
import gc import gc
import os import os
import subprocess
import sys
from typing import Any, Dict, List, Optional, Tuple, TypeVar from typing import Any, Dict, List, Optional, Tuple, TypeVar
import pytest import pytest
...@@ -15,13 +13,14 @@ from transformers import (AutoModelForCausalLM, AutoModelForVision2Seq, ...@@ -15,13 +13,14 @@ from transformers import (AutoModelForCausalLM, AutoModelForVision2Seq,
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.config import TokenizerPoolConfig, VisionLanguageConfig from vllm.config import TokenizerPoolConfig, VisionLanguageConfig
from vllm.distributed import destroy_model_parallel from vllm.distributed import (destroy_distributed_environment,
destroy_model_parallel)
from vllm.inputs import TextPrompt from vllm.inputs import TextPrompt
from vllm.logger import init_logger from vllm.logger import init_logger
from vllm.multimodal import MultiModalData from vllm.multimodal import MultiModalData
from vllm.multimodal.image import ImageFeatureData, ImagePixelData from vllm.multimodal.image import ImageFeatureData, ImagePixelData
from vllm.sequence import SampleLogprobs from vllm.sequence import SampleLogprobs
from vllm.utils import is_cpu from vllm.utils import cuda_device_count_stateless, is_cpu
logger = init_logger(__name__) logger = init_logger(__name__)
...@@ -54,6 +53,7 @@ def _read_prompts(filename: str) -> List[str]: ...@@ -54,6 +53,7 @@ def _read_prompts(filename: str) -> List[str]:
def cleanup(): def cleanup():
destroy_model_parallel() destroy_model_parallel()
destroy_distributed_environment()
with contextlib.suppress(AssertionError): with contextlib.suppress(AssertionError):
torch.distributed.destroy_process_group() torch.distributed.destroy_process_group()
gc.collect() gc.collect()
...@@ -537,15 +537,4 @@ def num_gpus_available(): ...@@ -537,15 +537,4 @@ def num_gpus_available():
"""Get number of GPUs without initializing the CUDA context """Get number of GPUs without initializing the CUDA context
in current process.""" in current process."""
try: return cuda_device_count_stateless()
out = subprocess.run([
sys.executable, "-c",
"import torch; print(torch.cuda.device_count())"
],
capture_output=True,
check=True,
text=True)
except subprocess.CalledProcessError as e:
logger.warning("Failed to get number of GPUs.", exc_info=e)
return 0
return int(out.stdout.strip())
...@@ -149,7 +149,7 @@ def test_complex(): ...@@ -149,7 +149,7 @@ def test_complex():
# Only the first seq group has a new token appended. # Only the first seq group has a new token appended.
append_new_token(running[0], 1) append_new_token(running[0], 1)
# Add 2 more requsets. # Add 2 more requests.
for i in range(2, 4): for i in range(2, 4):
_, seq_group = create_dummy_prompt(str(i), prompt_length=60) _, seq_group = create_dummy_prompt(str(i), prompt_length=60)
scheduler.add_seq_group(seq_group) scheduler.add_seq_group(seq_group)
......
...@@ -7,9 +7,9 @@ import torch ...@@ -7,9 +7,9 @@ import torch
import torch.distributed as dist import torch.distributed as dist
from vllm.distributed.communication_op import ( # noqa from vllm.distributed.communication_op import ( # noqa
graph_capture, tensor_model_parallel_all_reduce) tensor_model_parallel_all_reduce)
from vllm.distributed.parallel_state import (get_tensor_model_parallel_group, from vllm.distributed.parallel_state import (get_tensor_model_parallel_group,
get_tp_ca_communicator) get_tp_group, graph_capture)
from ..utils import (init_test_distributed_environment, from ..utils import (init_test_distributed_environment,
multi_process_tensor_parallel) multi_process_tensor_parallel)
...@@ -91,7 +91,7 @@ def eager_allreduce(tp_size, pp_size, rank, distributed_init_port): ...@@ -91,7 +91,7 @@ def eager_allreduce(tp_size, pp_size, rank, distributed_init_port):
# communicate independently # communicate independently
num_communication = rank // tp_size + 1 num_communication = rank // tp_size + 1
sz = 1024 sz = 1024
fa = get_tp_ca_communicator() fa = get_tp_group().ca_comm
inp = torch.ones(sz, dtype=torch.float32, device=device) inp = torch.ones(sz, dtype=torch.float32, device=device)
out = inp out = inp
for _ in range(num_communication): for _ in range(num_communication):
......
...@@ -6,10 +6,11 @@ import torch ...@@ -6,10 +6,11 @@ import torch
import torch.distributed import torch.distributed
from vllm.distributed.communication_op import ( # noqa from vllm.distributed.communication_op import ( # noqa
graph_capture, tensor_model_parallel_all_reduce) tensor_model_parallel_all_reduce)
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
from vllm.distributed.device_communicators.pynccl_wrapper import NCCLLibrary from vllm.distributed.device_communicators.pynccl_wrapper import NCCLLibrary
from vllm.distributed.parallel_state import (ensure_model_parallel_initialized, from vllm.distributed.parallel_state import (ensure_model_parallel_initialized,
get_world_group, graph_capture,
init_distributed_environment) init_distributed_environment)
from vllm.utils import update_environment_variables from vllm.utils import update_environment_variables
...@@ -53,7 +54,8 @@ def worker_fn_wrapper(fn): ...@@ -53,7 +54,8 @@ def worker_fn_wrapper(fn):
@worker_fn_wrapper @worker_fn_wrapper
def worker_fn(): def worker_fn():
pynccl_comm = PyNcclCommunicator() pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
device=get_world_group().device)
tensor = torch.ones(16, 1024, 1024, tensor = torch.ones(16, 1024, 1024,
dtype=torch.float32).cuda(pynccl_comm.rank) dtype=torch.float32).cuda(pynccl_comm.rank)
with pynccl_comm.change_state(enable=True): with pynccl_comm.change_state(enable=True):
...@@ -129,7 +131,8 @@ def test_pynccl_multiple_allreduce_with_vllm(): ...@@ -129,7 +131,8 @@ def test_pynccl_multiple_allreduce_with_vllm():
def worker_fn_with_cudagraph(): def worker_fn_with_cudagraph():
with torch.no_grad(): with torch.no_grad():
graph = torch.cuda.CUDAGraph() graph = torch.cuda.CUDAGraph()
pynccl_comm = PyNcclCommunicator() pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
device=get_world_group().device)
# run something in the default stream to initialize torch engine # run something in the default stream to initialize torch engine
a = torch.ones((4, 4), device=f'cuda:{pynccl_comm.rank}') a = torch.ones((4, 4), device=f'cuda:{pynccl_comm.rank}')
torch.cuda.synchronize() torch.cuda.synchronize()
...@@ -154,7 +157,8 @@ def test_pynccl_with_cudagraph(): ...@@ -154,7 +157,8 @@ def test_pynccl_with_cudagraph():
@worker_fn_wrapper @worker_fn_wrapper
def send_recv_worker_fn(): def send_recv_worker_fn():
pynccl_comm = PyNcclCommunicator() pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
device=get_world_group().device)
if pynccl_comm.rank == 0: if pynccl_comm.rank == 0:
tensor = torch.ones(16, 1024, 1024, tensor = torch.ones(16, 1024, 1024,
dtype=torch.float32).cuda(pynccl_comm.rank) dtype=torch.float32).cuda(pynccl_comm.rank)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment