v0.5.0.post1

a5753ff5 · zhuwenwen · 21c06ecb · 0f0d8bc0 · a5753ff5 · a5753ff5
Commit a5753ff5 authored Jun 19, 2024 by zhuwenwen
20 changed files
--- a/docs/source/community/meetups.rst
+++ b/docs/source/community/meetups.rst
@@ -5,6 +5,7 @@ vLLM Meetups
 We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
+- `The fourth vLLM meetup <https://lu.ma/agivllm>`__, with Cloudflare and BentoML, June 11th 2024. `[Slides] <https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing>`__
 - `The third vLLM meetup <https://robloxandvllmmeetup2024.splashthat.com/>`__, with Roblox, April 2nd 2024. `[Slides] <https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing>`__
 - `The second vLLM meetup <https://lu.ma/ygxbpzhl>`__, with IBM Research, January 31st 2024. `[Slides] <https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing>`__ `[Video (vLLM Update)] <https://youtu.be/Y0C-DUvEnZQ>`__ `[Video (IBM Research & torch.compile)] <https://youtu.be/m0dMtFLI-dg>`__
 - `The first vLLM meetup <https://lu.ma/first-vllm-meetup>`__, with a16z, October 5th 2023. `[Slides] <https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing>`__

--- a/docs/source/getting_started/cpu-installation.rst
+++ b/docs/source/getting_started/cpu-installation.rst
@@ -10,6 +10,7 @@ Table of contents:
 #. :ref:`Requirements <cpu_backend_requirements>`
 #. :ref:`Quick start using Dockerfile <cpu_backend_quick_start_dockerfile>`
 #. :ref:`Build from source <build_cpu_backend_from_source>`
+#. :ref:`Intel Extension for PyTorch <ipex_guidance>`
 #. :ref:`Performance tips <cpu_backend_performance_tips>`
 .. _cpu_backend_requirements:
@@ -18,7 +19,7 @@ Requirements
 ------------
 * OS: Linux
-* Compiler: gcc/g++>=12.3.0 (recommended)
+* Compiler: gcc/g++>=12.3.0 (optional, recommended)
 * Instruction set architecture (ISA) requirement: AVX512 is required.
 .. _cpu_backend_quick_start_dockerfile:
@@ -41,7 +42,7 @@ Quick start using Dockerfile
 Build from source
 -----------------
- First, install required compiler. We recommend to use ``gcc/g++ >= 12.3.0`` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
+- First, install recommended compiler. We recommend to use ``gcc/g++ >= 12.3.0`` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
 .. code-block:: console
@@ -70,6 +71,15 @@ Build from source
    - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.    
+.. _ipex_guidance:
+Intel Extension for PyTorch
+---------------------------
+- `Intel Extension for PyTorch (IPEX) <https://github.com/intel/intel-extension-for-pytorch>`_ extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+- IPEX after the ``2.3.0`` can be enabled in the CPU backend by default if it is installed.
 .. _cpu_backend_performance_tips:
 Performance tips
@@ -77,6 +87,15 @@ Performance tips
 - vLLM CPU backend uses environment variable ``VLLM_CPU_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_CPU_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
+- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
+.. code-block:: console
+    $ sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
+    $ find / -name *libtcmalloc* # find the dynamic link library path
+    $ export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
+    $ python examples/offline_inference.py # run vLLM
 - vLLM CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it will be very critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service event-loop), to avoid CPU oversubscription. 
 - If using vLLM CPU backend on a bare-metal machine, it is recommended to disable the hyper-threading.

--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -8,27 +8,30 @@ Debugging hang/crash issues
 When an vLLM instance hangs or crashes, it is very difficult to debug the issue. But wait a minute, it is also possible that vLLM is doing something that indeed takes a long time:
- Downloading a model: do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue.
+- **Downloading a model**: Do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface-cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue.
- Loading the model from disk: if the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory.
+- **Loading the model from disk**: If the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory.
- Tensor parallel inference: if the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+- **Tensor parallel inference**: If the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-If you already take care of the above issues, and the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue:
+If you have already taken care of the above issues, but the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue:
 - Set the environment variable ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging.
 - Set the environment variable ``export CUDA_LAUNCH_BLOCKING=1`` to know exactly which CUDA kernel is causing the trouble.
 - Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL.
- Set the environment variable ``export VLLM_TRACE_FUNCTION=1`` . All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs. **Note: it will generate a lot of logs and slow down the system. Only use it for debugging purposes.**
+- Set the environment variable ``export VLLM_TRACE_FUNCTION=1``. All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs.
+  .. warning::
+    vLLM function tracing will generate a lot of logs and slow down the system. Only use it for debugging purposes.
 With more logging, hopefully you can find the root cause of the issue.
 Here are some common issues that can cause hangs:
- The network setup is incorrect. The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``.
+- **Incorrect network setup**: The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``.
- Hardware/driver setup is incorrect. GPU communication cannot be established. You can run a sanity check script below to see if the GPU communication is working correctly.
+- **Incorrect hardware/driver**: GPU communication cannot be established. You can run the following sanity check script to see if the GPU communication is working correctly.
 .. code-block:: python
-    # save it as `test.py`` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
+    # save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
    # adjust `--nproc-per-node` to the number of GPUs you want to use.
    import torch
    import torch.distributed as dist
@@ -39,4 +42,4 @@ Here are some common issues that can cause hangs:
    value = data.mean().item()
    assert value == dist.get_world_size()
-If the problem persists, feel free to open an `issue <https://github.com/vllm-project/vllm/issues/new/choose>`_ on GitHub, with a detailed description of the issue, your environment, and the logs.
+If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs.
--- a/docs/source/getting_started/tpu-installation.rst
+++ b/docs/source/getting_started/tpu-installation.rst
+.. _installation_tpu:
+Installation with TPU
+=====================
+vLLM supports Google Cloud TPUs using PyTorch XLA.
+Requirements
+------------
+* Google Cloud TPU VM (single host)
+* TPU versions: v5e, v5p, v4
+* Python: 3.10
+Installation options:
+1. :ref:`Build a docker image with Dockerfile <build_docker_tpu>`.
+2. :ref:`Build from source <build_from_source_tpu>`.
+.. _build_docker_tpu:
+Build a docker image with :code:`Dockerfile.tpu`
+------------------------------------------------
+`Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ is provided to build a docker image with TPU support.
+.. code-block:: console
+    $ docker build -f Dockerfile.tpu -t vllm-tpu .
+You can run the docker image with the following command:
+.. code-block:: console
+    $ # Make sure to add `--privileged --net host --shm-size=16G`.
+    $ docker run --privileged --net host --shm-size=16G -it vllm-tpu
+.. _build_from_source_tpu:
+Build from source
+-----------------
+You can also build and install the TPU backend from source.
+First, install the dependencies:
+.. code-block:: console
+    $ # (Recommended) Create a new conda environment.
+    $ conda create -n myenv python=3.10 -y
+    $ conda activate myenv
+    $ # Clean up the existing torch and torch-xla packages.
+    $ pip uninstall torch torch-xla -y
+    $ # Install PyTorch and PyTorch XLA.
+    $ export DATE="+20240601"
+    $ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-nightly${DATE}-cp310-cp310-linux_x86_64.whl
+    $ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly${DATE}-cp310-cp310-linux_x86_64.whl
+    $ # Install JAX and Pallas.
+    $ pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+    $ pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
+    $ # Install other build dependencies.
+    $ pip install packaging aiohttp
+Next, build vLLM from source. This will only take a few seconds:
+.. code-block:: console
+    $ VLLM_TARGET_DEVICE="tpu" python setup.py develop
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -63,8 +63,9 @@ Documentation
   getting_started/installation
   getting_started/amd-installation
-   getting_started/neuron-installation
   getting_started/cpu-installation
+   getting_started/neuron-installation
+   getting_started/tpu-installation
   getting_started/quickstart
   getting_started/debugging
   getting_started/examples/examples_index

--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -20,9 +20,9 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
    Currently, the support for vision language models on vLLM has the following limitations:
    * Only single image input is supported per text prompt.
-    * Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
+    * Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.
-    We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
+    We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
 Offline Batched Inference
 -------------------------

--- a/examples/fp8/quantizer/quantize.py
+++ b/examples/fp8/quantizer/quantize.py
@@ -332,7 +332,7 @@ def main(args):
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--model_dir",
+    parser.add_argument("--model-dir",
                        help="Specify where the HuggingFace model is",
                        required=True)
    parser.add_argument("--device", default="cuda")
@@ -346,19 +346,19 @@ if __name__ == "__main__":
            "full_prec"
        ],
    )
-    parser.add_argument("--batch_size",
+    parser.add_argument("--batch-size",
                        help="Batch size for calibration.",
                        type=int,
                        default=1)
-    parser.add_argument("--calib_size",
+    parser.add_argument("--calib-size",
                        help="Number of samples for calibration.",
                        type=int,
                        default=512)
-    parser.add_argument("--output_dir", default="exported_model")
+    parser.add_argument("--output-dir", default="exported_model")
-    parser.add_argument("--tp_size", type=int, default=1)
+    parser.add_argument("--tp-size", type=int, default=1)
-    parser.add_argument("--pp_size", type=int, default=1)
+    parser.add_argument("--pp-size", type=int, default=1)
-    parser.add_argument("--awq_block_size", type=int, default=128)
+    parser.add_argument("--awq-block-size", type=int, default=128)
-    parser.add_argument("--kv_cache_dtype",
+    parser.add_argument("--kv-cache-dtype",
                        help="KV Cache dtype.",
                        default=None,
                        choices=["int8", "fp8", None])

--- a/examples/tensorize_vllm_model.py
+++ b/examples/tensorize_vllm_model.py
@@ -3,18 +3,12 @@ import dataclasses
 import json
 import os
 import uuid
-from functools import partial
-from tensorizer import stream_io
 from vllm import LLM
-from vllm.distributed import (init_distributed_environment,
-                              initialize_model_parallel)
 from vllm.engine.arg_utils import EngineArgs
-from vllm.engine.llm_engine import LLMEngine
 from vllm.model_executor.model_loader.tensorizer import (TensorizerArgs,
                                                         TensorizerConfig,
-                                                         serialize_vllm_model)
+                                                         tensorize_vllm_model)
 # yapf conflicts with isort for this docstring
 # yapf: disable
@@ -61,6 +55,12 @@ Which downloads the model tensors from your S3 bucket and deserializes them.
 You can also provide a `--keyfile` argument to decrypt the model weights if 
 they were serialized with encryption.
+To support distributed tensor-parallel models, each model shard will be
+serialized to a separate file. The tensorizer_uri is then specified as a string
+template with a format specifier such as '%03d' that will be rendered with the
+shard's rank. Sharded models serialized with this script will be named as
+model-rank-%03d.tensors
 For more information on the available arguments for serializing, run 
 `python -m examples.tensorize_vllm_model serialize --help`.
@@ -168,71 +168,66 @@ def parse_args():
 def deserialize():
    llm = LLM(model=args.model,
              load_format="tensorizer",
+              tensor_parallel_size=args.tensor_parallel_size,
              model_loader_extra_config=tensorizer_config
    )
    return llm
+if __name__ == '__main__':
+    args = parse_args()
-args = parse_args()
+    s3_access_key_id = (getattr(args, 's3_access_key_id', None)
-s3_access_key_id = (getattr(args, 's3_access_key_id', None)
                        or os.environ.get("S3_ACCESS_KEY_ID", None))
-s3_secret_access_key = (getattr(args, 's3_secret_access_key', None)
+    s3_secret_access_key = (getattr(args, 's3_secret_access_key', None)
                            or os.environ.get("S3_SECRET_ACCESS_KEY", None))
-s3_endpoint = (getattr(args, 's3_endpoint', None)
+    s3_endpoint = (getattr(args, 's3_endpoint', None)
                or os.environ.get("S3_ENDPOINT_URL", None))
-credentials = {
+    credentials = {
        "s3_access_key_id": s3_access_key_id,
        "s3_secret_access_key": s3_secret_access_key,
        "s3_endpoint": s3_endpoint
-}
+    }
-_read_stream, _write_stream = (partial(
-    stream_io.open_stream,
-    mode=mode,
-    s3_access_key_id=s3_access_key_id,
-    s3_secret_access_key=s3_secret_access_key,
-    s3_endpoint=s3_endpoint,
-) for mode in ("rb", "wb+"))
-model_ref = args.model
-model_name = model_ref.split("/")[1]
-os.environ["MASTER_ADDR"] = "127.0.0.1"
+    model_ref = args.model
-os.environ["MASTER_PORT"] = "8080"
-init_distributed_environment(world_size=1, rank=0, local_rank=0)
+    model_name = model_ref.split("/")[1]
-initialize_model_parallel()
-keyfile = args.keyfile if args.keyfile else None
+    keyfile = args.keyfile if args.keyfile else None
+    if args.model_loader_extra_config:
-if args.model_loader_extra_config:
        config = json.loads(args.model_loader_extra_config)
-    tensorizer_args = TensorizerConfig(**config)._construct_tensorizer_args()
+        tensorizer_args = \
+            TensorizerConfig(**config)._construct_tensorizer_args()
        tensorizer_args.tensorizer_uri = args.path_to_tensors
-else:
+    else:
        tensorizer_args = None
-if args.command == "serialize":
+    if args.command == "serialize":
        eng_args_dict = {f.name: getattr(args, f.name) for f in
                        dataclasses.fields(EngineArgs)}
-    engine_args = EngineArgs.from_cli_args(argparse.Namespace(**eng_args_dict))
+        engine_args = EngineArgs.from_cli_args(
-    engine = LLMEngine.from_engine_args(engine_args)
+            argparse.Namespace(**eng_args_dict)
+        )
        input_dir = args.serialized_directory.rstrip('/')
        suffix = args.suffix if args.suffix else uuid.uuid4().hex
        base_path = f"{input_dir}/vllm/{model_ref}/{suffix}"
+        if engine_args.tensor_parallel_size > 1:
+            model_path = f"{base_path}/model-rank-%03d.tensors"
+        else:
            model_path = f"{base_path}/model.tensors"
        tensorizer_config = TensorizerConfig(
            tensorizer_uri=model_path,
+            encryption_keyfile=keyfile,
            **credentials)
-    serialize_vllm_model(engine, tensorizer_config, keyfile)
-elif args.command == "deserialize":
+        tensorize_vllm_model(engine_args, tensorizer_config)
+    elif args.command == "deserialize":
        if not tensorizer_args:
            tensorizer_config = TensorizerConfig(
                tensorizer_uri=args.path_to_tensors,
@@ -240,5 +235,5 @@ elif args.command == "deserialize":
                **credentials
            )
        deserialize()
-else:
+    else:
        raise ValueError("Either serialize or deserialize must be specified.")
--- a/format.sh
+++ b/format.sh
@@ -36,12 +36,12 @@ tool_version_check() {
    fi
 }
-tool_version_check "yapf" $YAPF_VERSION "$(grep yapf requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "yapf" $YAPF_VERSION "$(grep yapf requirements-lint.txt | cut -d'=' -f3)"
-tool_version_check "ruff" $RUFF_VERSION "$(grep "ruff==" requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "ruff" $RUFF_VERSION "$(grep "ruff==" requirements-lint.txt | cut -d'=' -f3)"
-tool_version_check "mypy" "$MYPY_VERSION" "$(grep mypy requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "mypy" "$MYPY_VERSION" "$(grep mypy requirements-lint.txt | cut -d'=' -f3)"
-tool_version_check "isort" "$ISORT_VERSION" "$(grep isort requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "isort" "$ISORT_VERSION" "$(grep isort requirements-lint.txt | cut -d'=' -f3)"
-tool_version_check "codespell" "$CODESPELL_VERSION" "$(grep codespell requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "codespell" "$CODESPELL_VERSION" "$(grep codespell requirements-lint.txt | cut -d'=' -f3)"
-tool_version_check "clang-format" "$CLANGFORMAT_VERSION" "$(grep clang-format requirements-dev.txt | cut -d'=' -f3)"
+tool_version_check "clang-format" "$CLANGFORMAT_VERSION" "$(grep clang-format requirements-lint.txt | cut -d'=' -f3)"
 YAPF_FLAGS=(
    '--recursive'

--- a/requirements-cpu.txt
+++ b/requirements-cpu.txt
@@ -2,5 +2,5 @@
 -r requirements-common.txt
 # Dependencies for x86_64 CPUs
-torch == 2.3.0+cpu
+torch == 2.3.1+cpu
 triton >= 2.2.0  # FIXME(woosuk): This is a hack to avoid import error.
\ No newline at end of file
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
-# formatting
+-r requirements-lint.txt
-yapf==0.32.0
+-r requirements-test.txt
-toml==0.10.2
-tomli==2.0.1
-ruff==0.1.5
-codespell==2.2.6
-isort==5.13.2
-clang-format==18.1.5
-# type checking
+# Avoid adding requirements directly to this file.
-mypy==1.9.0
+# Instead, modify the two files referenced above.
-types-PyYAML
-types-requests
-types-setuptools
-# testing
-pytest
-tensorizer>=2.9.0
-pytest-forked
-pytest-asyncio
-pytest-rerunfailures
-pytest-shard
-# testing utils
-awscli
-einops # required for MPT
-httpx
-peft
-requests
-ray
-sentence-transformers # required for embedding
-# Benchmarking
-aiohttp
-# quantization
-bitsandbytes==0.42.0
--- a/requirements-lint.txt
+++ b/requirements-lint.txt
+# formatting
+yapf==0.32.0
+toml==0.10.2
+tomli==2.0.1
+ruff==0.1.5
+codespell==2.3.0
+isort==5.13.2
+clang-format==18.1.5
+# type checking
+mypy==1.9.0
+types-PyYAML
+types-requests
+types-setuptools
--- a/requirements-test.txt
+++ b/requirements-test.txt
+# testing
+pytest
+tensorizer>=2.9.0
+pytest-forked
+pytest-asyncio
+pytest-rerunfailures
+pytest-shard
+# testing utils
+awscli
+einops # required for MPT
+httpx
+peft
+requests
+ray
+sentence-transformers # required for embedding
+# Benchmarking
+aiohttp
+# quantization
+bitsandbytes==0.42.0
--- a/requirements-tpu.txt
+++ b/requirements-tpu.txt
+# Common dependencies
+-r requirements-common.txt
+# Dependencies for TPU
+# Currently, the TPU backend uses a nightly version of PyTorch XLA.
+# You can install the dependencies in Dockerfile.tpu.
+triton  # To avoid import errors
--- a/setup.py
+++ b/setup.py
@@ -144,6 +144,7 @@ class cmake_build_ext(build_ext):
            cmake_args += [
                '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache',
                '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache',
+                '-DCMAKE_C_COMPILER_LAUNCHER=sccache',
            ]
        elif is_ccache_available():
            cmake_args += [
@@ -175,7 +176,6 @@ class cmake_build_ext(build_ext):
        else:
            # Default build tool to whatever cmake picks.
            build_tool = []
        subprocess.check_call(
            ['cmake', ext.cmake_lists_dir, *build_tool, *cmake_args],
            cwd=self.build_temp)
@@ -210,9 +210,9 @@ class cmake_build_ext(build_ext):
 def _is_cuda() -> bool:
-    return VLLM_TARGET_DEVICE == "cuda" \
+    has_cuda = torch.version.cuda is not None
-            and torch.version.cuda is not None \
+    return (VLLM_TARGET_DEVICE == "cuda" and has_cuda
-            and not _is_neuron()
+            and not (_is_neuron() or _is_tpu()))
 def _is_hip() -> bool:
@@ -229,10 +229,18 @@ def _is_neuron() -> bool:
    return torch_neuronx_installed or VLLM_TARGET_DEVICE == "neuron"
+def _is_tpu() -> bool:
+    return VLLM_TARGET_DEVICE == "tpu"
 def _is_cpu() -> bool:
    return VLLM_TARGET_DEVICE == "cpu"
+def _build_custom_ops() -> bool:
+    return _is_cuda() or _is_hip() or _is_cpu()
 def _install_punica() -> bool:
    return envs.VLLM_INSTALL_PUNICA_KERNELS
@@ -350,8 +358,8 @@ def get_version_add(sha: Optional[str] = None) -> str:
    version += ".torch" + torch.__version__[:5]
    with open(add_version_path, encoding="utf-8",mode="w") as file:
-        file.write("__version__='0.5.0'\n")
+        file.write("__version__='0.5.0.post1'\n")
-        file.write("__dcu_version__='0.5.0+{}'\n".format(version))
+        file.write("__dcu_version__='0.5.0.post1+{}'\n".format(version))
    file.close()
@@ -364,7 +372,7 @@ def get_version():
 def get_vllm_version() -> str:
-    version = find_version(get_path("vllm", "__init__.py"))
+    version = find_version(get_path("vllm", "version.py"))
    if _is_cuda():
        cuda_version = str(get_nvcc_cuda_version())
@@ -384,6 +392,8 @@ def get_vllm_version() -> str:
        if neuron_version != MAIN_CUDA_VERSION:
            neuron_version_str = neuron_version.replace(".", "")[:3]
            version += f"+neuron{neuron_version_str}"
+    elif _is_tpu():
+        version += "+tpu"
    elif _is_cpu():
        version += "+cpu"
    else:
@@ -431,6 +441,8 @@ def get_requirements() -> List[str]:
        requirements = _read_requirements("requirements-rocm.txt")
    elif _is_neuron():
        requirements = _read_requirements("requirements-neuron.txt")
+    elif _is_tpu():
+        requirements = _read_requirements("requirements-tpu.txt")
    elif _is_cpu():
        requirements = _read_requirements("requirements-cpu.txt")
    else:
@@ -444,7 +456,7 @@ ext_modules = []
 if _is_cuda() or _is_hip():
    ext_modules.append(CMakeExtension(name="vllm._moe_C"))
-if not _is_neuron():
+if _build_custom_ops():
    ext_modules.append(CMakeExtension(name="vllm._C"))
    if _install_punica():
@@ -487,6 +499,6 @@ setup(
    extras_require={
        "tensorizer": ["tensorizer>=2.9.0"],
    },
-    cmdclass={"build_ext": cmake_build_ext} if not _is_neuron() else {},
+    cmdclass={"build_ext": cmake_build_ext} if _build_custom_ops() else {},
    package_data=package_data,
 )
--- a/tests/async_engine/test_openapi_server_ray.py
+++ b/tests/async_engine/test_openapi_server_ray.py
@@ -4,16 +4,22 @@ import pytest
 # and debugging.
 import ray
-from ..utils import ServerRunner
+from ..utils import VLLM_PATH, RemoteOpenAIServer
 # any model with a chat template should work here
 MODEL_NAME = "facebook/opt-125m"
 @pytest.fixture(scope="module")
-def server():
+def ray_ctx():
-    ray.init()
+    ray.init(runtime_env={"working_dir": VLLM_PATH})
-    server_runner = ServerRunner.remote([
+    yield
+    ray.shutdown()
+@pytest.fixture(scope="module")
+def server(ray_ctx):
+    return RemoteOpenAIServer([
        "--model",
        MODEL_NAME,
        # use half precision for speed and memory savings in CI environment
@@ -24,22 +30,15 @@ def server():
        "--enforce-eager",
        "--engine-use-ray"
    ])
-    ray.get(server_runner.ready.remote())
-    yield server_runner
-    ray.shutdown()
 @pytest.fixture(scope="module")
-def client():
+def client(server):
-    client = openai.AsyncOpenAI(
+    return server.get_async_client()
-        base_url="http://localhost:8000/v1",
-        api_key="token-abc123",
-    )
-    yield client
 @pytest.mark.asyncio
-async def test_check_models(server, client: openai.AsyncOpenAI):
+async def test_check_models(client: openai.AsyncOpenAI):
    models = await client.models.list()
    models = models.data
    served_model = models[0]
@@ -48,7 +47,7 @@ async def test_check_models(server, client: openai.AsyncOpenAI):
 @pytest.mark.asyncio
-async def test_single_completion(server, client: openai.AsyncOpenAI):
+async def test_single_completion(client: openai.AsyncOpenAI):
    completion = await client.completions.create(model=MODEL_NAME,
                                                 prompt="Hello, my name is",
                                                 max_tokens=5,
@@ -72,7 +71,7 @@ async def test_single_completion(server, client: openai.AsyncOpenAI):
 @pytest.mark.asyncio
-async def test_single_chat_session(server, client: openai.AsyncOpenAI):
+async def test_single_chat_session(client: openai.AsyncOpenAI):
    messages = [{
        "role": "system",
        "content": "you are a helpful assistant"

--- a/tests/conftest.py
+++ b/tests/conftest.py
 import contextlib
 import gc
 import os
-import subprocess
-import sys
 from typing import Any, Dict, List, Optional, Tuple, TypeVar
 import pytest
@@ -15,13 +13,14 @@ from transformers import (AutoModelForCausalLM, AutoModelForVision2Seq,
 from vllm import LLM, SamplingParams
 from vllm.config import TokenizerPoolConfig, VisionLanguageConfig
-from vllm.distributed import destroy_model_parallel
+from vllm.distributed import (destroy_distributed_environment,
+                              destroy_model_parallel)
 from vllm.inputs import TextPrompt
 from vllm.logger import init_logger
 from vllm.multimodal import MultiModalData
 from vllm.multimodal.image import ImageFeatureData, ImagePixelData
 from vllm.sequence import SampleLogprobs
-from vllm.utils import is_cpu
+from vllm.utils import cuda_device_count_stateless, is_cpu
 logger = init_logger(__name__)
@@ -54,6 +53,7 @@ def _read_prompts(filename: str) -> List[str]:
 def cleanup():
    destroy_model_parallel()
+    destroy_distributed_environment()
    with contextlib.suppress(AssertionError):
        torch.distributed.destroy_process_group()
    gc.collect()
@@ -537,15 +537,4 @@ def num_gpus_available():
    """Get number of GPUs without initializing the CUDA context
    in current process."""
-    try:
+    return cuda_device_count_stateless()
-        out = subprocess.run([
-            sys.executable, "-c",
-            "import torch; print(torch.cuda.device_count())"
-        ],
-                             capture_output=True,
-                             check=True,
-                             text=True)
-    except subprocess.CalledProcessError as e:
-        logger.warning("Failed to get number of GPUs.", exc_info=e)
-        return 0
-    return int(out.stdout.strip())
--- a/tests/core/test_chunked_prefill_scheduler.py
+++ b/tests/core/test_chunked_prefill_scheduler.py
@@ -149,7 +149,7 @@ def test_complex():
    # Only the first seq group has a new token appended.
    append_new_token(running[0], 1)
-    # Add 2 more requsets.
+    # Add 2 more requests.
    for i in range(2, 4):
        _, seq_group = create_dummy_prompt(str(i), prompt_length=60)
        scheduler.add_seq_group(seq_group)

--- a/tests/distributed/test_custom_all_reduce.py
+++ b/tests/distributed/test_custom_all_reduce.py
@@ -7,9 +7,9 @@ import torch
 import torch.distributed as dist
 from vllm.distributed.communication_op import (  # noqa
-    graph_capture, tensor_model_parallel_all_reduce)
+    tensor_model_parallel_all_reduce)
 from vllm.distributed.parallel_state import (get_tensor_model_parallel_group,
-                                             get_tp_ca_communicator)
+                                             get_tp_group, graph_capture)
 from ..utils import (init_test_distributed_environment,
                     multi_process_tensor_parallel)
@@ -91,7 +91,7 @@ def eager_allreduce(tp_size, pp_size, rank, distributed_init_port):
    # communicate independently
    num_communication = rank // tp_size + 1
    sz = 1024
-    fa = get_tp_ca_communicator()
+    fa = get_tp_group().ca_comm
    inp = torch.ones(sz, dtype=torch.float32, device=device)
    out = inp
    for _ in range(num_communication):

--- a/tests/distributed/test_pynccl.py
+++ b/tests/distributed/test_pynccl.py
@@ -6,10 +6,11 @@ import torch
 import torch.distributed
 from vllm.distributed.communication_op import (  # noqa
-    graph_capture, tensor_model_parallel_all_reduce)
+    tensor_model_parallel_all_reduce)
 from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
 from vllm.distributed.device_communicators.pynccl_wrapper import NCCLLibrary
 from vllm.distributed.parallel_state import (ensure_model_parallel_initialized,
+                                             get_world_group, graph_capture,
                                             init_distributed_environment)
 from vllm.utils import update_environment_variables
@@ -53,7 +54,8 @@ def worker_fn_wrapper(fn):
 @worker_fn_wrapper
 def worker_fn():
-    pynccl_comm = PyNcclCommunicator()
+    pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
+                                     device=get_world_group().device)
    tensor = torch.ones(16, 1024, 1024,
                        dtype=torch.float32).cuda(pynccl_comm.rank)
    with pynccl_comm.change_state(enable=True):
@@ -129,7 +131,8 @@ def test_pynccl_multiple_allreduce_with_vllm():
 def worker_fn_with_cudagraph():
    with torch.no_grad():
        graph = torch.cuda.CUDAGraph()
-        pynccl_comm = PyNcclCommunicator()
+        pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
+                                         device=get_world_group().device)
        # run something in the default stream to initialize torch engine
        a = torch.ones((4, 4), device=f'cuda:{pynccl_comm.rank}')
        torch.cuda.synchronize()
@@ -154,7 +157,8 @@ def test_pynccl_with_cudagraph():
 @worker_fn_wrapper
 def send_recv_worker_fn():
-    pynccl_comm = PyNcclCommunicator()
+    pynccl_comm = PyNcclCommunicator(get_world_group().cpu_group,
+                                     device=get_world_group().device)
    if pynccl_comm.rank == 0:
        tensor = torch.ones(16, 1024, 1024,
                            dtype=torch.float32).cuda(pynccl_comm.rank)