Commit 6d2051cc authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.6.3.post1' into v0.6.3.post1-dev

parents 2c7f740a a2c71c54
...@@ -4,6 +4,7 @@ sphinx-copybutton==0.5.2 ...@@ -4,6 +4,7 @@ sphinx-copybutton==0.5.2
myst-parser==2.0.0 myst-parser==2.0.0
sphinx-argparse==0.4.0 sphinx-argparse==0.4.0
msgspec msgspec
cloudpickle
# packages to install to build the documentation # packages to install to build the documentation
pydantic >= 2.8 pydantic >= 2.8
...@@ -12,4 +13,5 @@ torch ...@@ -12,4 +13,5 @@ torch
py-cpuinfo py-cpuinfo
transformers transformers
mistral_common >= 1.3.4 mistral_common >= 1.3.4
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
\ No newline at end of file partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
\ No newline at end of file
...@@ -25,7 +25,7 @@ Module Contents ...@@ -25,7 +25,7 @@ Module Contents
LLM Engine Inputs LLM Engine Inputs
----------------- -----------------
.. autoclass:: vllm.inputs.LLMInputs .. autoclass:: vllm.inputs.DecoderOnlyInputs
:members: :members:
:show-inheritance: :show-inheritance:
......
...@@ -8,7 +8,7 @@ Multi-Modality ...@@ -8,7 +8,7 @@ Multi-Modality
vLLM provides experimental support for multi-modal models through the :mod:`vllm.multimodal` package. vLLM provides experimental support for multi-modal models through the :mod:`vllm.multimodal` package.
Multi-modal inputs can be passed alongside text and token prompts to :ref:`supported models <supported_vlms>` Multi-modal inputs can be passed alongside text and token prompts to :ref:`supported models <supported_vlms>`
via the ``multi_modal_data`` field in :class:`vllm.inputs.PromptInputs`. via the ``multi_modal_data`` field in :class:`vllm.inputs.PromptType`.
Currently, vLLM only has built-in support for image data. You can extend vLLM to process additional modalities Currently, vLLM only has built-in support for image data. You can extend vLLM to process additional modalities
by following :ref:`this guide <adding_multimodal_plugin>`. by following :ref:`this guide <adding_multimodal_plugin>`.
......
LLM Inputs LLM Inputs
========== ==========
.. autodata:: vllm.inputs.PromptInputs .. autodata:: vllm.inputs.PromptType
.. autoclass:: vllm.inputs.TextPrompt .. autoclass:: vllm.inputs.TextPrompt
:show-inheritance: :show-inheritance:
......
...@@ -59,20 +59,6 @@ Build from source ...@@ -59,20 +59,6 @@ Build from source
$ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy $ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
$ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu $ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
- Third, build and install oneDNN library from source:
.. code-block:: console
$ git clone -b rls-v3.5 https://github.com/oneapi-src/oneDNN.git
$ cmake -B ./oneDNN/build -S ./oneDNN -G Ninja -DONEDNN_LIBRARY_TYPE=STATIC \
-DONEDNN_BUILD_DOC=OFF \
-DONEDNN_BUILD_EXAMPLES=OFF \
-DONEDNN_BUILD_TESTS=OFF \
-DONEDNN_BUILD_GRAPH=OFF \
-DONEDNN_ENABLE_WORKLOAD=INFERENCE \
-DONEDNN_ENABLE_PRIMITIVE=MATMUL
$ cmake --build ./oneDNN/build --target install --config Release
- Finally, build and install vLLM CPU backend: - Finally, build and install vLLM CPU backend:
.. code-block:: console .. code-block:: console
......
.. _debugging: .. _debugging:
===============
Debugging Tips Debugging Tips
=============== ===============
Debugging hang/crash issues This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please `search existing issues <https://github.com/vllm-project/vllm/issues?q=is%3Aissue>`_ first to see if it has already been reported. If not, please `file a new issue <https://github.com/vllm-project/vllm/issues/new/choose>`_, providing as much relevant information as possible.
---------------------------
.. note::
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
When an vLLM instance hangs or crashes, it is very difficult to debug the issue. But wait a minute, it is also possible that vLLM is doing something that indeed takes a long time: Hangs downloading a model
----------------------------------------
If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection.
It's recommended to download the model first using the `huggingface-cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and passing the local path to the model to vLLM. This way, you can isolate the issue.
- **Downloading a model**: Do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface-cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue. Hangs loading a model from disk
- **Loading the model from disk**: If the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory. ----------------------------------------
- **Tensor parallel inference**: If the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
If you have already taken care of the above issues, but the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue: Model is too large
----------------------------------------
If the model is too large to fit in a single GPU, you might want to `consider tensor parallelism <https://docs.vllm.ai/en/latest/serving/distributed_serving.html#distributed-inference-and-serving>`_ to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `this example <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
- Set the environment variable ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging. Enable more logging
- Set the environment variable ``export CUDA_LAUNCH_BLOCKING=1`` to know exactly which CUDA kernel is causing the trouble. ----------------------------------------
- Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL. If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:
- Set the environment variable ``export VLLM_TRACE_FUNCTION=1``. All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs.
With more logging, hopefully you can find the root cause of the issue. - ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging.
- ``export CUDA_LAUNCH_BLOCKING=1`` to identify which CUDA kernel is causing the problem.
- ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL.
- ``export VLLM_TRACE_FUNCTION=1`` to record all function calls for inspection in the log files to tell which function crashes or hangs.
If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the :class:`~vllm.LLM` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error. Incorrect network setup
----------------------------------------
The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl`` and the IP address should be the correct one.
If it's not, override the IP address using the environment variable ``export VLLM_HOST_IP=<your_ip_address>``.
Here are some common issues that can cause hangs: You might also need to set ``export NCCL_SOCKET_IFNAME=<your_network_interface>`` and ``export GLOO_SOCKET_IFNAME=<your_network_interface>`` to specify the network interface for the IP address.
- **Incorrect network setup**: The vLLM instance cannot get the correct IP address if you have complicated network config. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``. You might also need to set ``export NCCL_SOCKET_IFNAME=your_network_interface`` and ``export GLOO_SOCKET_IFNAME=your_network_interface`` to specify the network interface for the IP address. Error near ``self.graph.replay()``
- **Incorrect hardware/driver**: GPU/CPU communication cannot be established. You can run the following sanity check script to see if the GPU/CPU communication is working correctly. ----------------------------------------
If vLLM crashes and the error trace captures it somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the :class:`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
Incorrect hardware/driver
----------------------------------------
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
.. code-block:: python .. code-block:: python
...@@ -84,33 +105,29 @@ Here are some common issues that can cause hangs: ...@@ -84,33 +105,29 @@ Here are some common issues that can cause hangs:
dist.destroy_process_group(gloo_group) dist.destroy_process_group(gloo_group)
dist.destroy_process_group() dist.destroy_process_group()
.. tip:: If you are testing with a single node, adjust ``--nproc-per-node`` to the number of GPUs you want to use:
Save the script as ``test.py``. .. code-block:: shell
If you are testing in a single-node, run it with ``NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py``, adjust ``--nproc-per-node`` to the number of GPUs you want to use.
If you are testing with multi-nodes, run it with ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py``. Adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup. Make sure ``MASTER_ADDR``:
- is the correct IP address of the master node
- is reachable from all nodes
- is set before running the script.
If the script runs successfully, you should see the message ``sanity check is successful!``. NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
Note that multi-node environment is more complicated than single-node. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: If you are testing with multi-nodes, adjust ``--nproc-per-node`` and ``--nnodes`` according to your setup and set ``MASTER_ADDR`` to the correct IP address of the master node, reachable from all nodes. Then, run:
- In the first node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py``. .. code-block:: shell
- In the second node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py``.
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup. The difference is that you need to execute different commands (with different ``--node-rank``) on different nodes. If the script runs successfully, you should see the message ``sanity check is successful!``.
If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs. .. note::
Some known issues: A multi-node environment is more complicated than a single-node one. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq <https://github.com/zeromq/pyzmq/issues/2000>`_ , which can cause hangs at a low probability (once in about 20 times, depending on the machine configuration). The solution is to upgrade to the latest version of ``vllm`` to include the `fix <https://github.com/vllm-project/vllm/pull/6759>`_ . - In the first node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py``.
- In the second node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py``.
.. warning:: Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup, being sure to execute different commands (with different ``--node-rank``) on different nodes.
After you find the root cause and solve the issue, remember to turn off all the debugging environment variables defined above, or simply start a new shell to avoid being affected by the debugging settings. If you don't do this, the system might be slow because many debugging functionalities are turned on. Known Issues
----------------------------------------
- In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq <https://github.com/zeromq/pyzmq/issues/2000>`_ , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of ``vllm`` to include the `fix <https://github.com/vllm-project/vllm/pull/6759>`_.
.. _installation: .. _installation:
============
Installation Installation
============ ============
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
Requirements Requirements
------------ ===========================
* OS: Linux * OS: Linux
* Python: 3.8 -- 3.12 * Python: 3.8 -- 3.12
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.) * GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
Install with pip Install released versions
---------------- ===========================
You can install vLLM using pip: You can install vLLM using pip:
...@@ -46,98 +47,173 @@ You can install vLLM using pip: ...@@ -46,98 +47,173 @@ You can install vLLM using pip:
Therefore, it is recommended to install vLLM with a **fresh new** conda environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See below for instructions. Therefore, it is recommended to install vLLM with a **fresh new** conda environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See below for instructions.
.. note::
vLLM also publishes a subset of wheels (Python 3.10, 3.11 with CUDA 12) for every commit since v0.5.3. You can download them with the following command: .. _install-the-latest-code:
.. code-block:: console Install the latest code
=========================
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on x86 platform with cuda 12 for every commit since v0.5.3. You can download and install the latest one with the following command:
.. code-block:: console
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
If you want to access the wheels for previous commits, you can specify the commit hash in the URL:
.. code-block:: console
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
Note that the wheels are built with Python 3.8 abi (see `PEP 425 <https://peps.python.org/pep-0425/>`_ for more details about abi), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (``1.0.0.dev``) is just a placeholder to have a unified URL for the wheels. The actual versions of wheels are contained in the wheel metadata.
Another way to access the latest code is to use the docker images:
.. code-block:: console
$ export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
$ docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}
$ export VLLM_VERSION=0.6.1.post1 # vLLM's main branch version is currently set to latest released tag These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days.
$ pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
$ # You can also access a specific commit
$ # export VLLM_COMMIT=...
$ # pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
Latest code can contain bugs and may not be stable. Please use it with caution.
.. _build_from_source: .. _build_from_source:
Build from source Build from source
----------------- ==================
You can also build and install vLLM from source: .. _python-only-build:
Python-only build (without compilation)
----------------------------------------
If you only need to change Python code, you can simply build vLLM without compilation.
The first step is to install the latest vLLM wheel:
.. code-block:: console
pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
You can find more information about vLLM's wheels `above <#install-the-latest-code>`_.
After verifying that the installation is successful, you can use `the following script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_:
.. code-block:: console .. code-block:: console
$ git clone https://github.com/vllm-project/vllm.git $ git clone https://github.com/vllm-project/vllm.git
$ cd vllm $ cd vllm
$ pip install -e . # This may take 5-10 minutes. $ python python_only_dev.py
.. note:: The script will:
This will uninstall existing PyTorch, and install the version required by vLLM. If you want to use an existing PyTorch installation, there need to be some changes: * Find the installed vLLM package in the current environment.
* Copy built files to the current directory.
* Rename the installed vLLM package.
* Symbolically link the current directory to the installed vLLM package.
.. code-block:: console Now, you can edit the Python code in the current directory, and the changes will be reflected when you run vLLM.
$ git clone https://github.com/vllm-project/vllm.git Once you have finished editing or want to install another vLLM wheel, you should exit the development environment using `the same script <https://github.com/vllm-project/vllm/blob/main/python_only_dev.py>`_ with the ``--quit-dev``(or ``-q`` for short) flag:
$ cd vllm
$ python use_existing_torch.py .. code-block:: console
$ pip install -r requirements-build.txt
$ pip install -e . --no-build-isolation
The differences are: $ python python_only_dev.py --quit-dev
- ``python use_existing_torch.py``: This script will remove all the PyTorch versions in the requirements files, so that the existing PyTorch installation will be used. The script with ``--quit-dev`` flag will:
- ``pip install -r requirements-build.txt``: You need to manually install the requirements for building vLLM.
- ``pip install -e . --no-build-isolation``: You need to disable build isolation, so that the build system can use the existing PyTorch installation.
This is especially useful when the PyTorch dependency cannot be easily installed via pip, e.g.: * Remove the symbolic link from the current directory to the vLLM package.
* Restore the original vLLM package from the backup.
- build vLLM with PyTorch nightly or a custom PyTorch build. If you update the vLLM wheel and want to rebuild from the source and make further edits, you will need to start `all above <#python-only-build>`_ over again.
- build vLLM with aarch64 and cuda (GH200), where the PyTorch wheels are not available on PyPI. Currently, only PyTorch nightly has wheels for aarch64 with CUDA. You can run ``pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124`` to install PyTorch nightly, and then build vLLM on top of it.
.. note:: .. note::
vLLM can fully run only on Linux, but you can still build it on other systems (for example, macOS). This build is only for development purposes, allowing for imports and a more convenient dev environment. The binaries will not be compiled and not work on non-Linux systems. You can create such a build with the following commands: There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to `the above section <#install-the-latest-code>`_ for instructions on how to install a specified wheel.
.. code-block:: console Full build (with compilation)
---------------------------------
$ export VLLM_TARGET_DEVICE=empty If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
$ pip install -e .
.. code-block:: console
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -e .
.. tip:: .. tip::
Building from source requires quite a lot compilation. If you are building from source for multiple times, it is beneficial to cache the compilation results. For example, you can install `ccache <https://github.com/ccache/ccache>`_ via either ``conda install ccache`` or ``apt install ccache`` . As long as ``which ccache`` command can find the ``ccache`` binary, it will be used automatically by the build system. After the first build, the subsequent builds will be much faster. Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
For example, you can install `ccache <https://github.com/ccache/ccache>`_ using ``conda install ccache`` or ``apt install ccache`` .
As long as ``which ccache`` command can find the ``ccache`` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
.. tip::
To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable ``MAX_JOBS``. For example:
.. code-block:: console Use an existing PyTorch installation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
$ export MAX_JOBS=6 * Building vLLM with PyTorch nightly or a custom PyTorch build.
$ pip install -e . * Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run ``pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124`` to `install PyTorch nightly <https://pytorch.org/get-started/locally/>`_, and then build vLLM on top of it.
This is especially useful when you are building on less powerful machines. For example, when you use WSL, it only `gives you half of the memory by default <https://learn.microsoft.com/en-us/windows/wsl/wsl-config>`_, and you'd better use ``export MAX_JOBS=1`` to avoid compiling multiple files simultaneously and running out of memory. The side effect is that the build process will be much slower. If you only touch the Python code, slow compilation is okay, as you are building in an editable mode: you can just change the code and run the Python script without any re-compilation or re-installation. To build vLLM using an existing PyTorch installation:
.. tip:: .. code-block:: console
If you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
.. code-block:: console $ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ python use_existing_torch.py
$ pip install -r requirements-build.txt
$ pip install -e . --no-build-isolation
$ # Use `--ipc=host` to make sure the shared memory is large enough.
$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from `the official website <https://developer.nvidia.com/cuda-toolkit-archive>`_. After installation, set the environment variable ``CUDA_HOME`` to the installation path of CUDA Toolkit, and make sure that the ``nvcc`` compiler is in your ``PATH``, e.g.: Troubleshooting
~~~~~~~~~~~~~~~~~
.. code-block:: console To avoid your system being overloaded, you can limit the number of compilation jobs
to be run simultaneously, via the environment variable ``MAX_JOBS``. For example:
$ export CUDA_HOME=/usr/local/cuda .. code-block:: console
$ export PATH="${CUDA_HOME}/bin:$PATH"
Here is a sanity check to verify that the CUDA Toolkit is correctly installed: $ export MAX_JOBS=6
$ pip install -e .
.. code-block:: console This is especially useful when you are building on less powerful machines. For example, when you use WSL it only `assigns 50% of the total memory by default <https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings>`_, so using ``export MAX_JOBS=1`` can avoid compiling multiple files simultaneously and running out of memory.
A side effect is a much slower build process.
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
.. code-block:: console
$ # Use `--ipc=host` to make sure the shared memory is large enough.
$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from `the official website <https://developer.nvidia.com/cuda-toolkit-archive>`_. After installation, set the environment variable ``CUDA_HOME`` to the installation path of CUDA Toolkit, and make sure that the ``nvcc`` compiler is in your ``PATH``, e.g.:
.. code-block:: console
$ export CUDA_HOME=/usr/local/cuda
$ export PATH="${CUDA_HOME}/bin:$PATH"
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
.. code-block:: console
$ nvcc --version # verify that nvcc is in your PATH
$ ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
Unsupported OS build
----------------------
vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems.
Simply disable the ``VLLM_TARGET_DEVICE`` environment variable before installing:
.. code-block:: console
$ nvcc --version # verify that nvcc is in your PATH $ export VLLM_TARGET_DEVICE=empty
$ ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME $ pip install -e .
...@@ -27,6 +27,10 @@ Installation steps: ...@@ -27,6 +27,10 @@ Installation steps:
.. _build_from_source_neuron: .. _build_from_source_neuron:
.. note::
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with vLLM >= 0.5.3. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
Build from source Build from source
----------------- -----------------
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
Installation with OpenVINO Installation with OpenVINO
========================== ==========================
vLLM powered by OpenVINO supports all LLM models from :doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features: vLLM powered by OpenVINO supports all LLM models from :doc:`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (`the list of supported GPUs <https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu>`_). OpenVINO vLLM backend supports the following advanced vLLM features:
- Prefix caching (``--enable-prefix-caching``) - Prefix caching (``--enable-prefix-caching``)
- Chunked prefill (``--enable-chunked-prefill``) - Chunked prefill (``--enable-chunked-prefill``)
...@@ -53,34 +53,57 @@ Install from source ...@@ -53,34 +53,57 @@ Install from source
$ pip install --upgrade pip $ pip install --upgrade pip
$ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu $ pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
- Finally, install vLLM with OpenVINO backend: - Finally, install vLLM with OpenVINO backend:
.. code-block:: console .. code-block:: console
$ PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v . $ PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v .
- [Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: `https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html <https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html>`_.
.. _openvino_backend_performance_tips: .. _openvino_backend_performance_tips:
Performance tips Performance tips
---------------- ----------------
vLLM OpenVINO backend uses the following environment variables to control behavior: vLLM OpenVINO backend environment variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``VLLM_OPENVINO_DEVICE`` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, ``VLLM_OPENVINO_DEVICE=GPU.1``). If the value is not specified, CPU device is used by default.
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
CPU performance tips
~~~~~~~~~~~~~~~~~~~~
CPU uses the following environment variables to control behavior:
- ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. - ``VLLM_OPENVINO_KVCACHE_SPACE`` to specify the KV Cache size (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=40`` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform. - ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``) To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``)
OpenVINO best known configuration is: OpenVINO best known configuration for CPU is:
.. code-block:: console .. code-block:: console
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \ $ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256 python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
GPU performance tips
~~~~~~~~~~~~~~~~~~~~
GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account ``gpu_memory_utilization`` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using ``VLLM_OPENVINO_KVCACHE_SPACE`` environment variable (e.g, ``VLLM_OPENVINO_KVCACHE_SPACE=8`` means 8 GB space for KV cache).
Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`.
OpenVINO best known configuration for GPU is:
.. code-block:: console
$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
.. _openvino_backend_limitations: .. _openvino_backend_limitations:
Limitations Limitations
......
...@@ -79,12 +79,14 @@ Documentation ...@@ -79,12 +79,14 @@ Documentation
serving/openai_compatible_server serving/openai_compatible_server
serving/deploying_with_docker serving/deploying_with_docker
serving/deploying_with_k8s
serving/distributed_serving serving/distributed_serving
serving/metrics serving/metrics
serving/env_vars serving/env_vars
serving/usage_stats serving/usage_stats
serving/integrations serving/integrations
serving/tensorizer serving/tensorizer
serving/compatibility_matrix
serving/faq serving/faq
.. toctree:: .. toctree::
......
...@@ -85,21 +85,21 @@ When it comes to the linear layers, we provide the following options to parallel ...@@ -85,21 +85,21 @@ When it comes to the linear layers, we provide the following options to parallel
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving. * :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer. * :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer. * :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices. * :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple :code:`ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices. * :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization. Note that all the linear layers above take :code:`linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
4. Implement the weight loading logic 4. Implement the weight loading logic
------------------------------------- -------------------------------------
You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class. You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately. This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for :code:`MergedColumnParallelLinear` and :code:`QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
5. Register your model 5. Register your model
---------------------- ----------------------
Finally, register your :code:`*ForCausalLM` class to the :code:`_MODELS` in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_. Finally, register your :code:`*ForCausalLM` class to the :code:`_VLLM_MODELS` in `vllm/model_executor/models/registry.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py>`_.
6. Out-of-Tree Model Integration 6. Out-of-Tree Model Integration
-------------------------------------------- --------------------------------------------
...@@ -114,6 +114,18 @@ Just add the following lines in your code: ...@@ -114,6 +114,18 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM) ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
If your model imports modules that initialize CUDA, consider instead lazy-importing it to avoid an error like :code:`RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
.. code-block:: python
from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
.. important::
If your model is a multimodal model, make sure the model class implements the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
Read more about that :ref:`here <enabling_multimodal_inputs>`.
If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code: If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code:
.. code-block:: python .. code-block:: python
......
...@@ -22,6 +22,8 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo ...@@ -22,6 +22,8 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False. You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
.. _chunked-prefill:
Chunked Prefill Chunked Prefill
--------------- ---------------
vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests. vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.
......
...@@ -30,7 +30,6 @@ The following code configures vLLM in an offline mode to use speculative decodin ...@@ -30,7 +30,6 @@ The following code configures vLLM in an offline mode to use speculative decodin
tensor_parallel_size=1, tensor_parallel_size=1,
speculative_model="facebook/opt-125m", speculative_model="facebook/opt-125m",
num_speculative_tokens=5, num_speculative_tokens=5,
use_v2_block_manager=True,
) )
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
...@@ -44,10 +43,10 @@ To perform the same with an online mode launch the server: ...@@ -44,10 +43,10 @@ To perform the same with an online mode launch the server:
.. code-block:: bash .. code-block:: bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \ --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
--num_speculative_tokens 5 --gpu_memory_utilization 0.8 --num_speculative_tokens 5 --gpu_memory_utilization 0.8
Then use a client: Then use a client:
.. code-block:: python .. code-block:: python
...@@ -104,7 +103,6 @@ matching n-grams in the prompt. For more information read `this thread. <https:/ ...@@ -104,7 +103,6 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
speculative_model="[ngram]", speculative_model="[ngram]",
num_speculative_tokens=5, num_speculative_tokens=5,
ngram_prompt_lookup_max=4, ngram_prompt_lookup_max=4,
use_v2_block_manager=True,
) )
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
...@@ -135,7 +133,6 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide- ...@@ -135,7 +133,6 @@ For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-
tensor_parallel_size=4, tensor_parallel_size=4,
speculative_model="ibm-fms/llama3-70b-accelerator", speculative_model="ibm-fms/llama3-70b-accelerator",
speculative_draft_tensor_parallel_size=1, speculative_draft_tensor_parallel_size=1,
use_v2_block_manager=True,
) )
outputs = llm.generate(prompts, sampling_params) outputs = llm.generate(prompts, sampling_params)
......
...@@ -3,292 +3,457 @@ ...@@ -3,292 +3,457 @@
Supported Models Supported Models
================ ================
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_. vLLM supports a variety of generative Transformer models in `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM. The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it. Alongside each architecture, we include some popular models that use it.
---- Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^
Text Generation
---------------
Decoder-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table:: .. list-table::
:widths: 25 25 50 5 :widths: 25 25 50 5 5
:header-rows: 1 :header-rows: 1
* - Architecture * - Architecture
- Models - Models
- Example HuggingFace Models - Example HF Models
- :ref:`LoRA <lora>` - :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`AquilaForCausalLM` * - :code:`AquilaForCausalLM`
- Aquila & Aquila2 - Aquila, Aquila2
- :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc. - :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`ArcticForCausalLM` * - :code:`ArcticForCausalLM`
- Arctic - Arctic
- :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc. - :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc.
- -
- ✅︎
* - :code:`BaiChuanForCausalLM` * - :code:`BaiChuanForCausalLM`
- Baichuan & Baichuan2 - Baichuan2, Baichuan
- :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc. - :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`BloomForCausalLM` * - :code:`BloomForCausalLM`
- BLOOM, BLOOMZ, BLOOMChat - BLOOM, BLOOMZ, BLOOMChat
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc. - :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
- -
- ✅︎
* - :code:`BartForConditionalGeneration`
- BART
- :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
-
-
* - :code:`ChatGLMModel` * - :code:`ChatGLMModel`
- ChatGLM - ChatGLM
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc. - :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`CohereForCausalLM` * - :code:`CohereForCausalLM`
- Command-R - Command-R
- :code:`CohereForAI/c4ai-command-r-v01`, etc. - :code:`CohereForAI/c4ai-command-r-v01`, etc.
- - ✅︎
- ✅︎
* - :code:`DbrxForCausalLM` * - :code:`DbrxForCausalLM`
- DBRX - DBRX
- :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc. - :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc.
- -
- ✅︎
* - :code:`DeciLMForCausalLM` * - :code:`DeciLMForCausalLM`
- DeciLM - DeciLM
- :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc. - :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc.
- -
- ✅︎
* - :code:`DeepseekForCausalLM`
- DeepSeek
- :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc.
-
- ✅︎
* - :code:`DeepseekV2ForCausalLM`
- DeepSeek-V2
- :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc.
-
- ✅︎
* - :code:`ExaoneForCausalLM` * - :code:`ExaoneForCausalLM`
- EXAONE-3 - EXAONE-3
- :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. - :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`FalconForCausalLM` * - :code:`FalconForCausalLM`
- Falcon - Falcon
- :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc. - :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc.
- -
- ✅︎
* - :code:`GemmaForCausalLM` * - :code:`GemmaForCausalLM`
- Gemma - Gemma
- :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc. - :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`Gemma2ForCausalLM` * - :code:`Gemma2ForCausalLM`
- Gemma2 - Gemma2
- :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc. - :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`GPT2LMHeadModel` * - :code:`GPT2LMHeadModel`
- GPT-2 - GPT-2
- :code:`gpt2`, :code:`gpt2-xl`, etc. - :code:`gpt2`, :code:`gpt2-xl`, etc.
- -
- ✅︎
* - :code:`GPTBigCodeForCausalLM` * - :code:`GPTBigCodeForCausalLM`
- StarCoder, SantaCoder, WizardCoder - StarCoder, SantaCoder, WizardCoder
- :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc. - :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`GPTJForCausalLM` * - :code:`GPTJForCausalLM`
- GPT-J - GPT-J
- :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc. - :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc.
- -
- ✅︎
* - :code:`GPTNeoXForCausalLM` * - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
- :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc. - :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc.
- -
- ✅︎
* - :code:`GraniteForCausalLM`
- PowerLM
- :code:`ibm/PowerLM-3b` etc.
- ✅︎
- ✅︎
* - :code:`GraniteMoeForCausalLM`
- PowerMoE
- :code:`ibm/PowerMoE-3b` etc.
- ✅︎
- ✅︎
* - :code:`InternLMForCausalLM` * - :code:`InternLMForCausalLM`
- InternLM - InternLM
- :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc. - :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`InternLM2ForCausalLM` * - :code:`InternLM2ForCausalLM`
- InternLM2 - InternLM2
- :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc. - :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc.
- -
- ✅︎
* - :code:`JAISLMHeadModel` * - :code:`JAISLMHeadModel`
- Jais - Jais
- :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc. - :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc.
- -
- ✅︎
* - :code:`JambaForCausalLM` * - :code:`JambaForCausalLM`
- Jamba - Jamba
- :code:`ai21labs/Jamba-v0.1`, etc. - :code:`ai21labs/AI21-Jamba-1.5-Large`, :code:`ai21labs/AI21-Jamba-1.5-Mini`, :code:`ai21labs/Jamba-v0.1`, etc.
- ✅︎ - ✅︎
-
* - :code:`LlamaForCausalLM` * - :code:`LlamaForCausalLM`
- Llama 3.1, Llama 3, Llama 2, LLaMA, Yi - Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
- :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc. - :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`MambaForCausalLM`
- Mamba
- :code:`state-spaces/mamba-130m-hf`, :code:`state-spaces/mamba-790m-hf`, :code:`state-spaces/mamba-2.8b-hf`, etc.
-
-
* - :code:`MiniCPMForCausalLM` * - :code:`MiniCPMForCausalLM`
- MiniCPM - MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc. - :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, :code:`openbmb/MiniCPM-S-1B-sft`, etc.
- - ✅︎
- ✅︎
* - :code:`MiniCPM3ForCausalLM` * - :code:`MiniCPM3ForCausalLM`
- MiniCPM3 - MiniCPM3
- :code:`openbmb/MiniCPM3-4B`, etc. - :code:`openbmb/MiniCPM3-4B`, etc.
- - ✅︎
- ✅︎
* - :code:`MistralForCausalLM` * - :code:`MistralForCausalLM`
- Mistral, Mistral-Instruct - Mistral, Mistral-Instruct
- :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc. - :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`MixtralForCausalLM` * - :code:`MixtralForCausalLM`
- Mixtral-8x7B, Mixtral-8x7B-Instruct - Mixtral-8x7B, Mixtral-8x7B-Instruct
- :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc. - :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`MPTForCausalLM` * - :code:`MPTForCausalLM`
- MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
- :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc. - :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc.
- -
- ✅︎
* - :code:`NemotronForCausalLM` * - :code:`NemotronForCausalLM`
- Nemotron-3, Nemotron-4, Minitron - Nemotron-3, Nemotron-4, Minitron
- :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. - :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
- ✅︎ - ✅︎
* - :code:`OLMoEForCausalLM` - ✅︎
- OLMoE
- :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc.
-
* - :code:`OLMoForCausalLM` * - :code:`OLMoForCausalLM`
- OLMo - OLMo
- :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc. - :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc.
- -
- ✅︎
* - :code:`OLMoEForCausalLM`
- OLMoE
- :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc.
- ✅︎
- ✅︎
* - :code:`OPTForCausalLM` * - :code:`OPTForCausalLM`
- OPT, OPT-IML - OPT, OPT-IML
- :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc. - :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc.
- -
- ✅︎
* - :code:`OrionForCausalLM` * - :code:`OrionForCausalLM`
- Orion - Orion
- :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc. - :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc.
- -
- ✅︎
* - :code:`PhiForCausalLM` * - :code:`PhiForCausalLM`
- Phi - Phi
- :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc. - :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`Phi3ForCausalLM` * - :code:`Phi3ForCausalLM`
- Phi-3 - Phi-3
- :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc. - :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc.
- - ✅︎
- ✅︎
* - :code:`Phi3SmallForCausalLM` * - :code:`Phi3SmallForCausalLM`
- Phi-3-Small - Phi-3-Small
- :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc. - :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc.
- -
- ✅︎
* - :code:`PhiMoEForCausalLM` * - :code:`PhiMoEForCausalLM`
- Phi-3.5-MoE - Phi-3.5-MoE
- :code:`microsoft/Phi-3.5-MoE-instruct`, etc. - :code:`microsoft/Phi-3.5-MoE-instruct`, etc.
- - ✅︎
- ✅︎
* - :code:`PersimmonForCausalLM` * - :code:`PersimmonForCausalLM`
- Persimmon - Persimmon
- :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc. - :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc.
- -
- ✅︎
* - :code:`QWenLMHeadModel` * - :code:`QWenLMHeadModel`
- Qwen - Qwen
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc. - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
- -
- ✅︎
* - :code:`Qwen2ForCausalLM` * - :code:`Qwen2ForCausalLM`
- Qwen2 - Qwen2
- :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc. - :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc.
- ✅︎ - ✅︎
- ✅︎
* - :code:`Qwen2MoeForCausalLM` * - :code:`Qwen2MoeForCausalLM`
- Qwen2MoE - Qwen2MoE
- :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. - :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
- -
- ✅︎
* - :code:`StableLmForCausalLM` * - :code:`StableLmForCausalLM`
- StableLM - StableLM
- :code:`stabilityai/stablelm-3b-4e1t/` , :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc. - :code:`stabilityai/stablelm-3b-4e1t`, :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
- -
- ✅︎
* - :code:`Starcoder2ForCausalLM` * - :code:`Starcoder2ForCausalLM`
- Starcoder2 - Starcoder2
- :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc. - :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc.
- -
- ✅︎
* - :code:`SolarForCausalLM` * - :code:`SolarForCausalLM`
- EXAONE-3 - Solar Pro
- :code:`upstage/solar-pro-preview-instruct`, etc. - :code:`upstage/solar-pro-preview-instruct`, etc.
- - ✅︎
- ✅︎
* - :code:`XverseForCausalLM` * - :code:`XverseForCausalLM`
- Xverse - XVERSE
- :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc. - :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc.
- - ✅︎
- ✅︎
.. note:: .. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096. Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
.. _supported_vlms: Text Embedding
--------------
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`Gemma2Model`
- Gemma2-based
- :code:`BAAI/bge-multilingual-gemma2`, etc.
-
- ✅︎
* - :code:`MistralModel`
- Mistral-based
- :code:`intfloat/e5-mistral-7b-instruct`, etc.
-
- ✅︎
Reward Modeling
---------------
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`Qwen2ForRewardModel`
- Qwen2-based
- :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
-
- ✅︎
.. note::
As an interim measure, these models are supported via Embeddings API. See `this RFC <https://github.com/vllm-project/vllm/issues/8967>`_ for upcoming changes.
Multimodal Language Models Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^
The following modalities are supported depending on the model:
- **T**\ ext
- **I**\ mage
- **V**\ ideo
- **A**\ udio
.. _supported_vlms:
Text Generation
---------------
.. list-table:: .. list-table::
:widths: 25 25 25 25 5 :widths: 25 25 15 25 5 5
:header-rows: 1 :header-rows: 1
* - Architecture * - Architecture
- Models - Models
- Modalities - Inputs
- Example HuggingFace Models - Example HF Models
- :ref:`LoRA <lora>` - :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`Blip2ForConditionalGeneration` * - :code:`Blip2ForConditionalGeneration`
- BLIP-2 - BLIP-2
- Image\ :sup:`E` - T + I\ :sup:`E`
- :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc. - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
- -
- ✅︎
* - :code:`ChameleonForConditionalGeneration` * - :code:`ChameleonForConditionalGeneration`
- Chameleon - Chameleon
- Image - T + I
- :code:`facebook/chameleon-7b` etc. - :code:`facebook/chameleon-7b` etc.
- -
- ✅︎
* - :code:`FuyuForCausalLM` * - :code:`FuyuForCausalLM`
- Fuyu - Fuyu
- Image - T + I
- :code:`adept/fuyu-8b` etc. - :code:`adept/fuyu-8b` etc.
- -
- ✅︎
* - :code:`ChatGLMModel`
- GLM-4V
- T + I
- :code:`THUDM/glm-4v-9b` etc.
-
- ✅︎
* - :code:`InternVLChatModel` * - :code:`InternVLChatModel`
- InternVL2 - InternVL2
- Image\ :sup:`E+` - T + I\ :sup:`E+`
- :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc. - :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
- -
- ✅︎
* - :code:`LlavaForConditionalGeneration` * - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5 - LLaVA-1.5
- Image\ :sup:`E+` - T + I\ :sup:`E+`
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc. - :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
- -
- ✅︎
* - :code:`LlavaNextForConditionalGeneration` * - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT - LLaVA-NeXT
- Image\ :sup:`E+` - T + I\ :sup:`E+`
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc. - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
- -
- ✅︎
* - :code:`LlavaNextVideoForConditionalGeneration` * - :code:`LlavaNextVideoForConditionalGeneration`
- LLaVA-NeXT-Video - LLaVA-NeXT-Video
- Video - T + V
- :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
- -
- ✅︎
* - :code:`LlavaOnevisionForConditionalGeneration` * - :code:`LlavaOnevisionForConditionalGeneration`
- LLaVA-Onevision - LLaVA-Onevision
- Image\ :sup:`+` / Video - T + I\ :sup:`+` + V
- :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. - :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
- -
- ✅︎
* - :code:`MiniCPMV` * - :code:`MiniCPMV`
- MiniCPM-V - MiniCPM-V
- Image\ :sup:`+` - T + I\ :sup:`E+`
- :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc. - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
- - ✅︎
- ✅︎
* - :code:`MllamaForConditionalGeneration` * - :code:`MllamaForConditionalGeneration`
- Llama 3.2 - Llama 3.2
- Image - T + I
- :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc. - :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
- -
-
* - :code:`MolmoForCausalLM`
- Molmo
- Image
- :code:`allenai/Molmo-7B-D-0924`, :code:`allenai/Molmo-72B-0924`, etc.
-
- ✅︎
* - :code:`NVLM_D_Model`
- NVLM-D 1.0
- T + I\ :sup:`E+`
- :code:`nvidia/NVLM-D-72B`, etc.
-
- ✅︎
* - :code:`PaliGemmaForConditionalGeneration` * - :code:`PaliGemmaForConditionalGeneration`
- PaliGemma - PaliGemma
- Image\ :sup:`E` - T + I\ :sup:`E`
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc. - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
- -
- ✅︎
* - :code:`Phi3VForCausalLM` * - :code:`Phi3VForCausalLM`
- Phi-3-Vision, Phi-3.5-Vision - Phi-3-Vision, Phi-3.5-Vision
- Image\ :sup:`E+` - T + I\ :sup:`E+`
- :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc. - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
- -
- ✅︎
* - :code:`PixtralForConditionalGeneration` * - :code:`PixtralForConditionalGeneration`
- Pixtral - Pixtral
- Image\ :sup:`+` - T + I\ :sup:`+`
- :code:`mistralai/Pixtral-12B-2409` - :code:`mistralai/Pixtral-12B-2409`
- -
- ✅︎
* - :code:`QWenLMHeadModel` * - :code:`QWenLMHeadModel`
- Qwen-VL - Qwen-VL
- Image\ :sup:`E+` - T + I\ :sup:`E+`
- :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc. - :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
- -
- ✅︎
* - :code:`Qwen2VLForConditionalGeneration` * - :code:`Qwen2VLForConditionalGeneration`
- Qwen2-VL - Qwen2-VL
- Image\ :sup:`+` / Video\ :sup:`+` - T + I\ :sup:`E+` + V\ :sup:`+`
- :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc. - :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
- -
- ✅︎
* - :code:`UltravoxModel` * - :code:`UltravoxModel`
- Ultravox - Ultravox
- Audio\ :sup:`E+` - T + A\ :sup:`E+`
- :code:`fixie-ai/ultravox-v0_3` - :code:`fixie-ai/ultravox-v0_3`
- -
- ✅︎
| :sup:`E` Pre-computed embeddings can be inputted for this modality. | :sup:`E` Pre-computed embeddings can be inputted for this modality.
| :sup:`+` Multiple items can be inputted per text prompt for this modality. | :sup:`+` Multiple items can be inputted per text prompt for this modality.
...@@ -297,6 +462,26 @@ Multimodal Language Models ...@@ -297,6 +462,26 @@ Multimodal Language Models
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now. For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630 For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
Multimodal Embedding
--------------------
.. list-table::
:widths: 25 25 15 25 5 5
:header-rows: 1
* - Architecture
- Models
- Inputs
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`Phi3VForCausalLM`
- Phi-3-Vision-based
- T + I
- :code:`TIGER-Lab/VLM2Vec-Full`
- 🚧
- ✅︎
---- ----
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM. If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
......
...@@ -6,10 +6,9 @@ Using VLMs ...@@ -6,10 +6,9 @@ Using VLMs
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`. vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
This document shows you how to run and serve these models using vLLM. This document shows you how to run and serve these models using vLLM.
.. important:: .. note::
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation. We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
Offline Inference Offline Inference
----------------- -----------------
...@@ -23,14 +22,10 @@ The :class:`~vllm.LLM` class can be instantiated in much the same way as languag ...@@ -23,14 +22,10 @@ The :class:`~vllm.LLM` class can be instantiated in much the same way as languag
llm = LLM(model="llava-hf/llava-1.5-7b-hf") llm = LLM(model="llava-hf/llava-1.5-7b-hf")
.. note:: To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
the above snippet. Specifically, ``image_feature_size`` can no longer be specified as we now calculate that internally for each model.
To pass an image to the model, note the following in :class:`vllm.inputs.PromptInputs`:
* ``prompt``: The prompt should follow the format that is documented on HuggingFace. * ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. * ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
.. code-block:: python .. code-block:: python
...@@ -39,7 +34,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI ...@@ -39,7 +34,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
# Load the image using PIL.Image # Load the image using PIL.Image
image = PIL.Image.open(...) image = PIL.Image.open(...)
# Single prompt inference # Single prompt inference
outputs = llm.generate({ outputs = llm.generate({
"prompt": prompt, "prompt": prompt,
...@@ -60,7 +55,31 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI ...@@ -60,7 +55,31 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
for o in outputs: for o in outputs:
generated_text = o.outputs[0].text generated_text = o.outputs[0].text
print(generated_text) print(generated_text)
# Inference with image embeddings as input with additional parameters
# Specifically, we are conducting a trial run of Qwen2VL and MiniCPM-V with the new input format, which utilizes additional parameters.
mm_data = {}
image_embeds = torch.load(...) # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
# For Qwen2VL, image_grid_thw is needed to calculate positional encoding.
mm_data['image'] = {
"image_embeds": image_embeds,
"image_grid_thw": torch.load(...) # torch.Tensor of shape (1, 3),
}
# For MiniCPM-V, image_size_list is needed to calculate details of the sliced image.
mm_data['image'] = {
"image_embeds": image_embeds,
"image_size_list": [image.size] # list of image sizes
}
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": mm_data,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference # Batch inference
image_1 = PIL.Image.open(...) image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...) image_2 = PIL.Image.open(...)
...@@ -104,7 +123,7 @@ Instead of passing in a single image, you can pass in a list of images. ...@@ -104,7 +123,7 @@ Instead of passing in a single image, you can pass in a list of images.
.. code-block:: python .. code-block:: python
# Refer to the HuggingFace repo for the correct format to use # Refer to the HuggingFace repo for the correct format to use
prompt = "<|user|>\n<image_1>\n<image_2>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
# Load the images using PIL.Image # Load the images using PIL.Image
image1 = PIL.Image.open(...) image1 = PIL.Image.open(...)
...@@ -123,6 +142,33 @@ Instead of passing in a single image, you can pass in a list of images. ...@@ -123,6 +142,33 @@ Instead of passing in a single image, you can pass in a list of images.
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_. A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
.. code-block:: python
# Specify the maximum number of frames per video to be 4. This can be changed.
llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
# Create the request payload.
video_frames = ... # load your video making sure it only has the number of frames specified earlier.
message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
],
}
for i in range(len(video_frames)):
base64_image = encode_image(video_frames[i]) # base64 encoding.
new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
message["content"].append(new_image)
# Perform inference and log output.
outputs = llm.chat([message])
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Online Inference Online Inference
---------------- ----------------
...@@ -195,7 +241,7 @@ To consume the server, you can use the OpenAI client like in the example below: ...@@ -195,7 +241,7 @@ To consume the server, you can use the OpenAI client like in the example below:
print("Chat completion output:", chat_response.choices[0].message.content) print("Chat completion output:", chat_response.choices[0].message.content)
A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_. A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.
.. note:: .. note::
......
...@@ -106,7 +106,7 @@ Install ``vllm`` and ``lm-evaluation-harness``: ...@@ -106,7 +106,7 @@ Install ``vllm`` and ``lm-evaluation-harness``:
.. code-block:: console .. code-block:: console
$ pip install vllm lm_eval==0.4.3 $ pip install vllm lm-eval==0.4.4
Load and run the model in ``vllm``: Load and run the model in ``vllm``:
......
...@@ -28,7 +28,7 @@ The table below shows the compatibility of various quantization implementations ...@@ -28,7 +28,7 @@ The table below shows the compatibility of various quantization implementations
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- - ✅︎
- ✗ - ✗
- ✗ - ✗
* - GPTQ * - GPTQ
...@@ -61,7 +61,7 @@ The table below shows the compatibility of various quantization implementations ...@@ -61,7 +61,7 @@ The table below shows the compatibility of various quantization implementations
- ✅︎ - ✅︎
- ✗ - ✗
- ✗ - ✗
- - ✅︎
- ✗ - ✗
- ✗ - ✗
* - FP8 (W8A8) * - FP8 (W8A8)
......
.. _compatibility_matrix:
Compatibility Matrix
====================
The tables below show mutually exclusive features and the support on some hardware.
.. note::
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
Feature x Feature
-----------------
.. raw:: html
<style>
/* Make smaller to try to improve readability */
td {
font-size: 0.8rem;
text-align: center;
}
th {
text-align: center;
font-size: 0.8rem;
}
</style>
.. list-table::
:header-rows: 1
:widths: auto
* - Feature
- :ref:`CP <chunked-prefill>`
- :ref:`APC <apc>`
- :ref:`LoRA <lora>`
- :abbr:`prmpt adptr (Prompt Adapter)`
- :ref:`SD <spec_decode>`
- CUDA graph
- :abbr:`enc-dec (Encoder-Decoder Models)`
- :abbr:`logP (Logprobs)`
- :abbr:`prmpt logP (Prompt Logprobs)`
- :abbr:`async output (Async Output Processing)`
- multi-step
- :abbr:`MM (Multimodal)`
- best-of
- beam-search
- :abbr:`guided dec (Guided Decoding)`
* - :ref:`CP <chunked-prefill>`
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`APC <apc>`
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`LoRA <lora>`
- `✗ <https://github.com/vllm-project/vllm/pull/9057>`__
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
-
* - :abbr:`prmpt adptr (Prompt Adapter)`
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
-
-
* - :ref:`SD <spec_decode>`
- ✗
- ✅
- ✗
- ✅
-
-
-
-
-
-
-
-
-
-
-
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
-
-
* - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
- ✗
- ✗
- `✗ <https://github.com/vllm-project/vllm/issues/7366>`__
- ✅
-
-
-
-
-
-
-
-
-
* - :abbr:`logP (Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
-
* - :abbr:`prmpt logP (Prompt Logprobs)`
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/8199>`__
- ✅
- ✅
- ✅
-
-
-
-
-
-
-
* - :abbr:`async output (Async Output Processing)`
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
- ✗
- ✅
- ✅
-
-
-
-
-
-
* - multi-step
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- ✗
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8198>`__
- ✅
-
-
-
-
-
* - :abbr:`MM (Multimodal)`
- `✗ <https://github.com/vllm-project/vllm/pull/8346>`__
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
- ?
- ?
- ✅
- ✗
- ✅
- ✅
- ✅
- ?
-
-
-
-
* - best-of
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅
- ✅
- ✅
- ✅
- ?
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
- ✅
-
-
-
* - beam-search
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/6137>`__
- ✅
- ✅
- ✅
- ✅
- ?
- `✗ <https://github.com/vllm-project/vllm/issues/7968>`__
- ?
- ✅
-
-
* - :abbr:`guided dec (Guided Decoding)`
- ✅
- ✅
- ?
- ?
- ✅
- ✅
- ?
- ✅
- ✅
- ✅
- ✗
- ?
- ✅
- ✅
-
Feature x Hardware
^^^^^^^^^^^^^^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* - Feature
- Volta
- Turing
- Ampere
- Ada
- Hopper
- CPU
- AMD
* - :ref:`CP <chunked-prefill>`
- `✗ <https://github.com/vllm-project/vllm/issues/2729>`__
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - :ref:`APC <apc>`
- `✗ <https://github.com/vllm-project/vllm/issues/3687>`__
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - :ref:`LoRA <lora>`
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/4830>`__
- ✅
* - :abbr:`prmpt adptr (Prompt Adapter)`
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8475>`__
- ✅
* - :ref:`SD <spec_decode>`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - CUDA graph
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✅
* - :abbr:`enc-dec (Encoder-Decoder Models)`
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/blob/a84e598e2125960d3b4f716b78863f24ac562947/vllm/worker/cpu_model_runner.py#L125>`__
- ✗
* - :abbr:`logP (Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`prmpt logP (Prompt Logprobs)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`async output (Async Output Processing)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✗
- ✗
* - multi-step
- ✅
- ✅
- ✅
- ✅
- ✅
- `✗ <https://github.com/vllm-project/vllm/issues/8477>`__
- ✅
* - :abbr:`MM (Multimodal)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - best-of
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - beam-search
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
* - :abbr:`guided dec (Guided Decoding)`
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
- ✅
.. _deploying_with_k8s:
Deploying with Kubernetes
==========================
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Prerequisites
-------------
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
Deployment Steps
----------------
1. **Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
.. code-block:: yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
.. code-block:: yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
data:
token: "REPLACE_WITH_TOKEN"
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
.. code-block:: yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
2. **Create a Kubernetes Service for vLLM**
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
.. code-block:: yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
3. **Deploy and Test**
Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
.. code-block:: console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
To test the deployment, run the following ``curl`` command:
.. code-block:: console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
If the service is correctly deployed, you should receive a response from the vLLM model.
Conclusion
----------
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
\ No newline at end of file
...@@ -5,4 +5,4 @@ Deploying with KServe ...@@ -5,4 +5,4 @@ Deploying with KServe
vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving. vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving.
Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/vllm/>`_ for more details on using vLLM with KServe. Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/>`_ for more details on using vLLM with KServe.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment