Commit afd0da21 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.7.1' into v0.7.1-dev

parents 1a11f127 4f4d427a
(installation-index)=
# Installation
vLLM supports the following hardware platforms:
:::{toctree}
:maxdepth: 1
:hidden:
gpu/index
cpu/index
ai_accelerator/index
:::
- <project:gpu/index.md>
- NVIDIA CUDA
- AMD ROCm
- Intel XPU
- <project:cpu/index.md>
- Intel/AMD x86
- ARM AArch64
- Apple silicon
- <project:ai_accelerator/index.md>
- Google TPU
- Intel Gaudi
- AWS Neuron
- OpenVINO
You can create a new Python environment using `conda`:
```console
# (Recommended) Create a new conda environment.
conda create -n myenv python=3.12 -y
conda activate myenv
```
:::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
:::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
```console
# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
```
...@@ -2,34 +2,45 @@ ...@@ -2,34 +2,45 @@
# Quickstart # Quickstart
This guide will help you quickly get started with vLLM to: This guide will help you quickly get started with vLLM to perform:
- [Run offline batched inference](#offline-batched-inference) - [Offline batched inference](#quickstart-offline)
- [Run OpenAI-compatible inference](#openai-compatible-server) - [Online serving using OpenAI-compatible server](#quickstart-online)
## Prerequisites ## Prerequisites
- OS: Linux - OS: Linux
- Python: 3.9 -- 3.12 - Python: 3.9 -- 3.12
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
## Installation ## Installation
You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console ```console
$ conda create -n myenv python=3.10 -y uv venv myenv --python 3.12 --seed
$ conda activate myenv source myenv/bin/activate
$ pip install vllm uv pip install vllm
``` ```
Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM. You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
```console
conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllm
```
(offline-batched-inference)= :::{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
:::
(quickstart-offline)=
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py> With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`: The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
...@@ -40,7 +51,7 @@ The first line of this example imports the classes {class}`~vllm.LLM` and {class ...@@ -40,7 +51,7 @@ The first line of this example imports the classes {class}`~vllm.LLM` and {class
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
``` ```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](https://docs.vllm.ai/en/stable/dev/sampling_params.html). The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
```python ```python
prompts = [ prompts = [
...@@ -58,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model]( ...@@ -58,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
``` ```
```{note} :::{note}
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine. By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
``` :::
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens. Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
...@@ -73,7 +84,7 @@ for output in outputs: ...@@ -73,7 +84,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
(openai-compatible-server)= (quickstart-online)=
## OpenAI-Compatible Server ## OpenAI-Compatible Server
...@@ -83,18 +94,18 @@ By default, it starts the server at `http://localhost:8000`. You can specify the ...@@ -83,18 +94,18 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model: Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console ```console
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct vllm serve Qwen/Qwen2.5-1.5B-Instruct
``` ```
```{note} :::{note}
By default, the server uses a predefined chat template stored in the tokenizer. By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template). You can learn about overriding it [here](#chat-template).
``` :::
This server can be queried in the same format as OpenAI API. For example, to list the models: This server can be queried in the same format as OpenAI API. For example, to list the models:
```console ```console
$ curl http://localhost:8000/v1/models curl http://localhost:8000/v1/models
``` ```
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header. You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
...@@ -104,17 +115,17 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` ...@@ -104,17 +115,17 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
Once your server is started, you can query the model with input prompts: Once your server is started, you can query the model with input prompts:
```console ```console
$ curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
$ -H "Content-Type: application/json" \ -H "Content-Type: application/json" \
$ -d '{ -d '{
$ "model": "Qwen/Qwen2.5-1.5B-Instruct", "model": "Qwen/Qwen2.5-1.5B-Instruct",
$ "prompt": "San Francisco is a", "prompt": "San Francisco is a",
$ "max_tokens": 7, "max_tokens": 7,
$ "temperature": 0 "temperature": 0
$ }' }'
``` ```
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` python package: Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
```python ```python
from openai import OpenAI from openai import OpenAI
...@@ -131,7 +142,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct", ...@@ -131,7 +142,7 @@ completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: <gh-file:examples/openai_completion_client.py> A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
...@@ -140,18 +151,18 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter ...@@ -140,18 +151,18 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model: You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
```console ```console
$ curl http://localhost:8000/v1/chat/completions \ curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \ -H "Content-Type: application/json" \
$ -d '{ -d '{
$ "model": "Qwen/Qwen2.5-1.5B-Instruct", "model": "Qwen/Qwen2.5-1.5B-Instruct",
$ "messages": [ "messages": [
$ {"role": "system", "content": "You are a helpful assistant."}, {"role": "system", "content": "You are a helpful assistant."},
$ {"role": "user", "content": "Who won the world series in 2020?"} {"role": "user", "content": "Who won the world series in 2020?"}
$ ] ]
$ }' }'
``` ```
Alternatively, you can use the `openai` python package: Alternatively, you can use the `openai` Python package:
```python ```python
from openai import OpenAI from openai import OpenAI
......
(debugging)= (troubleshooting)=
# Debugging Tips # Troubleshooting
This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{note} :::{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated. Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
``` :::
## Hangs downloading a model ## Hangs downloading a model
...@@ -18,13 +18,13 @@ It's recommended to download the model first using the [huggingface-cli](https:/ ...@@ -18,13 +18,13 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow. If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory. It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
```{note} :::{note}
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck. To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
``` :::
## Model is too large ## Out of memory
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider [using tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Enable more logging ## Enable more logging
...@@ -47,6 +47,8 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>` ...@@ -47,6 +47,8 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph. If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error. To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver ## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly. If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
...@@ -117,29 +119,30 @@ dist.destroy_process_group() ...@@ -117,29 +119,30 @@ dist.destroy_process_group()
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use: If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
```console ```console
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
``` ```
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run: If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
```console ```console
$ NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
``` ```
If the script runs successfully, you should see the message `sanity check is successful!`. If the script runs successfully, you should see the message `sanity check is successful!`.
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully. If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
```{note} :::{note}
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`. - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`. - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes. Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
``` :::
(troubleshooting-python-multiprocessing)=
(debugging-python-multiprocessing)=
## Python multiprocessing ## Python multiprocessing
### `RuntimeError` Exception ### `RuntimeError` Exception
...@@ -150,7 +153,7 @@ If you have seen a warning in your logs like this: ...@@ -150,7 +153,7 @@ If you have seen a warning in your logs like this:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
for more information. for more information.
``` ```
...@@ -194,7 +197,64 @@ if __name__ == '__main__': ...@@ -194,7 +197,64 @@ if __name__ == '__main__':
llm = vllm.LLM(...) llm = vllm.LLM(...)
``` ```
## `torch.compile` Error
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
```python
import torch
@torch.compile
def f(x):
# a simple function to test torch.compile
x = x + 1
x = x * 2
x = x.sin()
return x
x = torch.randn(4, 4).cuda()
print(f(x))
```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
## Model failed to be inspected
If you see an error like:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
```
It means that vLLM failed to import the model file.
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
Please read the logs carefully to determine the root cause of the error.
## Model not supported
If you see an error like:
```text
Traceback (most recent call last):
...
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable
```
or:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
```
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
## Known Issues ## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) . - To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
# Welcome to vLLM! # Welcome to vLLM
```{figure} ./assets/logos/vllm-logo-text-light.png :::{figure} ./assets/logos/vllm-logo-text-light.png
:align: center :align: center
:alt: vLLM :alt: vLLM
:class: no-scaled-link :class: no-scaled-link
:width: 60% :width: 60%
``` :::
```{raw} html :::{raw} html
<p style="text-align:center"> <p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone <strong>Easy, fast, and cheap LLM serving for everyone
</strong> </strong>
...@@ -19,14 +19,16 @@ ...@@ -19,14 +19,16 @@
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a> <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a> <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p> </p>
``` :::
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.
vLLM is fast with: vLLM is fast with:
- State-of-the-art serving throughput - State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8 - Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
...@@ -50,151 +52,152 @@ For more information, check out the following: ...@@ -50,151 +52,152 @@ For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention) - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023) - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al. - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- {ref}`vLLM Meetups <meetups>`. - [vLLM Meetups](#meetups)
## Documentation ## Documentation
```{toctree} % How to start using vLLM?
:::{toctree}
:caption: Getting Started :caption: Getting Started
:maxdepth: 1 :maxdepth: 1
getting_started/installation getting_started/installation/index
getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation
getting_started/gaudi-installation
getting_started/arm-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
getting_started/quickstart getting_started/quickstart
getting_started/debugging
getting_started/examples/examples_index getting_started/examples/examples_index
``` getting_started/troubleshooting
getting_started/faq
:::
```{toctree} % What does vLLM support?
:caption: Serving
:maxdepth: 1
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/integrations
serving/tensorizer
serving/runai_model_streamer
```
```{toctree} :::{toctree}
:caption: Models :caption: Models
:maxdepth: 1 :maxdepth: 1
models/supported_models
models/generative_models models/generative_models
models/pooling_models models/pooling_models
models/adding_model models/supported_models
models/enabling_multimodal_inputs models/extensions/index
``` :::
```{toctree}
:caption: Usage
:maxdepth: 1
usage/lora % Additional capabilities
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
usage/performance
usage/faq
usage/engine_args
usage/env_vars
usage/usage_stats
usage/disagg_prefill
```
```{toctree}
:caption: Quantization
:maxdepth: 1
quantization/supported_hardware :::{toctree}
quantization/auto_awq :caption: Features
quantization/bnb
quantization/gguf
quantization/int8
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
```
```{toctree}
:caption: Automatic Prefix Caching
:maxdepth: 1 :maxdepth: 1
automatic_prefix_caching/apc features/quantization/index
automatic_prefix_caching/details features/lora
``` features/tool_calling
features/reasoning_outputs
```{toctree} features/structured_outputs
:caption: Performance features/automatic_prefix_caching
features/disagg_prefill
features/spec_decode
features/compatibility_matrix
:::
% Details about running vLLM
:::{toctree}
:caption: Inference and Serving
:maxdepth: 1 :maxdepth: 1
performance/benchmarks serving/offline_inference
``` serving/openai_compatible_server
serving/multimodal_inputs
serving/distributed_serving
serving/metrics
serving/engine_args
serving/env_vars
serving/usage_stats
serving/integrations/index
:::
% Community: User community resources % Scaling up vLLM for production
```{toctree} :::{toctree}
:caption: Community :caption: Deployment
:maxdepth: 1 :maxdepth: 1
community/meetups deployment/docker
community/sponsors deployment/k8s
``` deployment/nginx
deployment/frameworks/index
deployment/integrations/index
:::
% API Documentation: API reference aimed at vllm library usage % Making the most out of vLLM
```{toctree} :::{toctree}
:caption: API Documentation :caption: Performance
:maxdepth: 2 :maxdepth: 1
dev/sampling_params performance/optimization
dev/pooling_params performance/benchmarks
dev/offline_inference/offline_index :::
dev/engine/engine_index
```
% Design: docs about vLLM internals % Explanation of vLLM internals
```{toctree} :::{toctree}
:caption: Design :caption: Design Documents
:maxdepth: 2 :maxdepth: 2
design/arch_overview design/arch_overview
design/huggingface_integration design/huggingface_integration
design/plugin_system design/plugin_system
design/input_processing/model_inputs_index
design/kernel/paged_attention design/kernel/paged_attention
design/multimodal/multimodal_index design/mm_processing
design/automatic_prefix_caching
design/multiprocessing design/multiprocessing
``` :::
% For Developers: contributing to the vLLM project :::{toctree}
:caption: V1 Design Documents
:maxdepth: 2
design/v1/prefix_caching
:::
% How to contribute to the vLLM project
```{toctree} :::{toctree}
:caption: For Developers :caption: Developer Guide
:maxdepth: 2 :maxdepth: 2
contributing/overview contributing/overview
contributing/profiling/profiling_index contributing/profiling/profiling_index
contributing/dockerfile/dockerfile contributing/dockerfile/dockerfile
``` contributing/model/index
contributing/vulnerability_management
:::
% Technical API specifications
:::{toctree}
:caption: API Reference
:maxdepth: 2
api/offline_inference/index
api/engine/index
api/inference_params
api/multimodal/index
api/model/index
:::
% Latest news and acknowledgements
:::{toctree}
:caption: Community
:maxdepth: 1
community/blog
community/meetups
community/sponsors
:::
# Indices and tables ## Indices and tables
- {ref}`genindex` - {ref}`genindex`
- {ref}`modindex` - {ref}`modindex`
(enabling-multimodal-inputs)=
# Enabling Multimodal Inputs
This document walks you through the steps to extend a vLLM model so that it accepts [multi-modal inputs](#multimodal-inputs).
```{seealso}
[Adding a New Model](adding-a-new-model)
```
## 1. Update the base vLLM model
It is assumed that you have already implemented the model in vLLM according to [these steps](#adding-a-new-model).
Further update the model as follows:
- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
```diff
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
```{note}
The model class does not have to be named {code}`*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
```
- If you haven't already done so, reserve a keyword parameter in {meth}`~torch.nn.Module.forward`
for each input tensor that corresponds to a multi-modal input, as shown in the following example:
```diff
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
```
## 2. Register input mappers
For each modality type that the model accepts as input, decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in {meth}`~torch.nn.Module.forward`.
```diff
from vllm.model_executor.models.interfaces import SupportsMultiModal
+ from vllm.multimodal import MULTIMODAL_REGISTRY
+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```
## 3. Register maximum number of multi-modal tokens
For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data item
and register it via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.
```diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsMultiModal
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
Here are some examples:
- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```
## 4. (Optional) Register dummy data
During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.
```diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsMultiModal
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
```{note}
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
```
Here are some examples:
- Image inputs (static feature size): [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Image inputs (dynamic feature size): [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```
## 5. (Optional) Register input processor
Sometimes, there is a need to process inputs at the {class}`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's {meth}`~torch.nn.Module.forward` call.
You can register input processors via {meth}`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.
```diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsMultiModal
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:
- Insert static number of image tokens: [LLaVA-1.5 Model](gh-file:vllm/model_executor/models/llava.py)
- Insert dynamic number of image tokens: [LLaVA-NeXT Model](gh-file:vllm/model_executor/models/llava_next.py)
```{seealso}
[Input Processing Pipeline](#input-processing-pipeline)
```
# Built-in Extensions
:::{toctree}
:maxdepth: 1
runai_model_streamer
tensorizer
:::
(runai-model-streamer)= (runai-model-streamer)=
# Loading Models with Run:ai Model Streamer # Loading models with Run:ai Model Streamer
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
...@@ -9,25 +9,25 @@ vLLM supports loading weights in Safetensors format using the Run:ai Model Strea ...@@ -9,25 +9,25 @@ vLLM supports loading weights in Safetensors format using the Run:ai Model Strea
You first need to install vLLM RunAI optional dependency: You first need to install vLLM RunAI optional dependency:
```console ```console
$ pip3 install vllm[runai] pip3 install vllm[runai]
``` ```
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag: To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
``` ```
To run model from AWS S3 object store run: To run model from AWS S3 object store run:
```console ```console
$ vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
``` ```
To run model from a S3 compatible object store run: To run model from a S3 compatible object store run:
```console ```console
$ RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
``` ```
## Tunable parameters ## Tunable parameters
...@@ -38,16 +38,16 @@ You can tune `concurrency` that controls the level of concurrency and number of ...@@ -38,16 +38,16 @@ You can tune `concurrency` that controls the level of concurrency and number of
For reading from S3, it will be the number of client instances the host is opening to the S3 server. For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}' vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
``` ```
You can controls the size of the CPU Memory buffer to which tensors are read from the file, and limit this size. You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit). You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
```console ```console
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}' vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
``` ```
```{note} :::{note}
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md). For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
``` :::
(tensorizer)= (tensorizer)=
# Loading Models with CoreWeave's Tensorizer # Loading models with CoreWeave's Tensorizer
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
...@@ -9,8 +9,8 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor ...@@ -9,8 +9,8 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
For more information on CoreWeave's Tensorizer, please refer to For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html). the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
```{note} :::{note}
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
``` :::
...@@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll ...@@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text. which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Offline Inference ## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference. The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. See [Engine Arguments](#engine-args) for a list of options when initializing the model.
For generative models, the only supported {code}`task` option is {code}`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
### `LLM.generate` ### `LLM.generate`
The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM. The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
...@@ -33,7 +33,7 @@ for output in outputs: ...@@ -33,7 +33,7 @@ for output in outputs:
``` ```
You can optionally control the language generation by passing {class}`~vllm.SamplingParams`. You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting {code}`temperature=0`: For example, you can use greedy sampling by setting `temperature=0`:
```python ```python
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
...@@ -46,7 +46,7 @@ for output in outputs: ...@@ -46,7 +46,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference.py> A code example can be found here: <gh-file:examples/offline_inference/basic.py>
### `LLM.beam_search` ### `LLM.beam_search`
...@@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas ...@@ -70,10 +70,10 @@ The {class}`~vllm.LLM.chat` method implements chat functionality on top of {clas
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt. and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
```{important} :::{important}
In general, only instruction-tuned models have a chat template. In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation. Base models may perform poorly as they are not trained to respond to the chat conversation.
``` :::
```python ```python
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
...@@ -103,7 +103,7 @@ for output in outputs: ...@@ -103,7 +103,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_chat.py> A code example can be found here: <gh-file:examples/offline_inference/chat.py>
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:
...@@ -118,9 +118,9 @@ print("Loaded chat template:", custom_template) ...@@ -118,9 +118,9 @@ print("Loaded chat template:", custom_template)
outputs = llm.chat(conversation, chat_template=custom_template) outputs = llm.chat(conversation, chat_template=custom_template)
``` ```
## Online Inference ## Online Serving
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text. - [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template. - [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
...@@ -8,37 +8,60 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo ...@@ -8,37 +8,60 @@ In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmMo
These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
before returning them. before returning them.
```{note} :::{note}
We currently support pooling models primarily as a matter of convenience. We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
``` :::
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:
:::{list-table}
:widths: 50 25 25 25
:header-rows: 1
- * Task
* Pooling Type
* Normalization
* Softmax
- * Embedding (`embed`)
* `LAST`
* ✅︎
*
- * Classification (`classify`)
* `LAST`
*
* ✅︎
- * Sentence Pair Scoring (`score`)
* \*
* \*
* \*
- * Reward Modeling (`reward`)
* `ALL`
*
*
:::
\*The default pooler is always defined by the model.
:::{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
:::
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
:::{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
:::
## Offline Inference ## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference. The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. See [Engine Arguments](#engine-args) for a list of options when initializing the model.
For pooling models, we support the following {code}`task` options:
- Embedding ({code}`"embed"` / {code}`"embedding"`)
- Classification ({code}`"classify"`)
- Sentence Pair Scoring ({code}`"score"`)
- Reward Modeling ({code}`"reward"`)
The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used:
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`).
You can customize the model's pooling method via the {code}`override_pooler_config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
### `LLM.encode` ### `LLM.encode`
The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM. The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
...@@ -65,7 +88,7 @@ embeds = output.outputs.embedding ...@@ -65,7 +88,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_embedding.py> A code example can be found here: <gh-file:examples/offline_inference/embedding.py>
### `LLM.classify` ### `LLM.classify`
...@@ -80,7 +103,7 @@ probs = output.outputs.probs ...@@ -80,7 +103,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_classification.py> A code example can be found here: <gh-file:examples/offline_inference/classification.py>
### `LLM.score` ### `LLM.score`
...@@ -88,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p ...@@ -88,10 +111,10 @@ The {class}`~vllm.LLM.score` method outputs similarity scores between sentence p
It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html). It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
These types of models serve as rerankers between candidate query-document pairs in RAG systems. These types of models serve as rerankers between candidate query-document pairs in RAG systems.
```{note} :::{note}
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG. vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain). To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
``` :::
```python ```python
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score") llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
...@@ -102,11 +125,11 @@ score = output.outputs.score ...@@ -102,11 +125,11 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference_scoring.py> A code example can be found here: <gh-file:examples/offline_inference/scoring.py>
## Online Inference ## Online Serving
Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models. - [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models. - [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
......
(supported-models)= (supported-models)=
# Supported Models # List of Supported Models
vLLM supports generative and pooling models across various tasks. vLLM supports generative and pooling models across various tasks.
If a model supports more than one task, you can set the task via the {code}`--task` argument. If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM. For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it. Alongside each architecture, we include some popular models that use it.
...@@ -14,10 +14,10 @@ Alongside each architecture, we include some popular models that use it. ...@@ -14,10 +14,10 @@ Alongside each architecture, we include some popular models that use it.
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models). By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
To determine whether a given model is supported, you can check the {code}`config.json` file inside the HF repository. To determine whether a given model is supported, you can check the `config.json` file inside the HF repository.
If the {code}`"architectures"` field contains a model architecture listed below, then it should be supported in theory. If the `"architectures"` field contains a model architecture listed below, then it should be supported in theory.
````{tip} :::{tip}
The easiest way to check if your model is really supported at runtime is to run the program below: The easiest way to check if your model is really supported at runtime is to run the program below:
```python ```python
...@@ -35,9 +35,9 @@ print(output) ...@@ -35,9 +35,9 @@ print(output)
``` ```
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported. If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
```` :::
Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM. Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support. Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### ModelScope ### ModelScope
...@@ -45,10 +45,10 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project ...@@ -45,10 +45,10 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable: To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
```shell ```shell
$ export VLLM_USE_MODELSCOPE=True export VLLM_USE_MODELSCOPE=True
``` ```
And use with {code}`trust_remote_code=True`. And use with `trust_remote_code=True`.
```python ```python
from vllm import LLM from vllm import LLM
...@@ -72,459 +72,472 @@ See [this page](#generative-models) for more information on how to use generativ ...@@ -72,459 +72,472 @@ See [this page](#generative-models) for more information on how to use generativ
#### Text Generation (`--task generate`) #### Text Generation (`--task generate`)
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 50 5 5
:widths: 25 25 50 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `AquilaForCausalLM`
* - :code:`AquilaForCausalLM` * Aquila, Aquila2
- Aquila, Aquila2 * `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.
- :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `ArcticForCausalLM`
* - :code:`ArcticForCausalLM` * Arctic
- Arctic * `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc.
- :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc. *
- * ✅︎
- ✅︎ - * `BaiChuanForCausalLM`
* - :code:`BaiChuanForCausalLM` * Baichuan2, Baichuan
- Baichuan2, Baichuan * `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
- :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `BloomForCausalLM`
* - :code:`BloomForCausalLM` * BLOOM, BLOOMZ, BLOOMChat
- BLOOM, BLOOMZ, BLOOMChat * `bigscience/bloom`, `bigscience/bloomz`, etc.
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc. *
- * ✅︎
- ✅︎ - * `BartForConditionalGeneration`
* - :code:`BartForConditionalGeneration` * BART
- BART * `facebook/bart-base`, `facebook/bart-large-cnn`, etc.
- :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc. *
- *
- - * `ChatGLMModel`
* - :code:`ChatGLMModel` * ChatGLM
- ChatGLM * `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `CohereForCausalLM`, `Cohere2ForCausalLM`
* - :code:`CohereForCausalLM`,:code:`Cohere2ForCausalLM` * Command-R
- Command-R * `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc.
- :code:`CohereForAI/c4ai-command-r-v01`, :code:`CohereForAI/c4ai-command-r7b-12-2024`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `DbrxForCausalLM`
* - :code:`DbrxForCausalLM` * DBRX
- DBRX * `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc.
- :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc. *
- * ✅︎
- ✅︎ - * `DeciLMForCausalLM`
* - :code:`DeciLMForCausalLM` * DeciLM
- DeciLM * `Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.
- :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc. *
- * ✅︎
- ✅︎ - * `DeepseekForCausalLM`
* - :code:`DeepseekForCausalLM` * DeepSeek
- DeepSeek * `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat` etc.
- :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc. *
- * ✅︎
- ✅︎ - * `DeepseekV2ForCausalLM`
* - :code:`DeepseekV2ForCausalLM` * DeepSeek-V2
- DeepSeek-V2 * `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat` etc.
- :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc. *
- * ✅︎
- ✅︎ - * `DeepseekV3ForCausalLM`
* - :code:`DeepseekV3ForCausalLM` * DeepSeek-V3
- DeepSeek-V3 * `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3` etc.
- :code:`deepseek-ai/DeepSeek-V3-Base`, :code:`deepseek-ai/DeepSeek-V3` etc. *
- * ✅︎
- ✅︎ - * `ExaoneForCausalLM`
* - :code:`ExaoneForCausalLM` * EXAONE-3
- EXAONE-3 * `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
- :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `FalconForCausalLM`
* - :code:`FalconForCausalLM` * Falcon
- Falcon * `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.
- :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc. *
- * ✅︎
- ✅︎ - * `FalconMambaForCausalLM`
* - :code:`FalconMambaForCausalLM` * FalconMamba
- FalconMamba * `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc.
- :code:`tiiuae/falcon-mamba-7b`, :code:`tiiuae/falcon-mamba-7b-instruct`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GemmaForCausalLM`
* - :code:`GemmaForCausalLM` * Gemma
- Gemma * `google/gemma-2b`, `google/gemma-7b`, etc.
- :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Gemma2ForCausalLM`
* - :code:`Gemma2ForCausalLM` * Gemma2
- Gemma2 * `google/gemma-2-9b`, `google/gemma-2-27b`, etc.
- :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GlmForCausalLM`
* - :code:`GlmForCausalLM` * GLM-4
- GLM-4 * `THUDM/glm-4-9b-chat-hf`, etc.
- :code:`THUDM/glm-4-9b-chat-hf`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GPT2LMHeadModel`
* - :code:`GPT2LMHeadModel` * GPT-2
- GPT-2 * `gpt2`, `gpt2-xl`, etc.
- :code:`gpt2`, :code:`gpt2-xl`, etc. *
- * ✅︎
- ✅︎ - * `GPTBigCodeForCausalLM`
* - :code:`GPTBigCodeForCausalLM` * StarCoder, SantaCoder, WizardCoder
- StarCoder, SantaCoder, WizardCoder * `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc.
- :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GPTJForCausalLM`
* - :code:`GPTJForCausalLM` * GPT-J
- GPT-J * `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.
- :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc. *
- * ✅︎
- ✅︎ - * `GPTNeoXForCausalLM`
* - :code:`GPTNeoXForCausalLM` * GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM * `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.
- :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc. *
- * ✅︎
- ✅︎ - * `GraniteForCausalLM`
* - :code:`GraniteForCausalLM` * Granite 3.0, Granite 3.1, PowerLM
- Granite 3.0, Granite 3.1, PowerLM * `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc.
- :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.1-8b-instruct`, :code:`ibm/PowerLM-3b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GraniteMoeForCausalLM`
* - :code:`GraniteMoeForCausalLM` * Granite 3.0 MoE, PowerMoE
- Granite 3.0 MoE, PowerMoE * `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc.
- :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `GritLM`
* - :code:`GritLM` * GritLM
- GritLM * `parasail-ai/GritLM-7B-vllm`.
- :code:`parasail-ai/GritLM-7B-vllm`. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `InternLMForCausalLM`
* - :code:`InternLMForCausalLM` * InternLM
- InternLM * `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.
- :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `InternLM2ForCausalLM`
* - :code:`InternLM2ForCausalLM` * InternLM2
- InternLM2 * `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc.
- :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `InternLM3ForCausalLM`
* - :code:`JAISLMHeadModel` * InternLM3
- Jais * `internlm/internlm3-8b-instruct`, etc.
- :code:`inceptionai/jais-13b`, :code:`inceptionai/jais-13b-chat`, :code:`inceptionai/jais-30b-v3`, :code:`inceptionai/jais-30b-chat-v3`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `JAISLMHeadModel`
* - :code:`JambaForCausalLM` * Jais
- Jamba * `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc.
- :code:`ai21labs/AI21-Jamba-1.5-Large`, :code:`ai21labs/AI21-Jamba-1.5-Mini`, :code:`ai21labs/Jamba-v0.1`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `JambaForCausalLM`
* - :code:`LlamaForCausalLM` * Jamba
- Llama 3.1, Llama 3, Llama 2, LLaMA, Yi * `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc.
- :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `LlamaForCausalLM`
* - :code:`MambaForCausalLM` * Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
- Mamba * `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc.
- :code:`state-spaces/mamba-130m-hf`, :code:`state-spaces/mamba-790m-hf`, :code:`state-spaces/mamba-2.8b-hf`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `MambaForCausalLM`
* - :code:`MiniCPMForCausalLM` * Mamba
- MiniCPM * `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc.
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, :code:`openbmb/MiniCPM-S-1B-sft`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `MiniCPMForCausalLM`
* - :code:`MiniCPM3ForCausalLM` * MiniCPM
- MiniCPM3 * `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc.
- :code:`openbmb/MiniCPM3-4B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `MiniCPM3ForCausalLM`
* - :code:`MistralForCausalLM` * MiniCPM3
- Mistral, Mistral-Instruct * `openbmb/MiniCPM3-4B`, etc.
- :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `MistralForCausalLM`
* - :code:`MixtralForCausalLM` * Mistral, Mistral-Instruct
- Mixtral-8x7B, Mixtral-8x7B-Instruct * `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.
- :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `MixtralForCausalLM`
* - :code:`MPTForCausalLM` * Mixtral-8x7B, Mixtral-8x7B-Instruct
- MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter * `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc.
- :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `MPTForCausalLM`
* - :code:`NemotronForCausalLM` * MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
- Nemotron-3, Nemotron-4, Minitron * `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc.
- :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `NemotronForCausalLM`
* - :code:`OLMoForCausalLM` * Nemotron-3, Nemotron-4, Minitron
- OLMo * `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
- :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `OLMoForCausalLM`
* - :code:`OLMo2ForCausalLM` * OLMo
- OLMo2 * `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc.
- :code:`allenai/OLMo2-7B-1124`, etc. *
- * ✅︎
- ✅︎ - * `OLMo2ForCausalLM`
* - :code:`OLMoEForCausalLM` * OLMo2
- OLMoE * `allenai/OLMo2-7B-1124`, etc.
- :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `OLMoEForCausalLM`
* - :code:`OPTForCausalLM` * OLMoE
- OPT, OPT-IML * `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc.
- :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `OPTForCausalLM`
* - :code:`OrionForCausalLM` * OPT, OPT-IML
- Orion * `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.
- :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc. *
- * ✅︎
- ✅︎ - * `OrionForCausalLM`
* - :code:`PhiForCausalLM` * Orion
- Phi * `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc.
- :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `PhiForCausalLM`
* - :code:`Phi3ForCausalLM` * Phi
- Phi-3 * `microsoft/phi-1_5`, `microsoft/phi-2`, etc.
- :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Phi3ForCausalLM`
* - :code:`Phi3SmallForCausalLM` * Phi-4, Phi-3
- Phi-3-Small * `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc.
- :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `Phi3SmallForCausalLM`
* - :code:`PhiMoEForCausalLM` * Phi-3-Small
- Phi-3.5-MoE * `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc.
- :code:`microsoft/Phi-3.5-MoE-instruct`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `PhiMoEForCausalLM`
* - :code:`PersimmonForCausalLM` * Phi-3.5-MoE
- Persimmon * `microsoft/Phi-3.5-MoE-instruct`, etc.
- :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `PersimmonForCausalLM`
* - :code:`QWenLMHeadModel` * Persimmon
- Qwen * `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc.
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `QWenLMHeadModel`
* - :code:`Qwen2ForCausalLM` * Qwen
- Qwen2 * `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.
- :code:`Qwen/QwQ-32B-Preview`, :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Qwen2ForCausalLM`
* - :code:`Qwen2MoeForCausalLM` * QwQ, Qwen2
- Qwen2MoE * `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc.
- :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. * ✅︎
- * ✅︎
- ✅︎ - * `Qwen2MoeForCausalLM`
* - :code:`StableLmForCausalLM` * Qwen2MoE
- StableLM * `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
- :code:`stabilityai/stablelm-3b-4e1t`, :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc. *
- * ✅︎
- ✅︎ - * `StableLmForCausalLM`
* - :code:`Starcoder2ForCausalLM` * StableLM
- Starcoder2 * `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.
- :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc. *
- * ✅︎
- ✅︎ - * `Starcoder2ForCausalLM`
* - :code:`SolarForCausalLM` * Starcoder2
- Solar Pro * `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc.
- :code:`upstage/solar-pro-preview-instruct`, etc. *
- ✅︎ * ✅︎
- ✅︎ - * `SolarForCausalLM`
* - :code:`TeleChat2ForCausalLM` * Solar Pro
- TeleChat2 * `upstage/solar-pro-preview-instruct`, etc.
- :code:`TeleAI/TeleChat2-3B`, :code:`TeleAI/TeleChat2-7B`, :code:`TeleAI/TeleChat2-35B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `TeleChat2ForCausalLM`
* - :code:`XverseForCausalLM` * TeleChat2
- XVERSE * `TeleAI/TeleChat2-3B`, `TeleAI/TeleChat2-7B`, `TeleAI/TeleChat2-35B`, etc.
- :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `XverseForCausalLM`
``` * XVERSE
* `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
```{note} * ✅︎
* ✅︎
:::
:::{note}
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096. Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
``` :::
### Pooling Models ### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models. See [this page](pooling-models) for more information on how to use pooling models.
```{important} :::{important}
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
``` :::
#### Text Embedding (`--task embed`) #### Text Embedding (`--task embed`)
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 50 5 5
:widths: 25 25 50 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `BertModel`
* - :code:`BertModel` * BERT-based
- BERT-based * `BAAI/bge-base-en-v1.5`, etc.
- :code:`BAAI/bge-base-en-v1.5`, etc. *
- *
- - * `Gemma2Model`
* - :code:`Gemma2Model` * Gemma2-based
- Gemma2-based * `BAAI/bge-multilingual-gemma2`, etc.
- :code:`BAAI/bge-multilingual-gemma2`, etc. *
- * ✅︎
- ✅︎ - * `GritLM`
* - :code:`GritLM` * GritLM
- GritLM * `parasail-ai/GritLM-7B-vllm`.
- :code:`parasail-ai/GritLM-7B-vllm`. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc.
* - :code:`LlamaModel`, :code:`LlamaForCausalLM`, :code:`MistralModel`, etc. * Llama-based
- Llama-based * `intfloat/e5-mistral-7b-instruct`, etc.
- :code:`intfloat/e5-mistral-7b-instruct`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Qwen2Model`, `Qwen2ForCausalLM`
* - :code:`Qwen2Model`, :code:`Qwen2ForCausalLM` * Qwen2-based
- Qwen2-based * `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc.
- :code:`ssmits/Qwen2-7B-Instruct-embed-base` (see note), :code:`Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `RobertaModel`, `RobertaForMaskedLM`
* - :code:`RobertaModel`, :code:`RobertaForMaskedLM` * RoBERTa-based
- RoBERTa-based * `sentence-transformers/all-roberta-large-v1`, `sentence-transformers/all-roberta-large-v1`, etc.
- :code:`sentence-transformers/all-roberta-large-v1`, :code:`sentence-transformers/all-roberta-large-v1`, etc. *
- *
- - * `XLMRobertaModel`
* - :code:`XLMRobertaModel` * XLM-RoBERTa-based
- XLM-RoBERTa-based * `intfloat/multilingual-e5-large`, etc.
- :code:`intfloat/multilingual-e5-large`, etc. *
- *
- :::
```
:::{note}
```{note} `ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
{code}`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config. You should manually set mean pooling by passing `--override-pooler-config '{"pooling_type": "MEAN"}'`.
You should manually set mean pooling by passing {code}`--override-pooler-config '{"pooling_type": "MEAN"}'`. :::
```
:::{note}
```{note} Unlike base Qwen2, `Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention.
Unlike base Qwen2, {code}`Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention. You can set `--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.
You can set {code}`--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.
On the other hand, its 1.5B variant (`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
On the other hand, its 1.5B variant ({code}`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
despite being described otherwise on its model card. despite being described otherwise on its model card.
```
Regardless of the variant, you need to enable `--trust-remote-code` for the correct tokenizer to be
loaded. See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).
:::
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings {func}`~vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token. of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
#### Reward Modeling (`--task reward`) #### Reward Modeling (`--task reward`)
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 50 5 5
:widths: 25 25 50 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `InternLM2ForRewardModel`
* - :code:`LlamaForCausalLM` * InternLM2-based
- Llama-based * `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.
- :code:`peiyi9979/math-shepherd-mistral-7b-prm`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `LlamaForCausalLM`
* - :code:`Qwen2ForRewardModel` * Llama-based
- Qwen2-based * `peiyi9979/math-shepherd-mistral-7b-prm`, etc.
- :code:`Qwen/Qwen2.5-Math-RM-72B`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Qwen2ForRewardModel`
``` * Qwen2-based
* `Qwen/Qwen2.5-Math-RM-72B`, etc.
* ✅︎
* ✅︎
- * `Qwen2ForProcessRewardModel`
* Qwen2-based
* `Qwen/Qwen2.5-Math-PRM-7B`, `Qwen/Qwen2.5-Math-PRM-72B`, etc.
* ✅︎
* ✅︎
:::
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly. {func}`~vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
```{important} :::{important}
For process-supervised reward models such as {code}`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly, For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`. e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
``` :::
#### Classification (`--task classify`) #### Classification (`--task classify`)
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 50 5 5
:widths: 25 25 50 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `JambaForSequenceClassification`
* - :code:`JambaForSequenceClassification` * Jamba
- Jamba * `ai21labs/Jamba-tiny-reward-dev`, etc.
- :code:`ai21labs/Jamba-tiny-reward-dev`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ - * `Qwen2ForSequenceClassification`
* - :code:`Qwen2ForSequenceClassification` * Qwen2-based
- Qwen2-based * `jason9693/Qwen2.5-1.5B-apeach`, etc.
- :code:`jason9693/Qwen2.5-1.5B-apeach`, etc. * ✅︎
- ✅︎ * ✅︎
- ✅︎ :::
```
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token. {func}`~vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring (`--task score`) #### Sentence Pair Scoring (`--task score`)
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 50 5 5
:widths: 25 25 50 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `BertForSequenceClassification`
* - :code:`BertForSequenceClassification` * BERT-based
- BERT-based * `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
- :code:`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. *
- *
- - * `RobertaForSequenceClassification`
* - :code:`RobertaForSequenceClassification` * RoBERTa-based
- RoBERTa-based * `cross-encoder/quora-roberta-base`, etc.
- :code:`cross-encoder/quora-roberta-base`, etc. *
- *
- - * `XLMRobertaForSequenceClassification`
* - :code:`XLMRobertaForSequenceClassification` * XLM-RoBERTa-based
- XLM-RoBERTa-based * `BAAI/bge-reranker-v2-m3`, etc.
- :code:`BAAI/bge-reranker-v2-m3`, etc. *
- *
- :::
```
(supported-mm-models)= (supported-mm-models)=
...@@ -537,206 +550,21 @@ The following modalities are supported depending on the model: ...@@ -537,206 +550,21 @@ The following modalities are supported depending on the model:
- **V**ideo - **V**ideo
- **A**udio - **A**udio
Any combination of modalities joined by {code}`+` are supported. Any combination of modalities joined by `+` are supported.
- e.g.: {code}`T + I` means that the model supports text-only, image-only, and text-with-image inputs. - e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs.
On the other hand, modalities separated by {code}`/` are mutually exclusive. On the other hand, modalities separated by `/` are mutually exclusive.
- e.g.: {code}`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs. - e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model. See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
### Generative Models :::{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
See [this page](#generative-models) for more information on how to use generative models. Offline inference:
#### Text Generation (`--task generate`)
```{eval-rst}
.. list-table::
:widths: 25 25 15 20 5 5 5
:header-rows: 1
* - Architecture
- Models
- Inputs
- Example HF Models
- :ref:`LoRA <lora-adapter>`
- :ref:`PP <distributed-serving>`
- V1
* - :code:`AriaForConditionalGeneration`
- Aria
- T + I
- :code:`rhymes-ai/Aria`
-
- ✅︎
-
* - :code:`Blip2ForConditionalGeneration`
- BLIP-2
- T + I\ :sup:`E`
- :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
-
- ✅︎
-
* - :code:`ChameleonForConditionalGeneration`
- Chameleon
- T + I
- :code:`facebook/chameleon-7b` etc.
-
- ✅︎
-
* - :code:`FuyuForCausalLM`
- Fuyu
- T + I
- :code:`adept/fuyu-8b` etc.
-
- ✅︎
-
* - :code:`ChatGLMModel`
- GLM-4V
- T + I
- :code:`THUDM/glm-4v-9b` etc.
- ✅︎
- ✅︎
-
* - :code:`H2OVLChatModel`
- H2OVL
- T + I\ :sup:`E+`
- :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc.
-
- ✅︎
-
* - :code:`Idefics3ForConditionalGeneration`
- Idefics3
- T + I
- :code:`HuggingFaceM4/Idefics3-8B-Llama3` etc.
- ✅︎
-
-
* - :code:`InternVLChatModel`
- InternVL 2.5, Mono-InternVL, InternVL 2.0
- T + I\ :sup:`E+`
- :code:`OpenGVLab/InternVL2_5-4B`, :code:`OpenGVLab/Mono-InternVL-2B`, :code:`OpenGVLab/InternVL2-4B`, etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
-
- ✅︎
-
* - :code:`LlavaNextVideoForConditionalGeneration`
- LLaVA-NeXT-Video
- T + V
- :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
-
- ✅︎
-
* - :code:`LlavaOnevisionForConditionalGeneration`
- LLaVA-Onevision
- T + I\ :sup:`+` + V\ :sup:`+`
- :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
-
- ✅︎
-
* - :code:`MiniCPMV`
- MiniCPM-V
- T + I\ :sup:`E+`
- :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
- ✅︎
- ✅︎
-
* - :code:`MllamaForConditionalGeneration`
- Llama 3.2
- T + I\ :sup:`+`
- :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
-
-
-
* - :code:`MolmoForCausalLM`
- Molmo
- T + I
- :code:`allenai/Molmo-7B-D-0924`, :code:`allenai/Molmo-72B-0924`, etc.
-
- ✅︎
- ✅︎
* - :code:`NVLM_D_Model`
- NVLM-D 1.0
- T + I\ :sup:`E+`
- :code:`nvidia/NVLM-D-72B`, etc.
-
- ✅︎
- ✅︎
* - :code:`PaliGemmaForConditionalGeneration`
- PaliGemma, PaliGemma 2
- T + I\ :sup:`E`
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, :code:`google/paligemma2-3b-ft-docci-448`, etc.
-
- ✅︎
-
* - :code:`Phi3VForCausalLM`
- Phi-3-Vision, Phi-3.5-Vision
- T + I\ :sup:`E+`
- :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
-
- ✅︎
- ✅︎
* - :code:`PixtralForConditionalGeneration`
- Pixtral
- T + I\ :sup:`+`
- :code:`mistralai/Pixtral-12B-2409`, :code:`mistral-community/pixtral-12b` etc.
-
- ✅︎
- ✅︎
* - :code:`QWenLMHeadModel`
- Qwen-VL
- T + I\ :sup:`E+`
- :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
- ✅︎
- ✅︎
-
* - :code:`Qwen2AudioForConditionalGeneration`
- Qwen2-Audio
- T + A\ :sup:`+`
- :code:`Qwen/Qwen2-Audio-7B-Instruct`
-
- ✅︎
-
* - :code:`Qwen2VLForConditionalGeneration`
- Qwen2-VL
- T + I\ :sup:`E+` + V\ :sup:`E+`
- :code:`Qwen/QVQ-72B-Preview`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
- ✅︎
- ✅︎
-
* - :code:`UltravoxModel`
- Ultravox
- T + A\ :sup:`E+`
- :code:`fixie-ai/ultravox-v0_3`
-
- ✅︎
-
```
```{eval-rst}
:sup:`E` Pre-computed embeddings can be inputted for this modality.
:sup:`+` Multiple items can be inputted per text prompt for this modality.
```
````{important}
To enable multiple multi-modal items per text prompt, you have to set {code}`limit_mm_per_prompt` (offline inference)
or {code}`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
```python ```python
llm = LLM( llm = LLM(
...@@ -745,90 +573,300 @@ llm = LLM( ...@@ -745,90 +573,300 @@ llm = LLM(
) )
``` ```
Online serving:
```bash ```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4 vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
``` ```
````
```{note} :::
:::{note}
vLLM currently only supports adding LoRA to the language backbone of multimodal models. vLLM currently only supports adding LoRA to the language backbone of multimodal models.
``` :::
```{note} ### Generative Models
To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo ({code}`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM. See [this page](#generative-models) for more information on how to use generative models.
```
#### Text Generation (`--task generate`)
```{note} :::{list-table}
The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now. :widths: 25 25 15 20 5 5 5
:header-rows: 1
- * Architecture
* Models
* Inputs
* Example HF Models
* [LoRA](#lora-adapter)
* [PP](#distributed-serving)
* [V1](gh-issue:8779)
- * `AriaForConditionalGeneration`
* Aria
* T + I<sup>+</sup>
* `rhymes-ai/Aria`
*
* ✅︎
* ✅︎
- * `Blip2ForConditionalGeneration`
* BLIP-2
* T + I<sup>E</sup>
* `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc.
*
* ✅︎
* ✅︎
- * `ChameleonForConditionalGeneration`
* Chameleon
* T + I
* `facebook/chameleon-7b` etc.
*
* ✅︎
* ✅︎
- * `DeepseekVLV2ForCausalLM`
* DeepSeek-VL2
* T + I<sup>+</sup>
* `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note)
*
* ✅︎
* ✅︎
- * `FuyuForCausalLM`
* Fuyu
* T + I
* `adept/fuyu-8b` etc.
*
* ✅︎
* ✅︎
- * `ChatGLMModel`
* GLM-4V
* T + I
* `THUDM/glm-4v-9b` etc.
* ✅︎
* ✅︎
*
- * `H2OVLChatModel`
* H2OVL
* T + I<sup>E+</sup>
* `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
*
* ✅︎
*
- * `Idefics3ForConditionalGeneration`
* Idefics3
* T + I
* `HuggingFaceM4/Idefics3-8B-Llama3` etc.
* ✅︎
*
*
- * `InternVLChatModel`
* InternVL 2.5, Mono-InternVL, InternVL 2.0
* T + I<sup>E+</sup>
* `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
*
* ✅︎
* ✅︎
- * `LlavaForConditionalGeneration`
* LLaVA-1.5
* T + I<sup>E+</sup>
* `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
*
* ✅︎
* ✅︎
- * `LlavaNextForConditionalGeneration`
* LLaVA-NeXT
* T + I<sup>E+</sup>
* `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
*
* ✅︎
* ✅︎
- * `LlavaNextVideoForConditionalGeneration`
* LLaVA-NeXT-Video
* T + V
* `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
*
* ✅︎
* ✅︎
- * `LlavaOnevisionForConditionalGeneration`
* LLaVA-Onevision
* T + I<sup>+</sup> + V<sup>+</sup>
* `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
*
* ✅︎
* ✅︎
- * `MiniCPMO`
* MiniCPM-O
* T + I<sup>E+</sup> + V<sup>E+</sup> + A<sup>E+</sup>
* `openbmb/MiniCPM-o-2_6`, etc.
* ✅︎
* ✅︎
*
- * `MiniCPMV`
* MiniCPM-V
* T + I<sup>E+</sup> + V<sup>E+</sup>
* `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
* ✅︎
* ✅︎
*
- * `MllamaForConditionalGeneration`
* Llama 3.2
* T + I<sup>+</sup>
* `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc.
*
*
*
- * `MolmoForCausalLM`
* Molmo
* T + I
* `allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc.
* ✅︎
* ✅︎
* ✅︎
- * `NVLM_D_Model`
* NVLM-D 1.0
* T + I<sup>E+</sup>
* `nvidia/NVLM-D-72B`, etc.
*
* ✅︎
* ✅︎
- * `PaliGemmaForConditionalGeneration`
* PaliGemma, PaliGemma 2
* T + I<sup>E</sup>
* `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
*
* ✅︎
*
- * `Phi3VForCausalLM`
* Phi-3-Vision, Phi-3.5-Vision
* T + I<sup>E+</sup>
* `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc.
*
* ✅︎
* ✅︎
- * `PixtralForConditionalGeneration`
* Pixtral
* T + I<sup>+</sup>
* `mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` (see note), etc.
*
* ✅︎
* ✅︎
- * `QWenLMHeadModel`
* Qwen-VL
* T + I<sup>E+</sup>
* `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
* ✅︎
* ✅︎
* ✅︎
- * `Qwen2AudioForConditionalGeneration`
* Qwen2-Audio
* T + A<sup>+</sup>
* `Qwen/Qwen2-Audio-7B-Instruct`
*
* ✅︎
* ✅︎
- * `Qwen2VLForConditionalGeneration`
* QVQ, Qwen2-VL
* T + I<sup>E+</sup> + V<sup>E+</sup>
* `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc.
* ✅︎
* ✅︎
* ✅︎
- * `UltravoxModel`
* Ultravox
* T + A<sup>E+</sup>
* `fixie-ai/ultravox-v0_3`
*
* ✅︎
* ✅︎
:::
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{note}
To use `DeepSeek-VL2` series models, you have to pass `--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` when running vLLM.
:::
:::{note}
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
:::
:::{note}
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630> For more details, please see: <gh-pr:4087#issuecomment-2250397630>
``` :::
:::{note}
The chat template for Pixtral-HF is incorrect (see [discussion](https://huggingface.co/mistral-community/pixtral-12b/discussions/22)).
A corrected version is available at <gh-file:examples/template_pixtral_hf.jinja>.
:::
### Pooling Models ### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models. See [this page](pooling-models) for more information on how to use pooling models.
```{important} :::{important}
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
``` :::
#### Text Embedding (`--task embed`) #### Text Embedding (`--task embed`)
Any text generation model can be converted into an embedding model by passing {code}`--task embed`. Any text generation model can be converted into an embedding model by passing `--task embed`.
```{note} :::{note}
To get the best results, you should use pooling models that are specifically trained as such. To get the best results, you should use pooling models that are specifically trained as such.
``` :::
The following table lists those that are tested in vLLM. The following table lists those that are tested in vLLM.
```{eval-rst} :::{list-table}
.. list-table:: :widths: 25 25 15 25 5 5
:widths: 25 25 15 25 5 5 :header-rows: 1
:header-rows: 1
- * Architecture
* - Architecture * Models
- Models * Inputs
- Inputs * Example HF Models
- Example HF Models * [LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>` * [PP](#distributed-serving)
- :ref:`PP <distributed-serving>` - * `LlavaNextForConditionalGeneration`
* - :code:`LlavaNextForConditionalGeneration` * LLaVA-NeXT-based
- LLaVA-NeXT-based * T / I
- T / I * `royokong/e5-v`
- :code:`royokong/e5-v` *
- * ✅︎
- ✅︎ - * `Phi3VForCausalLM`
* - :code:`Phi3VForCausalLM` * Phi-3-Vision-based
- Phi-3-Vision-based * T + I
- T + I * `TIGER-Lab/VLM2Vec-Full`
- :code:`TIGER-Lab/VLM2Vec-Full` * 🚧
- 🚧 * ✅︎
- ✅︎ - * `Qwen2VLForConditionalGeneration`
* - :code:`Qwen2VLForConditionalGeneration` * Qwen2-VL-based
- Qwen2-VL-based * T + I
- T + I * `MrLight/dse-qwen2-2b-mrl-v1`
- :code:`MrLight/dse-qwen2-2b-mrl-v1` *
- * ✅︎
- ✅︎ :::
```
_________________
______________________________________________________________________
## Model Support Policy
# Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support: At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated! 1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results. 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
```{tip} :::{tip}
When comparing the output of {code}`model.generate` from HuggingFace Transformers with the output of {code}`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs. When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
``` :::
3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback. 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use. 4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement. 5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem. Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
......
(performance)= (optimization-and-tuning)=
# Performance and Tuning # Optimization and Tuning
## Preemption ## Preemption
...@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w ...@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed: available again. When this occurs, the following warning is printed:
``` ```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1 WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
``` ```
...@@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma ...@@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma
```python ```python
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True) llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
# Set max_num_batched_tokens to tune performance. # Set max_num_batched_tokens to tune performance.
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill. # NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512) # llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)
``` ```
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch. By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
...@@ -49,13 +49,12 @@ This policy has two benefits: ...@@ -49,13 +49,12 @@ This policy has two benefits:
- It improves ITL and generation decode because decode requests are prioritized. - It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch. - It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
You can tune the performance by changing `max_num_batched_tokens`. You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes. Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch. Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes). - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler. - Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
We recommend you set `max_num_batched_tokens > 2048` for throughput. We recommend you set `max_num_batched_tokens > 2048` for throughput.
......
(fp8-e4m3-kvcache)=
# FP8 E4M3 KV Cache
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache,
improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2
(5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of
the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of
FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside
each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling
factors of a finer granularity (e.g. per-channel).
These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If
this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an
unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
To install AMMO (AlgorithMic Model Optimization):
```console
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
```
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc.
Thus, LLM inference is greatly accelerated with minimal accuracy loss.
Here is an example of how to enable this feature:
```python
# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8",
quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
prompt = "London is the capital of"
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
print(out)
# output w/ scaling factors: England, the United Kingdom, and one of the world's leading financial,
# output w/o scaling factors: England, located in the southeastern part of the country. It is known
```
(fp8-kv-cache)=
# FP8 E5M2 KV Cache
The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.
Here is an example of how to enable this feature:
```python
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
(deploying-with-helm)=
# Deploying with Helm
A Helm chart to deploy vLLM for Kubernetes
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- S3 with the model which will be deployed
## Installing the chart
To install the chart with the release name `test-vllm`:
```console
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
```
## Uninstalling the Chart
To uninstall the `test-vllm` deployment:
```console
helm uninstall test-vllm --namespace=ns-vllm
```
The command removes all the Kubernetes components associated with the
chart **including persistent volumes** and deletes the release.
## Architecture
```{image} architecture_helm_deployment.png
```
## Values
```{eval-rst}
.. list-table:: Values
:widths: 25 25 25 25
:header-rows: 1
* - Key
- Type
- Default
- Description
* - autoscaling
- object
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
- Autoscaling configuration
* - autoscaling.enabled
- bool
- false
- Enable autoscaling
* - autoscaling.maxReplicas
- int
- 100
- Maximum replicas
* - autoscaling.minReplicas
- int
- 1
- Minimum replicas
* - autoscaling.targetCPUUtilizationPercentage
- int
- 80
- Target CPU utilization for autoscaling
* - configs
- object
- {}
- Configmap
* - containerPort
- int
- 8000
- Container port
* - customObjects
- list
- []
- Custom Objects configuration
* - deploymentStrategy
- object
- {}
- Deployment strategy configuration
* - externalConfigs
- list
- []
- External configuration
* - extraContainers
- list
- []
- Additional containers configuration
* - extraInit
- object
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
- Additional configuration for the init container
* - extraInit.pvcStorage
- string
- "50Gi"
- Storage size of the s3
* - extraInit.s3modelpath
- string
- "relative_s3_model_path/opt-125m"
- Path of the model on the s3 which hosts model weights and config files
* - extraInit.awsEc2MetadataDisabled
- boolean
- true
- Disables the use of the Amazon EC2 instance metadata service
* - extraPorts
- list
- []
- Additional ports configuration
* - gpuModels
- list
- ["TYPE_GPU_USED"]
- Type of gpu used
* - image
- object
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
- Image configuration
* - image.command
- list
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
- Container launch command
* - image.repository
- string
- "vllm/vllm-openai"
- Image repository
* - image.tag
- string
- "latest"
- Image tag
* - livenessProbe
- object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
- Liveness probe configuration
* - livenessProbe.failureThreshold
- int
- 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
* - livenessProbe.httpGet
- object
- {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server
* - livenessProbe.httpGet.path
- string
- "/health"
- Path to access on the HTTP server
* - livenessProbe.httpGet.port
- int
- 8000
- Name or number of the port to access on the container, on which the server is listening
* - livenessProbe.initialDelaySeconds
- int
- 15
- Number of seconds after the container has started before liveness probe is initiated
* - livenessProbe.periodSeconds
- int
- 10
- How often (in seconds) to perform the liveness probe
* - maxUnavailablePodDisruptionBudget
- string
- ""
- Disruption Budget Configuration
* - readinessProbe
- object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
- Readiness probe configuration
* - readinessProbe.failureThreshold
- int
- 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
* - readinessProbe.httpGet
- object
- {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server
* - readinessProbe.httpGet.path
- string
- "/health"
- Path to access on the HTTP server
* - readinessProbe.httpGet.port
- int
- 8000
- Name or number of the port to access on the container, on which the server is listening
* - readinessProbe.initialDelaySeconds
- int
- 5
- Number of seconds after the container has started before readiness probe is initiated
* - readinessProbe.periodSeconds
- int
- 5
- How often (in seconds) to perform the readiness probe
* - replicaCount
- int
- 1
- Number of replicas
* - resources
- object
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
- Resource configuration
* - resources.limits."nvidia.com/gpu"
- int
- 1
- Number of gpus used
* - resources.limits.cpu
- int
- 4
- Number of CPUs
* - resources.limits.memory
- string
- "16Gi"
- CPU memory configuration
* - resources.requests."nvidia.com/gpu"
- int
- 1
- Number of gpus used
* - resources.requests.cpu
- int
- 4
- Number of CPUs
* - resources.requests.memory
- string
- "16Gi"
- CPU memory configuration
* - secrets
- object
- {}
- Secrets configuration
* - serviceName
- string
-
- Service name
* - servicePort
- int
- 80
- Service port
* - labels.environment
- string
- test
- Environment name
* - labels.release
- string
- test
- Release name
```
(deploying-with-k8s)=
# Deploying with Kubernetes
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
## Deployment Steps
1. **Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
data:
token: "REPLACE_WITH_TOKEN"
```
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
```
2. **Create a Kubernetes Service for vLLM**
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
```yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
```
3. **Deploy and Test**
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
```console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
```
To test the deployment, run the following `curl` command:
```console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```
If the service is correctly deployed, you should receive a response from the vLLM model.
## Conclusion
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
...@@ -8,23 +8,23 @@ Before going into the details of distributed inference and serving, let's first ...@@ -8,23 +8,23 @@ Before going into the details of distributed inference and serving, let's first
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference. - **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4. - **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2. - **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes. In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough. After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
```{note} :::{note}
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs. There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
``` :::
## Details for Distributed Inference and Serving ## Running vLLM on a single node
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray. vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured {code}`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the {code}`LLM` class {code}`distributed-executor-backend` argument or {code}`--distributed-executor-backend` API server argument. Set it to {code}`mp` for multiprocessing or {code}`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
To run multi-GPU inference with the {code}`LLM` class, set the {code}`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
```python ```python
from vllm import LLM from vllm import LLM
...@@ -32,74 +32,74 @@ llm = LLM("facebook/opt-13b", tensor_parallel_size=4) ...@@ -32,74 +32,74 @@ llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a") output = llm.generate("San Franciso is a")
``` ```
To run multi-GPU serving, pass in the {code}`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console ```console
$ vllm serve facebook/opt-13b \ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4 --tensor-parallel-size 4
``` ```
You can also additionally specify {code}`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism: You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console ```console
$ vllm serve gpt2 \ vllm serve gpt2 \
$ --tensor-parallel-size 4 \ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2 --pipeline-parallel-size 2
``` ```
## Multi-Node Inference and Serving ## Running vLLM on multiple nodes
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration. If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command. The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/online_serving/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
Pick a node as the head node, and run the following command: Pick a node as the head node, and run the following command:
```console ```console
$ bash run_cluster.sh \ bash run_cluster.sh \
$ vllm/vllm-openai \ vllm/vllm-openai \
$ ip_of_head_node \ ip_of_head_node \
$ --head \ --head \
$ /path/to/the/huggingface/home/in/this/node /path/to/the/huggingface/home/in/this/node
``` ```
On the rest of the worker nodes, run the following command: On the rest of the worker nodes, run the following command:
```console ```console
$ bash run_cluster.sh \ bash run_cluster.sh \
$ vllm/vllm-openai \ vllm/vllm-openai \
$ ip_of_head_node \ ip_of_head_node \
$ --worker \ --worker \
$ /path/to/the/huggingface/home/in/this/node /path/to/the/huggingface/home/in/this/node
``` ```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct. Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs. Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2: After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console ```console
$ vllm serve /path/to/the/model/in/the/container \ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 8 \ --tensor-parallel-size 8 \
$ --pipeline-parallel-size 2 --pipeline-parallel-size 2
``` ```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16: You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
```console ```console
$ vllm serve /path/to/the/model/in/the/container \ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 16 --tensor-parallel-size 16
``` ```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient. To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning} :::{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information. After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
``` :::
```{warning} :::{warning}
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes. Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model. When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
``` :::
...@@ -4,6 +4,7 @@ ...@@ -4,6 +4,7 @@
Below, you can find an explanation of every engine argument for vLLM: Below, you can find an explanation of every engine argument for vLLM:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
...@@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM: ...@@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
Below are the additional arguments related to the asynchronous engine: Below are the additional arguments related to the asynchronous engine:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
......
...@@ -2,14 +2,14 @@ ...@@ -2,14 +2,14 @@
vLLM uses the following environment variables to configure the system: vLLM uses the following environment variables to configure the system:
```{warning} :::{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work. Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
``` :::
```{literalinclude} ../../../vllm/envs.py :::{literalinclude} ../../../vllm/envs.py
:end-before: end-env-vars-definition :end-before: end-env-vars-definition
:language: python :language: python
:start-after: begin-env-vars-definition :start-after: begin-env-vars-definition
``` :::
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment