vLLM does not support Windows natively. To run vLLM on Windows, you can use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use some community-maintained forks, e.g. [https://github.com/SystemPanic/vllm-windows](https://github.com/SystemPanic/vllm-windows).
vLLM does not support Windows natively. To run vLLM on Windows, you can use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use some community-maintained forks, e.g. [https://github.com/SystemPanic/vllm-windows](https://github.com/SystemPanic/vllm-windows).
vLLM supports AMD GPUs with ROCm 6.3 or above, and torch 2.8.0 and above.
!!! tip
!!! tip
[Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
[Docker](#set-up-using-docker) is the recommended way to use vLLM on ROCm.
...
@@ -11,9 +11,10 @@ vLLM supports AMD GPUs with ROCm 6.3 or above.
...
@@ -11,9 +11,10 @@ vLLM supports AMD GPUs with ROCm 6.3 or above.
# --8<-- [end:installation]
# --8<-- [end:installation]
# --8<-- [start:requirements]
# --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201)
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150)
- ROCm 6.3 or above
- ROCm 6.3 or above
- MI350 requires ROCm 7.0 or above
- MI350 requires ROCm 7.0 or above
- Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above
# --8<-- [end:requirements]
# --8<-- [end:requirements]
# --8<-- [start:set-up-using-python]
# --8<-- [start:set-up-using-python]
...
@@ -28,57 +29,63 @@ Currently, there are no pre-built ROCm wheels.
...
@@ -28,57 +29,63 @@ Currently, there are no pre-built ROCm wheels.
# --8<-- [end:pre-built-wheels]
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
# --8<-- [start:build-wheel-from-source]
!!! tip
- If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps.
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.4.3_ubuntu24.04_py3.12_pytorch_release_2.6.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release_2.8.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
1. Install [Triton for ROCm](https://github.com/triton-lang/triton)
1. Install [Triton for ROCm](https://github.com/ROCm/triton.git)
Install ROCm's Triton (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
Install ROCm's Triton following the instructions from [ROCm/triton](https://github.com/ROCm/triton.git)
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
- The validated `$TRITON_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/Dao-AILab/flash-attention)
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/Dao-AILab/flash-attention.git)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention#amd-rocm-support)
Install ROCm's flash attention (v2.8.0) following the instructions from [ROCm/flash-attention](https://github.com/Dao-AILab/flash-attention#amd-rocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
For example, for ROCm 7.0, suppose your gfx arch is `gfx942`. To get your gfx architecture, run `rocminfo |grep gfx`.
You might need to downgrade the "ninja" version to 1.10 as it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
- The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
...
@@ -92,11 +99,13 @@ Currently, there are no pre-built ROCm wheels.
...
@@ -92,11 +99,13 @@ Currently, there are no pre-built ROCm wheels.
```
```
!!! note
!!! note
You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
4. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
??? console "Commands"
???+ console "Commands"
```bash
```bash
pip install --upgrade pip
pip install --upgrade pip
...
@@ -109,31 +118,48 @@ Currently, there are no pre-built ROCm wheels.
...
@@ -109,31 +118,48 @@ Currently, there are no pre-built ROCm wheels.
scipy \
scipy \
huggingface-hub[cli,hf_transfer] \
huggingface-hub[cli,hf_transfer] \
setuptools_scm
setuptools_scm
pip install "numpy<2"
pip install -r requirements/rocm.txt
pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
# To build for a single architecture (e.g., MI300) for faster installation (recommended):
export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
export PYTORCH_ROCM_ARCH="gfx942"
# To build vLLM for multiple arch MI210/MI250/MI300, use this instead
# export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
python3 setup.py develop
python3 setup.py develop
```
```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
!!! tip
!!! tip
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm-up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
!!! tip
!!! tip
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html).
# --8<-- [end:build-wheel-from-source]
# --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images]
# --8<-- [start:pre-built-images]
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
AMD also offers nightly prebuilt docker image from [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev), which has vLLM and all its dependencies installed.
???+ console "Commands"
```bash
docker pull rocm/vllm-dev:nightly # to get the latest image
docker run -it--rm\
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-optseccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-eHF_HOME="/app/models"\
rocm/vllm-dev:nightly
```
!!! tip
!!! tip
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
...
@@ -144,33 +170,33 @@ docker image designed for validating inference performance on the AMD Instinct
...
@@ -144,33 +170,33 @@ docker image designed for validating inference performance on the AMD Instinct
Building the Docker image from source is the recommended way to use vLLM with ROCm.
Building the Docker image from source is the recommended way to use vLLM with ROCm.
#### (Optional) Build an image with ROCm software stack
??? info "(Optional) Build an image with ROCm software stack"
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM.
Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows.
If you choose to build this rocm_base image yourself, the steps are as follows.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```json
```json
{
{
"features":{
"features": {
"buildkit":true
"buildkit": true
}
}
}
}
```
```
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default:
```bash
```bash
DOCKER_BUILDKIT=1 docker build \
DOCKER_BUILDKIT=1 docker build \
-f docker/Dockerfile.rocm_base \
-f docker/Dockerfile.rocm_base \
-t rocm/vllm-dev:base .
-t rocm/vllm-dev:base .
```
```
#### Build an image with vLLM
#### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```bash
```bash
...
@@ -181,24 +207,24 @@ It is important that the user kicks off the docker build using buildkit. Either
...
@@ -181,24 +207,24 @@ It is important that the user kicks off the docker build using buildkit. Either
}
}
```
```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
[docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 7.0 by default, but also supports ROCm 5.7, 6.0, 6.1, 6.2, 6.3, and 6.4, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments:
It provides flexibility to customize the build of docker image using the following arguments:
-`BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base>
-`BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
-`ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
-`ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options.
Their values can be passed in when running `docker build` with `--build-arg` options.
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default:
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
On NVIDIA CUDA only, it's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
-[Online serving using OpenAI-compatible server][quickstart-online]
-[Online serving using OpenAI-compatible server](#openai-compatible-server)
## Prerequisites
## Prerequisites
- OS: Linux
- OS: Linux
- Python: 3.9 -- 3.13
- Python: 3.10 -- 3.13
## Installation
## Installation
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
=== "NVIDIA CUDA"
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
```bash
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```
`uv` can [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`).
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
```
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
`uv` can [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driverversion via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`).
```bash
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
uv run --with vllm vllm --help
```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
```bash
uv run --with vllm vllm --help
```
```bash
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
conda create -n myenv python=3.12 -y
conda activate myenv
```bash
pip install--upgrade uv
conda create -n myenv python=3.12 -y
uv pip install vllm --torch-backend=auto
conda activate myenv
```
pip install --upgrade uv
uv pip install vllm --torch-backend=auto
```
=== "AMD ROCm"
Use a pre-built docker image from Docker Hub. The public stable image is [rocm/vllm:latest](https://hub.docker.com/r/rocm/vllm). There is also a development image at [rocm/vllm-dev](https://hub.docker.com/r/rocm/vllm-dev).
The `-v` flag in the `docker run` command below mounts a local directory into the container. Replace `<path/to/your/models>` with the path on your host machine to the directory containing your models. The models will then be accessible inside the container at `/app/models`.
???+ console "Commands"
```bash
docker pull rocm/vllm-dev:nightly # to get the latest image
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm-dev:nightly
```
=== "Google TPU"
To run vLLM on Google TPUs, you need to install the `vllm-tpu` package.
```bash
uv pip install vllm-tpu
```
!!! note
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
!!! note
!!! note
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
[](){ #quickstart-offline }
## Offline Batched Inference
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
...
@@ -57,7 +90,7 @@ The first line of this example imports the classes [LLM][vllm.LLM] and [Sampling
...
@@ -57,7 +90,7 @@ The first line of this example imports the classes [LLM][vllm.LLM] and [Sampling
fromvllmimportLLM,SamplingParams
fromvllmimportLLM,SamplingParams
```
```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](../api/README.md#inference-parameters).
!!! important
!!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
By default, the server uses a predefined chat template stored in the tokenizer.
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here][chat-template].
You can learn about overriding it [here](../serving/openai_compatible_server.md#chat-template).
!!! important
!!! important
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
...
@@ -194,12 +225,14 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
...
@@ -194,12 +225,14 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### OpenAI Chat Completions API with vLLM
### OpenAI Chat Completions API with vLLM
...
@@ -239,7 +272,7 @@ Alternatively, you can use the `openai` Python package:
...
@@ -239,7 +272,7 @@ Alternatively, you can use the `openai` Python package:
messages=[
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
{"role": "user", "content": "Tell me a joke."},
]
],
)
)
print("Chat response:", chat_response)
print("Chat response:", chat_response)
```
```
...
@@ -248,7 +281,17 @@ Alternatively, you can use the `openai` Python package:
...
@@ -248,7 +281,17 @@ Alternatively, you can use the `openai` Python package:
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options:
- On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
For AMD ROCm, you can futher control the specific Attention implementation using the following variables:
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it.
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
@@ -3,4 +3,4 @@ Loading Model weights with fastsafetensors
...
@@ -3,4 +3,4 @@ Loading Model weights with fastsafetensors
Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.
Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.
To enable this feature, use the ``--load-format fastsafetensors`` command-line argument
To enable this feature, use the `--load-format fastsafetensors` command-line argument
You can tune parameters using `--model-loader-extra-config`:
You can tune parameters using `--model-loader-extra-config`:
You can tune `distributed` that controls whether distributed streaming should be used. This is currently only possible on CUDA and ROCM devices. This can significantly improve loading times from object storage or high-throughput network fileshares.
You can read further about Distributed streaming [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/usage.md#distributed-streaming)
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
To create sharded model files, you can use the script provided in [examples/offline_inference/save_sharded_state.py](../../../examples/offline_inference/save_sharded_state.py). This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
A code example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### `LLM.beam_search`
### `LLM.beam_search`
...
@@ -98,15 +98,15 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
...
@@ -98,15 +98,15 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
conversation = [
conversation = [
{
{
"role": "system",
"role": "system",
"content": "You are a helpful assistant"
"content": "You are a helpful assistant",
},
},
{
{
"role": "user",
"role": "user",
"content": "Hello"
"content": "Hello",
},
},
{
{
"role": "assistant",
"role": "assistant",
"content": "Hello! How can I assist you today?"
"content": "Hello! How can I assist you today?",
},
},
{
{
"role": "user",
"role": "user",
...
@@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
...
@@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
-[Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
-[Completions API](../serving/openai_compatible_server.md#completions-api) is similar to `LLM.generate` but only accepts text.
-[Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
-[Chat API](../serving/openai_compatible_server.md#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions!
We are now planning to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
## Configuration
## Configuration
...
@@ -30,11 +30,11 @@ If `--runner pooling` has been set (manually or automatically) but the model doe
...
@@ -30,11 +30,11 @@ If `--runner pooling` has been set (manually or automatically) but the model doe
vLLM will attempt to automatically convert the model according to the architecture names
vLLM will attempt to automatically convert the model according to the architecture names
(output,)=llm.score("What is the capital of France?",
(output,)=llm.score(
"The capital of Brazil is Brasilia.")
"What is the capital of France?",
"The capital of Brazil is Brasilia.",
)
score=output.outputs.score
score=output.outputs.score
print(f"Score: {score}")
print(f"Score: {score}")
```
```
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
### `LLM.reward`
### `LLM.reward`
The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
It returns the extracted hidden states directly.
```python
```python
fromvllmimportLLM
fromvllmimportLLM
...
@@ -154,20 +157,22 @@ data = output.outputs.data
...
@@ -154,20 +157,22 @@ data = output.outputs.data
print(f"Data: {data!r}")
print(f"Data: {data!r}")
```
```
A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py>
A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
### `LLM.encode`
### `LLM.encode`
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
It returns the extracted hidden states directly.
!!! note
!!! note
Please use one of the more specific methods or set the task directly when using `LLM.encode`:
Please use one of the more specific methods or set the task directly when using `LLM.encode`:
- For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
- For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
- For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
- For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
- For rewards, use `LLM.reward(...)` or `pooling_task="reward"`.
- For similarity scores, use `LLM.score(...)`.
- For similarity scores, use `LLM.score(...)`.
- For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`.
- For token classification, use `pooling_task="token_classify"`.
- For multi-vector retrieval, use `pooling_task="token_embed"`
- For IO Processor Plugins , use `pooling_task="plugin"`
```python
```python
fromvllmimportLLM
fromvllmimportLLM
...
@@ -183,10 +188,47 @@ print(f"Data: {data!r}")
...
@@ -183,10 +188,47 @@ print(f"Data: {data!r}")
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
-[Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
-[Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
-[Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
-[Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
-[Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
-[Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
-[Score API][score-api] is similar to `LLM.score` for cross-encoder models.
-[Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
!!! note
Please use one of the more specific methods or set the task directly when using [Pooling API](../serving/openai_compatible_server.md#pooling-api) api.:
- For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`.
- For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `task":"classify"`.
- For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api).
- For rewards, `task":"token_classify"`.
- For token classification, use `task":"token_classify"`.
- For multi-vector retrieval, use `task":"token_embed"`
- For IO Processor Plugins , use `task":"plugin"`
```python
# start a supported embeddings model server with `vllm serve`, e.g.
@@ -220,27 +262,31 @@ You can change the output dimensions of embedding models that support Matryoshka
...
@@ -220,27 +262,31 @@ You can change the output dimensions of embedding models that support Matryoshka
```python
```python
fromvllmimportLLM,PoolingParams
fromvllmimportLLM,PoolingParams
llm=LLM(model="jinaai/jina-embeddings-v3",
llm=LLM(
runner="pooling",
model="jinaai/jina-embeddings-v3",
trust_remote_code=True)
runner="pooling",
outputs=llm.embed(["Follow the white rabbit."],
trust_remote_code=True,
pooling_params=PoolingParams(dimensions=32))
)
outputs=llm.embed(
["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)
print(outputs[0].outputs)
```
```
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py>
A code example can be found here: [examples/offline_inference/pooling/embed_matryoshka_fy.py](../../examples/offline_inference/pooling/embed_matryoshka_fy.py)
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py>
An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)
## Deprecated Features
### Encode task
We have split the `encode` task into two more specific token wise tasks: `token_embed` and `token_classify`:
-`token_embed` is the same as embed, using normalize as activation.
-`token_classify` is the same as classify, default using softmax as activation.
### Remove softmax from PoolingParams
We are going to remove `softmax` and `activation` from `PoolingParams`. Instead, you should set `use_activation`, since we actually allow `classify` and `token_classify` to use any activation function.
@@ -9,30 +9,30 @@ Alongside each architecture, we include some popular models that use it.
...
@@ -9,30 +9,30 @@ Alongside each architecture, we include some popular models that use it.
### vLLM
### vLLM
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>.
If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
These models are what we list in [supported text models](#list-of-text-only-language-models) and [supported multimodal models](#list-of-multimodal-language-models).
[](){ #transformers-backend }
### Transformers
### Transformers
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <1% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers backend".
vLLM also supports model implementations that are available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within <5% of the performance of a dedicated vLLM model implementation. We call this feature the "Transformers modeling backend".
Currently, the Transformers backend works for the following:
Currently, the Transformers modeling backend works for the following:
- Modalities: embedding models, language models and vision-language models*
- Modalities: embedding models, language models and vision-language models*
- Attention types: full attention and/or sliding attention
- Attention types: full attention and/or sliding attention
_*Vision-language models currently accept only image inputs. Support for video inputs will be added in a future release._
_*Vision-language models currently accept only image inputs. Support for video inputs will be added in a future release._
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers backend, it will be compatible with the following features of vLLM:
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers modeling backend, it will be compatible with the following features of vLLM:
- All the features listed in the [compatibility matrix](../features/README.md#feature-x-feature)
- All the features listed in the [compatibility matrix](../features/README.md#feature-x-feature)
- Any combination of the following vLLM parallelisation schemes:
- Any combination of the following vLLM parallelisation schemes:
-Pipeline parallel
-Data parallel
- Tensor parallel
- Tensor parallel
- Expert parallel
- Pipeline parallel
Checking if the modeling backend is Transformers is as simple as:
Checking if the modeling backend is Transformers is as simple as:
If the printed type starts with `Transformers...` then it's using the Transformers model implementation!
If the printed type starts with `Transformers...` then it's using the Transformers model implementation!
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
If a model has a vLLM implementation but you would prefer to use the Transformers implementation via the Transformers modeling backend, set `model_impl="transformers"` for [offline inference](../serving/offline_inference.md) or `--model-impl transformers` for the [online serving](../serving/openai_compatible_server.md).
!!! note
!!! note
For vision-language models, if you are loading with `dtype="auto"`, vLLM loads the whole model with config's `dtype` if it exists. In contrast the native Transformers will respect the `dtype` attribute of each backbone in the model. That might cause a slight difference in performance.
For vision-language models, if you are loading with `dtype="auto"`, vLLM loads the whole model with config's `dtype` if it exists. In contrast the native Transformers will respect the `dtype` attribute of each backbone in the model. That might cause a slight difference in performance.
...
@@ -53,12 +53,12 @@ If a model has a vLLM implementation but you would prefer to use the Transformer
...
@@ -53,12 +53,12 @@ If a model has a vLLM implementation but you would prefer to use the Transformer
If a model is neither supported natively by vLLM nor Transformers, it can still be used in vLLM!
If a model is neither supported natively by vLLM nor Transformers, it can still be used in vLLM!
For a model to be compatible with the Transformers backend for vLLM it must:
For a model to be compatible with the Transformers modeling backend for vLLM it must:
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
- The model directory must have the correct structure (e.g. `config.json` is present).
- The model directory must have the correct structure (e.g. `config.json` is present).
-`config.json` must contain `auto_map.AutoModel`.
-`config.json` must contain `auto_map.AutoModel`.
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
- be a Transformers modeling backend for vLLM compatible model (see [Writingcustommodels](#writing-custom-models)):
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
If the compatible model is:
If the compatible model is:
...
@@ -66,18 +66,21 @@ If the compatible model is:
...
@@ -66,18 +66,21 @@ If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
This means that, with the Transformers modeling backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
[](){ #writing-custom-models }
#### Writing custom models
#### Writing custom models
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers modeling backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
To make your model compatible with the Transformers backend, it needs:
To make your model compatible with the Transformers modeling backend, it needs:
1.`kwargs` passed down through all modules from `MyModel` to `MyAttention`.
1.`kwargs` passed down through all modules from `MyModel` to `MyAttention`.
1. If your model is encoder-only, you must also add `is_causal = False` to `MyAttention`.
- If your model is encoder-only:
1. Add `is_causal = False` to `MyAttention`.
- If your model is mixture-of-experts (MoE):
1. Your sparse MoE block must have an attribute called `experts`.
2. The class of `experts` (`MyExperts`) must inherit from `nn.ModuleList`.
3.`MyExperts.forward` must accept `hidden_states`, `top_k_index`, `top_k_weights`.
2.`MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
2.`MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
3.`MyModel` must contain `_supports_attention_backend = True`.
3.`MyModel` must contain `_supports_attention_backend = True`.
...
@@ -104,6 +107,23 @@ class MyAttention(nn.Module):
...
@@ -104,6 +107,23 @@ class MyAttention(nn.Module):
@@ -114,7 +134,7 @@ Here is what happens in the background when this model is loaded:
...
@@ -114,7 +134,7 @@ Here is what happens in the background when this model is loaded:
1. The config is loaded.
1. The config is loaded.
2.`MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
2.`MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
3.`MyModel` is loaded into one of the Transformers backend classes in <gh-file:vllm/model_executor/models/transformers.py> which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
3.`MyModel` is loaded into one of the Transformers modeling backend classes in [vllm/model_executor/models/transformers](../../vllm/model_executor/models/transformers) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
That's it!
That's it!
...
@@ -162,7 +182,7 @@ To determine whether a given model is natively supported, you can check the `con
...
@@ -162,7 +182,7 @@ To determine whether a given model is natively supported, you can check the `con
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
Models do not _need_ to be natively supported to be used in vLLM.
Models do not _need_ to be natively supported to be used in vLLM.
The [Transformers backend][transformers-backend] enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
The [Transformers modeling backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
!!! tip
!!! tip
The easiest way to check if your model is really supported at runtime is to run the program below:
The easiest way to check if your model is really supported at runtime is to run the program below:
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
\* Feature support is the same as that of the original model.
...
@@ -541,7 +563,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
...
@@ -541,7 +563,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
```
```
!!! note
!!! note
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/pooling/qwen3_reranker.py>.
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/offline_inference/pooling/qwen3_reranker.py](../../examples/offline_inference/pooling/qwen3_reranker.py).
Named Entity Recognition (NER) usage, please refer to <gh-file:examples/offline_inference/pooling/ner.py>, <gh-file:examples/online_serving/pooling/ner.py>.
Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models }
## List of Multimodal Language Models
## List of Multimodal Language Models
...
@@ -601,35 +622,34 @@ On the other hand, modalities separated by `/` are mutually exclusive.
...
@@ -601,35 +622,34 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
!!! important
!!! tip
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache.
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
Offline inference:
!!! note
vLLM currently only supports dynamic LoRA adapters on the language backbone of multimodal models.
If you wish to use a model with LoRA in the multi-modal encoder,
please merge the weights into the base model first before running it in vLLM like a regular model.
```python
```python
from vllm import LLM
from peft import PeftConfig, PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
**This is no longer required if you are using vLLM V1.**
!!! tip
For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache.
!!! note
vLLM currently only supports adding LoRA to the language backbone of multimodal models.
### Generative Models
### Generative Models
See [this page](generative_models.md) for more information on how to use generative models.
See [this page](generative_models.md) for more information on how to use generative models.
...
@@ -638,69 +658,74 @@ See [this page](generative_models.md) for more information on how to use generat
...
@@ -638,69 +658,74 @@ See [this page](generative_models.md) for more information on how to use generat
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
| `SkyworkR1VChatModel` | Skywork-R1V-38B | T + I | `Skywork/Skywork-R1V-38B` | | ✅︎ |
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
| `SmolVLMForConditionalGeneration` | SmolVLM2 | T + I | `SmolVLM2-2.2B-Instruct` | ✅︎ | |
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ | ✅︎ |
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ |
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
• For example, to use DeepSeek-VL2 series models:
• For example, to use DeepSeek-VL2 series models:
...
@@ -740,57 +765,29 @@ Some models are supported only via the [Transformers backend](#transformers). Th
...
@@ -740,57 +765,29 @@ Some models are supported only via the [Transformers backend](#transformers). Th
!!! note
!!! note
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
!!! warning
The output quality of `AllenAI/Molmo-7B-D-0924` (especially in object localization tasks) has deteriorated in recent updates.
For the best results, we recommend using the following dependency versions (tested on A10 and L40):
??? code "Dependency versions"
```text
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
torch==2.5.1
torchvision==0.20.1
transformers==4.48.1
tokenizers==0.21.0
tiktoken==0.7.0
vllm==0.7.0
# Optional but recommended for improved performance and stability
triton==3.1.0
xformers==0.0.28.post3
uvloop==0.21.0
protobuf==5.29.3
openai==1.60.2
opencv-python-headless==4.11.0.86
pillow==10.4.0
# Installed FlashAttention (for float16 only)
flash-attn>=2.5.6 # Not used in float32, but should be documented
```
**Note:** Make sure you understand the security implications of using outdated packages.
!!! note
!!! note
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! warning
!!! warning
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
!!! note
!!! note
For Qwen2.5-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`)
For Qwen2.5-Omni and Qwen3-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) is currently work in progress and not yet supported.
is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1.
#### Transcription
#### Transcription
Speech2Text models trained specifically for Automatic Speech Recognition.
Speech2Text models trained specifically for Automatic Speech Recognition.
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
\* Feature support is the same as that of the original model.
...
@@ -853,5 +852,5 @@ We have the following levels of testing for models:
...
@@ -853,5 +852,5 @@ We have the following levels of testing for models:
1.**Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
1.**Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2.**Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
2.**Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3.**Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
3.**Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test.
4.**Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
4.**Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
Context parallel mainly solves the problem of serving long context requests. As prefill and decode present quite different characteristics and have quite different SLO (service level objectives), we need to implement context parallel separately for them. The major considerations are:
- For long context prefill, we need to control the TTFT (time to first token) by amortizing the computation time of the prefill across query tokens.
- For long context decode, we need more space for KV cache to increase the batchsize (and hence the throughput).
## Prefill Context Parallel
During prefill, for a long request with `T` new tokens, we need to compute query/key/value tensors for these new tokens. Say we have `N` GPUs, we can split the request into `N` chunks, and each GPU computes one chunk of the query/key/value tensors.
Depending on the use case, there're two possible strategies:
1. Partial query, full key/value: If the request token length is moderately long (we can afford holding the full key/value tensors), and the goal is to accelerate the prefill (and amortize the computation time of the prefill across query tokens), then we can gather the key/value tensors from all GPUs and let each GPU compute the attention output corresponding to the query tokens of its chunk.
2. Partial query, partial key/value: If the request token length is too long, we cannot afford holding the full key/value tensors anymore, then we can only compute one chunk of query/key/value tensors for each GPU, and use techniques like [ring-attention](http://arxiv.org/abs/2310.01889) to send/recv key/value tensors chunk by chunk.
Both approaches are under active development.
## Decode Context Parallel
Due to the auto-regressive nature of decoding, every decoding step needs to compute a small amount of query tokens w.r.t. a large number of key/value tokens stored in the paged KV cache. The core of decode context parallel is how to shard the KV cache across GPUs.
For a model with `H` kv-heads, a request with `T` tokens in the context needs to store `H * T` key/value tensors in the KV cache.
1. If one GPU can hold them all, and the performance is good enough, then no parallelization is needed.
2. If one GPU cannot hold them all, or we want to hold more requests in the KV cache, we can first shard the KV cache along the `H` dimension, that's the plain tensor parallel sharding. It's as simple as adding `-tp <num_gpus>` to the command line.
3. Since `H` is limited (determined by the model architecture), when we continue to increase the tensor parallel size, the KV cache for each GPU will be duplicated for `tp_size / H` times. Of course, duplication is not good for efficiency. Then we need to add decode context parallel to further shard the KV cache along the `T` dimension. This is as simple as adding `-dcp <size>` to the command line. Note that `size` does not increase the number of GPUs we need to launch, but just reduces the KV cache duplication. The dcp size should lie in the range of `[1, tp_size/H]`. With larger dcp size, the KV cache duplication is reduced, but the communication overhead increases.
Theoretically, it is possible to extend the dcp size beyond `tp_size / H` to further shard the KV cache and accelerate the decoding phase. However, since the number of query tokens is limited in decoding, it's unclear what should we do for the remaining `dcp_size - tp_size / H` GPUs for non-attention layers. For the sake of simplicity, dcp size is upper bounded by `tp_size / H`. If you want to further accelerate the decoding phase, you can consider increasing the `tp_size` first, and then increasing the dcp size.
Note that kv cache can grow during decoding, and the sharding strategy needs to be carefully implemented. We use an interleaving strategy to shard the KV cache along the `T` dimension, so that kv cache for future tokens can be naturally sharded along the `T` dimension. This is proposed by [Chao Hong from Moonshot](https://github.com/youzhedian), and also explained in details in [this paper](http://arxiv.org/abs/2507.07120).
Case study:
For DeepSeek-R1, we have 1 kv-head when MLA is enabled. The typical single-node deployment with `-tp 8` causes 8x KV cache duplication. We can consider adding `-dcp 8` to reduce the KV cache duplication.
For Kimi-K2, the architecture is similar to DeepSeek-R1, but with more parameters. When we deploy it with `-tp 16`, the KV cache duplication is 16x. We can add `-dcp 16` to completely remove the KV cache duplication, at the cost of more communication overhead. We can also add `-dcp 8` to reduce the KV cache duplication to 2x. Although it still duplicates the KV cache twice, the communication overhead is smaller since the DCP communication only happens inside one node.
For Qwen3-235B-A22B, we have 4 kv-heads. When we deploy it with `-tp 8`, the KV cache duplication is 2x. Then we can add `-dcp 2` to remove the KV cache duplication.
In short, for decode context parallel, try to increase `-tp` size until you get satisfactory performance, and then add `-dcp` to reduce the KV cache duplication.
Decode context parallel is supported in vLLM, for both MLA and GQA models. Some attention backends also support the combination of decode context parallel and MTP (multi-token prediction) to further accelerate the decoding phase.
## Technical Discussions
The main discussions happen in the `#sig-context-parallel` channel of [vLLM Slack](https://slack.vllm.ai/).
@@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
...
@@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.
...
@@ -69,6 +69,7 @@ There are several notable differences when using Ray:
...
@@ -69,6 +69,7 @@ There are several notable differences when using Ray:
- A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node
- A single launch command (on any node) is needed to start all local and remote DP ranks, therefore it is more convenient compared to launching on each node
- There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address`
- There is no need to specify `--data-parallel-address`, and the node where the command is run is used as `--data-parallel-address`
- There is no need to specify `--data-parallel-rpc-port`
- There is no need to specify `--data-parallel-rpc-port`
- When a single DP group requires multiple nodes, *e.g.* in case a single model replica needs to run on at least two nodes, make sure to set `VLLM_RAY_DP_PACK_STRATEGY="span"` in which case `--data-parallel-size-local` is ignored and will be automatically determined
- Remote DP ranks will be allocated based on node resources of the Ray cluster
- Remote DP ranks will be allocated based on node resources of the Ray cluster
Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic.
Currently, the internal DP load balancing is done within the API server process(es) and is based on the running and waiting queues in each of the engines. This could be made more sophisticated in future by incorporating KV cache aware logic.
@@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
...
@@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication
## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to <gh-file:examples/online_serving/run_cluster.sh>, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>.
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script](../usage/troubleshooting.md#incorrect-hardwaredriver). If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request
## No available node types can fulfill resource request
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in <gh-file:examples/online_serving/run_cluster.sh> (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>.
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh)(with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
@@ -8,19 +8,22 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
...
@@ -8,19 +8,22 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1.**Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels).
1.**Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2.**Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
2.**Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3.**For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
3.**For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
### Backend Selection Guide
### Backend Selection Guide
vLLM provides three communication backends for EP:
vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one:
| Backend | Use Case | Features | Best For |
| Backend | Use Case | Features | Best For |
|---------|----------|----------|----------|
|---------|----------|----------|----------|
| `pplx` | Single node | Chunked prefill support | Development, best for intra-node deployments |
| `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration |
@@ -191,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
...
@@ -191,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
### Setup Steps
### Setup Steps
1.**Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip.
1.**Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
2.**Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
2.**Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`
...
@@ -239,10 +243,10 @@ try:
...
@@ -239,10 +243,10 @@ try:
"remote_engine_id":None,# Will be populated by vLLM
"remote_engine_id":None,# Will be populated by vLLM
"remote_block_ids":None,# Will be populated by vLLM
"remote_block_ids":None,# Will be populated by vLLM
"remote_host":None,# Will be populated by vLLM
"remote_host":None,# Will be populated by vLLM
"remote_port":None# Will be populated by vLLM
"remote_port":None,# Will be populated by vLLM
}
}
},
},
extra_headers={"X-Request-Id":request_id}
extra_headers={"X-Request-Id":request_id},
)
)
print("-"*50)
print("-"*50)
...
@@ -258,7 +262,7 @@ try:
...
@@ -258,7 +262,7 @@ try:
extra_body={
extra_body={
"kv_transfer_params":prefill_response.kv_transfer_params# Pass KV cache info
"kv_transfer_params":prefill_response.kv_transfer_params# Pass KV cache info
},
},
extra_headers={"X-Request-Id":request_id}# Same request ID
extra_headers={"X-Request-Id":request_id},# Same request ID