Commit 469e903b authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.8.2' into v0.8.2-dev

parents 389ebcf7 25f560a6
......@@ -15,7 +15,7 @@ more are listed [here](#supported-models).
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
but they are expected be inferior to models that are specifically trained on embedding tasks.
but they are expected to be inferior to models that are specifically trained on embedding tasks.
______________________________________________________________________
......
......@@ -8,21 +8,21 @@ vLLM supports the following hardware platforms:
:maxdepth: 1
:hidden:
gpu/index
cpu/index
ai_accelerator/index
installation/gpu
installation/cpu
installation/ai_accelerator
:::
- <project:gpu/index.md>
- <project:installation/gpu.md>
- NVIDIA CUDA
- AMD ROCm
- Intel XPU
- <project:cpu/index.md>
- <project:installation/cpu.md>
- Intel/AMD x86
- ARM AArch64
- Apple silicon
- <project:ai_accelerator/index.md>
- IBM Z (S390X)
- <project:installation/ai_accelerator.md>
- Google TPU
- Intel Gaudi
- AWS Neuron
- OpenVINO
......@@ -9,7 +9,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
:selected:
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -19,7 +19,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -29,17 +29,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -56,7 +46,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
......@@ -66,7 +56,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
......@@ -76,23 +66,13 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
:::::
## Configure a new environment
......@@ -103,7 +83,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
......@@ -113,7 +93,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
......@@ -123,21 +103,13 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} ../python_env_setup.inc.md
:::
::::
:::::
## Set up using Python
......@@ -150,7 +122,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -160,7 +132,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -170,17 +142,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -197,7 +159,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -207,7 +169,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -217,17 +179,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -246,7 +198,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -256,7 +208,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -266,17 +218,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -293,7 +235,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
......@@ -303,7 +245,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
......@@ -313,17 +255,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
......@@ -340,7 +272,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Google TPU
:sync: tpu
:::{include} tpu.inc.md
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Extra information"
:::
......@@ -349,7 +281,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} hpu-gaudi.inc.md
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Extra information"
:::
......@@ -358,16 +290,7 @@ vLLM is a Python library that supports the following AI accelerators. Select you
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} neuron.inc.md
:start-after: "## Extra information"
:::
::::
::::{tab-item} OpenVINO
:sync: openvino
:::{include} openvino.inc.md
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Extra information"
:::
......
......@@ -63,7 +63,7 @@ To build and install vLLM from source, run:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-hpu.txt
pip install -r requirements/hpu.txt
python setup.py develop
```
......@@ -73,7 +73,7 @@ Currently, the latest features and performance optimizations are developed in Ga
git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout habana_main
pip install -r requirements-hpu.txt
pip install -r requirements/hpu.txt
python setup.py develop
```
......@@ -119,7 +119,7 @@ If you're observing the following error: `docker: Error response from daemon: Un
## Supported configurations
The following configurations have been validated to be function with
The following configurations have been validated to function with
Gaudi2 devices. Configurations that are not listed may or may not work.
- [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b)
......
......@@ -116,7 +116,7 @@ Once neuronx-cc and transformers-neuronx packages are installed, we will be able
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -U -r requirements-neuron.txt
pip install -U -r requirements/neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install .
```
......
# Installation
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)).
:::{attention}
There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements
- OS: Linux
- Instruction set architecture (ISA) requirement: at least AVX2.
## Set up using Python
### Pre-built wheels
Currently, there are no pre-built OpenVINO wheels.
### Build wheel from source
First, install Python and ensure you lave the latest pip. For example, on Ubuntu 22.04, you can run:
```console
sudo apt-get update -y
sudo apt-get install python3
pip install --upgrade pip
```
Second, clone vLLM and install prerequisites for the vLLM OpenVINO backend installation:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
```
Finally, install vLLM with OpenVINO backend:
```console
PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v .
```
:::{tip}
To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html).
:::
## Set up using Docker
### Pre-built images
Currently, there are no pre-built OpenVINO images.
### Build image from source
```console
docker build -f Dockerfile.openvino -t vllm-openvino-env .
docker run -it --rm vllm-openvino-env
```
## Extra information
## Supported features
OpenVINO vLLM backend supports the following advanced vLLM features:
- Prefix caching (`--enable-prefix-caching`)
- Chunked prefill (`--enable-chunked-prefill`)
## Performance tips
### vLLM OpenVINO backend environment variables
- `VLLM_OPENVINO_DEVICE` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, `VLLM_OPENVINO_DEVICE=GPU.1`). If the value is not specified, CPU device is used by default.
- `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`
### CPU performance tips
CPU uses the following environment variables to control behavior:
- `VLLM_OPENVINO_KVCACHE_SPACE` to specify the KV Cache size (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- `VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8` to control KV cache precision. By default, FP16 / BF16 is used depending on platform.
To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (`--enable-chunked-prefill`). Based on the experiments, the recommended batch size is `256` (`--max-num-batched-tokens`)
OpenVINO best known configuration for CPU is:
```console
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256
```
### GPU performance tips
GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account `gpu_memory_utilization` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using `VLLM_OPENVINO_KVCACHE_SPACE` environment variable (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=8` means 8 GB space for KV cache).
Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`.
OpenVINO best known configuration for GPU is:
```console
$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
```
## Limitations
- LoRA serving is not supported.
- Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration.
- Tensor and pipeline parallelism are not currently enabled in vLLM integration.
......@@ -151,7 +151,7 @@ pip uninstall torch torch-xla -y
Install build dependencies:
```bash
pip install -r requirements-tpu.txt
pip install -r requirements/tpu.txt
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
......
......@@ -9,7 +9,7 @@ vLLM is a Python library that supports the following CPU variants. Select your C
:selected:
:sync: x86
:::{include} x86.inc.md
:::{include} cpu/x86.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -19,7 +19,7 @@ vLLM is a Python library that supports the following CPU variants. Select your C
::::{tab-item} ARM AArch64
:sync: arm
:::{include} arm.inc.md
:::{include} cpu/arm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -29,7 +29,17 @@ vLLM is a Python library that supports the following CPU variants. Select your C
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:::{include} cpu/apple.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} IBM Z (S390X)
:sync: s390x
:::{include} cpu/s390x.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -48,7 +58,7 @@ vLLM is a Python library that supports the following CPU variants. Select your C
::::{tab-item} Intel/AMD x86
:sync: x86
:::{include} x86.inc.md
:::{include} cpu/x86.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -58,7 +68,7 @@ vLLM is a Python library that supports the following CPU variants. Select your C
::::{tab-item} ARM AArch64
:sync: arm
:::{include} arm.inc.md
:::{include} cpu/arm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -68,7 +78,17 @@ vLLM is a Python library that supports the following CPU variants. Select your C
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:::{include} cpu/apple.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} IBM Z (S390X)
:sync: s390x
:::{include} cpu/s390x.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -81,7 +101,7 @@ vLLM is a Python library that supports the following CPU variants. Select your C
### Create a new Python environment
:::{include} ../python_env_setup.inc.md
:::{include} python_env_setup.inc.md
:::
### Pre-built wheels
......@@ -96,7 +116,7 @@ Currently, there are no pre-built CPU wheels.
::::{tab-item} Intel/AMD x86
:sync: x86
:::{include} x86.inc.md
:::{include} cpu/x86.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -106,7 +126,7 @@ Currently, there are no pre-built CPU wheels.
::::{tab-item} ARM AArch64
:sync: arm
:::{include} arm.inc.md
:::{include} cpu/arm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -116,7 +136,17 @@ Currently, there are no pre-built CPU wheels.
::::{tab-item} Apple silicon
:sync: apple
:::{include} apple.inc.md
:::{include} cpu/apple.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} IBM Z (s390x)
:sync: s390x
:::{include} cpu/s390x.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -147,6 +177,10 @@ $ docker run -it \
For ARM or Apple silicon, use `Dockerfile.arm`
::::
::::{tip}
For IBM Z (s390x), use `Dockerfile.s390x` and in `docker run` use flag `--dtype float`
::::
## Supported features
vLLM CPU backend supports the following vLLM features:
......@@ -155,12 +189,13 @@ vLLM CPU backend supports the following vLLM features:
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
- Chunked-prefill
- Prefix-caching
- FP8-E5M2 KV-Caching (TODO)
- FP8-E5M2 KV cache
## Related runtime environment variables
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores.
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
## Performance tips
......
......@@ -25,7 +25,7 @@ After installation of XCode and the Command Line Tools, which include Apple Clan
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-cpu.txt
pip install -r requirements/cpu.txt
pip install -e .
```
......
......@@ -20,7 +20,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM
### Build wheel from source
:::{include} build.inc.md
:::{include} cpu/build.inc.md
:::
Testing has been conducted on AWS Graviton3 instances for compatibility.
......
......@@ -6,12 +6,19 @@ sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
```
Second, install Python packages for vLLM CPU backend building:
Second, clone vLLM project:
```console
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
```
Third, install Python packages for vLLM CPU backend building:
```console
pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
```
Finally, build and install vLLM CPU backend:
......
# Installation
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only.
:::{attention}
There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements
- OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision`
## Set up using Python
### Pre-built wheels
### Build wheel from source
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
```console
dnf install -y \
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
openssl-devel openblas openblas-devel wget autoconf automake libtool cmake numactl-devel
```
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
```console
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
. "$HOME/.cargo/env"
```
Execute the following commands to build and install vLLM from the source.
::::{tip}
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
::::
```console
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
pip install -v \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
-r requirements-build.txt \
-r requirements-cpu.txt \
VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \
pip install dist/*.whl
```
## Set up using Docker
### Pre-built images
### Build image from source
## Extra information
......@@ -22,7 +22,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM
### Build wheel from source
:::{include} build.inc.md
:::{include} cpu/build.inc.md
:::
:::{note}
......
......@@ -9,7 +9,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
:selected:
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -19,7 +19,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -29,7 +29,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
......@@ -49,7 +49,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -59,7 +59,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -69,7 +69,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
......@@ -82,7 +82,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
### Create a new Python environment
:::{include} ../python_env_setup.inc.md
:::{include} python_env_setup.inc.md
:::
:::::{tab-set}
......@@ -91,7 +91,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels"
:::
......@@ -122,7 +122,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -132,7 +132,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -142,7 +142,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
......@@ -161,7 +161,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -171,7 +171,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -181,7 +181,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
......@@ -200,7 +200,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -210,7 +210,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -220,7 +220,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
......@@ -237,7 +237,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
......@@ -247,7 +247,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
......@@ -257,7 +257,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
......@@ -274,7 +274,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} cuda.inc.md
:::{include} gpu/cuda.inc.md
:start-after: "## Supported features"
:::
......@@ -283,7 +283,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} rocm.inc.md
:::{include} gpu/rocm.inc.md
:start-after: "## Supported features"
:::
......@@ -292,7 +292,7 @@ There is no extra information on creating a new Python environment for this devi
::::{tab-item} Intel XPU
:sync: xpu
:::{include} xpu.inc.md
:::{include} gpu/xpu.inc.md
:start-after: "## Supported features"
:::
......
......@@ -23,12 +23,12 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
You can install vLLM using either `pip` or `uv pip`:
```console
# Install vLLM with CUDA 12.1.
# Install vLLM with CUDA 12.4.
pip install vllm # If you are using pip.
uv pip install vllm # If you are using uv.
```
As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 11.8 and public PyTorch release versions:
As of now, vLLM's binaries are compiled with CUDA 12.4 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.1, 11.8, and public PyTorch release versions:
```console
# Install vLLM with CUDA 11.8.
......@@ -131,6 +131,8 @@ Building from source requires a lot of compilation. If you are building from sou
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
:::
......@@ -148,7 +150,7 @@ To build vLLM using an existing PyTorch installation:
git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt
pip install -r requirements/build.txt
pip install -e . --no-build-isolation
```
......
......@@ -53,9 +53,9 @@ Currently, there are no pre-built ROCm wheels.
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
:::
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention#amd-rocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
......@@ -84,7 +84,7 @@ Currently, there are no pre-built ROCm wheels.
# Install dependencies
$ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
$ pip install "numpy<2"
$ pip install -r requirements-rocm.txt
$ pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
......
# Installation
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
vLLM initially supports basic model inference and serving on Intel GPU platform.
:::{attention}
There are no pre-built wheels or images for this device, so you must build vLLM from source.
......@@ -9,7 +9,7 @@ There are no pre-built wheels or images for this device, so you must build vLLM
## Requirements
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2024.2
- OneAPI requirements: oneAPI 2025.0
## Set up using Python
......@@ -19,21 +19,27 @@ Currently, there are no pre-built XPU wheels.
### Build wheel from source
- First, install required driver and intel OneAPI 2024.2 or later.
- First, install required driver and Intel OneAPI 2025.0 or later.
- Second, install Python packages for vLLM XPU backend building:
```console
source /opt/intel/oneapi/setvars.sh
pip install --upgrade pip
pip install -v -r requirements-xpu.txt
pip install -v -r requirements/xpu.txt
```
- Finally, build and install vLLM XPU backend:
- Then, build and install vLLM XPU backend:
```console
VLLM_TARGET_DEVICE=xpu python setup.py install
```
- Finally, due to a known issue of conflict dependency(oneapi related) in torch-xpu 2.6 and ipex-xpu 2.6, we install ipex here. This will be fixed in the ipex-xpu 2.7.
```console
pip install intel-extension-for-pytorch==2.6.10+xpu \
--extra-index-url=https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
:::{note}
- FP16 is the default data type in the current XPU backend. The BF16 data
type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
......@@ -59,7 +65,7 @@ $ docker run -it \
## Supported features
XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following:
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
```console
python -m vllm.entrypoints.openai.api_server \
......@@ -72,4 +78,6 @@ python -m vllm.entrypoints.openai.api_server \
-tp=8
```
By default, a ray instance will be launched automatically if no existing one is detected in system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
There are some new features coming with ipex-xpu 2.6, e.g. **chunked prefill**, **V1 engine support**, **lora**, **MoE**, etc.
......@@ -24,6 +24,12 @@ source myenv/bin/activate
uv pip install vllm
```
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating an environment:
```console
uv run --with vllm vllm --help
```
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
```console
......@@ -52,6 +58,11 @@ from vllm import LLM, SamplingParams
```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
:::{important}
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
:::
```python
prompts = [
......@@ -70,7 +81,7 @@ llm = LLM(model="facebook/opt-125m")
```
:::{note}
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
:::
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
......@@ -101,6 +112,11 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template).
:::
:::{important}
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server.
:::
This server can be queried in the same format as OpenAI API. For example, to list the models:
......@@ -184,3 +200,13 @@ chat_response = client.chat.completions.create(
)
print("Chat response:", chat_response)
```
## On Attention Backends
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
```{attention}
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [Dockerfile](https://github.com/vllm-project/vllm/blob/main/Dockerfile) for instructions on how to install it.
```
......@@ -254,6 +254,10 @@ ValueError: Model architectures ['<arch>'] are not supported for now. Supported
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
......
# vLLM V1 User Guide
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
To disable V1, please set the environment variable as: `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason!
## Why vLLM V1?
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
Building on V0’s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.
Specifically, V1 aims to:
- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.
We see significant performance improvements from upgrading to V1 core engine, in
particular for long context scenarios. Please see performance benchmark (To be
added).
For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
### Supports Overview
#### Hardware
| Hardware | Status |
|----------|------------------------------------------|
| **NVIDIA** | <nobr>🚀 Natively Supported</nobr> |
| **AMD** | <nobr>🚧 WIP</nobr> |
| **TPU** | <nobr>🚧 WIP</nobr> |
#### Feature / Model
| Feature / Model | Status |
|-----------------|-----------------------------------------------------------------------------------|
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional ([PR #13096](https://github.com/vllm-project/vllm/pull/13096))</nobr>|
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr>|
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
| **FP8 KV Cache** | <nobr>🟡 Planned</nobr> |
| **Structured Output Alternative Backends** | <nobr>🟡 Planned</nobr> |
| **Embedding Models** | <nobr>🟡 Planned ([RFC #12249](https://github.com/vllm-project/vllm/issues/12249))</nobr> |
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
| **Encoder-Decoder Models** | <nobr>🟡 Planned</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
- **🟢 Functional**: Fully operational, with ongoing optimizations.
- **🚧 WIP**: Under active development.
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
- **🔴 Deprecated**: Not planned for v1 unless there is strong demand.
**Note**: vLLM V1’s unified scheduler treats both prompt and output tokens the same
way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically
allocate a fixed token budget per request, enabling features like chunked prefills,
prefix caching, and speculative decoding without a strict separation between prefill
and decode phases.
### Semantic Changes and Deprecated Features
#### Logprobs
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:
**Logprobs Calculation**
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
**Prompt Logprobs with Prefix Caching**
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
#### Deprecated Features
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
**Sampling features**
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
**KV Cache features**
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.
**Structured Output features**
- **Request-level Structured Output Backend**: Deprecated, alternative backends
(outlines, guidance) with fallbacks is WIP.
### Feature & Model Support in Progress
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
#### Features to Be Optimized
These features are already supported in vLLM V1, but their optimization is still
in progress.
- **LoRA**: LoRA is functionally working on vLLM V1 but its performance is
inferior to that of V0. The team is actively working on improving its
performance
(e.g., see [PR #13096](https://github.com/vllm-project/vllm/pull/13096)).
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
#### Features to Be Supported
- **FP8 KV Cache**: While vLLM V1 introduces new FP8 kernels for model weight quantization, support for an FP8 key–value cache is not yet available. Users must continue using FP16 (or other supported precisions) for the KV cache.
- **Structured Output Alternative Backends**: Structured output alternative backends (outlines, guidance) support is planned. V1 currently
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
Details about the structured outputs can be found
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
#### Models to Be Supported
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
and the majority fall into the following categories. V1 support for these models will be added eventually.
**Embedding Models**
Instead of having a separate model runner, hidden states processor [RFC #12249](https://github.com/vllm-project/vllm/issues/12249), which is based on global logits processor [RFC #13360](https://github.com/vllm-project/vllm/pull/13360), has been proposed to enable simultaneous generation and embedding using the same engine instance in V1. It is still in the planning stage.
**Mamba Models**
Models using selective state-space mechanisms (instead of standard transformer attention)
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
**Encoder-Decoder Models**
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Frequently Asked Questions
**I'm using vLLM V1 and I'm getting CUDA OOM errors. What should I do?**
The default `max_num_seqs` has been raised from `256` in V0 to `1024` in V1. If you encounter CUDA OOM only when using V1 engine, try setting a lower value of `max_num_seqs` or `gpu_memory_utilization`.
On the other hand, if you get an error about insufficient memory for the cache blocks, you should increase `gpu_memory_utilization` as this indicates that your GPU has sufficient memory but you're not allocating enough to vLLM for KV cache blocks.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment