Commit 3fb4b5fa authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.18.0' into v0.18.0-ori

parents bcf25339 89138b21
# Parallel Draft Models
The following code configures vLLM to use speculative decoding where proposals are generated by [PARD](https://arxiv.org/pdf/2504.18583) (Parallel Draft Models).
## PARD Offline Mode Example
```python
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
tensor_parallel_size=1,
speculative_config={
"model": "amd/PARD-Qwen3-0.6B",
"num_speculative_tokens": 12,
"method": "draft_model",
"parallel_drafting": True,
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## PARD Online Mode Example
```bash
vllm serve Qwen/Qwen3-4B \
--host 0.0.0.0 \
--port 8000 \
--seed 42 \
-tp 1 \
--max_model_len 2048 \
--gpu_memory_utilization 0.8 \
--speculative_config '{"model": "amd/PARD-Qwen3-0.6B", "num_speculative_tokens": 12, "method": "draft_model", "parallel_drafting": true}'
```
## Pre-trained PARD weights
- [amd/pard](https://huggingface.co/collections/amd/pard)
# Speculators # vLLM-Project/Speculators
![User Flow Light](../../assets/features/speculative_decoding/speculators-user-flow-light.svg#only-light)
![User Flow Dark](../../assets/features/speculative_decoding/speculators-user-flow-dark.svg#only-dark)
[Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput. [Speculators](https://docs.vllm.ai/projects/speculators/en/latest/) is a library for accelerating LLM inference through speculative decoding, providing efficient draft model training that integrates seamlessly with vLLM to reduce latency and improve throughput.
......
# Suffix Decoding
The following code configures vLLM to use speculative decoding where proposals are generated using Suffix Decoding ([technical report](https://arxiv.org/abs/2411.04975)).
Like n-gram, Suffix Decoding can generate draft tokens by pattern-matching using the last `n` generated tokens. Unlike n-gram, Suffix Decoding (1) can pattern-match against both the prompt and previous generations, (2) uses frequency counts to propose the most likely continuations, and (3) speculates an adaptive number of tokens for each request at each iteration to get better acceptance rates.
Suffix Decoding can achieve better performance for tasks with high repetition, such as code-editing, agentic loops (e.g. self-reflection, self-consistency), and RL rollouts.
!!! tip "Install Arctic Inference"
Suffix Decoding requires [Arctic Inference](https://github.com/snowflakedb/ArcticInference). You can install it with `pip install arctic-inference`.
!!! tip "Suffix Decoding Speculative Tokens"
Suffix Decoding will speculate a dynamic number of tokens for each request at each decoding step, so the `num_speculative_tokens` configuration specifies the *maximum* number of speculative tokens. It is suggested to use a high number such as `16` or `32` (default).
```python
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="Qwen/Qwen3-8B",
tensor_parallel_size=1,
speculative_config={
"method": "suffix",
"num_speculative_tokens": 32,
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
...@@ -210,6 +210,12 @@ Note that you can use reasoning with any provided structured outputs feature. Th ...@@ -210,6 +210,12 @@ Note that you can use reasoning with any provided structured outputs feature. Th
See also: [full example](../examples/online_serving/structured_outputs.md) See also: [full example](../examples/online_serving/structured_outputs.md)
!!! note
When using Qwen3 Coder models with reasoning enabled, structured outputs might become disabled if the reasoning content does not get parsed into the `reasoning` field separately (v0.11.2+).
To use both features together, you must explicitly enable structured outputs in reasoning mode.
To do so, add the following flag when starting the vLLM server: `--structured-outputs-config.enable_in_reasoning=True`.
See also: [Reasoning Outputs](reasoning_outputs.md) documentation.
## Experimental Automatic Parsing (OpenAI API) ## Experimental Automatic Parsing (OpenAI API)
This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types. This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.
......
...@@ -219,7 +219,7 @@ Supported models: ...@@ -219,7 +219,7 @@ Supported models:
* `ibm-granite/granite-4.0-h-small` and other Granite 4.0 models * `ibm-granite/granite-4.0-h-small` and other Granite 4.0 models
Recommended flags: `--tool-call-parser hermes` Recommended flags: `--tool-call-parser granite4`
* `ibm-granite/granite-3.0-8b-instruct` * `ibm-granite/granite-3.0-8b-instruct`
......
...@@ -16,4 +16,6 @@ vLLM supports the following hardware platforms: ...@@ -16,4 +16,6 @@ vLLM supports the following hardware platforms:
vLLM supports third-party hardware plugins that live **outside** the main `vllm` repository. These follow the [Hardware-Pluggable RFC](../../design/plugin_system.md). vLLM supports third-party hardware plugins that live **outside** the main `vllm` repository. These follow the [Hardware-Pluggable RFC](../../design/plugin_system.md).
A list of all supported hardware can be found on the [vllm.ai website](https://vllm.ai/#hardware). If you want to add new hardware, please contact us on [Slack](https://slack.vllm.ai/) or [Email](mailto:collaboration@vllm.ai). A list of all supported hardware can be found on the vLLM website, see [Universal Compatibility - Hardware](https://vllm.ai/#compatibility).
If you want to add new hardware, please contact us on [Slack](https://slack.vllm.ai/) or [Email](mailto:collaboration@vllm.ai).
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS. vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS.
...@@ -7,23 +8,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. ...@@ -7,23 +8,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
!!! tip "GPU-Accelerated Inference with vLLM-Metal" !!! tip "GPU-Accelerated Inference with vLLM-Metal"
For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend. For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: `macOS Sonoma` or later - OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools - SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0` - Compiler: `Apple Clang >= 15.0.0`
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Apple silicon CPU wheels. Currently, there are no pre-built Apple silicon CPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source. After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source.
...@@ -36,7 +37,7 @@ uv pip install -e . ...@@ -36,7 +37,7 @@ uv pip install -e .
!!! tip !!! tip
The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts. The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts.
The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation. The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation.
!!! note !!! note
...@@ -77,14 +78,14 @@ uv pip install -e . ...@@ -77,14 +78,14 @@ uv pip install -e .
``` ```
On Apple Clang 16 you should see: `#define __cplusplus 201703L` On Apple Clang 16 you should see: `#define __cplusplus 201703L`
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, there are no pre-built Arm silicon CPU images. Currently, there are no pre-built Arm silicon CPU images.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16. vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) - Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): NEON support is required - Instruction Set Architecture (ISA): NEON support is required
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries. Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries.
...@@ -43,13 +44,14 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE ...@@ -43,13 +44,14 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
**Install the latest code** #### Install the latest code
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used: LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used:
* `https://wheels.vllm.ai/nightly/cpu/vllm` - `https://wheels.vllm.ai/nightly/cpu/vllm`
To install from nightly index, run: To install from nightly index, run:
```bash ```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index
``` ```
...@@ -64,7 +66,7 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index ...@@ -64,7 +66,7 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index
pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!) pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!)
``` ```
**Install specific revisions** #### Install specific revisions
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
...@@ -73,8 +75,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha ...@@ -73,8 +75,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
...@@ -133,23 +135,23 @@ Testing has been conducted on AWS Graviton3 instances for compatibility. ...@@ -133,23 +135,23 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
export LD_PRELOAD="$TC_PATH:$LD_PRELOAD" export LD_PRELOAD="$TC_PATH:$LD_PRELOAD"
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
To pull the latest image: To pull the latest image from Docker Hub:
```bash ```bash
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest docker pull vllm/vllm-openai-cpu:latest-arm64
``` ```
To pull an image with a specific vLLM version: To pull an image with a specific vLLM version:
```bash ```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${VLLM_VERSION} docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-arm64
``` ```
All available image tags are here: [https://gallery.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo). All available image tags are here: [https://hub.docker.com/r/vllm/vllm-openai-cpu/tags](https://hub.docker.com/r/vllm/vllm-openai-cpu/tags).
You can run these images via: You can run these images via:
...@@ -158,7 +160,7 @@ docker run \ ...@@ -158,7 +160,7 @@ docker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \ -p 8000:8000 \
--env "HF_TOKEN=<secret>" \ --env "HF_TOKEN=<secret>" \
public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:<tag> <args...> vllm/vllm-openai-cpu:latest-arm64 <args...>
``` ```
You can also access the latest code with Docker images. These are not intended for production use and are meant for CI and testing only. They will expire after several days. You can also access the latest code with Docker images. These are not intended for production use and are meant for CI and testing only. They will expire after several days.
...@@ -170,28 +172,81 @@ export VLLM_COMMIT=6299628d326f429eba78736acb44e76749b281f5 # use full commit ha ...@@ -170,28 +172,81 @@ export VLLM_COMMIT=6299628d326f429eba78736acb44e76749b281f5 # use full commit ha
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu
``` ```
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
#### Building for your target ARM CPU
```bash
docker build -f docker/Dockerfile.cpu \
--platform=linux/arm64 \
--build-arg VLLM_CPU_ARM_BF16=<false (default)|true> \
--tag vllm-cpu-env \
--target vllm-openai .
```
!!! note "Auto-detection by default"
By default, ARM CPU instruction sets (BF16, NEON, etc.) are automatically detected from the build system's CPU flags. The `VLLM_CPU_ARM_BF16` build argument is used for cross-compilation:
- `VLLM_CPU_ARM_BF16=true` - Force-enable ARM BF16 support (build with BF16 regardless of build system capabilities)
- `VLLM_CPU_ARM_BF16=false` - Rely on auto-detection (default)
##### Examples
###### Auto-detection build (native ARM)
```bash
# Building on ARM64 system - platform auto-detected
docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-arm64 \
--target vllm-openai .
```
###### Cross-compile for ARM with BF16 support
```bash ```bash
# Building on ARM64 for newer ARM CPUs with BF16
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-env . --build-arg VLLM_CPU_ARM_BF16=true \
--tag vllm-cpu-arm64-bf16 \
--target vllm-openai .
```
###### Cross-compile from x86_64 to ARM64 with BF16
# Launching OpenAI server ```bash
# Requires Docker buildx with ARM emulation (QEMU)
docker buildx build -f docker/Dockerfile.cpu \
--platform=linux/arm64 \
--build-arg VLLM_CPU_ARM_BF16=true \
--build-arg max_jobs=4 \
--tag vllm-cpu-arm64-bf16 \
--target vllm-openai \
--load .
```
!!! note "ARM BF16 requirements"
ARM BF16 support requires ARMv8.6-A or later (FEAT_BF16). Supported on AWS Graviton3/4, AmpereOne, and other recent ARM processors.
#### Launching the OpenAI server
```bash
docker run --rm \ docker run --rm \
--privileged=true \ --security-opt seccomp=unconfined \
--cap-add SYS_NICE \
--shm-size=4g \ --shm-size=4g \
-p 8000:8000 \ -p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \ -e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \ -e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \ vllm-cpu-arm64 \
--model=meta-llama/Llama-3.2-1B-Instruct \ meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \ --dtype=bfloat16 \
other vLLM OpenAI server arguments other vLLM OpenAI server arguments
``` ```
!!! tip !!! tip "Alternative to --privileged"
An alternative of `--privileged=true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`. Instead of `--privileged=true`, use `--cap-add SYS_NICE --security-opt seccomp=unconfined` for better security.
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]
---
toc_depth: 3
---
# CPU # CPU
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions: vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
...@@ -75,6 +79,8 @@ For example, the nightly build index is: `https://wheels.vllm.ai/nightly/cpu/`. ...@@ -75,6 +79,8 @@ For example, the nightly build index is: `https://wheels.vllm.ai/nightly/cpu/`.
#### Set up using Python-only build (without compilation) {#python-only-build} #### Set up using Python-only build (without compilation) {#python-only-build}
This method requires [pre-built wheels](#pre-built-wheels) for your platform.
Please refer to the instructions for [Python-only build on GPU](./gpu.md#python-only-build), and replace the build commands with: Please refer to the instructions for [Python-only build on GPU](./gpu.md#python-only-build), and replace the build commands with:
```bash ```bash
...@@ -176,7 +182,7 @@ For the full and up-to-date list of models validated on CPU platforms, please se ...@@ -176,7 +182,7 @@ For the full and up-to-date list of models validated on CPU platforms, please se
### How to find benchmark configuration examples for supported CPU models? ### How to find benchmark configuration examples for supported CPU models?
For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in [cpu test cases](../../../.buildkite/performance-benchmarks/tests/serving-tests-cpu.json) For any model listed under [Supported Models on CPU](../../models/hardware_supported_models/cpu.md), optimized runtime configurations are provided in the vLLM Benchmark Suite’s CPU test cases, defined in cpu test cases as serving-tests-cpu.json. Full test cases for Text-only models, Multi-Modal models and Embedded models are in cpu Text-Only test cases as serving-tests-cpu-text.json, cpu Multi-Modal test cases as serving-tests-cpu-multimodal.json and cpu Embedded test cases as serving-tests-cpu-embed.json.
For details on how these optimized configurations are determined, see: [performance-benchmark-details](../../../.buildkite/performance-benchmarks/README.md#performance-benchmark-details). For details on how these optimized configurations are determined, see: [performance-benchmark-details](../../../.buildkite/performance-benchmarks/README.md#performance-benchmark-details).
To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](../../benchmarking/dashboard.md#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment. To benchmark the supported models using these optimized settings, follow the steps in [running vLLM Benchmark Suite manually](../../benchmarking/dashboard.md#manually-trigger-the-benchmark) and run the Benchmark Suite on a CPU environment.
...@@ -199,6 +205,28 @@ lscpu | grep "NUMA node(s):" | awk '{print $3}' ...@@ -199,6 +205,28 @@ lscpu | grep "NUMA node(s):" | awk '{print $3}'
For performance reference, users may also consult the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm&deviceName=cpu) For performance reference, users may also consult the [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm&deviceName=cpu)
, which publishes default-model CPU results produced using the same Benchmark Suite. , which publishes default-model CPU results produced using the same Benchmark Suite.
#### Dry-Run
For users only need to get the optimized runtime configurations without running benchmark, a Dry-Run mode is provided.
By passing an environment variable DRY_RUN=1 with run-performance-benchmarks.sh,
all commands will be generated under `./benchmark/results/`.
```bash
ON_CPU=1 DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
By providing different JSON file, users can get runtime configurations for different models such as Embedded Models.
```bash
ON_CPU=1 SERVING_JSON=serving-tests-cpu-embed.json DRY_RUN=1 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
By providing MODEL_FILTER and DTYPE_FILTER, only commands for related model ID and Data Type will be generated.
```bash
ON_CPU=1 SERVING_JSON=serving-tests-cpu-text.json DRY_RUN=1 MODEL_FILTER=meta-llama/Llama-3.1-8B-Instruct DTYPE_FILTER=bfloat16 bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
### How to decide `VLLM_CPU_OMP_THREADS_BIND`? ### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
- Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following. - Default `auto` thread-binding is recommended for most cases. Ideally, each OpenMP thread will be bound to a dedicated physical core respectively, threads of each rank will be bound to the same NUMA node respectively, and 1 CPU per rank will be reserved for other vLLM components when `world_size > 1`. If you have any performance problems or unexpected binding behaviours, please try to bind threads as following.
...@@ -231,7 +259,7 @@ For performance reference, users may also consult the [vLLM Performance Dashboar ...@@ -231,7 +259,7 @@ For performance reference, users may also consult the [vLLM Performance Dashboar
# On this platform, it is recommended to only bind openMP threads on logical CPU cores 0-7 or 8-15 # On this platform, it is recommended to only bind openMP threads on logical CPU cores 0-7 or 8-15
$ export VLLM_CPU_OMP_THREADS_BIND=0-7 $ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py $ python examples/basic/offline_inference/basic.py
``` ```
- When deploying vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on the same NUMA node to avoid cross NUMA node memory access. - When deploying vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on the same NUMA node to avoid cross NUMA node memory access.
......
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform. vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform.
Currently, the CPU implementation for s390x architecture supports FP32 datatype only. Currently, the CPU implementation for s390x architecture supports FP32 datatype only.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: `Linux` - OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools - SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above. - Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision` - Build install python packages: `pyarrow`, `torch` and `torchvision`
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built IBM Z CPU wheels. Currently, there are no pre-built IBM Z CPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4: Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
...@@ -65,13 +66,13 @@ Execute the following commands to build and install vLLM from source. ...@@ -65,13 +66,13 @@ Execute the following commands to build and install vLLM from source.
pip install dist/*.whl pip install dist/*.whl
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, there are no pre-built IBM Z CPU images. Currently, there are no pre-built IBM Z CPU images.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
```bash ```bash
docker build -f docker/Dockerfile.s390x \ docker build -f docker/Dockerfile.s390x \
...@@ -93,6 +94,6 @@ docker run --rm \ ...@@ -93,6 +94,6 @@ docker run --rm \
!!! tip !!! tip
An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`. An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`.
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: Linux - OS: Linux
- CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional) - CPU flags: `avx512f` (Recommended), `avx2` (Limited features)
!!! tip !!! tip
Use `lscpu` to check the CPU flags. Use `lscpu` to check the CPU flags.
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels: Pre-built vLLM wheels for x86 with AVX512/AVX2 are available since version 0.17.0. To install release wheels:
```bash ```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
...@@ -25,6 +26,7 @@ export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/rel ...@@ -25,6 +26,7 @@ export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/rel
# use uv # use uv
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
``` ```
??? console "pip" ??? console "pip"
```bash ```bash
# use pip # use pip
...@@ -46,7 +48,7 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE ...@@ -46,7 +48,7 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD" export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
``` ```
**Install the latest code** #### Install the latest code
To install the wheel built from the latest main branch: To install the wheel built from the latest main branch:
...@@ -54,7 +56,7 @@ To install the wheel built from the latest main branch: ...@@ -54,7 +56,7 @@ To install the wheel built from the latest main branch:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
``` ```
**Install specific revisions** #### Install specific revisions
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
...@@ -63,8 +65,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha ...@@ -63,8 +65,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
...@@ -106,13 +108,13 @@ VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation ...@@ -106,13 +108,13 @@ VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation
If you want to develop vLLM, install it in editable mode instead. If you want to develop vLLM, install it in editable mode instead.
```bash ```bash
VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation VLLM_TARGET_DEVICE=cpu python3 setup.py develop
``` ```
Optionally, build a portable wheel which you can then install elsewhere: Optionally, build a portable wheel which you can then install elsewhere:
```bash ```bash
VLLM_TARGET_DEVICE=cpu uv build --wheel VLLM_TARGET_DEVICE=cpu uv build --wheel --no-build-isolation
``` ```
```bash ```bash
...@@ -158,16 +160,23 @@ uv pip install dist/*.whl ...@@ -158,16 +160,23 @@ uv pip install dist/*.whl
] ]
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
You can pull the latest available CPU image here via: You can pull the latest available CPU image from Docker Hub:
```bash ```bash
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest docker pull vllm/vllm-openai-cpu:latest-x86_64
``` ```
If you want a more specific build you can find all published CPU based images here: [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo) To pull an image for a specific vLLM version:
```bash
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
docker pull vllm/vllm-openai-cpu:v${VLLM_VERSION}-x86_64
```
All available image tags are here: [https://hub.docker.com/r/vllm/vllm-openai-cpu/tags](https://hub.docker.com/r/vllm/vllm-openai-cpu/tags)
You can run these images via: You can run these images via:
...@@ -176,64 +185,22 @@ docker run \ ...@@ -176,64 +185,22 @@ docker run \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \ -p 8000:8000 \
--env "HF_TOKEN=<secret>" \ --env "HF_TOKEN=<secret>" \
public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:<tag> <args...> vllm/vllm-openai-cpu:latest-x86_64 <args...>
``` ```
!!! warning --8<-- [end:pre-built-images]
If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. See the build-image-from-source section below for build arguments to match your target CPU capabilities. --8<-- [start:build-image-from-source]
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
## Building for your target CPU #### Building for your target CPU
```bash ```bash
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_DISABLE_AVX512=<false (default)|true> \ --build-arg VLLM_CPU_X86=<false (default)|true> \ # For cross-compilation
--build-arg VLLM_CPU_AVX2=<false (default)|true> \
--build-arg VLLM_CPU_AVX512=<false (default)|true> \
--build-arg VLLM_CPU_AVX512BF16=<false (default)|true> \
--build-arg VLLM_CPU_AVX512VNNI=<false (default)|true> \
--build-arg VLLM_CPU_AMXBF16=<false|true (default)> \
--tag vllm-cpu-env \ --tag vllm-cpu-env \
--target vllm-openai . --target vllm-openai .
``` ```
!!! note "Auto-detection by default" #### Launching the OpenAI server
By default, CPU instruction sets (AVX512, AVX2, etc.) are automatically detected from the build system's CPU flags. Build arguments like `VLLM_CPU_AVX2`, `VLLM_CPU_AVX512`, `VLLM_CPU_AVX512BF16`, `VLLM_CPU_AVX512VNNI`, and `VLLM_CPU_AMXBF16` are used for cross-compilation:
- `VLLM_CPU_{ISA}=true` - Force-enable the instruction set (build with ISA regardless of build system capabilities)
- `VLLM_CPU_{ISA}=false` - Rely on auto-detection (default)
### Examples
**Auto-detection build (default)**
```bash
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
```
**Cross-compile for AVX512**
```bash
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512=true \
--build-arg VLLM_CPU_AVX512BF16=true \
--build-arg VLLM_CPU_AVX512VNNI=true \
--tag vllm-cpu-avx512 \
--target vllm-openai .
```
**Cross-compile for AVX2**
```bash
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX2=true \
--tag vllm-cpu-avx2 \
--target vllm-openai .
```
## Launching the OpenAI server
```bash ```bash
docker run --rm \ docker run --rm \
...@@ -248,6 +215,6 @@ docker run --rm \ ...@@ -248,6 +215,6 @@ docker run --rm \
other vLLM OpenAI server arguments other vLLM OpenAI server arguments
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]
\ No newline at end of file
# --8<-- [start:installation] <!-- markdownlint-disable MD041 MD051 -->
--8<-- [start:installation]
vLLM contains pre-compiled C++ and CUDA (12.8) binaries. vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.) - GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
!!! note !!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
...@@ -17,8 +18,8 @@ In order to be performant, vLLM has to compile many cuda kernels. The compilatio ...@@ -17,8 +18,8 @@ In order to be performant, vLLM has to compile many cuda kernels. The compilatio
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details. Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
```bash ```bash
uv pip install vllm --torch-backend=auto uv pip install vllm --torch-backend=auto
...@@ -49,8 +50,8 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE ...@@ -49,8 +50,8 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used: LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used:
* `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9. - `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9.
* `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency. - `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency.
To install from nightly index, run: To install from nightly index, run:
...@@ -82,8 +83,8 @@ uv pip install vllm \ ...@@ -82,8 +83,8 @@ uv pip install vllm \
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation) {#python-only-build} #### Set up using Python-only build (without compilation) {#python-only-build}
...@@ -116,9 +117,9 @@ uv pip install --editable . ...@@ -116,9 +117,9 @@ uv pip install --editable .
There are more environment variables to control the behavior of Python-only build: There are more environment variables to control the behavior of Python-only build:
* `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped. - `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped.
* `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch. - `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch.
* `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection. - `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection.
You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code). You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).
...@@ -236,8 +237,8 @@ export VLLM_TARGET_DEVICE=empty ...@@ -236,8 +237,8 @@ export VLLM_TARGET_DEVICE=empty
uv pip install -e . uv pip install -e .
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
...@@ -297,8 +298,25 @@ You can add any other [engine-args](https://docs.vllm.ai/en/latest/configuration ...@@ -297,8 +298,25 @@ You can add any other [engine-args](https://docs.vllm.ai/en/latest/configuration
RUN uv pip install --system git+https://github.com/huggingface/transformers.git RUN uv pip install --system git+https://github.com/huggingface/transformers.git
``` ```
# --8<-- [end:pre-built-images] #### Running on Systems with Older CUDA Drivers
# --8<-- [start:build-image-from-source]
vLLM's Docker image comes with [CUDA compatibility libraries](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) pre-installed. This allows you to run vLLM on systems with NVIDIA drivers that are older than the CUDA Toolkit version used in the image, but only supports select professional and datacenter NVIDIA GPUs.
To enable this feature, set the `VLLM_ENABLE_CUDA_COMPATIBILITY` environment variable to `1` or `true` when running the container:
```bash
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_TOKEN=<secret>" \
--env "VLLM_ENABLE_CUDA_COMPATIBILITY=1" \
vllm/vllm-openai <args...>
```
This will automatically configure `LD_LIBRARY_PATH` to point to the compatibility libraries before loading PyTorch and other dependencies.
--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]
You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM: You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM:
...@@ -398,9 +416,9 @@ The argument `vllm/vllm-openai` specifies the image to run, and should be replac ...@@ -398,9 +416,9 @@ The argument `vllm/vllm-openai` specifies the image to run, and should be replac
!!! note !!! note
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` . **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information. See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
# --8<-- [end:supported-features] --8<-- [end:supported-features]
\ No newline at end of file
...@@ -88,8 +88,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -88,8 +88,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
### Pre-built images ### Pre-built images
<!-- markdownlint-disable MD025 --> --8<-- [start:pre-built-images]
# --8<-- [start:pre-built-images]
=== "NVIDIA CUDA" === "NVIDIA CUDA"
...@@ -103,15 +102,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -103,15 +102,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
<!-- markdownlint-enable MD025 -->
<!-- markdownlint-disable MD001 -->
### Build image from source ### Build image from source
<!-- markdownlint-enable MD001 -->
<!-- markdownlint-disable MD025 --> --8<-- [start:build-image-from-source]
# --8<-- [start:build-image-from-source]
=== "NVIDIA CUDA" === "NVIDIA CUDA"
...@@ -125,8 +120,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G ...@@ -125,8 +120,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
<!-- markdownlint-enable MD025 -->
## Supported features ## Supported features
......
# --8<-- [start:installation] <!-- markdownlint-disable MD041 MD051 -->
--8<-- [start:installation]
vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0. vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150) - GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150)
- ROCm 6.3 or above - ROCm 6.3 or above
- MI350 requires ROCm 7.0 or above - MI350 requires ROCm 7.0 or above
- Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above - Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, high‑performance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds. The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, high‑performance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds.
If you need a different ROCm version or want to use an existing PyTorch installation, you’ll need to build vLLM from source. See [below](#build-wheel-from-source) for more details. If you need a different ROCm version or want to use an existing PyTorch installation, you’ll need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`. To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`.
...@@ -34,7 +35,7 @@ To install a specific version and ROCm variant of vLLM wheel. ...@@ -34,7 +35,7 @@ To install a specific version and ROCm variant of vLLM wheel.
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
``` ```
!!! warning "Caveats for using `pip`" !!! warning "Caveats for using `pip`"
We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes).
...@@ -44,8 +45,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 ...@@ -44,8 +45,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
!!! tip !!! tip
- If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps. - If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps.
...@@ -104,7 +105,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 ...@@ -104,7 +105,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
!!! note !!! note
- The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps: 3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
```bash ```bash
...@@ -120,7 +120,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 ...@@ -120,7 +120,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose. - You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps: 4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps:
```bash ```bash
...@@ -135,7 +134,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 ...@@ -135,7 +134,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose. - You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose.
- The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps: 5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
???+ console "Commands" ???+ console "Commands"
...@@ -171,8 +169,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 ...@@ -171,8 +169,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html). For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html).
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags).
...@@ -217,8 +215,8 @@ rocm/vllm-dev:nightly ...@@ -217,8 +215,8 @@ rocm/vllm-dev:nightly
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html) Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image. for instructions on how to use this prebuilt docker image.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm). You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm).
...@@ -271,7 +269,6 @@ To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (w ...@@ -271,7 +269,6 @@ To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (w
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm . DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm .
``` ```
To run vLLM with the custom-built Docker image: To run vLLM with the custom-built Docker image:
```bash ```bash
...@@ -308,9 +305,9 @@ To use the docker image as base for development, you can launch it in interactiv ...@@ -308,9 +305,9 @@ To use the docker image as base for development, you can launch it in interactiv
vllm/vllm-openai-rocm vllm/vllm-openai-rocm
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information. See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
# --8<-- [end:supported-features] --8<-- [end:supported-features]
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM initially supports basic model inference and serving on Intel GPU platform. vLLM initially supports basic model inference and serving on Intel GPU platform.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU - Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2025.3 - Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform,
- Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform,
- Python: 3.12 - Python: 3.12
!!! warning !!! warning
The provided vllm-xpu-kernels whl is Python3.12 specific so this version is a MUST. The provided vllm-xpu-kernels whl is Python3.12 specific so this version is a MUST.
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
There is no extra information on creating a new Python environment for this device. There is no extra information on creating a new Python environment for this device.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built XPU wheels. Currently, there are no pre-built XPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
- First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.3 or later. - First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers).
- Second, install Python packages for vLLM XPU backend building: - Second, install Python packages for vLLM XPU backend building (Intel OneAPI dependencies are installed automatically as part of `torch-xpu`, see [PyTorch XPU get started](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html)):
```bash ```bash
git clone https://github.com/vllm-project/vllm.git git clone https://github.com/vllm-project/vllm.git
...@@ -35,19 +35,32 @@ pip install --upgrade pip ...@@ -35,19 +35,32 @@ pip install --upgrade pip
pip install -v -r requirements/xpu.txt pip install -v -r requirements/xpu.txt
``` ```
- Then, build and install vLLM XPU backend: - Then, install the correct Triton package for Intel XPU.
The default `triton` package (for NVIDIA GPUs) may be installed as a transitive dependency (e.g., via `xgrammar`). For Intel XPU, you must replace it with `triton-xpu`:
```bash
pip uninstall -y triton triton-xpu
pip install triton-xpu==3.6.0 --extra-index-url https://download.pytorch.org/whl/xpu
```
!!! note
- `triton` (without suffix) is for NVIDIA GPUs only. On XPU, using it instead of `triton-xpu` can cause correctness or runtime issues.
- For torch 2.10 (the version used in `requirements/xpu.txt`), the matching package is `triton-xpu==3.6.0`. If you use a different version of torch, check the corresponding `triton-xpu` version in [docker/Dockerfile.xpu](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.xpu).
- Finally, build and install vLLM XPU backend:
```bash ```bash
VLLM_TARGET_DEVICE=xpu pip install --no-build-isolation -e . -v VLLM_TARGET_DEVICE=xpu pip install --no-build-isolation -e . -v
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm). Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm).
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
```bash ```bash
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g . docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
...@@ -61,8 +74,8 @@ docker run -it \ ...@@ -61,8 +74,8 @@ docker run -it \
vllm-xpu-env vllm-xpu-env
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following: XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
...@@ -77,9 +90,9 @@ vllm serve facebook/opt-13b \ ...@@ -77,9 +90,9 @@ vllm serve facebook/opt-13b \
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features] --8<-- [end:supported-features]
# --8<-- [start:distributed-backend] --8<-- [start:distributed-backend]
XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU. XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
# --8<-- [end:distributed-backend] --8<-- [end:distributed-backend]
<!-- markdownlint-disable MD041 -->
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
```bash ```bash
uv venv --python 3.12 --seed uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate source .venv/bin/activate
``` ```
...@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform: ...@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform:
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py) With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/basic/offline_inference/basic.py](../../examples/basic/offline_inference/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]: The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
...@@ -228,7 +228,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep ...@@ -228,7 +228,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py) A more detailed client example can be found here: [examples/basic/offline_inference/basic.py](../../examples/basic/offline_inference/basic.py)
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
......
...@@ -55,6 +55,7 @@ Sorted alphabetically by GitHub handle: ...@@ -55,6 +55,7 @@ Sorted alphabetically by GitHub handle:
- [@ywang96](https://github.com/ywang96): Multimodality, benchmarks - [@ywang96](https://github.com/ywang96): Multimodality, benchmarks
- [@zhuohan123](https://github.com/zhuohan123): Project lead, RL integration, numerics - [@zhuohan123](https://github.com/zhuohan123): Project lead, RL integration, numerics
- [@zou3519](https://github.com/zou3519): Compilation - [@zou3519](https://github.com/zou3519): Compilation
- [@BoyuanFeng](https://github.com/BoyuanFeng): Compilation, CUDAGraph
### Emeritus Committers ### Emeritus Committers
...@@ -113,7 +114,7 @@ If you have PRs touching the area, please feel free to ping the area owner for r ...@@ -113,7 +114,7 @@ If you have PRs touching the area, please feel free to ping the area owner for r
- Multi-modal Input Processing: Components that load and process image/video/audio data into feature tensors - Multi-modal Input Processing: Components that load and process image/video/audio data into feature tensors
- @DarkLight1337, @ywang96, @Isotr0py - @DarkLight1337, @ywang96, @Isotr0py
- torch compile: The torch.compile integration in vLLM, custom passes & transformations - torch compile: The torch.compile integration in vLLM, custom passes & transformations
- @ProExpertProg, @zou3519, @youkaichao - @ProExpertProg, @zou3519, @youkaichao, @BoyuanFeng
- State space models: The state space models implementation in vLLM - State space models: The state space models implementation in vLLM
- @tdoublep, @tlrmchlsmth - @tdoublep, @tlrmchlsmth
- Reasoning and tool calling parsers - Reasoning and tool calling parsers
...@@ -154,7 +155,7 @@ If you have PRs touching the area, please feel free to ping the area owner for r ...@@ -154,7 +155,7 @@ If you have PRs touching the area, please feel free to ping the area owner for r
- FlashAttention: @LucasWilkinson - FlashAttention: @LucasWilkinson
- FlashInfer: @LucasWilkinson, @mgoin, @WoosukKwon - FlashInfer: @LucasWilkinson, @mgoin, @WoosukKwon
- Blackwell Kernels: @mgoin, @yewentao256 - Blackwell Kernels: @mgoin, @yewentao256
- DeepEP/DeepGEMM/pplx: @mgoin, @yewentao256 - DeepEP/DeepGEMM: @mgoin, @yewentao256
### Integrations ### Integrations
......
...@@ -79,13 +79,15 @@ Specially, committers are almost all area owners. They author subsystems, review ...@@ -79,13 +79,15 @@ Specially, committers are almost all area owners. They author subsystems, review
For a full list of committers and their respective areas, see the [committers](./committers.md) page. For a full list of committers and their respective areas, see the [committers](./committers.md) page.
#### Nomination Process #### Committer Proposal Process
Any committer can nominate candidates via our private mailing list: Any committer can nominate candidates via our private committer mailing list. The process runs as follows:
1. **Nominate**: Any committer may nominate a candidate by email to the private maintainers’ list, citing evidence mapped to the pre‑existing standards with links to PRs, reviews, RFCs, issues, benchmarks, and adoption evidence. 1. **Nominate**: A committer sends email to the committer group to nominate a candidate, highlighting the candidate’s contributions (e.g., links to PRs, reviews, RFCs, issues, benchmarks, and adoption evidence) and how they map to the standards below.
2. **Vote**: The lead maintainers will group voices support or concerns. Shared concerns can stop the process. The vote typically last 3 working days. For concerns, committers group discuss the clear criteria for such person to be nominated again. The lead maintainers will make the final decision. 2. **Discuss and vote**: The committer group discusses the nomination, votes, and voices concerns if needed. Shared concerns can stop the process. For concerns, the group discusses clear criteria for the person to be nominated again. Most cases are decided by consensus; in contentious cases, the lead maintainers resolve conflicts and make the decision.
3. **Confirm**: The lead maintainers send invitation, update CODEOWNERS, assign permissions, add to communications channels (mailing list and Slack). 3. **Feedback period**: After a two-week feedback period (allowing time for any last input or concerns), if no blocking concerns arise and the nominator confirms with lead maintainer group to move forward (via the mailing list or committers slack channel), the nominator sends an invitation to the candidate asking them to open a PR to update their code ownership (e.g., CODEOWNERS and committers list).
4. **Permissions and onboarding**: In parallel, the lead maintainers assign the necessary permissions in GitHub and add the new member to the committer mailing list, the committer-only Slack channel, and other communications channels as appropriate.
5. **Finalize**: Once the CODEOWNERS/committer PR is ready and permissions are in place, the PR is merged and the new committer is welcomed.
Committership is highly selective and merit based. The selection criteria requires: Committership is highly selective and merit based. The selection criteria requires:
...@@ -133,6 +135,19 @@ PRs requires at least one committer review and approval. If the code is covered ...@@ -133,6 +135,19 @@ PRs requires at least one committer review and approval. If the code is covered
In case where CI didn't pass due to the failure is not related to the PR, the PR can be merged by the lead maintainers using "force merge" option that overrides the CI checks. In case where CI didn't pass due to the failure is not related to the PR, the PR can be merged by the lead maintainers using "force merge" option that overrides the CI checks.
### AI Assisted Contributions
AI tools can accelerate development, but contributors remain fully responsible for all code they submit. Like the Developer Certificate of Origin, this policy centers on accountability: contributors must believe they have the right to submit their contribution under vLLM's open source license, regardless of how the code was created.
All AI-assisted contributions must meet the same quality, testing, and review standards as any other code. Contributors must review and understand AI-generated code before submission—just make sure it is good code:
- Do not submit "pure agent" PRs. The human submitter is responsible for reviewing all changed lines, validating behavior end-to-end, and running relevant tests.
- Attribution preserves legal clarity and community trust. Contributors must disclose AI assistance in pull requests and mark commits with appropriate trailers (e.g. `Co-authored-by:`).
- Avoid one-off "busywork" PRs (single typo, isolated style cleanup, one mutable default fix, etc.). Bundle mechanical cleanups into a clear, systematic scope.
!!! warning
These topics are outlined for agents in [AGENTS.md](../../AGENTS.md) with instructions for how to autonomously implement them.
### Slack ### Slack
Contributors are encouraged to join `#pr-reviews` and `#contributors` channels. Contributors are encouraged to join `#pr-reviews` and `#contributors` channels.
......
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# Skip PR builds unless the PR has the "documentation" or "ready" label.
# Used by Read the Docs (see .readthedocs.yaml).
if [[ "$READTHEDOCS_VERSION_TYPE" != "external" ]]; then
exit 0
fi
PR_URL="https://api.github.com/repos/vllm-project/vllm/pulls/${READTHEDOCS_VERSION}"
CURL_ARGS=(-s -o /tmp/pr_response.json -w "%{http_code}")
if [[ -n "$GITHUB_TOKEN" ]]; then
CURL_ARGS+=(-H "Authorization: token ${GITHUB_TOKEN}")
fi
HTTP_CODE=$(curl "${CURL_ARGS[@]}" "$PR_URL")
if [[ "$HTTP_CODE" -ne 200 ]]; then
echo "GitHub API returned HTTP ${HTTP_CODE}, proceeding with build."
elif grep -qE '"name": *"(documentation|ready)"' /tmp/pr_response.json; then
echo "Found required label, proceeding with build."
else
echo "PR #${READTHEDOCS_VERSION} lacks 'documentation' or 'ready' label, cancelling build."
exit 1
fi
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment