Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

4eabe123 · zhuwenwen · 45840cd2 · 58738772 · 45840cd2 · 45840cd2
Commit 4eabe123 authored May 28, 2025 by zhuwenwen
20 changed files
--- a/docs/source/features/quantization/supported_hardware.md
+++ b/docs/source/features/quantization/supported_hardware.md
-(quantization-supported-hardware)=
-
-# Supported Hardware
-
-The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-
-:::{list-table}
-:header-rows: 1
-:widths: 20 8 8 8 8 8 8 8 8 8 8
-
- * Implementation
-  * Volta
-  * Turing
-  * Ampere
-  * Ada
-  * Hopper
-  * AMD GPU
-  * Intel GPU
-  * x86 CPU
-  * AWS Inferentia
-  * Google TPU
- * AWQ
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
- * GPTQ
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
- * Marlin (GPTQ/AWQ/FP8)
-  * ❌
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * INT8 (W8A8)
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ✅︎
-  * ❌
-  * ✅︎
- * FP8 (W8A8)
-  * ❌
-  * ❌
-  * ❌
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * BitBLAS (GPTQ)
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * AQLM
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * bitsandbytes
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * DeepSpeedFP
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * GGUF
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
- * modelopt
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎
-  * ✅︎︎
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-:::
-
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
-
-:::{note}
-This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
-
-For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
-:::
--- a/docs/source/getting_started/installation.md
+++ b/docs/source/getting_started/installation.md
-(installation-index)=
-
-# Installation
-
-vLLM supports the following hardware platforms:
-
-:::{toctree}
-:maxdepth: 1
-:hidden:
-
-installation/gpu
-installation/cpu
-installation/ai_accelerator
-:::
-
- <project:installation/gpu.md>
-  - NVIDIA CUDA
-  - AMD ROCm
-  - Intel XPU
- <project:installation/cpu.md>
-  - Intel/AMD x86
-  - ARM AArch64
-  - Apple silicon
-  - IBM Z (S390X)
- <project:installation/ai_accelerator.md>
-  - Google TPU
-  - Intel Gaudi
-  - AWS Neuron
--- a/docs/source/getting_started/installation/ai_accelerator.md
+++ b/docs/source/getting_started/installation/ai_accelerator.md
-# Other AI accelerators
-
-vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:selected:
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
-
-## Requirements
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-:::::
-
-## Configure a new environment
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-:::::
-
-## Set up using Python
-
-### Pre-built wheels
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-:::::
-
-### Build wheel from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
-
-## Set up using Docker
-
-### Pre-built images
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-:::::
-
-### Build image from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-:::::
-
-## Extra information
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-:::::
--- a/docs/source/getting_started/installation/cpu/x86.inc.md
+++ b/docs/source/getting_started/installation/cpu/x86.inc.md
-# Installation
-
-vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
-
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
-
-## Requirements
-
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
-
-:::{tip}
-[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
-:::
-
-## Set up using Python
-
-### Pre-built wheels
-
-### Build wheel from source
-
-:::{include} cpu/build.inc.md
-:::
-
-:::{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
-:::
-
-## Set up using Docker
-
-### Pre-built images
-
-See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
-
-### Build image from source
-
-## Extra information
--- a/docs/source/getting_started/installation/gpu.md
+++ b/docs/source/getting_started/installation/gpu.md
-# GPU
-
-vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:selected:
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
-
-## Requirements
-
- OS: Linux
- Python: 3.9 -- 3.12
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-:::::
-
-## Set up using Python
-
-### Create a new Python environment
-
-:::{include} python_env_setup.inc.md
-:::
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Create a new Python environment"
-:end-before: "### Pre-built wheels"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-There is no extra information on creating a new Python environment for this device.
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-There is no extra information on creating a new Python environment for this device.
-
-::::
-
-:::::
-
-### Pre-built wheels
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-:::::
-
-(build-from-source)=
-
-### Build wheel from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
-
-## Set up using Docker
-
-### Pre-built images
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-:::::
-
-### Build image from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-:::::
-
-## Supported features
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-:::::
--- a/docs/source/getting_started/installation/python_env_setup.inc.md
+++ b/docs/source/getting_started/installation/python_env_setup.inc.md
-You can create a new Python environment using [conda](https://docs.conda.io/projects/conda/en/stable/user-guide/getting-started.html):
-
-```console
-# (Recommended) Create a new conda environment.
-conda create -n vllm python=3.12 -y
-conda activate vllm
-```
-
-:::{note}
-[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
-:::
-
-Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
-
-```console
-# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
-uv venv --python 3.12 --seed
-source .venv/bin/activate
-```
--- a/docs/source/index.md
+++ b/docs/source/index.md
-# Welcome to vLLM
-
-:::{figure} ./assets/logos/vllm-logo-text-light.png
-:align: center
-:alt: vLLM
-:class: no-scaled-link
-:width: 60%
-:::
-
-:::{raw} html
-<p style="text-align:center">
-<strong>Easy, fast, and cheap LLM serving for everyone
-</strong>
-</p>
-
-<p style="text-align:center">
-<script async defer src="https://buttons.github.io/buttons.js"></script>
-<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
-<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
-<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
-</p>
-:::
-
-vLLM is a fast and easy-to-use library for LLM inference and serving.
-
-Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
-
-vLLM is fast with:
-
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
-
-vLLM is flexible and easy to use with:
-
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
-
-For more information, check out the following:
-
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- [vLLM Meetups](#meetups)
-
-## Documentation
-
-% How to start using vLLM?
-
-:::{toctree}
-:caption: Getting Started
-:maxdepth: 1
-
-getting_started/installation
-getting_started/quickstart
-getting_started/examples/examples_index
-getting_started/troubleshooting
-getting_started/faq
-getting_started/v1_user_guide
-
-:::
-
-% What does vLLM support?
-
-:::{toctree}
-:caption: Models
-:maxdepth: 1
-
-models/supported_models
-models/generative_models
-models/pooling_models
-models/extensions/index
-:::
-
-% Additional capabilities
-
-:::{toctree}
-:caption: Features
-:maxdepth: 1
-
-features/quantization/index
-features/multimodal_inputs
-features/prompt_embeds
-features/lora
-features/tool_calling
-features/reasoning_outputs
-features/structured_outputs
-features/automatic_prefix_caching
-features/disagg_prefill
-features/spec_decode
-features/compatibility_matrix
-:::
-
-% Details about running vLLM
-
-:::{toctree}
-:caption: Training
-:maxdepth: 1
-
-training/trl.md
-training/rlhf.md
-
-:::
-
-:::{toctree}
-:caption: Inference and Serving
-:maxdepth: 1
-
-serving/offline_inference
-serving/openai_compatible_server
-serving/serve_args
-serving/distributed_serving
-serving/metrics
-serving/engine_args
-serving/env_vars
-serving/usage_stats
-serving/integrations/index
-:::
-
-% Scaling up vLLM for production
-
-:::{toctree}
-:caption: Deployment
-:maxdepth: 1
-
-deployment/security
-deployment/docker
-deployment/k8s
-deployment/nginx
-deployment/frameworks/index
-deployment/integrations/index
-:::
-
-% Making the most out of vLLM
-
-:::{toctree}
-:caption: Performance
-:maxdepth: 1
-
-performance/optimization
-performance/benchmarks
-:::
-
-% Explanation of vLLM internals
-
-:::{toctree}
-:caption: Design Documents
-:maxdepth: 2
-
-design/arch_overview
-design/huggingface_integration
-design/plugin_system
-design/kernel/paged_attention
-design/mm_processing
-design/automatic_prefix_caching
-design/multiprocessing
-:::
-
-:::{toctree}
-:caption: V1 Design Documents
-:maxdepth: 2
-
-design/v1/torch_compile
-design/v1/prefix_caching
-design/v1/metrics
-:::
-
-% How to contribute to the vLLM project
-
-:::{toctree}
-:caption: Developer Guide
-:maxdepth: 2
-
-contributing/overview
-contributing/deprecation_policy
-contributing/profiling/profiling_index
-contributing/dockerfile/dockerfile
-contributing/model/index
-contributing/vulnerability_management
-:::
-
-% Technical API specifications
-
-:::{toctree}
-:caption: API Reference
-:maxdepth: 2
-
-api/summary
-api/vllm/vllm
-:::
-
-% Latest news and acknowledgements
-
-:::{toctree}
-:caption: Community
-:maxdepth: 1
-
-community/blog
-community/meetups
-community/sponsors
-:::
-
-## Indices and tables
-
- {ref}`genindex`
- {ref}`modindex`
--- a/docs/source/models/extensions/index.md
+++ b/docs/source/models/extensions/index.md
-# Built-in Extensions
-
-:::{toctree}
-:maxdepth: 1
-
-runai_model_streamer
-tensorizer
-fastsafetensor
-:::
--- a/docs/source/serving/engine_args.md
+++ b/docs/source/serving/engine_args.md
-(engine-args)=
-
-# Engine Arguments
-
-Engine arguments control the behavior of the vLLM engine.
-
- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class.
- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`.
-
-For references to all arguments available from `vllm serve` see the [serve args](#serve-args) documentation.
-
-Below, you can find an explanation of every engine argument:
-
-<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
-```{eval-rst}
-.. argparse::
-    :module: vllm.engine.arg_utils
-    :func: _engine_args_parser
-    :prog: vllm serve
-    :nodefaultconst:
-    :markdownhelp:
-```
-
-## Async Engine Arguments
-
-Additional arguments are available to the asynchronous engine which is used for online serving:
-
-<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
-```{eval-rst}
-.. argparse::
-    :module: vllm.engine.arg_utils
-    :func: _async_engine_args_parser
-    :prog: vllm serve
-    :nodefaultconst:
-    :markdownhelp:
-```
--- a/docs/source/serving/env_vars.md
+++ b/docs/source/serving/env_vars.md
-# Environment Variables
-
-vLLM uses the following environment variables to configure the system:
-
-:::{warning}
-Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
-
-All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
-:::
-
-:::{literalinclude} ../../../vllm/envs.py
-:end-before: end-env-vars-definition
-:language: python
-:start-after: begin-env-vars-definition
-:::
--- a/docs/source/serving/integrations/index.md
+++ b/docs/source/serving/integrations/index.md
-# External Integrations
-
-:::{toctree}
-:maxdepth: 1
-
-langchain
-llamaindex
-:::
--- a/docs/source/training/rlhf.md
+++ b/docs/source/training/rlhf.md
@@ -6,6 +6,6 @@ vLLM can be used to generate the completions for RLHF. The best way to do this i

 See the following basic examples to get started if you don't want to use an existing library:

- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf.html)
- [Training and inference processes are colocated on the same GPUs using Ray](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html)
- [Utilities for performing RLHF with vLLM](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_utils.html)
+- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
+- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
+- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
--- a/docs/source/training/trl.md
+++ b/docs/source/training/trl.md
@@ -6,8 +6,7 @@ Online methods such as GRPO or Online DPO require the model to generate completi

 See the guide [vLLM for fast generation in online methods](https://huggingface.co/docs/trl/main/en/speeding_up_training#vllm-for-fast-generation-in-online-methods) in the TRL documentation for more information.

-:::{seealso}
-For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
- [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
- [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
-:::
+!!! info
+    For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
+    - [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
+    - [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
--- a/docs/usage/README.md
+++ b/docs/usage/README.md
+# Using vLLM
+
+vLLM supports the following usage patterns:
+
+- [Inference and Serving](../serving/offline_inference.md): Run a single instance of a model.
+- [Deployment](../deployment/docker.md): Scale up model instances for production.
+- [Training](../training/rlhf.md): Train or fine-tune a model.
--- a/docs/source/getting_started/faq.md
+++ b/docs/source/getting_started/faq.md
-(faq)=
-
-# Frequently Asked Questions
+---
+title: Frequently Asked Questions
+---
+[](){ #faq }

 > Q: How can I serve multiple models on a single port using the OpenAI API?

 A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.

-______________________________________________________________________
+---

 > Q: Which model to use for offline inference embedding?

 A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
-more are listed [here](#supported-models).
+more are listed [here][supported-models].

 By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
 [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
 but they are expected to be inferior to models that are specifically trained on embedding tasks.

-______________________________________________________________________
+---

 > Q: Can the output of a prompt vary across runs in vLLM?


--- a/docs/source/serving/metrics.md
+++ b/docs/source/serving/metrics.md
@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
 system. These metrics are exposed via the `/metrics` endpoint on the vLLM
 OpenAI compatible API server.

-You can start the server using Python, or using [Docker](#deployment-docker):
+You can start the server using Python, or using [Docker][deployment-docker]:

 ```console
 vllm serve unsloth/Llama-3.2-1B-Instruct
@@ -31,11 +31,9 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I

 The following metrics are exposed:

-:::{literalinclude} ../../../vllm/engine/metrics.py
-:end-before: end-metrics-definitions
-:language: python
-:start-after: begin-metrics-definitions
-:::
+```python
+--8<-- "vllm/engine/metrics.py:metrics-definitions"
+```

 The following metrics are deprecated and due to be removed in a future version:


--- a/docs/usage/reproducibility.md
+++ b/docs/usage/reproducibility.md
+# Reproducibility
+
+vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. You need to do the following to achieve
+reproducible results:
+
+- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
+- For V0: Set the global seed (see below).
+
+Example: <gh-file:examples/offline_inference/reproducibility.py>
+
+!!! warning
+
+    Applying the above settings [changes the random state in user code](#locality-of-random-state).
+
+!!! note
+
+    Even with the above settings, vLLM only provides reproducibility
+    when it runs on the same hardware and the same vLLM version.
+    Also, the online serving API (`vllm serve`) does not support reproducibility
+    because it is almost impossible to make the scheduling deterministic in the
+    online setting.
+
+## Setting the global seed
+
+The `seed` parameter in vLLM is used to control the random states for various random number generators.
+
+If a specific seed value is provided, the random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly.
+
+However, in some cases, setting the seed will also [change the random state in user code](#locality-of-random-state).
+
+### Default Behavior
+
+In V0, the `seed` parameter defaults to `None`. When the `seed` parameter is `None`, the random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that each run of vLLM will produce different results if `temperature > 0`, as expected.
+
+In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`.
+
+!!! note
+
+    It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
+    for workflows such as speculative decoding.
+    
+    For more information, see: <gh-pr:17929>
+
+### Locality of random state
+
+The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM under the following conditions:
+
+- For V0: The seed is specified.
+- For V1: The workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
+
+By default, these conditions are not active so you can use vLLM without having to worry about
+accidentally making deterministic subsequent operations that rely on random state.
--- a/docs/source/deployment/security.md
+++ b/docs/source/deployment/security.md
-# Security Guide
+# Security

 ## Inter-Node Communication


--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/source/getting_started/troubleshooting.md
-(troubleshooting)=
-
-# Troubleshooting
+---
+title: Troubleshooting
+---
+[](){ #troubleshooting }

 This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

-:::{note}
-Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
-:::
+!!! note
+    Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.

 ## Hangs downloading a model

@@ -18,13 +18,12 @@ It's recommended to download the model first using the [huggingface-cli](https:/
 If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
 It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.

-:::{note}
-To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
-:::
+!!! note
+    To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.

 ## Out of memory

-If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](#reducing-memory-usage) to reduce the memory consumption.
+If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](../configuration/conserving_memory.md) to reduce the memory consumption.

 ## Generation quality changed

@@ -53,9 +52,9 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
 ## Error near `self.graph.replay()`

 If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
-To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
+To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.

-(troubleshooting-incorrect-hardware-driver)=
+[](){ #troubleshooting-incorrect-hardware-driver }

 ## Incorrect hardware/driver

@@ -140,16 +139,15 @@ If the script runs successfully, you should see the message `sanity check is suc

 If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.

-:::{note}
-A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
+!!! note
+    A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:

- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
+    - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
+    - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.

-Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
-:::
+    Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.

-(troubleshooting-python-multiprocessing)=
+[](){ #troubleshooting-python-multiprocessing }

 ## Python multiprocessing

@@ -161,7 +159,7 @@ If you have seen a warning in your logs like this:
 WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
-    https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
+    https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing
    for more information.
 ```

@@ -260,7 +258,7 @@ or:
 ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
 ```

-But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
+But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.

 ## Failed to infer device type


--- a/docs/source/serving/usage_stats.md
+++ b/docs/source/serving/usage_stats.md