Commit 4eabe123 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

parents 45840cd2 58738772
(quantization-supported-hardware)=
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
:::{list-table}
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
- * Implementation
* Volta
* Turing
* Ampere
* Ada
* Hopper
* AMD GPU
* Intel GPU
* x86 CPU
* AWS Inferentia
* Google TPU
- * AWQ
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * GPTQ
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * Marlin (GPTQ/AWQ/FP8)
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * INT8 (W8A8)
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
* ✅︎
*
* ✅︎
- * FP8 (W8A8)
*
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
- * BitBLAS (GPTQ)
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * AQLM
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * bitsandbytes
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * DeepSpeedFP
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * GGUF
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
- * modelopt
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎︎
*
*
*
*
*
:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
:::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
:::
(installation-index)=
# Installation
vLLM supports the following hardware platforms:
:::{toctree}
:maxdepth: 1
:hidden:
installation/gpu
installation/cpu
installation/ai_accelerator
:::
- <project:installation/gpu.md>
- NVIDIA CUDA
- AMD ROCm
- Intel XPU
- <project:installation/cpu.md>
- Intel/AMD x86
- ARM AArch64
- Apple silicon
- IBM Z (S390X)
- <project:installation/ai_accelerator.md>
- Google TPU
- Intel Gaudi
- AWS Neuron
# Other AI accelerators
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:selected:
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
:::
::::
:::::
## Configure a new environment
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
:::
::::
:::::
## Set up using Python
### Pre-built wheels
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
:::::
### Build wheel from source
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker
### Pre-built images
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
:::::
### Build image from source
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
:::
::::
:::::
## Extra information
:::::{tab-set}
:sync-group: device
::::{tab-item} Google TPU
:sync: tpu
:::{include} ai_accelerator/tpu.inc.md
:start-after: "## Extra information"
:::
::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi
:::{include} ai_accelerator/hpu-gaudi.inc.md
:start-after: "## Extra information"
:::
::::
::::{tab-item} AWS Neuron
:sync: neuron
:::{include} ai_accelerator/neuron.inc.md
:start-after: "## Extra information"
:::
::::
:::::
# Installation
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
:::{attention}
There are no pre-built wheels or images for this device, so you must build vLLM from source.
:::
## Requirements
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
:::{tip}
[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
:::
## Set up using Python
### Pre-built wheels
### Build wheel from source
:::{include} cpu/build.inc.md
:::
:::{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
:::
## Set up using Docker
### Pre-built images
See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
### Build image from source
## Extra information
# GPU
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:selected:
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
:::
::::
:::::
## Requirements
- OS: Linux
- Python: 3.9 -- 3.12
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
:::
::::
:::::
## Set up using Python
### Create a new Python environment
:::{include} python_env_setup.inc.md
:::
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "## Create a new Python environment"
:end-before: "### Pre-built wheels"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
There is no extra information on creating a new Python environment for this device.
::::
::::{tab-item} Intel XPU
:sync: xpu
There is no extra information on creating a new Python environment for this device.
::::
:::::
### Pre-built wheels
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
:::
::::
:::::
(build-from-source)=
### Build wheel from source
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
:::
::::
:::::
## Set up using Docker
### Pre-built images
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
:::
::::
:::::
### Build image from source
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
:::
::::
:::::
## Supported features
:::::{tab-set}
:sync-group: device
::::{tab-item} NVIDIA CUDA
:sync: cuda
:::{include} gpu/cuda.inc.md
:start-after: "## Supported features"
:::
::::
::::{tab-item} AMD ROCm
:sync: rocm
:::{include} gpu/rocm.inc.md
:start-after: "## Supported features"
:::
::::
::::{tab-item} Intel XPU
:sync: xpu
:::{include} gpu/xpu.inc.md
:start-after: "## Supported features"
:::
::::
:::::
You can create a new Python environment using [conda](https://docs.conda.io/projects/conda/en/stable/user-guide/getting-started.html):
```console
# (Recommended) Create a new conda environment.
conda create -n vllm python=3.12 -y
conda activate vllm
```
:::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
:::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
```console
# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate
```
# Welcome to vLLM
:::{figure} ./assets/logos/vllm-logo-text-light.png
:align: center
:alt: vLLM
:class: no-scaled-link
:width: 60%
:::
:::{raw} html
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>
:::
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- [vLLM Meetups](#meetups)
## Documentation
% How to start using vLLM?
:::{toctree}
:caption: Getting Started
:maxdepth: 1
getting_started/installation
getting_started/quickstart
getting_started/examples/examples_index
getting_started/troubleshooting
getting_started/faq
getting_started/v1_user_guide
:::
% What does vLLM support?
:::{toctree}
:caption: Models
:maxdepth: 1
models/supported_models
models/generative_models
models/pooling_models
models/extensions/index
:::
% Additional capabilities
:::{toctree}
:caption: Features
:maxdepth: 1
features/quantization/index
features/multimodal_inputs
features/prompt_embeds
features/lora
features/tool_calling
features/reasoning_outputs
features/structured_outputs
features/automatic_prefix_caching
features/disagg_prefill
features/spec_decode
features/compatibility_matrix
:::
% Details about running vLLM
:::{toctree}
:caption: Training
:maxdepth: 1
training/trl.md
training/rlhf.md
:::
:::{toctree}
:caption: Inference and Serving
:maxdepth: 1
serving/offline_inference
serving/openai_compatible_server
serving/serve_args
serving/distributed_serving
serving/metrics
serving/engine_args
serving/env_vars
serving/usage_stats
serving/integrations/index
:::
% Scaling up vLLM for production
:::{toctree}
:caption: Deployment
:maxdepth: 1
deployment/security
deployment/docker
deployment/k8s
deployment/nginx
deployment/frameworks/index
deployment/integrations/index
:::
% Making the most out of vLLM
:::{toctree}
:caption: Performance
:maxdepth: 1
performance/optimization
performance/benchmarks
:::
% Explanation of vLLM internals
:::{toctree}
:caption: Design Documents
:maxdepth: 2
design/arch_overview
design/huggingface_integration
design/plugin_system
design/kernel/paged_attention
design/mm_processing
design/automatic_prefix_caching
design/multiprocessing
:::
:::{toctree}
:caption: V1 Design Documents
:maxdepth: 2
design/v1/torch_compile
design/v1/prefix_caching
design/v1/metrics
:::
% How to contribute to the vLLM project
:::{toctree}
:caption: Developer Guide
:maxdepth: 2
contributing/overview
contributing/deprecation_policy
contributing/profiling/profiling_index
contributing/dockerfile/dockerfile
contributing/model/index
contributing/vulnerability_management
:::
% Technical API specifications
:::{toctree}
:caption: API Reference
:maxdepth: 2
api/summary
api/vllm/vllm
:::
% Latest news and acknowledgements
:::{toctree}
:caption: Community
:maxdepth: 1
community/blog
community/meetups
community/sponsors
:::
## Indices and tables
- {ref}`genindex`
- {ref}`modindex`
# Built-in Extensions
:::{toctree}
:maxdepth: 1
runai_model_streamer
tensorizer
fastsafetensor
:::
(engine-args)=
# Engine Arguments
Engine arguments control the behavior of the vLLM engine.
- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class.
- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`.
For references to all arguments available from `vllm serve` see the [serve args](#serve-args) documentation.
Below, you can find an explanation of every engine argument:
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
:func: _engine_args_parser
:prog: vllm serve
:nodefaultconst:
:markdownhelp:
```
## Async Engine Arguments
Additional arguments are available to the asynchronous engine which is used for online serving:
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
:func: _async_engine_args_parser
:prog: vllm serve
:nodefaultconst:
:markdownhelp:
```
# Environment Variables
vLLM uses the following environment variables to configure the system:
:::{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
:::
:::{literalinclude} ../../../vllm/envs.py
:end-before: end-env-vars-definition
:language: python
:start-after: begin-env-vars-definition
:::
# External Integrations
:::{toctree}
:maxdepth: 1
langchain
llamaindex
:::
......@@ -6,6 +6,6 @@ vLLM can be used to generate the completions for RLHF. The best way to do this i
See the following basic examples to get started if you don't want to use an existing library:
- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf.html)
- [Training and inference processes are colocated on the same GPUs using Ray](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_colocate.html)
- [Utilities for performing RLHF with vLLM](https://docs.vllm.ai/en/latest/getting_started/examples/rlhf_utils.html)
- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
......@@ -6,8 +6,7 @@ Online methods such as GRPO or Online DPO require the model to generate completi
See the guide [vLLM for fast generation in online methods](https://huggingface.co/docs/trl/main/en/speeding_up_training#vllm-for-fast-generation-in-online-methods) in the TRL documentation for more information.
:::{seealso}
For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
- [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
- [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
:::
!!! info
For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
- [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
- [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
# Using vLLM
vLLM supports the following usage patterns:
- [Inference and Serving](../serving/offline_inference.md): Run a single instance of a model.
- [Deployment](../deployment/docker.md): Scale up model instances for production.
- [Training](../training/rlhf.md): Train or fine-tune a model.
(faq)=
# Frequently Asked Questions
---
title: Frequently Asked Questions
---
[](){ #faq }
> Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
______________________________________________________________________
---
> Q: Which model to use for offline inference embedding?
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
more are listed [here](#supported-models).
more are listed [here][supported-models].
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
but they are expected to be inferior to models that are specifically trained on embedding tasks.
______________________________________________________________________
---
> Q: Can the output of a prompt vary across runs in vLLM?
......
......@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.
You can start the server using Python, or using [Docker](#deployment-docker):
You can start the server using Python, or using [Docker][deployment-docker]:
```console
vllm serve unsloth/Llama-3.2-1B-Instruct
......@@ -31,11 +31,9 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed:
:::{literalinclude} ../../../vllm/engine/metrics.py
:end-before: end-metrics-definitions
:language: python
:start-after: begin-metrics-definitions
:::
```python
--8<-- "vllm/engine/metrics.py:metrics-definitions"
```
The following metrics are deprecated and due to be removed in a future version:
......
# Reproducibility
vLLM does not guarantee the reproducibility of the results by default, for the sake of performance. You need to do the following to achieve
reproducible results:
- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
- For V0: Set the global seed (see below).
Example: <gh-file:examples/offline_inference/reproducibility.py>
!!! warning
Applying the above settings [changes the random state in user code](#locality-of-random-state).
!!! note
Even with the above settings, vLLM only provides reproducibility
when it runs on the same hardware and the same vLLM version.
Also, the online serving API (`vllm serve`) does not support reproducibility
because it is almost impossible to make the scheduling deterministic in the
online setting.
## Setting the global seed
The `seed` parameter in vLLM is used to control the random states for various random number generators.
If a specific seed value is provided, the random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly.
However, in some cases, setting the seed will also [change the random state in user code](#locality-of-random-state).
### Default Behavior
In V0, the `seed` parameter defaults to `None`. When the `seed` parameter is `None`, the random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that each run of vLLM will produce different results if `temperature > 0`, as expected.
In V1, the `seed` parameter defaults to `0` which sets the random state for each worker, so the results will remain consistent for each vLLM run even if `temperature > 0`.
!!! note
It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
for workflows such as speculative decoding.
For more information, see: <gh-pr:17929>
### Locality of random state
The random state in user code (i.e. the code that constructs [LLM][vllm.LLM] class) is updated by vLLM under the following conditions:
- For V0: The seed is specified.
- For V1: The workers are run in the same process as user code, i.e.: `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
By default, these conditions are not active so you can use vLLM without having to worry about
accidentally making deterministic subsequent operations that rely on random state.
# Security Guide
# Security
## Inter-Node Communication
......
(troubleshooting)=
# Troubleshooting
---
title: Troubleshooting
---
[](){ #troubleshooting }
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
:::{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
:::
!!! note
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
## Hangs downloading a model
......@@ -18,13 +18,12 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
:::{note}
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
:::
!!! note
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
## Out of memory
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](#reducing-memory-usage) to reduce the memory consumption.
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](../configuration/conserving_memory.md) to reduce the memory consumption.
## Generation quality changed
......@@ -53,9 +52,9 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
## Error near `self.graph.replay()`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
[](){ #troubleshooting-incorrect-hardware-driver }
## Incorrect hardware/driver
......@@ -140,16 +139,15 @@ If the script runs successfully, you should see the message `sanity check is suc
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
:::{note}
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
!!! note
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
:::
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
(troubleshooting-python-multiprocessing)=
[](){ #troubleshooting-python-multiprocessing }
## Python multiprocessing
......@@ -161,7 +159,7 @@ If you have seen a warning in your logs like this:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing
for more information.
```
......@@ -260,7 +258,7 @@ or:
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
```
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment