You can create a new Python environment using `conda`:
```console
#(Recommended) Create a new conda environment.
conda create -n myenv python=3.12 -y
conda activate myenv
```
:::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
:::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
```console
#(Recommended) Create a new uv environment. Use `--seed` to install`pip` and `setuptools`in the environment.
You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
```console
```console
$conda create -n myenv python=3.10-y
uv venv myenv --python3.12 --seed
$conda activate myenv
source myenv/bin/activate
$pip install vllm
uv pip install vllm
```
```
Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM.
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
```console
conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllm
```
(offline-batched-inference)=
:::{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
:::
(quickstart-offline)=
## Offline Batched Inference
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference.py>
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
...
@@ -40,7 +51,7 @@ The first line of this example imports the classes {class}`~vllm.LLM` and {class
...
@@ -40,7 +51,7 @@ The first line of this example imports the classes {class}`~vllm.LLM` and {class
fromvllmimportLLM,SamplingParams
fromvllmimportLLM,SamplingParams
```
```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](https://docs.vllm.ai/en/stable/dev/sampling_params.html).
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
```python
```python
prompts=[
prompts=[
...
@@ -58,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
...
@@ -58,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
llm=LLM(model="facebook/opt-125m")
llm=LLM(model="facebook/opt-125m")
```
```
```{note}
:::{note}
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
```
:::
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
@@ -83,18 +94,18 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
...
@@ -83,18 +94,18 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
```console
```console
$vllm serve Qwen/Qwen2.5-1.5B-Instruct
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
```
```{note}
:::{note}
By default, the server uses a predefined chat template stored in the tokenizer.
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template).
You can learn about overriding it [here](#chat-template).
```
:::
This server can be queried in the same format as OpenAI API. For example, to list the models:
This server can be queried in the same format as OpenAI API. For example, to list the models:
```console
```console
$curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/models
```
```
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header.
...
@@ -104,17 +115,17 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
...
@@ -104,17 +115,17 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
Once your server is started, you can query the model with input prompts:
Once your server is started, you can query the model with input prompts:
```console
```console
$curl http://localhost:8000/v1/completions \
curl http://localhost:8000/v1/completions \
$ -H"Content-Type: application/json"\
-H "Content-Type: application/json" \
$ -d'{
-d '{
$"model": "Qwen/Qwen2.5-1.5B-Instruct",
"model": "Qwen/Qwen2.5-1.5B-Instruct",
$"prompt": "San Francisco is a",
"prompt": "San Francisco is a",
$"max_tokens": 7,
"max_tokens": 7,
$"temperature": 0
"temperature": 0
$}'
}'
```
```
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai`python package:
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai`Python package:
This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{note}
:::{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
```
:::
## Hangs downloading a model
## Hangs downloading a model
...
@@ -18,13 +18,13 @@ It's recommended to download the model first using the [huggingface-cli](https:/
...
@@ -18,13 +18,13 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
```{note}
:::{note}
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
```
:::
## Model is too large
## Out of memory
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider [using tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Enable more logging
## Enable more logging
...
@@ -47,6 +47,8 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
...
@@ -47,6 +47,8 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver
## Incorrect hardware/driver
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
If the script runs successfully, you should see the message `sanity check is successful!`.
If the script runs successfully, you should see the message `sanity check is successful!`.
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
```{note}
:::{note}
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
```
:::
(troubleshooting-python-multiprocessing)=
(debugging-python-multiprocessing)=
## Python multiprocessing
## Python multiprocessing
### `RuntimeError` Exception
### `RuntimeError` Exception
...
@@ -150,7 +153,7 @@ If you have seen a warning in your logs like this:
...
@@ -150,7 +153,7 @@ If you have seen a warning in your logs like this:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
initialized. We must use the `spawn` multiprocessing start method. Setting
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
```python
importtorch
@torch.compile
deff(x):
# a simple function to test torch.compile
x=x+1
x=x*2
x=x.sin()
returnx
x=torch.randn(4,4).cuda()
print(f(x))
```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
## Model failed to be inspected
If you see an error like:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
```
It means that vLLM failed to import the model file.
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
Please read the logs carefully to determine the root cause of the error.
## Model not supported
If you see an error like:
```text
Traceback (most recent call last):
...
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable
```
or:
```text
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
```
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
## Known Issues
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
vLLM is fast with:
- State-of-the-art serving throughput
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
...
@@ -50,151 +52,152 @@ For more information, check out the following:
...
@@ -50,151 +52,152 @@ For more information, check out the following:
-[vLLM announcing blog post](https://vllm.ai)(intro to PagedAttention)
-[vLLM announcing blog post](https://vllm.ai)(intro to PagedAttention)
-[How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
-[How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
-{ref}`vLLM Meetups <meetups>`.
-[vLLM Meetups](#meetups)
## Documentation
## Documentation
```{toctree}
% How to start using vLLM?
:::{toctree}
:caption: Getting Started
:caption: Getting Started
:maxdepth: 1
:maxdepth: 1
getting_started/installation
getting_started/installation/index
getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation
getting_started/gaudi-installation
getting_started/arm-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
getting_started/quickstart
getting_started/quickstart
getting_started/debugging
getting_started/examples/examples_index
getting_started/examples/examples_index
```
getting_started/troubleshooting
getting_started/faq
:::
```{toctree}
% What does vLLM support?
:caption: Serving
:maxdepth: 1
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/integrations
serving/tensorizer
serving/runai_model_streamer
```
```{toctree}
:::{toctree}
:caption: Models
:caption: Models
:maxdepth: 1
:maxdepth: 1
models/supported_models
models/generative_models
models/generative_models
models/pooling_models
models/pooling_models
models/adding_model
models/supported_models
models/enabling_multimodal_inputs
models/extensions/index
```
:::
```{toctree}
:caption: Usage
:maxdepth: 1
usage/lora
% Additional capabilities
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
usage/performance
usage/faq
usage/engine_args
usage/env_vars
usage/usage_stats
usage/disagg_prefill
```
```{toctree}
:caption: Quantization
:maxdepth: 1
quantization/supported_hardware
:::{toctree}
quantization/auto_awq
:caption: Features
quantization/bnb
quantization/gguf
quantization/int8
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
```
```{toctree}
:caption: Automatic Prefix Caching
:maxdepth: 1
:maxdepth: 1
automatic_prefix_caching/apc
features/quantization/index
automatic_prefix_caching/details
features/lora
```
features/tool_calling
features/reasoning_outputs
```{toctree}
features/structured_outputs
:caption: Performance
features/automatic_prefix_caching
features/disagg_prefill
features/spec_decode
features/compatibility_matrix
:::
% Details about running vLLM
:::{toctree}
:caption: Inference and Serving
:maxdepth: 1
:maxdepth: 1
performance/benchmarks
serving/offline_inference
```
serving/openai_compatible_server
serving/multimodal_inputs
serving/distributed_serving
serving/metrics
serving/engine_args
serving/env_vars
serving/usage_stats
serving/integrations/index
:::
% Community: User community resources
% Scaling up vLLM for production
```{toctree}
:::{toctree}
:caption: Community
:caption: Deployment
:maxdepth: 1
:maxdepth: 1
community/meetups
deployment/docker
community/sponsors
deployment/k8s
```
deployment/nginx
deployment/frameworks/index
deployment/integrations/index
:::
% API Documentation: API reference aimed at vllm library usage
% Making the most out of vLLM
```{toctree}
:::{toctree}
:caption: API Documentation
:caption: Performance
:maxdepth: 2
:maxdepth: 1
dev/sampling_params
performance/optimization
dev/pooling_params
performance/benchmarks
dev/offline_inference/offline_index
:::
dev/engine/engine_index
```
% Design: docs about vLLM internals
% Explanation of vLLM internals
```{toctree}
:::{toctree}
:caption: Design
:caption: Design Documents
:maxdepth: 2
:maxdepth: 2
design/arch_overview
design/arch_overview
design/huggingface_integration
design/huggingface_integration
design/plugin_system
design/plugin_system
design/input_processing/model_inputs_index
design/kernel/paged_attention
design/kernel/paged_attention
design/multimodal/multimodal_index
design/mm_processing
design/automatic_prefix_caching
design/multiprocessing
design/multiprocessing
```
:::
% For Developers: contributing to the vLLM project
This document walks you through the steps to extend a vLLM model so that it accepts [multi-modal inputs](#multimodal-inputs).
```{seealso}
[Adding a New Model](adding-a-new-model)
```
## 1. Update the base vLLM model
It is assumed that you have already implemented the model in vLLM according to [these steps](#adding-a-new-model).
Further update the model as follows:
- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
```diff
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
```{note}
The model class does not have to be named {code}`*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
```
- If you haven't already done so, reserve a keyword parameter in {meth}`~torch.nn.Module.forward`
for each input tensor that corresponds to a multi-modal input, as shown in the following example:
```diff
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
```
## 2. Register input mappers
For each modality type that the model accepts as input, decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in {meth}`~torch.nn.Module.forward`.
```diff
from vllm.model_executor.models.interfaces import SupportsMultiModal
During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via {meth}`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.
```diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsMultiModal
Sometimes, there is a need to process inputs at the {class}`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's {meth}`~torch.nn.Module.forward` call.
You can register input processors via {meth}`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.
```diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsMultiModal
You can controls the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
...
@@ -9,8 +9,8 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
...
@@ -9,8 +9,8 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor
For more information on CoreWeave's Tensorizer, please refer to
For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html).
the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html).
```{note}
:::{note}
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
vLLM supports generative and pooling models across various tasks.
vLLM supports generative and pooling models across various tasks.
If a model supports more than one task, you can set the task via the {code}`--task` argument.
If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
Alongside each architecture, we include some popular models that use it.
...
@@ -14,10 +14,10 @@ Alongside each architecture, we include some popular models that use it.
...
@@ -14,10 +14,10 @@ Alongside each architecture, we include some popular models that use it.
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
To determine whether a given model is supported, you can check the {code}`config.json` file inside the HF repository.
To determine whether a given model is supported, you can check the `config.json` file inside the HF repository.
If the {code}`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
If the `"architectures"` field contains a model architecture listed below, then it should be supported in theory.
````{tip}
:::{tip}
The easiest way to check if your model is really supported at runtime is to run the program below:
The easiest way to check if your model is really supported at runtime is to run the program below:
```python
```python
...
@@ -35,9 +35,9 @@ print(output)
...
@@ -35,9 +35,9 @@ print(output)
```
```
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
````
:::
Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM.
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### ModelScope
### ModelScope
...
@@ -45,10 +45,10 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
...
@@ -45,10 +45,10 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
```shell
```shell
$ export VLLM_USE_MODELSCOPE=True
export VLLM_USE_MODELSCOPE=True
```
```
And use with {code}`trust_remote_code=True`.
And use with `trust_remote_code=True`.
```python
```python
fromvllmimportLLM
fromvllmimportLLM
...
@@ -72,459 +72,472 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -72,459 +72,472 @@ See [this page](#generative-models) for more information on how to use generativ
#### Text Generation (`--task generate`)
#### Text Generation (`--task generate`)
```{eval-rst}
:::{list-table}
.. list-table::
:widths: 25 25 50 5 5
:widths: 25 25 50 5 5
:header-rows: 1
:header-rows: 1
-* Architecture
* - Architecture
* Models
- Models
* Example HF Models
- Example HF Models
*[LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>`
*[PP](#distributed-serving)
- :ref:`PP <distributed-serving>`
-*`AquilaForCausalLM`
* - :code:`AquilaForCausalLM`
* Aquila, Aquila2
- Aquila, Aquila2
*`BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.
- :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
* ✅︎
- ✅︎
* ✅︎
- ✅︎
-*`ArcticForCausalLM`
* - :code:`ArcticForCausalLM`
* Arctic
- Arctic
*`Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc.
- :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc.
*
-
* ✅︎
- ✅︎
-*`BaiChuanForCausalLM`
* - :code:`BaiChuanForCausalLM`
* Baichuan2, Baichuan
- Baichuan2, Baichuan
*`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
- :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc.
* ✅︎
- ✅︎
* ✅︎
- ✅︎
-*`BloomForCausalLM`
* - :code:`BloomForCausalLM`
* BLOOM, BLOOMZ, BLOOMChat
- BLOOM, BLOOMZ, BLOOMChat
*`bigscience/bloom`, `bigscience/bloomz`, etc.
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
*
-
* ✅︎
- ✅︎
-*`BartForConditionalGeneration`
* - :code:`BartForConditionalGeneration`
* BART
- BART
*`facebook/bart-base`, `facebook/bart-large-cnn`, etc.
- :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
*
-
*
-
-*`ChatGLMModel`
* - :code:`ChatGLMModel`
* ChatGLM
- ChatGLM
*`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
If your model is not in the above list, we will try to automatically convert the model using
If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
{func}`~vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring (`--task score`)
#### Sentence Pair Scoring (`--task score`)
```{eval-rst}
:::{list-table}
.. list-table::
:widths: 25 25 50 5 5
:widths: 25 25 50 5 5
:header-rows: 1
:header-rows: 1
-* Architecture
* - Architecture
* Models
- Models
* Example HF Models
- Example HF Models
*[LoRA](#lora-adapter)
- :ref:`LoRA <lora-adapter>`
*[PP](#distributed-serving)
- :ref:`PP <distributed-serving>`
-*`BertForSequenceClassification`
* - :code:`BertForSequenceClassification`
* BERT-based
- BERT-based
*`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
- :code:`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
*
-
*
-
-*`RobertaForSequenceClassification`
* - :code:`RobertaForSequenceClassification`
* RoBERTa-based
- RoBERTa-based
*`cross-encoder/quora-roberta-base`, etc.
- :code:`cross-encoder/quora-roberta-base`, etc.
*
-
*
-
-*`XLMRobertaForSequenceClassification`
* - :code:`XLMRobertaForSequenceClassification`
* XLM-RoBERTa-based
- XLM-RoBERTa-based
*`BAAI/bge-reranker-v2-m3`, etc.
- :code:`BAAI/bge-reranker-v2-m3`, etc.
*
-
*
-
:::
```
(supported-mm-models)=
(supported-mm-models)=
...
@@ -537,206 +550,21 @@ The following modalities are supported depending on the model:
...
@@ -537,206 +550,21 @@ The following modalities are supported depending on the model:
-**V**ideo
-**V**ideo
-**A**udio
-**A**udio
Any combination of modalities joined by {code}`+` are supported.
Any combination of modalities joined by `+` are supported.
- e.g.: {code}`T + I` means that the model supports text-only, image-only, and text-with-image inputs.
- e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs.
On the other hand, modalities separated by {code}`/` are mutually exclusive.
On the other hand, modalities separated by `/` are mutually exclusive.
- e.g.: {code}`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
### Generative Models
:::{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
See [this page](#generative-models) for more information on how to use generative models.
Offline inference:
#### Text Generation (`--task generate`)
```{eval-rst}
.. list-table::
:widths: 25 25 15 20 5 5 5
:header-rows: 1
* - Architecture
- Models
- Inputs
- Example HF Models
- :ref:`LoRA <lora-adapter>`
- :ref:`PP <distributed-serving>`
- V1
* - :code:`AriaForConditionalGeneration`
- Aria
- T + I
- :code:`rhymes-ai/Aria`
-
- ✅︎
-
* - :code:`Blip2ForConditionalGeneration`
- BLIP-2
- T + I\ :sup:`E`
- :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
-
- ✅︎
-
* - :code:`ChameleonForConditionalGeneration`
- Chameleon
- T + I
- :code:`facebook/chameleon-7b` etc.
-
- ✅︎
-
* - :code:`FuyuForCausalLM`
- Fuyu
- T + I
- :code:`adept/fuyu-8b` etc.
-
- ✅︎
-
* - :code:`ChatGLMModel`
- GLM-4V
- T + I
- :code:`THUDM/glm-4v-9b` etc.
- ✅︎
- ✅︎
-
* - :code:`H2OVLChatModel`
- H2OVL
- T + I\ :sup:`E+`
- :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc.
-
- ✅︎
-
* - :code:`Idefics3ForConditionalGeneration`
- Idefics3
- T + I
- :code:`HuggingFaceM4/Idefics3-8B-Llama3` etc.
- ✅︎
-
-
* - :code:`InternVLChatModel`
- InternVL 2.5, Mono-InternVL, InternVL 2.0
- T + I\ :sup:`E+`
- :code:`OpenGVLab/InternVL2_5-4B`, :code:`OpenGVLab/Mono-InternVL-2B`, :code:`OpenGVLab/InternVL2-4B`, etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
vLLM currently only supports adding LoRA to the language backbone of multimodal models.
vLLM currently only supports adding LoRA to the language backbone of multimodal models.
```
:::
```{note}
### Generative Models
To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo ({code}`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
See [this page](#generative-models) for more information on how to use generative models.
```
#### Text Generation (`--task generate`)
```{note}
:::{list-table}
The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now.
:widths: 25 25 15 20 5 5 5
:header-rows: 1
-* Architecture
* Models
* Inputs
* Example HF Models
*[LoRA](#lora-adapter)
*[PP](#distributed-serving)
*[V1](gh-issue:8779)
-*`AriaForConditionalGeneration`
* Aria
* T + I<sup>+</sup>
*`rhymes-ai/Aria`
*
* ✅︎
* ✅︎
-*`Blip2ForConditionalGeneration`
* BLIP-2
* T + I<sup>E</sup>
*`Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc.
*
* ✅︎
* ✅︎
-*`ChameleonForConditionalGeneration`
* Chameleon
* T + I
*`facebook/chameleon-7b` etc.
*
* ✅︎
* ✅︎
-*`DeepseekVLV2ForCausalLM`
* DeepSeek-VL2
* T + I<sup>+</sup>
*`deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note)
*
* ✅︎
* ✅︎
-*`FuyuForCausalLM`
* Fuyu
* T + I
*`adept/fuyu-8b` etc.
*
* ✅︎
* ✅︎
-*`ChatGLMModel`
* GLM-4V
* T + I
*`THUDM/glm-4v-9b` etc.
* ✅︎
* ✅︎
*
-*`H2OVLChatModel`
* H2OVL
* T + I<sup>E+</sup>
*`h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
*
* ✅︎
*
-*`Idefics3ForConditionalGeneration`
* Idefics3
* T + I
*`HuggingFaceM4/Idefics3-8B-Llama3` etc.
* ✅︎
*
*
-*`InternVLChatModel`
* InternVL 2.5, Mono-InternVL, InternVL 2.0
* T + I<sup>E+</sup>
*`OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
*
* ✅︎
* ✅︎
-*`LlavaForConditionalGeneration`
* LLaVA-1.5
* T + I<sup>E+</sup>
*`llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
*
* ✅︎
* ✅︎
-*`LlavaNextForConditionalGeneration`
* LLaVA-NeXT
* T + I<sup>E+</sup>
*`llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
*
* ✅︎
* ✅︎
-*`LlavaNextVideoForConditionalGeneration`
* LLaVA-NeXT-Video
* T + V
*`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
*
* ✅︎
* ✅︎
-*`LlavaOnevisionForConditionalGeneration`
* LLaVA-Onevision
* T + I<sup>+</sup> + V<sup>+</sup>
*`llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
*
* ✅︎
* ✅︎
-*`MiniCPMO`
* MiniCPM-O
* T + I<sup>E+</sup> + V<sup>E+</sup> + A<sup>E+</sup>
*`openbmb/MiniCPM-o-2_6`, etc.
* ✅︎
* ✅︎
*
-*`MiniCPMV`
* MiniCPM-V
* T + I<sup>E+</sup> + V<sup>E+</sup>
*`openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
* ✅︎
* ✅︎
*
-*`MllamaForConditionalGeneration`
* Llama 3.2
* T + I<sup>+</sup>
*`meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc.
*
*
*
-*`MolmoForCausalLM`
* Molmo
* T + I
*`allenai/Molmo-7B-D-0924`, `allenai/Molmo-72B-0924`, etc.
* ✅︎
* ✅︎
* ✅︎
-*`NVLM_D_Model`
* NVLM-D 1.0
* T + I<sup>E+</sup>
*`nvidia/NVLM-D-72B`, etc.
*
* ✅︎
* ✅︎
-*`PaliGemmaForConditionalGeneration`
* PaliGemma, PaliGemma 2
* T + I<sup>E</sup>
*`google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
*
* ✅︎
*
-*`Phi3VForCausalLM`
* Phi-3-Vision, Phi-3.5-Vision
* T + I<sup>E+</sup>
*`microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc.
*
* ✅︎
* ✅︎
-*`PixtralForConditionalGeneration`
* Pixtral
* T + I<sup>+</sup>
*`mistralai/Pixtral-12B-2409`, `mistral-community/pixtral-12b` (see note), etc.
*
* ✅︎
* ✅︎
-*`QWenLMHeadModel`
* Qwen-VL
* T + I<sup>E+</sup>
*`Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
* ✅︎
* ✅︎
* ✅︎
-*`Qwen2AudioForConditionalGeneration`
* Qwen2-Audio
* T + A<sup>+</sup>
*`Qwen/Qwen2-Audio-7B-Instruct`
*
* ✅︎
* ✅︎
-*`Qwen2VLForConditionalGeneration`
* QVQ, Qwen2-VL
* T + I<sup>E+</sup> + V<sup>E+</sup>
*`Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc.
* ✅︎
* ✅︎
* ✅︎
-*`UltravoxModel`
* Ultravox
* T + A<sup>E+</sup>
*`fixie-ai/ultravox-v0_3`
*
* ✅︎
* ✅︎
:::
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{note}
To use `DeepSeek-VL2` series models, you have to pass `--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` when running vLLM.
:::
:::{note}
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
:::
:::{note}
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
```
:::
:::{note}
The chat template for Pixtral-HF is incorrect (see [discussion](https://huggingface.co/mistral-community/pixtral-12b/discussions/22)).
A corrected version is available at <gh-file:examples/template_pixtral_hf.jinja>.
:::
### Pooling Models
### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models.
See [this page](pooling-models) for more information on how to use pooling models.
```{important}
:::{important}
Since some model architectures support both generative and pooling tasks,
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
```
:::
#### Text Embedding (`--task embed`)
#### Text Embedding (`--task embed`)
Any text generation model can be converted into an embedding model by passing {code}`--task embed`.
Any text generation model can be converted into an embedding model by passing `--task embed`.
```{note}
:::{note}
To get the best results, you should use pooling models that are specifically trained as such.
To get the best results, you should use pooling models that are specifically trained as such.
```
:::
The following table lists those that are tested in vLLM.
The following table lists those that are tested in vLLM.
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
1.**Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
1.**Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
```{tip}
:::{tip}
When comparing the output of {code}`model.generate` from HuggingFace Transformers with the output of {code}`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
```
:::
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
4.**Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
4.**Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
5.**Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
5.**Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
...
@@ -8,7 +8,7 @@ Due to the auto-regressive nature of transformer architecture, there are times w
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
The vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
available again. When this occurs, the following warning is printed:
available again. When this occurs, the following warning is printed:
```
```text
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
```
```
...
@@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma
...
@@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
...
@@ -49,13 +49,12 @@ This policy has two benefits:
...
@@ -49,13 +49,12 @@ This policy has two benefits:
- It improves ITL and generation decode because decode requests are prioritized.
- It improves ITL and generation decode because decode requests are prioritized.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
You can tune the performance by changing `max_num_batched_tokens`.
You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
We recommend you set `max_num_batched_tokens > 2048` for throughput.
We recommend you set `max_num_batched_tokens > 2048` for throughput.
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- S3 with the model which will be deployed
## Installing the chart
To install the chart with the release name `test-vllm`:
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
## Prerequisites
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
## Deployment Steps
1.**Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
```yaml
apiVersion:v1
kind:PersistentVolumeClaim
metadata:
name:mistral-7b
namespace:default
spec:
accessModes:
-ReadWriteOnce
resources:
requests:
storage:50Gi
storageClassName:default
volumeMode:Filesystem
```
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
```yaml
apiVersion:v1
kind:Secret
metadata:
name:hf-token-secret
namespace:default
type:Opaque
data:
token:"REPLACE_WITH_TOKEN"
```
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
```yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:mistral-7b
namespace:default
labels:
app:mistral-7b
spec:
replicas:1
selector:
matchLabels:
app:mistral-7b
template:
metadata:
labels:
app:mistral-7b
spec:
volumes:
-name:cache-volume
persistentVolumeClaim:
claimName:mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
If the service is correctly deployed, you should receive a response from the vLLM model.
## Conclusion
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
@@ -8,23 +8,23 @@ Before going into the details of distributed inference and serving, let's first
...
@@ -8,23 +8,23 @@ Before going into the details of distributed inference and serving, let's first
-**Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
-**Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
-**Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
-**Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
-**Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
-**Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
```{note}
:::{note}
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
```
:::
## Details for Distributed Inference and Serving
## Running vLLM on a single node
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured {code}`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the {code}`LLM` class {code}`distributed-executor-backend` argument or {code}`--distributed-executor-backend` API server argument. Set it to {code}`mp` for multiprocessing or {code}`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
To run multi-GPU inference with the {code}`LLM` class, set the {code}`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
To run multi-GPU serving, pass in the {code}`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console
```console
$vllm serve facebook/opt-13b \
vllm serve facebook/opt-13b \
$--tensor-parallel-size 4
--tensor-parallel-size 4
```
```
You can also additionally specify {code}`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console
```console
$vllm serve gpt2 \
vllm serve gpt2 \
$--tensor-parallel-size 4 \
--tensor-parallel-size 4 \
$--pipeline-parallel-size 2
--pipeline-parallel-size 2
```
```
## Multi-Node Inference and Serving
## Running vLLM on multiple nodes
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/online_serving/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
Pick a node as the head node, and run the following command:
Pick a node as the head node, and run the following command:
```console
```console
$bash run_cluster.sh \
bash run_cluster.sh \
$ vllm/vllm-openai \
vllm/vllm-openai \
$ ip_of_head_node \
ip_of_head_node \
$ --head\
--head \
$ /path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node
```
```
On the rest of the worker nodes, run the following command:
On the rest of the worker nodes, run the following command:
```console
```console
$bash run_cluster.sh \
bash run_cluster.sh \
$ vllm/vllm-openai \
vllm/vllm-openai \
$ ip_of_head_node \
ip_of_head_node \
$ --worker\
--worker \
$ /path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node
```
```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console
```console
$vllm serve /path/to/the/model/in/the/container \
vllm serve /path/to/the/model/in/the/container \
$--tensor-parallel-size 8 \
--tensor-parallel-size 8 \
$--pipeline-parallel-size 2
--pipeline-parallel-size 2
```
```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
```console
```console
$vllm serve /path/to/the/model/in/the/container \
vllm serve /path/to/the/model/in/the/container \
$--tensor-parallel-size 16
--tensor-parallel-size 16
```
```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning}
:::{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
```
:::
```{warning}
:::{warning}
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
vLLM uses the following environment variables to configure the system:
vLLM uses the following environment variables to configure the system:
```{warning}
:::{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).