We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
Engine arguments control the behavior of the vLLM engine.
Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
- For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`.
- For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`.
The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
## `EngineArgs`
However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
--8<-- "docs/argparse/engine_args.md"
For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
## `AsyncEngineArgs`
!!! note
--8<-- "docs/argparse/async_engine_args.md"
Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
...
@@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
3. Since vLLM uses `uv`, ensure the following index strategy is applied:
- Via environment variable:
```bash
export UV_INDEX_STRATEGY=unsafe-best-match
```
- Or via CLI flag:
```bash
--index-strategy unsafe-best-match
```
If failures are found in the pull request, raise them as issues on vLLM and
If failures are found in the pull request, raise them as issues on vLLM and
cc the PyTorch release team to initiate discussion on how to address them.
cc the PyTorch release team to initiate discussion on how to address them.
...
@@ -45,20 +58,25 @@ cc the PyTorch release team to initiate discussion on how to address them.
...
@@ -45,20 +58,25 @@ cc the PyTorch release team to initiate discussion on how to address them.
## Update CUDA version
## Update CUDA version
The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example,
The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example,
torch2.7.0+cu12.6) is uploaded to PyPI. However, vLLM may require a different CUDA version,
`torch2.7.0+cu12.6`) is uploaded to PyPI. However, vLLM may require a different CUDA version,
such as 12.8 for Blackwell support.
such as 12.8 for Blackwell support.
This complicates the process as we cannot use the out-of-the-box
This complicates the process as we cannot use the out-of-the-box
`pip install torch torchvision torchaudio` command. The solution is to use
`pip install torch torchvision torchaudio` command. The solution is to use
`--extra-index-url` in vLLM's Dockerfiles.
`--extra-index-url` in vLLM's Dockerfiles.
1. Use `--extra-index-url https://download.pytorch.org/whl/cu128` to install torch+cu128.
- Important indexes at the moment include:
2. Other important indexes at the moment include:
1. CPU ‒ https://download.pytorch.org/whl/cpu
| Platform | `--extra-index-url` |
2. ROCm ‒ https://download.pytorch.org/whl/rocm6.2.4 and https://download.pytorch.org/whl/rocm6.3
|----------|-----------------|
3. XPU ‒ https://download.pytorch.org/whl/xpu
| CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)|
3. Update .buildkite/release-pipeline.yaml and .buildkite/scripts/upload-wheels.sh to
| CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)|
match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested
- Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI.
-`.buildkite/release-pipeline.yaml`
-`.buildkite/scripts/upload-wheels.sh`
## Address long vLLM build time
## Address long vLLM build time
...
@@ -68,8 +86,8 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
...
@@ -68,8 +86,8 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
it doesn't populate the cache, so re-running it to warm up the cache
it doesn't populate the cache, so re-running it to warm up the cache
is ineffective.
is ineffective.
While ongoing efforts like [#17419](https://github.com/vllm-project/vllm/issues/17419)
While ongoing efforts like [#17419](gh-issue:17419)
address the long build time at its source, the current workaround is to set VLLM_CI_BRANCH
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
when manually triggering a build on Buildkite. This branch accomplishes two things:
when manually triggering a build on Buildkite. This branch accomplishes two things:
...
@@ -89,17 +107,18 @@ releases (which would take too much time), they can be built from
...
@@ -89,17 +107,18 @@ releases (which would take too much time), they can be built from
source to unblock the update process.
source to unblock the update process.
### FlashInfer
### FlashInfer
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
One caveat is that building FlashInfer from source adds approximately 30
One caveat is that building FlashInfer from source adds approximately 30
minutes to the vLLM build time. Therefore, it's preferable to cache the wheel in a
minutes to the vLLM build time. Therefore, it's preferable to cache the wheel in a
public location for immediate installation, such as https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl. For future releases, contact the PyTorch release
public location for immediate installation, such as [this FlashInfer wheel link](https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl). For future releases, contact the PyTorch release
team if you want to get the package published there.
team if you want to get the package published there.
### xFormers
### xFormers
...
@@ -107,13 +126,15 @@ Similar to FlashInfer, here is how to build and install xFormers from source:
...
@@ -107,13 +126,15 @@ Similar to FlashInfer, here is how to build and install xFormers from source:
@@ -84,6 +84,7 @@ Below is an example of what the generated `CMakeUserPresets.json` might look lik
...
@@ -84,6 +84,7 @@ Below is an example of what the generated `CMakeUserPresets.json` might look lik
```
```
**What do the various configurations mean?**
**What do the various configurations mean?**
-`CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically.
-`CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically.
-`CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default.
-`CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default.
-`VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable.
-`VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable.
...
@@ -98,16 +99,16 @@ Once your `CMakeUserPresets.json` is configured:
...
@@ -98,16 +99,16 @@ Once your `CMakeUserPresets.json` is configured:
1.**Initialize the CMake build environment:**
1.**Initialize the CMake build environment:**
This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir`
This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir`
```console
```console
cmake --preset release
cmake --preset release
```
```
2.**Build and install the vLLM components:**
2.**Build and install the vLLM components:**
This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation.
This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation.
```console
```console
cmake --build --preset release --target install
cmake --build --preset release --target install
```
```
3.**Make changes and repeat!**
3.**Make changes and repeat!**
Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files.
Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files.
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance.
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/compatibility_matrix.md) to optimize their performance.
The complexity of integrating a model into vLLM depends heavily on the model's architecture.
The complexity of integrating a model into vLLM depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).
## 1. Update the base vLLM model
## 1. Update the base vLLM model
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
It is assumed that you have already implemented the model in vLLM according to [these steps](basic.md).
Further update the model as follows:
Further update the model as follows:
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
??? Code
??? code
```python
```python
class YourModelForImage2Seq(nn.Module):
class YourModelForImage2Seq(nn.Module):
...
@@ -41,7 +38,7 @@ Further update the model as follows:
...
@@ -41,7 +38,7 @@ Further update the model as follows:
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
??? Code
??? code
```python
```python
class YourModelForImage2Seq(nn.Module):
class YourModelForImage2Seq(nn.Module):
...
@@ -71,7 +68,7 @@ Further update the model as follows:
...
@@ -71,7 +68,7 @@ Further update the model as follows:
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
??? Code
??? code
```python
```python
from .utils import merge_multimodal_embeddings
from .utils import merge_multimodal_embeddings
...
@@ -155,7 +152,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
...
@@ -155,7 +152,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `LlavaForConditionalGeneration`:
Looking at the code of HF's `LlavaForConditionalGeneration`:
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor <vllm.multimodal.registry.MultiModalRegistry.register_processor>`
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.processing.MultiModalRegistry.register_processor]
to register them to the multi-modal registry:
to register them to the multi-modal registry:
```diff
```diff
...
@@ -846,7 +843,7 @@ Examples:
...
@@ -846,7 +843,7 @@ Examples:
### Handling prompt updates unrelated to multi-modal data
### Handling prompt updates unrelated to multi-modal data
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](../../design/mm_processing.md).
vLLM relies on a model registry to determine how to run each model.
vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found [here][supported-models].
A list of pre-registered architectures can be found [here](../../models/supported_models.md).
If your model is not on this list, you must register it to vLLM.
If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so.
This page provides detailed instructions on how to do so.
...
@@ -14,16 +11,16 @@ This page provides detailed instructions on how to do so.
...
@@ -14,16 +11,16 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model.
This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory.
After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models][supported-models] to promote your model!
Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important
!!! important
The list of models in each section should be maintained in alphabetical order.
The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models
## Out-of-tree models
You can load an external model [using a plugin][plugin-system] without modifying the vLLM codebase.
You can load an external model [using a plugin](../../design/plugin_system.md) without modifying the vLLM codebase.
To register the model, use the following code:
To register the model, use the following code:
...
@@ -51,4 +48,4 @@ def register():
...
@@ -51,4 +48,4 @@ def register():
!!! important
!!! important
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.