Commit 99324e25 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.2' into v0.9.2-ori

parents cc7f22a8 a5dd03c1
...@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system: ...@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```python ??? Code
--8<-- "vllm/envs.py:env-vars-definition"
``` ```python
--8<-- "vllm/envs.py:env-vars-definition"
```
...@@ -29,6 +29,8 @@ See <gh-file:LICENSE>. ...@@ -29,6 +29,8 @@ See <gh-file:LICENSE>.
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source][build-from-source] documentation for details. Check out the [building from source][build-from-source] documentation for details.
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
### Building the docs with MkDocs ### Building the docs with MkDocs
#### Introduction to MkDocs #### Introduction to MkDocs
...@@ -93,25 +95,27 @@ For additional features and advanced configurations, refer to the official [MkDo ...@@ -93,25 +95,27 @@ For additional features and advanced configurations, refer to the official [MkDo
## Testing ## Testing
```bash ??? note "Commands"
pip install -r requirements/dev.txt
# Linting, formatting and static type checking ```bash
pre-commit install --hook-type pre-commit --hook-type commit-msg pip install -r requirements/dev.txt
# You can manually run pre-commit with # Linting, formatting and static type checking
pre-commit run --all-files pre-commit install --hook-type pre-commit --hook-type commit-msg
# To manually run something from CI that does not run # You can manually run pre-commit with
# locally by default, you can run: pre-commit run --all-files
pre-commit run mypy-3.9 --hook-stage manual --all-files
# Unit tests # To manually run something from CI that does not run
pytest tests/ # locally by default, you can run:
pre-commit run mypy-3.9 --hook-stage manual --all-files
# Run tests for a single test file with detailed output # Unit tests
pytest -s -v tests/test_logger.py pytest tests/
```
# Run tests for a single test file with detailed output
pytest -s -v tests/test_logger.py
```
!!! tip !!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12. Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
...@@ -130,7 +134,7 @@ pytest -s -v tests/test_logger.py ...@@ -130,7 +134,7 @@ pytest -s -v tests/test_logger.py
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! warning !!! important
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
...@@ -147,6 +151,14 @@ the terms of the DCO. ...@@ -147,6 +151,14 @@ the terms of the DCO.
Using `-s` with `git commit` will automatically add this header. Using `-s` with `git commit` will automatically add this header.
!!! tip
You can enable automatic sign-off via your IDE:
- **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window.
It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`.
- **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings)
and enable the `Git: Always Sign Off` (`git.alwaysSignOff`) field.
### PR Title and Classification ### PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed Only specific types of PRs will be reviewed. The PR title is prefixed
...@@ -186,6 +198,7 @@ The PR needs to meet the following code quality standards: ...@@ -186,6 +198,7 @@ The PR needs to meet the following code quality standards:
### Adding or Changing Kernels ### Adding or Changing Kernels
When actively developing or modifying kernels, using the [Incremental Compilation Workflow](./incremental_build.md) is highly recommended for faster build times.
Each custom kernel needs a schema and one or more implementations to be registered with PyTorch. Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.
- Make sure custom ops are registered following PyTorch guidelines: - Make sure custom ops are registered following PyTorch guidelines:
......
...@@ -37,14 +37,14 @@ multiple Y releases: ...@@ -37,14 +37,14 @@ multiple Y releases:
- **Timeline**: A removal version is explicitly stated in the deprecation - **Timeline**: A removal version is explicitly stated in the deprecation
warning (e.g., "This will be removed in v0.10.0"). warning (e.g., "This will be removed in v0.10.0").
- **Communication**: Deprecation is noted in the following, as applicable: - **Communication**: Deprecation is noted in the following, as applicable:
- Help strings - Help strings
- Log output - Log output
- API responses - API responses
- `/metrics` output (for metrics features) - `/metrics` output (for metrics features)
- User-facing documentation - User-facing documentation
- Release notes - Release notes
- GitHub Issue (RFC) for feedback - GitHub Issue (RFC) for feedback
- Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
**2.Deprecated (Off By Default)** **2.Deprecated (Off By Default)**
......
# Incremental Compilation Workflow
When working on vLLM's C++/CUDA kernels located in the `csrc/` directory, recompiling the entire project with `uv pip install -e .` for every change can be time-consuming. An incremental compilation workflow using CMake allows for faster iteration by only recompiling the necessary components after an initial setup. This guide details how to set up and use such a workflow, which complements your editable Python installation.
## Prerequisites
Before setting up the incremental build:
1. **vLLM Editable Install:** Ensure you have vLLM installed from source in an editable mode. Using pre-compiled wheels for the initial editable setup can be faster, as the CMake workflow will handle subsequent kernel recompilations.
```console
uv venv --python 3.12 --seed
source .venv/bin/activate
VLLM_USE_PRECOMPILED=1 uv pip install -U -e . --torch-backend=auto
```
2. **CUDA Toolkit:** Verify that the NVIDIA CUDA Toolkit is correctly installed and `nvcc` is accessible in your `PATH`. CMake relies on `nvcc` to compile CUDA code. You can typically find `nvcc` in `$CUDA_HOME/bin/nvcc` or by running `which nvcc`. If you encounter issues, refer to the [official CUDA Toolkit installation guides](https://developer.nvidia.com/cuda-toolkit-archive) and vLLM's main [GPU installation documentation](../getting_started/installation/gpu.md#troubleshooting) for troubleshooting. The `CMAKE_CUDA_COMPILER` variable in your `CMakeUserPresets.json` should also point to your `nvcc` binary.
3. **Build Tools:** It is highly recommended to install `ccache` for fast rebuilds by caching compilation results (e.g., `sudo apt install ccache` or `conda install ccache`). Also, ensure the core build dependencies like `cmake` and `ninja` are installed. These are installable through `requirements/build.txt` or your system's package manager.
```console
uv pip install -r requirements/build.txt --torch-backend=auto
```
## Setting up the CMake Build Environment
The incremental build process is managed through CMake. You can configure your build settings using a `CMakeUserPresets.json` file at the root of the vLLM repository.
### Generate `CMakeUserPresets.json` using the helper script
To simplify the setup, vLLM provides a helper script that attempts to auto-detect your system's configuration (like CUDA path, Python environment, and CPU cores) and generates the `CMakeUserPresets.json` file for you.
**Run the script:**
Navigate to the root of your vLLM clone and execute the following command:
```console
python tools/generate_cmake_presets.py
```
The script will prompt you if it cannot automatically determine certain paths (e.g., `nvcc` or a specific Python executable for your vLLM development environment). Follow the on-screen prompts. If an existing `CMakeUserPresets.json` is found, the script will ask for confirmation before overwriting it.
After running the script, a `CMakeUserPresets.json` file will be created in the root of your vLLM repository.
### Example `CMakeUserPresets.json`
Below is an example of what the generated `CMakeUserPresets.json` might look like. The script will tailor these values based on your system and any input you provide.
```json
{
"version": 6,
"cmakeMinimumRequired": {
"major": 3,
"minor": 26,
"patch": 1
},
"configurePresets": [
{
"name": "release",
"generator": "Ninja",
"binaryDir": "${sourceDir}/cmake-build-release",
"cacheVariables": {
"CMAKE_CUDA_COMPILER": "/usr/local/cuda/bin/nvcc",
"CMAKE_C_COMPILER_LAUNCHER": "ccache",
"CMAKE_CXX_COMPILER_LAUNCHER": "ccache",
"CMAKE_CUDA_COMPILER_LAUNCHER": "ccache",
"CMAKE_BUILD_TYPE": "Release",
"VLLM_PYTHON_EXECUTABLE": "/home/user/venvs/vllm/bin/python",
"CMAKE_INSTALL_PREFIX": "${sourceDir}",
"CMAKE_CUDA_FLAGS": "",
"NVCC_THREADS": "4",
"CMAKE_JOB_POOLS": "compile=32"
}
}
],
"buildPresets": [
{
"name": "release",
"configurePreset": "release",
"jobs": 32
}
]
}
```
**What do the various configurations mean?**
- `CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically.
- `CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default.
- `VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable.
- `CMAKE_INSTALL_PREFIX: "${sourceDir}"`: Specifies that the compiled components should be installed back into your vLLM source directory. This is crucial for the editable install, as it makes the newly built kernels immediately available to your Python environment.
- `CMAKE_JOB_POOLS` and `jobs` in build presets: Control the parallelism of the build. The script sets these based on the number of CPU cores detected on your system.
- `binaryDir`: Specifies where the build artifacts will be stored (e.g., `cmake-build-release`).
## Building and Installing with CMake
Once your `CMakeUserPresets.json` is configured:
1. **Initialize the CMake build environment:**
This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir`
```console
cmake --preset release
```
2. **Build and install the vLLM components:**
This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation.
```console
cmake --build --preset release --target install
```
3. **Make changes and repeat!**
Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files.
```console
cmake --build --preset release --target install
```
## Verifying the Build
After a successful build, you will find a populated build directory (e.g., `cmake-build-release/` if you used the `release` preset and the example configuration).
```console
> ls cmake-build-release/
bin cmake_install.cmake _deps machete_generation.log
build.ninja CPackConfig.cmake detect_cuda_compute_capabilities.cu marlin_generation.log
_C.abi3.so CPackSourceConfig.cmake detect_cuda_version.cc _moe_C.abi3.so
CMakeCache.txt ctest _flashmla_C.abi3.so moe_marlin_generation.log
CMakeFiles cumem_allocator.abi3.so install_local_manifest.txt vllm-flash-attn
```
The `cmake --build ... --target install` command copies the compiled shared libraries (like `_C.abi3.so`, `_moe_C.abi3.so`, etc.) into the appropriate `vllm` package directory within your source tree. This updates your editable installation with the newly compiled kernels.
## Additional Tips
- **Adjust Parallelism:** Fine-tune the `CMAKE_JOB_POOLS` in `configurePresets` and `jobs` in `buildPresets` in your `CMakeUserPresets.json`. Too many jobs can overload systems with limited RAM or CPU cores, leading to slower builds or system instability. Too few won't fully utilize available resources.
- **Clean Builds When Necessary:** If you encounter persistent or strange build errors, especially after significant changes or switching branches, consider removing the CMake build directory (e.g., `rm -rf cmake-build-release`) and re-running the `cmake --preset` and `cmake --build` commands.
- **Specific Target Builds:** For even faster iterations when working on a specific module, you can sometimes build a specific target instead of the full `install` target, though `install` ensures all necessary components are updated in your Python environment. Refer to CMake documentation for more advanced target management.
--- ---
title: Adding a New Model title: Summary
--- ---
[](){ #new-model } [](){ #new-model }
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM. !!! important
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
Contents: vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance.
- [Basic](basic.md) The complexity of integrating a model into vLLM depends heavily on the model's architecture.
- [Registration](registration.md) The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
- [Tests](tests.md) However, this can be more complex for models that include new operators (e.g., a new attention mechanism).
- [Multimodal](multimodal.md)
!!! note Read through these pages for a step-by-step guide:
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM. - [Basic Model](basic.md)
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex. - [Registering a Model](registration.md)
- [Unit Testing](tests.md)
- [Multi-Modal Support](multimodal.md)
!!! tip !!! tip
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues) If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
......
--- ---
title: Implementing a Basic Model title: Basic Model
--- ---
[](){ #new-model-basic } [](){ #new-model-basic }
...@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons ...@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons
The initialization code should look like this: The initialization code should look like this:
```python ??? Code
from torch import nn
from vllm.config import VllmConfig ```python
from vllm.attention import Attention from torch import nn
from vllm.config import VllmConfig
class MyAttention(nn.Module): from vllm.attention import Attention
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyAttention(nn.Module):
self.attn = Attention(prefix=f"{prefix}.attn") def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
class MyDecoderLayer(nn.Module): self.attn = Attention(prefix=f"{prefix}.attn")
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyDecoderLayer(nn.Module):
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn") def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
class MyModel(nn.Module): self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__() class MyModel(nn.Module):
self.layers = nn.ModuleList( def __init__(self, vllm_config: VllmConfig, prefix: str):
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)] super().__init__()
) self.layers = nn.ModuleList(
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
class MyModelForCausalLM(nn.Module): )
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__() class MyModelForCausalLM(nn.Module):
self.model = MyModel(vllm_config, prefix=f"{prefix}.model") def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
``` super().__init__()
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
```
### Computation Code ### Computation Code
......
...@@ -10,6 +10,22 @@ This document walks you through the steps to extend a basic model so that it acc ...@@ -10,6 +10,22 @@ This document walks you through the steps to extend a basic model so that it acc
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic]. It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
Further update the model as follows: Further update the model as follows:
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
??? Code
```python
class YourModelForImage2Seq(nn.Module):
...
@classmethod
def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
if modality.startswith("image"):
return "<image>"
raise ValueError("Only image modality is supported")
```
- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example: - Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:
```diff ```diff
...@@ -25,59 +41,63 @@ Further update the model as follows: ...@@ -25,59 +41,63 @@ Further update the model as follows:
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
```python ??? Code
class YourModelForImage2Seq(nn.Module):
...
def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: ```python
class YourModelForImage2Seq(nn.Module):
...
assert self.vision_encoder is not None def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
image_features = self.vision_encoder(image_input)
return self.multi_modal_projector(image_features)
def get_multimodal_embeddings( assert self.vision_encoder is not None
self, **kwargs: object) -> Optional[MultiModalEmbeddings]: image_features = self.vision_encoder(image_input)
return self.multi_modal_projector(image_features)
# Validate the multimodal input keyword arguments def get_multimodal_embeddings(
image_input = self._parse_and_validate_image_input(**kwargs) self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
if image_input is None:
return None
# Run multimodal inputs through encoder and projector # Validate the multimodal input keyword arguments
vision_embeddings = self._process_image_input(image_input) image_input = self._parse_and_validate_image_input(**kwargs)
return vision_embeddings if image_input is None:
``` return None
# Run multimodal inputs through encoder and projector
vision_embeddings = self._process_image_input(image_input)
return vision_embeddings
```
!!! warning !!! important
The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request. The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings. - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
```python ??? Code
from .utils import merge_multimodal_embeddings
class YourModelForImage2Seq(nn.Module): ```python
... from .utils import merge_multimodal_embeddings
def get_input_embeddings( class YourModelForImage2Seq(nn.Module):
self, ...
input_ids: torch.Tensor,
multimodal_embeddings: Optional[MultiModalEmbeddings] = None, def get_input_embeddings(
) -> torch.Tensor: self,
input_ids: torch.Tensor,
# `get_input_embeddings` should already be implemented for the language multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
# model as one of the requirements of basic vLLM model implementation. ) -> torch.Tensor:
inputs_embeds = self.language_model.get_input_embeddings(input_ids)
# `get_input_embeddings` should already be implemented for the language
if multimodal_embeddings is not None: # model as one of the requirements of basic vLLM model implementation.
inputs_embeds = merge_multimodal_embeddings( inputs_embeds = self.language_model.get_input_embeddings(input_ids)
input_ids=input_ids,
inputs_embeds=inputs_embeds, if multimodal_embeddings is not None:
multimodal_embeddings=multimodal_embeddings, inputs_embeds = merge_multimodal_embeddings(
placeholder_token_id=self.config.image_token_index) input_ids=input_ids,
inputs_embeds=inputs_embeds,
return inputs_embeds multimodal_embeddings=multimodal_embeddings,
``` placeholder_token_id=self.config.image_token_index)
return inputs_embeds
```
- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model. - Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.
...@@ -100,8 +120,8 @@ Further update the model as follows: ...@@ -100,8 +120,8 @@ Further update the model as follows:
``` ```
!!! note !!! note
The model class does not have to be named `*ForCausalLM`. The model class does not have to be named `*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples. Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
## 2. Specify processing information ## 2. Specify processing information
...@@ -135,42 +155,46 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -135,42 +155,46 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `LlavaForConditionalGeneration`: Looking at the code of HF's `LlavaForConditionalGeneration`:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
n_image_features = image_features.shape[0] * image_features.shape[1]
if n_image_tokens != n_image_features: ```python
raise ValueError( # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}" n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
n_image_features = image_features.shape[0] * image_features.shape[1]
if n_image_tokens != n_image_features:
raise ValueError(
f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
)
special_image_mask = (
(input_ids == self.config.image_token_index)
.unsqueeze(-1)
.expand_as(inputs_embeds)
.to(inputs_embeds.device)
) )
special_image_mask = ( image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
(input_ids == self.config.image_token_index) inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
.unsqueeze(-1) ```
.expand_as(inputs_embeds)
.to(inputs_embeds.device)
)
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
```
The number of placeholder feature tokens per image is `image_features.shape[1]`. The number of placeholder feature tokens per image is `image_features.shape[1]`.
`image_features` is calculated inside the `get_image_features` method: `image_features` is calculated inside the `get_image_features` method:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
selected_image_feature = image_outputs.hidden_states[vision_feature_layer] image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature[:, 1:] selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
elif vision_feature_select_strategy == "full": if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature selected_image_feature = selected_image_feature[:, 1:]
else: elif vision_feature_select_strategy == "full":
raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}") selected_image_feature = selected_image_feature
image_features = self.multi_modal_projector(selected_image_feature) else:
return image_features raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
``` image_features = self.multi_modal_projector(selected_image_feature)
return image_features
```
We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
(`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model). (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
...@@ -193,20 +217,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -193,20 +217,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`: To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
target_dtype = self.patch_embedding.weight.dtype ```python
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
patch_embeds = patch_embeds.flatten(2).transpose(1, 2) target_dtype = self.patch_embedding.weight.dtype
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
class_embeds = self.class_embedding.expand(batch_size, 1, -1) patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
if interpolate_pos_encoding: class_embeds = self.class_embedding.expand(batch_size, 1, -1)
embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width) embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
else: if interpolate_pos_encoding:
embeddings = embeddings + self.position_embedding(self.position_ids) embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
return embeddings else:
``` embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
```
We can infer that `embeddings.shape[1] == self.num_positions`, where We can infer that `embeddings.shape[1] == self.num_positions`, where
...@@ -218,55 +244,59 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -218,55 +244,59 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Overall, the number of placeholder feature tokens for an image can be calculated as: Overall, the number of placeholder feature tokens for an image can be calculated as:
```python ??? Code
def get_num_image_tokens(
self,
*,
image_width: int,
image_height: int,
) -> int:
hf_config = self.get_hf_config()
hf_processor = self.get_hf_processor()
image_size = hf_config.vision_config.image_size ```python
patch_size = hf_config.vision_config.patch_size def get_num_image_tokens(
self,
*,
image_width: int,
image_height: int,
) -> int:
hf_config = self.get_hf_config()
hf_processor = self.get_hf_processor()
num_image_tokens = (image_size // patch_size) ** 2 + 1 image_size = hf_config.vision_config.image_size
if hf_processor.vision_feature_select_strategy == "default": patch_size = hf_config.vision_config.patch_size
num_image_tokens -= 1
return num_image_tokens num_image_tokens = (image_size // patch_size) ** 2 + 1
``` if hf_processor.vision_feature_select_strategy == "default":
num_image_tokens -= 1
return num_image_tokens
```
Notice that the number of image tokens doesn't depend on the image width and height. Notice that the number of image tokens doesn't depend on the image width and height.
We can simply use a dummy `image_size` to calculate the multimodal profiling data: We can simply use a dummy `image_size` to calculate the multimodal profiling data:
```python ??? Code
# NOTE: In actuality, this is usually implemented as part of the
# model's subclass of `BaseProcessingInfo`, but we show it as is
# here for simplicity.
def get_image_size_with_most_features(self) -> ImageSize:
hf_config = self.get_hf_config()
width = height = hf_config.image_size
return ImageSize(width=width, height=height)
def get_dummy_mm_data(
self,
seq_len: int,
mm_counts: Mapping[str, int],
) -> MultiModalDataDict:
num_images = mm_counts.get("image", 0)
target_width, target_height = \ ```python
self.info.get_image_size_with_most_features() # NOTE: In actuality, this is usually implemented as part of the
# model's subclass of `BaseProcessingInfo`, but we show it as is
# here for simplicity.
def get_image_size_with_most_features(self) -> ImageSize:
hf_config = self.get_hf_config()
width = height = hf_config.image_size
return ImageSize(width=width, height=height)
return { def get_dummy_mm_data(
"image": self,
self._get_dummy_images(width=target_width, seq_len: int,
height=target_height, mm_counts: Mapping[str, int],
num_images=num_images) ) -> MultiModalDataDict:
} num_images = mm_counts.get("image", 0)
```
target_width, target_height = \
self.info.get_image_size_with_most_features()
return {
"image":
self._get_dummy_images(width=target_width,
height=target_height,
num_images=num_images)
}
```
For the text, we simply expand the multimodal image token from the model config to match the desired number of images. For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
...@@ -284,21 +314,23 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -284,21 +314,23 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
Looking at the code of HF's `FuyuForCausalLM`: Looking at the code of HF's `FuyuForCausalLM`:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
if image_patches is not None and past_key_values is None: ```python
patch_embeddings = [ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype)) if image_patches is not None and past_key_values is None:
.squeeze(0) patch_embeddings = [
.to(inputs_embeds.device) self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
for patch in image_patches .squeeze(0)
] .to(inputs_embeds.device)
inputs_embeds = self.gather_continuous_embeddings( for patch in image_patches
word_embeddings=inputs_embeds, ]
continuous_embeddings=patch_embeddings, inputs_embeds = self.gather_continuous_embeddings(
image_patch_input_indices=image_patches_indices, word_embeddings=inputs_embeds,
) continuous_embeddings=patch_embeddings,
``` image_patch_input_indices=image_patches_indices,
)
```
The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`, The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`. which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
...@@ -312,92 +344,98 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -312,92 +344,98 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`, In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
returning the dimensions after resizing (but before padding) as metadata. returning the dimensions after resizing (but before padding) as metadata.
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"]) ```python
batch_images = image_encoding["images"] # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
image_unpadded_heights = image_encoding["image_unpadded_heights"] image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
image_unpadded_widths = image_encoding["image_unpadded_widths"] batch_images = image_encoding["images"]
image_unpadded_heights = image_encoding["image_unpadded_heights"]
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L image_unpadded_widths = image_encoding["image_unpadded_widths"]
if do_resize:
batch_images = [ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
[self.resize(image, size=size, input_data_format=input_data_format) for image in images] if do_resize:
for images in batch_images batch_images = [
] [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
for images in batch_images
image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
if do_pad:
batch_images = [
[
self.pad_image(
image,
size=size,
mode=padding_mode,
constant_values=padding_value,
input_data_format=input_data_format,
)
for image in images
] ]
for images in batch_images
]
```
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata: image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
if do_pad:
batch_images = [
[
self.pad_image(
image,
size=size,
mode=padding_mode,
constant_values=padding_value,
input_data_format=input_data_format,
)
for image in images
]
for images in batch_images
]
```
```python In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
image_input=tensor_batch_images,
image_present=image_present,
image_unpadded_h=image_unpadded_heights,
image_unpadded_w=image_unpadded_widths,
image_placeholder_id=image_placeholder_id,
image_newline_id=image_newline_id,
variable_sized=True,
)
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658 ??? Code
image_height, image_width = image.shape[1], image.shape[2]
if variable_sized: # variable_sized=True ```python
new_h = min( # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
image_height, model_image_input = self.image_processor.preprocess_with_tokenizer_info(
math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height, image_input=tensor_batch_images,
image_present=image_present,
image_unpadded_h=image_unpadded_heights,
image_unpadded_w=image_unpadded_widths,
image_placeholder_id=image_placeholder_id,
image_newline_id=image_newline_id,
variable_sized=True,
) )
new_w = min(
image_width,
math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
)
image = image[:, :new_h, :new_w]
image_height, image_width = new_h, new_w
num_patches = self.get_num_patches(image_height=image_height, image_width=image_width) # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
tensor_of_image_ids = torch.full( image_height, image_width = image.shape[1], image.shape[2]
[num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device if variable_sized: # variable_sized=True
) new_h = min(
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0) image_height,
assert num_patches == patches.shape[0] math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
``` )
new_w = min(
image_width,
math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
)
image = image[:, :new_h, :new_w]
image_height, image_width = new_h, new_w
num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
tensor_of_image_ids = torch.full(
[num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
)
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
assert num_patches == patches.shape[0]
```
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`: The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
patch_size = patch_size if patch_size is not None else self.patch_size ```python
patch_height, patch_width = self.patch_size["height"], self.patch_size["width"] # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
patch_size = patch_size if patch_size is not None else self.patch_size
if image_height % patch_height != 0: patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
raise ValueError(f"{image_height=} must be divisible by {patch_height}")
if image_width % patch_width != 0: if image_height % patch_height != 0:
raise ValueError(f"{image_width=} must be divisible by {patch_width}") raise ValueError(f"{image_height=} must be divisible by {patch_height}")
if image_width % patch_width != 0:
num_patches_per_dim_h = image_height // patch_height raise ValueError(f"{image_width=} must be divisible by {patch_width}")
num_patches_per_dim_w = image_width // patch_width
num_patches = num_patches_per_dim_h * num_patches_per_dim_w num_patches_per_dim_h = image_height // patch_height
``` num_patches_per_dim_w = image_width // patch_width
num_patches = num_patches_per_dim_h * num_patches_per_dim_w
```
These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`. to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
...@@ -419,23 +457,25 @@ Assuming that the memory usage increases with the number of tokens, the dummy in ...@@ -419,23 +457,25 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
For the multimodal image profiling data, the logic is very similar to LLaVA: For the multimodal image profiling data, the logic is very similar to LLaVA:
```python ??? Code
def get_dummy_mm_data(
self,
seq_len: int,
mm_counts: Mapping[str, int],
) -> MultiModalDataDict:
target_width, target_height = \
self.info.get_image_size_with_most_features()
num_images = mm_counts.get("image", 0)
return { ```python
"image": def get_dummy_mm_data(
self._get_dummy_images(width=target_width, self,
height=target_height, seq_len: int,
num_images=num_images) mm_counts: Mapping[str, int],
} ) -> MultiModalDataDict:
``` target_width, target_height = \
self.info.get_image_size_with_most_features()
num_images = mm_counts.get("image", 0)
return {
"image":
self._get_dummy_images(width=target_width,
height=target_height,
num_images=num_images)
}
```
## 4. Specify processing details ## 4. Specify processing details
...@@ -455,6 +495,7 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -455,6 +495,7 @@ return a schema of the tensors outputted by the HF processor that are related to
The output of `CLIPImageProcessor` is a simple tensor with shape The output of `CLIPImageProcessor` is a simple tensor with shape
`(num_images, num_channels, image_height, image_width)`: `(num_images, num_channels, image_height, image_width)`:
```python ```python
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345 # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
images = [ images = [
...@@ -505,40 +546,49 @@ return a schema of the tensors outputted by the HF processor that are related to ...@@ -505,40 +546,49 @@ return a schema of the tensors outputted by the HF processor that are related to
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA, In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]: we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
```python ??? Code
def _call_hf_processor(
self,
prompt: str,
mm_data: Mapping[str, object],
mm_kwargs: Mapping[str, object],
) -> BatchFeature:
processed_outputs = super()._call_hf_processor(
prompt=prompt,
mm_data=mm_data,
mm_kwargs=mm_kwargs,
)
image_patches = processed_outputs.get("image_patches") ```python
if image_patches is not None: def _call_hf_processor(
images = mm_data["images"] self,
assert isinstance(images, list) prompt: str,
mm_data: Mapping[str, object],
mm_kwargs: Mapping[str, object],
tok_kwargs: Mapping[str, object],
) -> BatchFeature:
processed_outputs = super()._call_hf_processor(
prompt=prompt,
mm_data=mm_data,
mm_kwargs=mm_kwargs,
tok_kwargs=tok_kwargs,
)
# Original output: (1, num_images, Pn, Px * Py * C) image_patches = processed_outputs.get("image_patches")
# New output: (num_images, Pn, Px * Py * C) if image_patches is not None:
assert (isinstance(image_patches, list) images = mm_data["images"]
and len(image_patches) == 1) assert isinstance(images, list)
assert (isinstance(image_patches[0], torch.Tensor)
and len(image_patches[0]) == len(images))
processed_outputs["image_patches"] = image_patches[0] # Original output: (1, num_images, Pn, Px * Py * C)
# New output: (num_images, Pn, Px * Py * C)
assert (isinstance(image_patches, list)
and len(image_patches) == 1)
assert (isinstance(image_patches[0], torch.Tensor)
and len(image_patches[0]) == len(images))
return processed_outputs processed_outputs["image_patches"] = image_patches[0]
```
return processed_outputs
```
!!! note !!! note
Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
for text-only inputs to prevent unnecessary warnings from HF processor. for text-only inputs to prevent unnecessary warnings from HF processor.
!!! note
The `_call_hf_processor` method specifies both `mm_kwargs` and `tok_kwargs` for
processing. `mm_kwargs` is used to both initialize and call the huggingface
processor, whereas `tok_kwargs` is only used to call the huggingface processor.
This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows: This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
```python ```python
...@@ -573,35 +623,37 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -573,35 +623,37 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`). It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows: Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
```python ??? Code
def _get_prompt_updates(
self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config()
image_token_id = hf_config.image_token_index
def get_replacement(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx) ```python
num_image_tokens = self.info.get_num_image_tokens( def _get_prompt_updates(
image_width=image_size.width, self,
image_height=image_size.height, mm_items: MultiModalDataItems,
) hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config()
image_token_id = hf_config.image_token_index
def get_replacement(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
num_image_tokens = self.info.get_num_image_tokens(
image_width=image_size.width,
image_height=image_size.height,
)
return [image_token_id] * num_image_tokens return [image_token_id] * num_image_tokens
return [ return [
PromptReplacement( PromptReplacement(
modality="image", modality="image",
target=[image_token_id], target=[image_token_id],
replacement=get_replacement, replacement=get_replacement,
), ),
] ]
``` ```
=== "Handling additional tokens: Fuyu" === "Handling additional tokens: Fuyu"
...@@ -616,117 +668,90 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -616,117 +668,90 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
We define a helper function to return `ncols` and `nrows` directly: We define a helper function to return `ncols` and `nrows` directly:
```python ??? Code
def get_image_feature_grid_size(
self, ```python
*, def get_image_feature_grid_size(
image_width: int, self,
image_height: int, *,
) -> tuple[int, int]: image_width: int,
image_processor = self.get_image_processor() image_height: int,
target_width = image_processor.size["width"] ) -> tuple[int, int]:
target_height = image_processor.size["height"] image_processor = self.get_image_processor()
patch_width = image_processor.patch_size["width"] target_width = image_processor.size["width"]
patch_height = image_processor.patch_size["height"] target_height = image_processor.size["height"]
patch_width = image_processor.patch_size["width"]
if not (image_width <= target_width and image_height <= target_height): patch_height = image_processor.patch_size["height"]
height_scale_factor = target_height / image_height
width_scale_factor = target_width / image_width if not (image_width <= target_width and image_height <= target_height):
optimal_scale_factor = min(height_scale_factor, width_scale_factor) height_scale_factor = target_height / image_height
width_scale_factor = target_width / image_width
image_height = int(image_height * optimal_scale_factor) optimal_scale_factor = min(height_scale_factor, width_scale_factor)
image_width = int(image_width * optimal_scale_factor)
image_height = int(image_height * optimal_scale_factor)
ncols = math.ceil(image_width / patch_width) image_width = int(image_width * optimal_scale_factor)
nrows = math.ceil(image_height / patch_height)
return ncols, nrows ncols = math.ceil(image_width / patch_width)
``` nrows = math.ceil(image_height / patch_height)
return ncols, nrows
```
Based on this, we can initially define our replacement tokens as: Based on this, we can initially define our replacement tokens as:
```python ??? Code
def get_replacement(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size( ```python
image_width=image_size.width, def get_replacement(item_idx: int):
image_height=image_size.height, images = mm_items.get_items("image", ImageProcessorItems)
) image_size = images.get_image_size(item_idx)
# `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|` ncols, nrows = self.info.get_image_feature_grid_size(
# `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|` image_width=image_size.width,
return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows image_height=image_size.height,
``` )
# `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
# `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
```
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called, However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
a BOS token (`<s>`) is also added to the promopt: a BOS token (`<s>`) is also added to the promopt:
```python ??? Code
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
model_image_input = self.image_processor.preprocess_with_tokenizer_info( ```python
image_input=tensor_batch_images, # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
image_present=image_present, model_image_input = self.image_processor.preprocess_with_tokenizer_info(
image_unpadded_h=image_unpadded_heights, image_input=tensor_batch_images,
image_unpadded_w=image_unpadded_widths, image_present=image_present,
image_placeholder_id=image_placeholder_id, image_unpadded_h=image_unpadded_heights,
image_newline_id=image_newline_id, image_unpadded_w=image_unpadded_widths,
variable_sized=True, image_placeholder_id=image_placeholder_id,
) image_newline_id=image_newline_id,
prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch( variable_sized=True,
tokenizer=self.tokenizer, )
prompts=prompts, prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
scale_factors=scale_factors, tokenizer=self.tokenizer,
max_tokens_to_generate=self.max_tokens_to_generate, prompts=prompts,
max_position_embeddings=self.max_position_embeddings, scale_factors=scale_factors,
add_BOS=True, max_tokens_to_generate=self.max_tokens_to_generate,
add_beginning_of_answer_token=True, max_position_embeddings=self.max_position_embeddings,
) add_BOS=True,
``` add_beginning_of_answer_token=True,
)
```
To assign the vision embeddings to only the image tokens, instead of a string To assign the vision embeddings to only the image tokens, instead of a string
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]: you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
```python ??? Code
hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id # `<s>`
assert isinstance(bos_token_id, int)
def get_replacement_fuyu(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size(
image_width=image_size.width,
image_height=image_size.height,
)
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
[_NEWLINE_TOKEN_ID]) * nrows
return PromptUpdateDetails.select_token_id( ```python
image_tokens + [bos_token_id],
embed_token_id=_IMAGE_TOKEN_ID,
)
```
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
we can search for it to conduct the replacement at the start of the string:
```python
def _get_prompt_updates(
self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config() hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id bos_token_id = hf_config.bos_token_id # `<s>`
assert isinstance(bos_token_id, int) assert isinstance(bos_token_id, int)
tokenizer = self.info.get_tokenizer()
eot_token_id = tokenizer.bos_token_id
assert isinstance(eot_token_id, int)
def get_replacement_fuyu(item_idx: int): def get_replacement_fuyu(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems) images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx) image_size = images.get_image_size(item_idx)
...@@ -742,15 +767,52 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies ...@@ -742,15 +767,52 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
image_tokens + [bos_token_id], image_tokens + [bos_token_id],
embed_token_id=_IMAGE_TOKEN_ID, embed_token_id=_IMAGE_TOKEN_ID,
) )
```
return [ Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
PromptReplacement( we can search for it to conduct the replacement at the start of the string:
modality="image",
target=[eot_token_id], ??? Code
replacement=get_replacement_fuyu,
) ```python
] def _get_prompt_updates(
``` self,
mm_items: MultiModalDataItems,
hf_processor_mm_kwargs: Mapping[str, object],
out_mm_kwargs: MultiModalKwargs,
) -> Sequence[PromptUpdate]:
hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id
assert isinstance(bos_token_id, int)
tokenizer = self.info.get_tokenizer()
eot_token_id = tokenizer.bos_token_id
assert isinstance(eot_token_id, int)
def get_replacement_fuyu(item_idx: int):
images = mm_items.get_items("image", ImageProcessorItems)
image_size = images.get_image_size(item_idx)
ncols, nrows = self.info.get_image_feature_grid_size(
image_width=image_size.width,
image_height=image_size.height,
)
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
[_NEWLINE_TOKEN_ID]) * nrows
return PromptUpdateDetails.select_token_id(
image_tokens + [bos_token_id],
embed_token_id=_IMAGE_TOKEN_ID,
)
return [
PromptReplacement(
modality="image",
target=[eot_token_id],
replacement=get_replacement_fuyu,
)
]
```
## 5. Register processor-related classes ## 5. Register processor-related classes
......
--- ---
title: Registering a Model to vLLM title: Registering a Model
--- ---
[](){ #new-model-registration } [](){ #new-model-registration }
...@@ -18,7 +18,7 @@ After you have implemented your model (see [tutorial][new-model-basic]), put it ...@@ -18,7 +18,7 @@ After you have implemented your model (see [tutorial][new-model-basic]), put it
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models][supported-models] to promote your model! Finally, update our [list of supported models][supported-models] to promote your model!
!!! warning !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models ## Out-of-tree models
...@@ -49,6 +49,6 @@ def register(): ...@@ -49,6 +49,6 @@ def register():
) )
``` ```
!!! warning !!! important
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
Read more about that [here][supports-multimodal]. Read more about that [here][supports-multimodal].
--- ---
title: Writing Unit Tests title: Unit Testing
--- ---
[](){ #new-model-tests } [](){ #new-model-tests }
...@@ -15,7 +15,7 @@ Without them, the CI for your PR will fail. ...@@ -15,7 +15,7 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>. Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! warning !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
!!! tip !!! tip
......
...@@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example ...@@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example
#### OpenAI Server #### OpenAI Server
```bash ```bash
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B VLLM_TORCH_PROFILER_DIR=./vllm_profile \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B
``` ```
benchmark_serving.py: benchmark_serving.py:
```bash ```bash
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2 python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \
--dataset-path sharegpt.json \
--profile \
--num-prompts 2
``` ```
## Profile with NVIDIA Nsight Systems ## Profile with NVIDIA Nsight Systems
...@@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo ...@@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo
The following is an example using the `benchmarks/benchmark_latency.py` script: The following is an example using the `benchmarks/benchmark_latency.py` script:
```bash ```bash
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8 nsys profile -o report.nsys-rep \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-iters-warmup 5 \
--num-iters 1 \
--batch-size 16 \
--input-len 512 \
--output-len 8
``` ```
#### OpenAI Server #### OpenAI Server
...@@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with ` ...@@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with `
```bash ```bash
# server # server
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct nsys profile -o report.nsys-rep \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
--delay 30 \
--duration 60 \
vllm serve meta-llama/Llama-3.1-8B-Instruct
# client # client
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512 python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1 \
--dataset-name random \
--random-input 1024 \
--random-output 512
``` ```
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
...@@ -97,26 +125,26 @@ to manually kill the profiler and generate your `nsys-rep` report. ...@@ -97,26 +125,26 @@ to manually kill the profiler and generate your `nsys-rep` report.
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started). You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
CLI example: ??? CLI example
```bash ```bash
nsys stats report1.nsys-rep nsys stats report1.nsys-rep
... ...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- -------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off… 12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_… 9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons… 5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa… 4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern… 2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
... ...
``` ```
GUI example: GUI example:
......
...@@ -34,6 +34,7 @@ you may contact the following individuals: ...@@ -34,6 +34,7 @@ you may contact the following individuals:
- Simon Mo - simon.mo@hey.com - Simon Mo - simon.mo@hey.com
- Russell Bryant - rbryant@redhat.com - Russell Bryant - rbryant@redhat.com
- Huzaifa Sidhpurwala - huzaifas@redhat.com
## Slack Discussion ## Slack Discussion
......
...@@ -10,7 +10,7 @@ title: Using Docker ...@@ -10,7 +10,7 @@ title: Using Docker
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
```console ```bash
docker run --runtime nvidia --gpus all \ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \ --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
...@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \ ...@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \
This image can also be used with other container engines such as [Podman](https://podman.io/). This image can also be used with other container engines such as [Podman](https://podman.io/).
```console ```bash
podman run --gpus all \ podman run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
...@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (` ...@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM: You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
```console ```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . \ DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \ --target vllm-openai \
...@@ -97,26 +97,28 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- ...@@ -97,26 +97,28 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits. flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below). Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
```console ??? Command
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
python3 use_existing_torch.py ```bash
DOCKER_BUILDKIT=1 docker build . \ # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
--file docker/Dockerfile \ python3 use_existing_torch.py
--target vllm-openai \ DOCKER_BUILDKIT=1 docker build . \
--platform "linux/arm64" \ --file docker/Dockerfile \
-t vllm/vllm-gh200-openai:latest \ --target vllm-openai \
--build-arg max_jobs=66 \ --platform "linux/arm64" \
--build-arg nvcc_threads=2 \ -t vllm/vllm-gh200-openai:latest \
--build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ --build-arg max_jobs=66 \
--build-arg vllm_fa_cmake_gpu_arches="90-real" --build-arg nvcc_threads=2 \
``` --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
```
!!! note !!! note
If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution. If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
Run the following command on your host machine to register QEMU user static handlers: Run the following command on your host machine to register QEMU user static handlers:
```console ```bash
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
``` ```
...@@ -126,7 +128,7 @@ DOCKER_BUILDKIT=1 docker build . \ ...@@ -126,7 +128,7 @@ DOCKER_BUILDKIT=1 docker build . \
To run vLLM with the custom-built Docker image: To run vLLM with the custom-built Docker image:
```console ```bash
docker run --runtime nvidia --gpus all \ docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \ -v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \ -p 8000:8000 \
......
...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
``` ```
......
...@@ -11,7 +11,7 @@ title: AutoGen ...@@ -11,7 +11,7 @@ title: AutoGen
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment - Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
```console ```bash
pip install vllm pip install vllm
# Install AgentChat and OpenAI client from Extensions # Install AgentChat and OpenAI client from Extensions
...@@ -23,58 +23,60 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]" ...@@ -23,58 +23,60 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 --model mistralai/Mistral-7B-Instruct-v0.2
``` ```
- Call it with AutoGen: - Call it with AutoGen:
```python ??? Code
import asyncio
from autogen_core.models import UserMessage ```python
from autogen_ext.models.openai import OpenAIChatCompletionClient import asyncio
from autogen_core.models import ModelFamily from autogen_core.models import UserMessage
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_core.models import ModelFamily
async def main() -> None:
# Create a model client
model_client = OpenAIChatCompletionClient( async def main() -> None:
model="mistralai/Mistral-7B-Instruct-v0.2", # Create a model client
base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1", model_client = OpenAIChatCompletionClient(
api_key="EMPTY", model="mistralai/Mistral-7B-Instruct-v0.2",
model_info={ base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
"vision": False, api_key="EMPTY",
"function_calling": False, model_info={
"json_output": False, "vision": False,
"family": ModelFamily.MISTRAL, "function_calling": False,
"structured_output": True, "json_output": False,
}, "family": ModelFamily.MISTRAL,
) "structured_output": True,
},
messages = [UserMessage(content="Write a very short story about a dragon.", source="user")] )
# Create a stream. messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
stream = model_client.create_stream(messages=messages)
# Create a stream.
# Iterate over the stream and print the responses. stream = model_client.create_stream(messages=messages)
print("Streamed responses:")
async for response in stream: # Iterate over the stream and print the responses.
if isinstance(response, str): print("Streamed responses:")
# A partial response is a string. async for response in stream:
print(response, flush=True, end="") if isinstance(response, str):
else: # A partial response is a string.
# The last response is a CreateResult object with the complete message. print(response, flush=True, end="")
print("\n\n------------\n") else:
print("The complete response:", flush=True) # The last response is a CreateResult object with the complete message.
print(response.content, flush=True) print("\n\n------------\n")
print("The complete response:", flush=True)
# Close the client when done. print(response.content, flush=True)
await model_client.close()
# Close the client when done.
await model_client.close()
asyncio.run(main())
```
asyncio.run(main())
```
For details, see the tutorial: For details, see the tutorial:
......
...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr ...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
To install the Cerebrium client, run: To install the Cerebrium client, run:
```console ```bash
pip install cerebrium pip install cerebrium
cerebrium login cerebrium login
``` ```
Next, create your Cerebrium project, run: Next, create your Cerebrium project, run:
```console ```bash
cerebrium init vllm-project cerebrium init vllm-project
``` ```
...@@ -34,75 +34,81 @@ vllm = "latest" ...@@ -34,75 +34,81 @@ vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`: Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
```python ??? Code
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") ```python
from vllm import LLM, SamplingParams
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
sampling_params = SamplingParams(temperature=temperature, top_p=top_p) def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
outputs = llm.generate(prompts, sampling_params)
# Print the outputs. sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
results = [] outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results} # Print the outputs.
``` results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
```
Then, run the following code to deploy it to the cloud: Then, run the following code to deploy it to the cloud:
```console ```bash
cerebrium deploy cerebrium deploy
``` ```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
```python ??? Command
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \ ```python
-H 'Authorization: <JWT TOKEN>' \ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
--data '{ -H 'Content-Type: application/json' \
"prompts": [ -H 'Authorization: <JWT TOKEN>' \
"Hello, my name is", --data '{
"The president of the United States is", "prompts": [
"The capital of France is", "Hello, my name is",
"The future of AI is" "The president of the United States is",
] "The capital of France is",
}' "The future of AI is"
``` ]
}'
```
You should get a response like: You should get a response like:
```python ??? Response
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", ```python
"result": { {
"result": [ "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
{ "result": {
"prompt": "Hello, my name is", "result": [
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" {
}, "prompt": "Hello, my name is",
{ "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
"prompt": "The president of the United States is", },
"generated_text": " elected every four years. This is a democratic system.\n\n5. What" {
}, "prompt": "The president of the United States is",
{ "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
"prompt": "The capital of France is", },
"generated_text": " Paris.\n" {
}, "prompt": "The capital of France is",
{ "generated_text": " Paris.\n"
"prompt": "The future of AI is", },
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." {
} "prompt": "The future of AI is",
] "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}, }
"run_time_ms": 152.53663063049316 ]
} },
``` "run_time_ms": 152.53663063049316
}
```
You now have an autoscaling endpoint where you only pay for the compute you use! You now have an autoscaling endpoint where you only pay for the compute you use!
...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat vllm serve qwen/Qwen1.5-0.5B-Chat
``` ```
......
...@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend. ...@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve Qwen/Qwen1.5-7B-Chat vllm serve Qwen/Qwen1.5-7B-Chat
``` ```
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)): - Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
```console ```bash
git clone https://github.com/langgenius/dify.git git clone https://github.com/langgenius/dify.git
cd dify cd dify
cd docker cd docker
......
...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), ...@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
To install dstack client, run: To install dstack client, run:
```console ```bash
pip install "dstack[all] pip install "dstack[all]
dstack server dstack server
``` ```
Next, to configure your dstack project, run: Next, to configure your dstack project, run:
```console ```bash
mkdir -p vllm-dstack mkdir -p vllm-dstack
cd vllm-dstack cd vllm-dstack
dstack init dstack init
...@@ -26,75 +26,81 @@ dstack init ...@@ -26,75 +26,81 @@ dstack init
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`: Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
```yaml ??? Config
type: service
```yaml
python: "3.11" type: service
env:
- MODEL=NousResearch/Llama-2-7b-chat-hf python: "3.11"
port: 8000 env:
resources: - MODEL=NousResearch/Llama-2-7b-chat-hf
gpu: 24GB port: 8000
commands: resources:
- pip install vllm gpu: 24GB
- vllm serve $MODEL --port 8000 commands:
model: - pip install vllm
format: openai - vllm serve $MODEL --port 8000
type: chat model:
name: NousResearch/Llama-2-7b-chat-hf format: openai
``` type: chat
name: NousResearch/Llama-2-7b-chat-hf
```
Then, run the following CLI for provisioning: Then, run the following CLI for provisioning:
```console ??? Command
$ dstack run . -f serve.dstack.yml
```console
⠸ Getting run plan... $ dstack run . -f serve.dstack.yml
Configuration serve.dstack.yml
Project deep-diver-main ⠸ Getting run plan...
User deep-diver Configuration serve.dstack.yml
Min resources 2..xCPU, 8GB.., 1xGPU (24GB) Project deep-diver-main
Max price - User deep-diver
Max duration - Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
Spot policy auto Max price -
Retry policy no Max duration -
Spot policy auto
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE Retry policy no
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 # BACKEND REGION INSTANCE RESOURCES SPOT PRICE
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804 1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
... 2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
Shown 3 of 193 offers, $5.876 max 3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
...
Continue? [y/n]: y Shown 3 of 193 offers, $5.876 max
⠙ Submitting run...
⠏ Launching spicy-treefrog-1 (pulling) Continue? [y/n]: y
spicy-treefrog-1 provisioning completed (running) ⠙ Submitting run...
Service is published at ... ⠏ Launching spicy-treefrog-1 (pulling)
``` spicy-treefrog-1 provisioning completed (running)
Service is published at ...
```
After the provisioning, you can interact with the model by using the OpenAI SDK: After the provisioning, you can interact with the model by using the OpenAI SDK:
```python ??? Code
from openai import OpenAI
```python
client = OpenAI( from openai import OpenAI
base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>" client = OpenAI(
) base_url="https://gateway.<gateway domain>",
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
completion = client.chat.completions.create( )
model="NousResearch/Llama-2-7b-chat-hf",
messages=[ completion = client.chat.completions.create(
{ model="NousResearch/Llama-2-7b-chat-hf",
"role": "user", messages=[
"content": "Compose a poem that explains the concept of recursion in programming.", {
} "role": "user",
] "content": "Compose a poem that explains the concept of recursion in programming.",
) }
]
print(completion.choices[0].message.content) )
```
print(completion.choices[0].message.content)
```
!!! note !!! note
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm) dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
...@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac ...@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
- Setup vLLM and Haystack environment - Setup vLLM and Haystack environment
```console ```bash
pip install vllm haystack-ai pip install vllm haystack-ai
``` ```
...@@ -21,35 +21,35 @@ pip install vllm haystack-ai ...@@ -21,35 +21,35 @@ pip install vllm haystack-ai
- Start the vLLM server with the supported chat completion model, e.g. - Start the vLLM server with the supported chat completion model, e.g.
```console ```bash
vllm serve mistralai/Mistral-7B-Instruct-v0.1 vllm serve mistralai/Mistral-7B-Instruct-v0.1
``` ```
- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server. - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
```python ??? Code
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage ```python
from haystack.utils import Secret from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
generator = OpenAIChatGenerator( from haystack.utils import Secret
# for compatibility with the OpenAI API, a placeholder api_key is needed
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), generator = OpenAIChatGenerator(
model="mistralai/Mistral-7B-Instruct-v0.1", # for compatibility with the OpenAI API, a placeholder api_key is needed
api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1", api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
generation_kwargs = {"max_tokens": 512} model="mistralai/Mistral-7B-Instruct-v0.1",
) api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
generation_kwargs = {"max_tokens": 512}
response = generator.run( )
messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
) response = generator.run(
messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
print("-"*30) )
print(response)
print("-"*30) print("-"*30)
``` print(response)
print("-"*30)
Output e.g.: ```
```console ```console
------------------------------ ------------------------------
......
...@@ -5,9 +5,9 @@ title: Helm ...@@ -5,9 +5,9 @@ title: Helm
A Helm chart to deploy vLLM for Kubernetes A Helm chart to deploy vLLM for Kubernetes
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values. Helm is a package manager for Kubernetes. It helps automate the deployment of vLLM applications on Kubernetes. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file. This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for Helm installation and documentation on architecture and values file.
## Prerequisites ## Prerequisites
...@@ -16,21 +16,27 @@ Before you begin, ensure that you have the following: ...@@ -16,21 +16,27 @@ Before you begin, ensure that you have the following:
- A running Kubernetes cluster - A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) - NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster - Available GPU resources in your cluster
- S3 with the model which will be deployed - An S3 with the model which will be deployed
## Installing the chart ## Installing the chart
To install the chart with the release name `test-vllm`: To install the chart with the release name `test-vllm`:
```console ```bash
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY helm upgrade --install --create-namespace \
--namespace=ns-vllm test-vllm . \
-f values.yaml \
--set secrets.s3endpoint=$ACCESS_POINT \
--set secrets.s3bucketname=$BUCKET \
--set secrets.s3accesskeyid=$ACCESS_KEY \
--set secrets.s3accesskey=$SECRET_KEY
``` ```
## Uninstalling the Chart ## Uninstalling the chart
To uninstall the `test-vllm` deployment: To uninstall the `test-vllm` deployment:
```console ```bash
helm uninstall test-vllm --namespace=ns-vllm helm uninstall test-vllm --namespace=ns-vllm
``` ```
...@@ -39,57 +45,59 @@ chart **including persistent volumes** and deletes the release. ...@@ -39,57 +45,59 @@ chart **including persistent volumes** and deletes the release.
## Architecture ## Architecture
![](../../assets/deployment/architecture_helm_deployment.png) ![helm deployment architecture](../../assets/deployment/architecture_helm_deployment.png)
## Values ## Values
| Key | Type | Default | Description | The following table describes configurable parameters of the chart in `values.yaml`:
|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration | | Key | Type | Default | Description |
| autoscaling.enabled | bool | false | Enable autoscaling | |-----|------|---------|-------------|
| autoscaling.maxReplicas | int | 100 | Maximum replicas | | autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
| autoscaling.minReplicas | int | 1 | Minimum replicas | | autoscaling.enabled | bool | false | Enable autoscaling |
| autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling | | autoscaling.maxReplicas | int | 100 | Maximum replicas |
| configs | object | {} | Configmap | | autoscaling.minReplicas | int | 1 | Minimum replicas |
| containerPort | int | 8000 | Container port | | autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling |
| customObjects | list | [] | Custom Objects configuration | | configs | object | {} | Configmap |
| deploymentStrategy | object | {} | Deployment strategy configuration | | containerPort | int | 8000 | Container port |
| externalConfigs | list | [] | External configuration | | customObjects | list | [] | Custom Objects configuration |
| extraContainers | list | [] | Additional containers configuration | | deploymentStrategy | object | {} | Deployment strategy configuration |
| extraInit | object | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} | Additional configuration for the init container | | externalConfigs | list | [] | External configuration |
| extraInit.pvcStorage | string | "50Gi" | Storage size of the s3 | | extraContainers | list | [] | Additional containers configuration |
| extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | Path of the model on the s3 which hosts model weights and config files | | extraInit | object | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} | Additional configuration for the init container |
| extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service | | extraInit.pvcStorage | string | "1Gi" | Storage size of the s3 |
| extraPorts | list | [] | Additional ports configuration | | extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | Path of the model on the s3 which hosts model weights and config files |
| gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used | | extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service |
| image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration | | extraPorts | list | [] | Additional ports configuration |
| image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command | | gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used |
| image.repository | string | "vllm/vllm-openai" | Image repository | | image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration |
| image.tag | string | "latest" | Image tag | | image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command |
| livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration | | image.repository | string | "vllm/vllm-openai" | Image repository |
| livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive | | image.tag | string | "latest" | Image tag |
| livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server | | livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration |
| livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server | | livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
| livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening | | livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server |
| livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated | | livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
| livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe | | livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
| maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration | | livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated |
| readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration | | livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe |
| readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready | | maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration |
| readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server | | readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration |
| readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server | | readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
| readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening | | readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server |
| readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated | | readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
| readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe | | readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
| replicaCount | int | 1 | Number of replicas | | readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated |
| resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration | | readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe |
| resources.limits."nvidia.com/gpu" | int | 1 | Number of gpus used | | replicaCount | int | 1 | Number of replicas |
| resources.limits.cpu | int | 4 | Number of CPUs | | resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration |
| resources.limits.memory | string | "16Gi" | CPU memory configuration | | resources.limits."nvidia.com/gpu" | int | 1 | Number of GPUs used |
| resources.requests."nvidia.com/gpu" | int | 1 | Number of gpus used | | resources.limits.cpu | int | 4 | Number of CPUs |
| resources.requests.cpu | int | 4 | Number of CPUs | | resources.limits.memory | string | "16Gi" | CPU memory configuration |
| resources.requests.memory | string | "16Gi" | CPU memory configuration | | resources.requests."nvidia.com/gpu" | int | 1 | Number of GPUs used |
| secrets | object | {} | Secrets configuration | | resources.requests.cpu | int | 4 | Number of CPUs |
| serviceName | string | Service name | | | resources.requests.memory | string | "16Gi" | CPU memory configuration |
| servicePort | int | 80 | Service port | | secrets | object | {} | Secrets configuration |
| labels.environment | string | test | Environment name | | serviceName | string | "" | Service name |
| servicePort | int | 80 | Service port |
| labels.environment | string | test | Environment name |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment