Merge tag 'v0.9.2' into v0.9.2-ori

99324e25 · zhuwenwen · cc7f22a8 · a5dd03c1 · 99324e25 · 99324e25
Commit 99324e25 authored Jul 12, 2025 by zhuwenwen
20 changed files
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:

    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).

-```python
--8<-- "vllm/envs.py:env-vars-definition"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/envs.py:env-vars-definition"
+    ```
--- a/docs/contributing/README.md
+++ b/docs/contributing/README.md
@@ -29,6 +29,8 @@ See <gh-file:LICENSE>.
 Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
 Check out the [building from source][build-from-source] documentation for details.

+For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
+
 ### Building the docs with MkDocs

 #### Introduction to MkDocs
@@ -93,25 +95,27 @@ For additional features and advanced configurations, refer to the official [MkDo

 ## Testing

-```bash
-pip install -r requirements/dev.txt
+??? note "Commands"

-# Linting, formatting and static type checking
-pre-commit install --hook-type pre-commit --hook-type commit-msg
+    ```bash
+    pip install -r requirements/dev.txt

-# You can manually run pre-commit with
-pre-commit run --all-files
+    # Linting, formatting and static type checking
+    pre-commit install --hook-type pre-commit --hook-type commit-msg

-# To manually run something from CI that does not run
-# locally by default, you can run:
-pre-commit run mypy-3.9 --hook-stage manual --all-files
+    # You can manually run pre-commit with
+    pre-commit run --all-files

-# Unit tests
-pytest tests/
+    # To manually run something from CI that does not run
+    # locally by default, you can run:
+    pre-commit run mypy-3.9 --hook-stage manual --all-files

-# Run tests for a single test file with detailed output
-pytest -s -v tests/test_logger.py
-```
+    # Unit tests
+    pytest tests/
+
+    # Run tests for a single test file with detailed output
+    pytest -s -v tests/test_logger.py
+    ```

 !!! tip
    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
@@ -130,7 +134,7 @@ pytest -s -v tests/test_logger.py

 If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

-!!! warning
+!!! important
    If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).

 ## Pull Requests & Code Reviews
@@ -147,6 +151,14 @@ the terms of the DCO.

 Using `-s` with `git commit` will automatically add this header.

+!!! tip
+    You can enable automatic sign-off via your IDE:
+  
+    - **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window.
+      It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`.
+    - **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings)
+      and enable the `Git: Always Sign Off` (`git.alwaysSignOff`) field.
+
 ### PR Title and Classification

 Only specific types of PRs will be reviewed. The PR title is prefixed
@@ -186,6 +198,7 @@ The PR needs to meet the following code quality standards:

 ### Adding or Changing Kernels

+When actively developing or modifying kernels, using the [Incremental Compilation Workflow](./incremental_build.md) is highly recommended for faster build times.
 Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

 - Make sure custom ops are registered following PyTorch guidelines:

--- a/docs/contributing/deprecation_policy.md
+++ b/docs/contributing/deprecation_policy.md
@@ -37,14 +37,14 @@ multiple Y releases:
 - **Timeline**: A removal version is explicitly stated in the deprecation
 warning (e.g., "This will be removed in v0.10.0").
 - **Communication**: Deprecation is noted in the following, as applicable:
-  - Help strings
-  - Log output
-  - API responses
-  - `/metrics` output (for metrics features)
-  - User-facing documentation
-  - Release notes
-  - GitHub Issue (RFC) for feedback
-  - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
+    - Help strings
+    - Log output
+    - API responses
+    - `/metrics` output (for metrics features)
+    - User-facing documentation
+    - Release notes
+    - GitHub Issue (RFC) for feedback
+    - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs

 **2.Deprecated (Off By Default)**


--- a/docs/contributing/incremental_build.md
+++ b/docs/contributing/incremental_build.md
+# Incremental Compilation Workflow
+
+When working on vLLM's C++/CUDA kernels located in the `csrc/` directory, recompiling the entire project with `uv pip install -e .` for every change can be time-consuming. An incremental compilation workflow using CMake allows for faster iteration by only recompiling the necessary components after an initial setup. This guide details how to set up and use such a workflow, which complements your editable Python installation.
+
+## Prerequisites
+
+Before setting up the incremental build:
+
+1. **vLLM Editable Install:** Ensure you have vLLM installed from source in an editable mode. Using pre-compiled wheels for the initial editable setup can be faster, as the CMake workflow will handle subsequent kernel recompilations.
+
+    ```console
+    uv venv --python 3.12 --seed
+    source .venv/bin/activate
+    VLLM_USE_PRECOMPILED=1 uv pip install -U -e . --torch-backend=auto
+    ```
+
+2. **CUDA Toolkit:** Verify that the NVIDIA CUDA Toolkit is correctly installed and `nvcc` is accessible in your `PATH`. CMake relies on `nvcc` to compile CUDA code. You can typically find `nvcc` in `$CUDA_HOME/bin/nvcc` or by running `which nvcc`. If you encounter issues, refer to the [official CUDA Toolkit installation guides](https://developer.nvidia.com/cuda-toolkit-archive) and vLLM's main [GPU installation documentation](../getting_started/installation/gpu.md#troubleshooting) for troubleshooting. The `CMAKE_CUDA_COMPILER` variable in your `CMakeUserPresets.json` should also point to your `nvcc` binary.
+
+3. **Build Tools:** It is highly recommended to install `ccache` for fast rebuilds by caching compilation results (e.g., `sudo apt install ccache` or `conda install ccache`). Also, ensure the core build dependencies like `cmake` and `ninja` are installed. These are installable through `requirements/build.txt` or your system's package manager.
+
+    ```console
+    uv pip install -r requirements/build.txt --torch-backend=auto
+    ```
+
+## Setting up the CMake Build Environment
+
+The incremental build process is managed through CMake. You can configure your build settings using a `CMakeUserPresets.json` file at the root of the vLLM repository.
+
+### Generate `CMakeUserPresets.json` using the helper script
+
+To simplify the setup, vLLM provides a helper script that attempts to auto-detect your system's configuration (like CUDA path, Python environment, and CPU cores) and generates the `CMakeUserPresets.json` file for you.
+
+**Run the script:**
+
+Navigate to the root of your vLLM clone and execute the following command:
+
+```console
+python tools/generate_cmake_presets.py
+```
+
+The script will prompt you if it cannot automatically determine certain paths (e.g., `nvcc` or a specific Python executable for your vLLM development environment). Follow the on-screen prompts. If an existing `CMakeUserPresets.json` is found, the script will ask for confirmation before overwriting it.
+
+After running the script, a `CMakeUserPresets.json` file will be created in the root of your vLLM repository.
+
+### Example `CMakeUserPresets.json`
+
+Below is an example of what the generated `CMakeUserPresets.json` might look like. The script will tailor these values based on your system and any input you provide.
+
+```json
+{
+    "version": 6,
+    "cmakeMinimumRequired": {
+        "major": 3,
+        "minor": 26,
+        "patch": 1
+    },
+    "configurePresets": [
+        {
+            "name": "release",
+            "generator": "Ninja",
+            "binaryDir": "${sourceDir}/cmake-build-release",
+            "cacheVariables": {
+                "CMAKE_CUDA_COMPILER": "/usr/local/cuda/bin/nvcc",
+                "CMAKE_C_COMPILER_LAUNCHER": "ccache",
+                "CMAKE_CXX_COMPILER_LAUNCHER": "ccache",
+                "CMAKE_CUDA_COMPILER_LAUNCHER": "ccache",
+                "CMAKE_BUILD_TYPE": "Release",
+                "VLLM_PYTHON_EXECUTABLE": "/home/user/venvs/vllm/bin/python",
+                "CMAKE_INSTALL_PREFIX": "${sourceDir}",
+                "CMAKE_CUDA_FLAGS": "",
+                "NVCC_THREADS": "4",
+                "CMAKE_JOB_POOLS": "compile=32"
+            }
+        }
+    ],
+    "buildPresets": [
+        {
+            "name": "release",
+            "configurePreset": "release",
+            "jobs": 32
+        }
+    ]
+}
+```
+
+**What do the various configurations mean?**
+- `CMAKE_CUDA_COMPILER`: Path to your `nvcc` binary. The script attempts to find this automatically.
+- `CMAKE_C_COMPILER_LAUNCHER`, `CMAKE_CXX_COMPILER_LAUNCHER`, `CMAKE_CUDA_COMPILER_LAUNCHER`: Setting these to `ccache` (or `sccache`) significantly speeds up rebuilds by caching compilation results. Ensure `ccache` is installed (e.g., `sudo apt install ccache` or `conda install ccache`). The script sets these by default.
+- `VLLM_PYTHON_EXECUTABLE`: Path to the Python executable in your vLLM development environment. The script will prompt for this, defaulting to the current Python environment if suitable.
+- `CMAKE_INSTALL_PREFIX: "${sourceDir}"`: Specifies that the compiled components should be installed back into your vLLM source directory. This is crucial for the editable install, as it makes the newly built kernels immediately available to your Python environment.
+- `CMAKE_JOB_POOLS` and `jobs` in build presets: Control the parallelism of the build. The script sets these based on the number of CPU cores detected on your system.
+- `binaryDir`: Specifies where the build artifacts will be stored (e.g., `cmake-build-release`).
+
+## Building and Installing with CMake
+
+Once your `CMakeUserPresets.json` is configured:
+
+1. **Initialize the CMake build environment:**
+   This step configures the build system according to your chosen preset (e.g., `release`) and creates the build directory at `binaryDir`
+
+   ```console
+   cmake --preset release
+   ```
+
+2. **Build and install the vLLM components:**
+   This command compiles the code and installs the resulting binaries into your vLLM source directory, making them available to your editable Python installation.
+
+   ```console
+   cmake --build --preset release --target install
+   ```
+
+3. **Make changes and repeat!**
+    Now you start using your editable install of vLLM, testing and making changes as needed. If you need to build again to update based on changes, simply run the CMake command again to build only the affected files.
+
+    ```console
+    cmake --build --preset release --target install
+    ```
+
+## Verifying the Build
+
+After a successful build, you will find a populated build directory (e.g., `cmake-build-release/` if you used the `release` preset and the example configuration).
+
+```console
+> ls cmake-build-release/
+bin             cmake_install.cmake      _deps                                machete_generation.log
+build.ninja     CPackConfig.cmake        detect_cuda_compute_capabilities.cu  marlin_generation.log
+_C.abi3.so      CPackSourceConfig.cmake  detect_cuda_version.cc               _moe_C.abi3.so
+CMakeCache.txt  ctest                    _flashmla_C.abi3.so                  moe_marlin_generation.log
+CMakeFiles      cumem_allocator.abi3.so  install_local_manifest.txt           vllm-flash-attn
+```
+
+The `cmake --build ... --target install` command copies the compiled shared libraries (like `_C.abi3.so`, `_moe_C.abi3.so`, etc.) into the appropriate `vllm` package directory within your source tree. This updates your editable installation with the newly compiled kernels.
+
+## Additional Tips
+
+- **Adjust Parallelism:** Fine-tune the `CMAKE_JOB_POOLS` in `configurePresets` and `jobs` in `buildPresets` in your `CMakeUserPresets.json`. Too many jobs can overload systems with limited RAM or CPU cores, leading to slower builds or system instability. Too few won't fully utilize available resources.
+- **Clean Builds When Necessary:** If you encounter persistent or strange build errors, especially after significant changes or switching branches, consider removing the CMake build directory (e.g., `rm -rf cmake-build-release`) and re-running the `cmake --preset` and `cmake --build` commands.
+- **Specific Target Builds:** For even faster iterations when working on a specific module, you can sometimes build a specific target instead of the full `install` target, though `install` ensures all necessary components are updated in your Python environment. Refer to CMake documentation for more advanced target management.
--- a/docs/contributing/model/README.md
+++ b/docs/contributing/model/README.md
 ---
-title: Adding a New Model
+title: Summary
 ---
 [](){ #new-model }

-This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
+!!! important
+    Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!

-Contents:
+vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance.

- [Basic](basic.md)
- [Registration](registration.md)
- [Tests](tests.md)
- [Multimodal](multimodal.md)
+The complexity of integrating a model into vLLM depends heavily on the model's architecture.
+The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
+However, this can be more complex for models that include new operators (e.g., a new attention mechanism).

-!!! note
-    The complexity of adding a new model depends heavily on the model's architecture.
-    The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
-    However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
+Read through these pages for a step-by-step guide:
+
+- [Basic Model](basic.md)
+- [Registering a Model](registration.md)
+- [Unit Testing](tests.md)
+- [Multi-Modal Support](multimodal.md)

 !!! tip
    If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)

--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
 ---
-title: Implementing a Basic Model
+title: Basic Model
 ---
 [](){ #new-model-basic }

@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons

 The initialization code should look like this:

-```python
-from torch import nn
-from vllm.config import VllmConfig
-from vllm.attention import Attention
-
-class MyAttention(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.attn = Attention(prefix=f"{prefix}.attn")
-
-class MyDecoderLayer(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
-
-class MyModel(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.layers = nn.ModuleList(
-            [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
-        )
-
-class MyModelForCausalLM(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
-        super().__init__()
-        self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
-```
+??? Code
+
+    ```python
+    from torch import nn
+    from vllm.config import VllmConfig
+    from vllm.attention import Attention
+
+    class MyAttention(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.attn = Attention(prefix=f"{prefix}.attn")
+
+    class MyDecoderLayer(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
+
+    class MyModel(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.layers = nn.ModuleList(
+                [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
+            )
+
+    class MyModelForCausalLM(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
+            super().__init__()
+            self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
+    ```

 ### Computation Code


--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@@ -10,6 +10,22 @@ This document walks you through the steps to extend a basic model so that it acc
 It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
 Further update the model as follows:

+- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
+
+    ??? Code
+
+        ```python
+        class YourModelForImage2Seq(nn.Module):
+            ...
+
+            @classmethod
+            def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
+                if modality.startswith("image"):
+                    return "<image>"
+
+                raise ValueError("Only image modality is supported")
+        ```
+
 - Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:

  ```diff
@@ -25,59 +41,63 @@ Further update the model as follows:

 - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.

-    ```python
-    class YourModelForImage2Seq(nn.Module):
-        ...
+    ??? Code

-        def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
+        ```python
+        class YourModelForImage2Seq(nn.Module):
+            ...

-            assert self.vision_encoder is not None
-            image_features = self.vision_encoder(image_input)
-            return self.multi_modal_projector(image_features)
+            def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:

-        def get_multimodal_embeddings(
-                self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+                assert self.vision_encoder is not None
+                image_features = self.vision_encoder(image_input)
+                return self.multi_modal_projector(image_features)

-            # Validate the multimodal input keyword arguments
-            image_input = self._parse_and_validate_image_input(**kwargs)
-            if image_input is None:
-                return None
+            def get_multimodal_embeddings(
+                    self, **kwargs: object) -> Optional[MultiModalEmbeddings]:

-            # Run multimodal inputs through encoder and projector
-            vision_embeddings = self._process_image_input(image_input)
-            return vision_embeddings
-    ```
+                # Validate the multimodal input keyword arguments
+                image_input = self._parse_and_validate_image_input(**kwargs)
+                if image_input is None:
+                    return None
+
+                # Run multimodal inputs through encoder and projector
+                vision_embeddings = self._process_image_input(image_input)
+                return vision_embeddings
+        ```

-!!! warning
-        The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
+!!! important
+    The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.

 - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.

-    ```python
-    from .utils import merge_multimodal_embeddings
+    ??? Code

-    class YourModelForImage2Seq(nn.Module):
-        ...
+        ```python
+        from .utils import merge_multimodal_embeddings

-        def get_input_embeddings(
-            self,
-            input_ids: torch.Tensor,
-            multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
-        ) -> torch.Tensor:
-
-            # `get_input_embeddings` should already be implemented for the language 
-            # model as one of the requirements of basic vLLM model implementation.
-            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
-
-            if multimodal_embeddings is not None:
-                inputs_embeds = merge_multimodal_embeddings(
-                    input_ids=input_ids, 
-                    inputs_embeds=inputs_embeds, 
-                    multimodal_embeddings=multimodal_embeddings,
-                    placeholder_token_id=self.config.image_token_index)
-
-            return inputs_embeds
-    ```
+        class YourModelForImage2Seq(nn.Module):
+            ...
+
+            def get_input_embeddings(
+                self,
+                input_ids: torch.Tensor,
+                multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+            ) -> torch.Tensor:
+
+                # `get_input_embeddings` should already be implemented for the language 
+                # model as one of the requirements of basic vLLM model implementation.
+                inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+
+                if multimodal_embeddings is not None:
+                    inputs_embeds = merge_multimodal_embeddings(
+                        input_ids=input_ids, 
+                        inputs_embeds=inputs_embeds, 
+                        multimodal_embeddings=multimodal_embeddings,
+                        placeholder_token_id=self.config.image_token_index)
+
+                return inputs_embeds
+        ```

 - Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.

@@ -100,8 +120,8 @@ Further update the model as follows:
  ```

 !!! note
-      The model class does not have to be named `*ForCausalLM`.
-      Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
+    The model class does not have to be named `*ForCausalLM`.
+    Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.

 ## 2. Specify processing information

@@ -135,42 +155,46 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Looking at the code of HF's `LlavaForConditionalGeneration`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
-    n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
-    n_image_features = image_features.shape[0] * image_features.shape[1]
+    ??? Code

-    if n_image_tokens != n_image_features:
-        raise ValueError(
-            f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
+        n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
+        n_image_features = image_features.shape[0] * image_features.shape[1]
+
+        if n_image_tokens != n_image_features:
+            raise ValueError(
+                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+            )
+        special_image_mask = (
+            (input_ids == self.config.image_token_index)
+            .unsqueeze(-1)
+            .expand_as(inputs_embeds)
+            .to(inputs_embeds.device)
        )
-    special_image_mask = (
-        (input_ids == self.config.image_token_index)
-        .unsqueeze(-1)
-        .expand_as(inputs_embeds)
-        .to(inputs_embeds.device)
-    )
-    image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
-    inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
-    ```
+        image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+        ```

    The number of placeholder feature tokens per image is `image_features.shape[1]`.
    `image_features` is calculated inside the `get_image_features` method:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
-    image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
-
-    selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
-    if vision_feature_select_strategy == "default":
-        selected_image_feature = selected_image_feature[:, 1:]
-    elif vision_feature_select_strategy == "full":
-        selected_image_feature = selected_image_feature
-    else:
-        raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
-    image_features = self.multi_modal_projector(selected_image_feature)
-    return image_features
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
+        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+
+        selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+        if vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        elif vision_feature_select_strategy == "full":
+            selected_image_feature = selected_image_feature
+        else:
+            raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+        image_features = self.multi_modal_projector(selected_image_feature)
+        return image_features
+        ```

    We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
    (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
@@ -193,20 +217,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
-    target_dtype = self.patch_embedding.weight.dtype
-    patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
-    patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
-
-    class_embeds = self.class_embedding.expand(batch_size, 1, -1)
-    embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
-    if interpolate_pos_encoding:
-        embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
-    else:
-        embeddings = embeddings + self.position_embedding(self.position_ids)
-    return embeddings
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+
+        class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+        embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+        if interpolate_pos_encoding:
+            embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+        else:
+            embeddings = embeddings + self.position_embedding(self.position_ids)
+        return embeddings
+        ```

    We can infer that `embeddings.shape[1] == self.num_positions`, where

@@ -218,55 +244,59 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Overall, the number of placeholder feature tokens for an image can be calculated as:

-    ```python
-    def get_num_image_tokens(
-        self,
-        *,
-        image_width: int,
-        image_height: int,
-    ) -> int:
-        hf_config = self.get_hf_config()
-        hf_processor = self.get_hf_processor()
+    ??? Code

-        image_size = hf_config.vision_config.image_size
-        patch_size = hf_config.vision_config.patch_size
+        ```python
+        def get_num_image_tokens(
+            self,
+            *,
+            image_width: int,
+            image_height: int,
+        ) -> int:
+            hf_config = self.get_hf_config()
+            hf_processor = self.get_hf_processor()

-        num_image_tokens = (image_size // patch_size) ** 2 + 1
-        if hf_processor.vision_feature_select_strategy == "default":
-            num_image_tokens -= 1
+            image_size = hf_config.vision_config.image_size
+            patch_size = hf_config.vision_config.patch_size

-        return num_image_tokens
-    ```
+            num_image_tokens = (image_size // patch_size) ** 2 + 1
+            if hf_processor.vision_feature_select_strategy == "default":
+                num_image_tokens -= 1
+
+            return num_image_tokens
+        ```

    Notice that the number of image tokens doesn't depend on the image width and height.
    We can simply use a dummy `image_size` to calculate the multimodal profiling data:

-    ```python
-    # NOTE: In actuality, this is usually implemented as part of the
-    # model's subclass of `BaseProcessingInfo`, but we show it as is
-    # here for simplicity.
-    def get_image_size_with_most_features(self) -> ImageSize:
-        hf_config = self.get_hf_config()
-        width = height = hf_config.image_size
-        return ImageSize(width=width, height=height)
-
-    def get_dummy_mm_data(
-        self,
-        seq_len: int,
-        mm_counts: Mapping[str, int],
-    ) -> MultiModalDataDict:
-        num_images = mm_counts.get("image", 0)
+    ??? Code

-        target_width, target_height = \
-            self.info.get_image_size_with_most_features()
+        ```python
+        # NOTE: In actuality, this is usually implemented as part of the
+        # model's subclass of `BaseProcessingInfo`, but we show it as is
+        # here for simplicity.
+        def get_image_size_with_most_features(self) -> ImageSize:
+            hf_config = self.get_hf_config()
+            width = height = hf_config.image_size
+            return ImageSize(width=width, height=height)

-        return {
-            "image":
-            self._get_dummy_images(width=target_width,
-                                   height=target_height,
-                                   num_images=num_images)
-        }
-    ```
+        def get_dummy_mm_data(
+            self,
+            seq_len: int,
+            mm_counts: Mapping[str, int],
+        ) -> MultiModalDataDict:
+            num_images = mm_counts.get("image", 0)
+
+            target_width, target_height = \
+                self.info.get_image_size_with_most_features()
+
+            return {
+                "image":
+                self._get_dummy_images(width=target_width,
+                                    height=target_height,
+                                    num_images=num_images)
+            }
+        ```

    For the text, we simply expand the multimodal image token from the model config to match the desired number of images.

@@ -284,21 +314,23 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Looking at the code of HF's `FuyuForCausalLM`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
-    if image_patches is not None and past_key_values is None:
-        patch_embeddings = [
-            self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
-            .squeeze(0)
-            .to(inputs_embeds.device)
-            for patch in image_patches
-        ]
-        inputs_embeds = self.gather_continuous_embeddings(
-            word_embeddings=inputs_embeds,
-            continuous_embeddings=patch_embeddings,
-            image_patch_input_indices=image_patches_indices,
-        )
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+        if image_patches is not None and past_key_values is None:
+            patch_embeddings = [
+                self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+                .squeeze(0)
+                .to(inputs_embeds.device)
+                for patch in image_patches
+            ]
+            inputs_embeds = self.gather_continuous_embeddings(
+                word_embeddings=inputs_embeds,
+                continuous_embeddings=patch_embeddings,
+                image_patch_input_indices=image_patches_indices,
+            )
+        ```

    The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
    which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
@@ -312,92 +344,98 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
    returning the dimensions after resizing (but before padding) as metadata.

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
-    image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
-    batch_images = image_encoding["images"]
-    image_unpadded_heights = image_encoding["image_unpadded_heights"]
-    image_unpadded_widths = image_encoding["image_unpadded_widths"]
-
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
-    if do_resize:
-        batch_images = [
-            [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
-            for images in batch_images
-        ]
-
-    image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
-    image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
-    image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
-
-    if do_pad:
-        batch_images = [
-            [
-                self.pad_image(
-                    image,
-                    size=size,
-                    mode=padding_mode,
-                    constant_values=padding_value,
-                    input_data_format=input_data_format,
-                )
-                for image in images
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+        image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+        batch_images = image_encoding["images"]
+        image_unpadded_heights = image_encoding["image_unpadded_heights"]
+        image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+        if do_resize:
+            batch_images = [
+                [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+                for images in batch_images
            ]
-            for images in batch_images
-        ]
-    ```

-    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+        image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+        image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+        image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+        if do_pad:
+            batch_images = [
+                [
+                    self.pad_image(
+                        image,
+                        size=size,
+                        mode=padding_mode,
+                        constant_values=padding_value,
+                        input_data_format=input_data_format,
+                    )
+                    for image in images
+                ]
+                for images in batch_images
+            ]
+        ```

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
-    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-        image_input=tensor_batch_images,
-        image_present=image_present,
-        image_unpadded_h=image_unpadded_heights,
-        image_unpadded_w=image_unpadded_widths,
-        image_placeholder_id=image_placeholder_id,
-        image_newline_id=image_newline_id,
-        variable_sized=True,
-    )
+    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:

-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
-    image_height, image_width = image.shape[1], image.shape[2]
-    if variable_sized:  # variable_sized=True
-        new_h = min(
-            image_height,
-            math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+            image_input=tensor_batch_images,
+            image_present=image_present,
+            image_unpadded_h=image_unpadded_heights,
+            image_unpadded_w=image_unpadded_widths,
+            image_placeholder_id=image_placeholder_id,
+            image_newline_id=image_newline_id,
+            variable_sized=True,
        )
-        new_w = min(
-            image_width,
-            math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
-        )
-        image = image[:, :new_h, :new_w]
-        image_height, image_width = new_h, new_w

-    num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
-    tensor_of_image_ids = torch.full(
-        [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
-    )
-    patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
-    assert num_patches == patches.shape[0]
-    ```
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+        image_height, image_width = image.shape[1], image.shape[2]
+        if variable_sized:  # variable_sized=True
+            new_h = min(
+                image_height,
+                math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+            )
+            new_w = min(
+                image_width,
+                math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+            )
+            image = image[:, :new_h, :new_w]
+            image_height, image_width = new_h, new_w
+
+        num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+        tensor_of_image_ids = torch.full(
+            [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+        )
+        patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+        assert num_patches == patches.shape[0]
+        ```

    The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
-    patch_size = patch_size if patch_size is not None else self.patch_size
-    patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
-
-    if image_height % patch_height != 0:
-        raise ValueError(f"{image_height=} must be divisible by {patch_height}")
-    if image_width % patch_width != 0:
-        raise ValueError(f"{image_width=} must be divisible by {patch_width}")
-
-    num_patches_per_dim_h = image_height // patch_height
-    num_patches_per_dim_w = image_width // patch_width
-    num_patches = num_patches_per_dim_h * num_patches_per_dim_w
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+        patch_size = patch_size if patch_size is not None else self.patch_size
+        patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+        if image_height % patch_height != 0:
+            raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+        if image_width % patch_width != 0:
+            raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+        num_patches_per_dim_h = image_height // patch_height
+        num_patches_per_dim_w = image_width // patch_width
+        num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+        ```

    These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
    to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
@@ -419,23 +457,25 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    For the multimodal image profiling data, the logic is very similar to LLaVA:

-    ```python
-    def get_dummy_mm_data(
-        self,
-        seq_len: int,
-        mm_counts: Mapping[str, int],
-    ) -> MultiModalDataDict:
-        target_width, target_height = \
-            self.info.get_image_size_with_most_features()
-        num_images = mm_counts.get("image", 0)
+    ??? Code

-        return {
-            "image":
-            self._get_dummy_images(width=target_width,
-                                   height=target_height,
-                                   num_images=num_images)
-        }
-    ```
+        ```python
+        def get_dummy_mm_data(
+            self,
+            seq_len: int,
+            mm_counts: Mapping[str, int],
+        ) -> MultiModalDataDict:
+            target_width, target_height = \
+                self.info.get_image_size_with_most_features()
+            num_images = mm_counts.get("image", 0)
+
+            return {
+                "image":
+                self._get_dummy_images(width=target_width,
+                                    height=target_height,
+                                    num_images=num_images)
+            }
+        ```

 ## 4. Specify processing details

@@ -455,6 +495,7 @@ return a schema of the tensors outputted by the HF processor that are related to
    The output of `CLIPImageProcessor` is a simple tensor with shape
    `(num_images, num_channels, image_height, image_width)`:

+
    ```python
    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
    images = [
@@ -505,40 +546,49 @@ return a schema of the tensors outputted by the HF processor that are related to
    In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
    we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:

-    ```python
-    def _call_hf_processor(
-        self,
-        prompt: str,
-        mm_data: Mapping[str, object],
-        mm_kwargs: Mapping[str, object],
-    ) -> BatchFeature:
-        processed_outputs = super()._call_hf_processor(
-            prompt=prompt,
-            mm_data=mm_data,
-            mm_kwargs=mm_kwargs,
-        )
+    ??? Code

-        image_patches = processed_outputs.get("image_patches")
-        if image_patches is not None:
-            images = mm_data["images"]
-            assert isinstance(images, list)
+        ```python
+        def _call_hf_processor(
+            self,
+            prompt: str,
+            mm_data: Mapping[str, object],
+            mm_kwargs: Mapping[str, object],
+            tok_kwargs: Mapping[str, object],
+        ) -> BatchFeature:
+            processed_outputs = super()._call_hf_processor(
+                prompt=prompt,
+                mm_data=mm_data,
+                mm_kwargs=mm_kwargs,
+                tok_kwargs=tok_kwargs,
+            )

-            # Original output: (1, num_images, Pn, Px * Py * C)
-            # New output: (num_images, Pn, Px * Py * C)
-            assert (isinstance(image_patches, list)
-                    and len(image_patches) == 1)
-            assert (isinstance(image_patches[0], torch.Tensor)
-                    and len(image_patches[0]) == len(images))
+            image_patches = processed_outputs.get("image_patches")
+            if image_patches is not None:
+                images = mm_data["images"]
+                assert isinstance(images, list)

-            processed_outputs["image_patches"] = image_patches[0]
+                # Original output: (1, num_images, Pn, Px * Py * C)
+                # New output: (num_images, Pn, Px * Py * C)
+                assert (isinstance(image_patches, list)
+                        and len(image_patches) == 1)
+                assert (isinstance(image_patches[0], torch.Tensor)
+                        and len(image_patches[0]) == len(images))

-        return processed_outputs
-    ```
+                processed_outputs["image_patches"] = image_patches[0]
+
+            return processed_outputs
+        ```

    !!! note
        Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
        for text-only inputs to prevent unnecessary warnings from HF processor.

+    !!! note
+        The `_call_hf_processor` method specifies both `mm_kwargs` and `tok_kwargs` for
+        processing. `mm_kwargs` is used to both initialize and call the huggingface
+        processor, whereas `tok_kwargs` is only used to call the huggingface processor.
+
    This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:

    ```python
@@ -573,35 +623,37 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
    Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:

-    ```python
-    def _get_prompt_updates(
-        self,
-        mm_items: MultiModalDataItems,
-        hf_processor_mm_kwargs: Mapping[str, object],
-        out_mm_kwargs: MultiModalKwargs,
-    ) -> Sequence[PromptUpdate]:
-        hf_config = self.info.get_hf_config()
-        image_token_id = hf_config.image_token_index
-
-        def get_replacement(item_idx: int):
-            images = mm_items.get_items("image", ImageProcessorItems)
+    ??? Code

-            image_size = images.get_image_size(item_idx)
-            num_image_tokens = self.info.get_num_image_tokens(
-                image_width=image_size.width,
-                image_height=image_size.height,
-            )
+        ```python
+        def _get_prompt_updates(
+            self,
+            mm_items: MultiModalDataItems,
+            hf_processor_mm_kwargs: Mapping[str, object],
+            out_mm_kwargs: MultiModalKwargs,
+        ) -> Sequence[PromptUpdate]:
+            hf_config = self.info.get_hf_config()
+            image_token_id = hf_config.image_token_index
+
+            def get_replacement(item_idx: int):
+                images = mm_items.get_items("image", ImageProcessorItems)
+
+                image_size = images.get_image_size(item_idx)
+                num_image_tokens = self.info.get_num_image_tokens(
+                    image_width=image_size.width,
+                    image_height=image_size.height,
+                )

-            return [image_token_id] * num_image_tokens
+                return [image_token_id] * num_image_tokens

-        return [
-            PromptReplacement(
-                modality="image",
-                target=[image_token_id],
-                replacement=get_replacement,
-            ),
-        ]
-    ```
+            return [
+                PromptReplacement(
+                    modality="image",
+                    target=[image_token_id],
+                    replacement=get_replacement,
+                ),
+            ]
+        ```

 === "Handling additional tokens: Fuyu"

@@ -616,117 +668,90 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies

    We define a helper function to return `ncols` and `nrows` directly:

-    ```python
-    def get_image_feature_grid_size(
-        self,
-        *,
-        image_width: int,
-        image_height: int,
-    ) -> tuple[int, int]:
-        image_processor = self.get_image_processor()
-        target_width = image_processor.size["width"]
-        target_height = image_processor.size["height"]
-        patch_width = image_processor.patch_size["width"]
-        patch_height = image_processor.patch_size["height"]
-
-        if not (image_width <= target_width and image_height <= target_height):
-            height_scale_factor = target_height / image_height
-            width_scale_factor = target_width / image_width
-            optimal_scale_factor = min(height_scale_factor, width_scale_factor)
-
-            image_height = int(image_height * optimal_scale_factor)
-            image_width = int(image_width * optimal_scale_factor)
-
-        ncols = math.ceil(image_width / patch_width)
-        nrows = math.ceil(image_height / patch_height)
-        return ncols, nrows
-    ```
+    ??? Code
+
+        ```python
+        def get_image_feature_grid_size(
+            self,
+            *,
+            image_width: int,
+            image_height: int,
+        ) -> tuple[int, int]:
+            image_processor = self.get_image_processor()
+            target_width = image_processor.size["width"]
+            target_height = image_processor.size["height"]
+            patch_width = image_processor.patch_size["width"]
+            patch_height = image_processor.patch_size["height"]
+
+            if not (image_width <= target_width and image_height <= target_height):
+                height_scale_factor = target_height / image_height
+                width_scale_factor = target_width / image_width
+                optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+                image_height = int(image_height * optimal_scale_factor)
+                image_width = int(image_width * optimal_scale_factor)
+
+            ncols = math.ceil(image_width / patch_width)
+            nrows = math.ceil(image_height / patch_height)
+            return ncols, nrows
+        ```

    Based on this, we can initially define our replacement tokens as:

-    ```python
-    def get_replacement(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-        image_size = images.get_image_size(item_idx)
+    ??? Code

-        ncols, nrows = self.info.get_image_feature_grid_size(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
+        ```python
+        def get_replacement(item_idx: int):
+            images = mm_items.get_items("image", ImageProcessorItems)
+            image_size = images.get_image_size(item_idx)

-        # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
-        # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
-        return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
-    ```
+            ncols, nrows = self.info.get_image_feature_grid_size(
+                image_width=image_size.width,
+                image_height=image_size.height,
+            )
+
+            # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+            # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+            return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+        ```

    However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
    a BOS token (`<s>`) is also added to the promopt:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
-    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-        image_input=tensor_batch_images,
-        image_present=image_present,
-        image_unpadded_h=image_unpadded_heights,
-        image_unpadded_w=image_unpadded_widths,
-        image_placeholder_id=image_placeholder_id,
-        image_newline_id=image_newline_id,
-        variable_sized=True,
-    )
-    prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
-        tokenizer=self.tokenizer,
-        prompts=prompts,
-        scale_factors=scale_factors,
-        max_tokens_to_generate=self.max_tokens_to_generate,
-        max_position_embeddings=self.max_position_embeddings,
-        add_BOS=True,
-        add_beginning_of_answer_token=True,
-    )
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+            image_input=tensor_batch_images,
+            image_present=image_present,
+            image_unpadded_h=image_unpadded_heights,
+            image_unpadded_w=image_unpadded_widths,
+            image_placeholder_id=image_placeholder_id,
+            image_newline_id=image_newline_id,
+            variable_sized=True,
+        )
+        prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+            tokenizer=self.tokenizer,
+            prompts=prompts,
+            scale_factors=scale_factors,
+            max_tokens_to_generate=self.max_tokens_to_generate,
+            max_position_embeddings=self.max_position_embeddings,
+            add_BOS=True,
+            add_beginning_of_answer_token=True,
+        )
+        ```

    To assign the vision embeddings to only the image tokens, instead of a string
    you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:

-    ```python
-    hf_config = self.info.get_hf_config()
-    bos_token_id = hf_config.bos_token_id  # `<s>`
-    assert isinstance(bos_token_id, int)
-
-    def get_replacement_fuyu(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-        image_size = images.get_image_size(item_idx)
-
-        ncols, nrows = self.info.get_image_feature_grid_size(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
-        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
-                        [_NEWLINE_TOKEN_ID]) * nrows
+    ??? Code

-        return PromptUpdateDetails.select_token_id(
-            image_tokens + [bos_token_id],
-            embed_token_id=_IMAGE_TOKEN_ID,
-        )
-    ```
-
-    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
-    we can search for it to conduct the replacement at the start of the string:
-
-    ```python
-    def _get_prompt_updates(
-        self,
-        mm_items: MultiModalDataItems,
-        hf_processor_mm_kwargs: Mapping[str, object],
-        out_mm_kwargs: MultiModalKwargs,
-    ) -> Sequence[PromptUpdate]:
+        ```python
        hf_config = self.info.get_hf_config()
-        bos_token_id = hf_config.bos_token_id
+        bos_token_id = hf_config.bos_token_id  # `<s>`
        assert isinstance(bos_token_id, int)

-        tokenizer = self.info.get_tokenizer()
-        eot_token_id = tokenizer.bos_token_id
-        assert isinstance(eot_token_id, int)
-
        def get_replacement_fuyu(item_idx: int):
            images = mm_items.get_items("image", ImageProcessorItems)
            image_size = images.get_image_size(item_idx)
@@ -742,15 +767,52 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
                image_tokens + [bos_token_id],
                embed_token_id=_IMAGE_TOKEN_ID,
            )
+        ```

-        return [
-            PromptReplacement(
-                modality="image",
-                target=[eot_token_id],
-                replacement=get_replacement_fuyu,
-            )
-        ]
-    ```
+    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+    we can search for it to conduct the replacement at the start of the string:
+
+    ??? Code
+
+        ```python
+        def _get_prompt_updates(
+            self,
+            mm_items: MultiModalDataItems,
+            hf_processor_mm_kwargs: Mapping[str, object],
+            out_mm_kwargs: MultiModalKwargs,
+        ) -> Sequence[PromptUpdate]:
+            hf_config = self.info.get_hf_config()
+            bos_token_id = hf_config.bos_token_id
+            assert isinstance(bos_token_id, int)
+
+            tokenizer = self.info.get_tokenizer()
+            eot_token_id = tokenizer.bos_token_id
+            assert isinstance(eot_token_id, int)
+
+            def get_replacement_fuyu(item_idx: int):
+                images = mm_items.get_items("image", ImageProcessorItems)
+                image_size = images.get_image_size(item_idx)
+
+                ncols, nrows = self.info.get_image_feature_grid_size(
+                    image_width=image_size.width,
+                    image_height=image_size.height,
+                )
+                image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                                [_NEWLINE_TOKEN_ID]) * nrows
+
+                return PromptUpdateDetails.select_token_id(
+                    image_tokens + [bos_token_id],
+                    embed_token_id=_IMAGE_TOKEN_ID,
+                )
+
+            return [
+                PromptReplacement(
+                    modality="image",
+                    target=[eot_token_id],
+                    replacement=get_replacement_fuyu,
+                )
+            ]
+        ```

 ## 5. Register processor-related classes


--- a/docs/contributing/model/registration.md
+++ b/docs/contributing/model/registration.md
 ---
-title: Registering a Model to vLLM
+title: Registering a Model
 ---
 [](){ #new-model-registration }

@@ -18,7 +18,7 @@ After you have implemented your model (see [tutorial][new-model-basic]), put it
 Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
 Finally, update our [list of supported models][supported-models] to promote your model!

-!!! warning
+!!! important
    The list of models in each section should be maintained in alphabetical order.

 ## Out-of-tree models
@@ -49,6 +49,6 @@ def register():
    )
 ```

-!!! warning
+!!! important
    If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
    Read more about that [here][supports-multimodal].
--- a/docs/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
 ---
-title: Writing Unit Tests
+title: Unit Testing
 ---
 [](){ #new-model-tests }

@@ -15,7 +15,7 @@ Without them, the CI for your PR will fail.
 Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
 This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.

-!!! warning
+!!! important
    The list of models in each section should be maintained in alphabetical order.

 !!! tip

--- a/docs/contributing/profiling.md
+++ b/docs/contributing/profiling.md
@@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example
 #### OpenAI Server

 ```bash
-VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
+VLLM_TORCH_PROFILER_DIR=./vllm_profile \
+    python -m vllm.entrypoints.openai.api_server \
+    --model meta-llama/Meta-Llama-3-70B
 ```

 benchmark_serving.py:

 ```bash
-python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
+python benchmarks/benchmark_serving.py \
+    --backend vllm \
+    --model meta-llama/Meta-Llama-3-70B \
+    --dataset-name sharegpt \
+    --dataset-path sharegpt.json \
+    --profile \
+    --num-prompts 2
 ```

 ## Profile with NVIDIA Nsight Systems
@@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo
 The following is an example using the `benchmarks/benchmark_latency.py` script:

 ```bash
-nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
+nsys profile -o report.nsys-rep \
+    --trace-fork-before-exec=true \
+    --cuda-graph-trace=node \
+    python benchmarks/benchmark_latency.py \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --num-iters-warmup 5 \
+    --num-iters 1 \
+    --batch-size 16 \
+    --input-len 512 \
+    --output-len 8
 ```

 #### OpenAI Server
@@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with `

 ```bash
 # server
-nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
+nsys profile -o report.nsys-rep \
+    --trace-fork-before-exec=true \
+    --cuda-graph-trace=node \
+    --delay 30 \
+    --duration 60 \
+    vllm serve meta-llama/Llama-3.1-8B-Instruct

 # client
-python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
+python benchmarks/benchmark_serving.py \
+    --backend vllm \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --num-prompts 1 \
+    --dataset-name random \
+    --random-input 1024 \
+    --random-output 512
 ```

 In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
@@ -97,26 +125,26 @@ to manually kill the profiler and generate your `nsys-rep` report.

 You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).

-CLI example:
-
-```bash
-nsys stats report1.nsys-rep
-...
- ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
-
- Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
- --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
-     46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
-     14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
-     12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
-      9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
-      5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
-      4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
-      2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
-      1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
-      0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
-... 
-```
+??? CLI example
+
+    ```bash
+    nsys stats report1.nsys-rep
+    ...
+    ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
+
+    Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
+    --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
+        46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
+        14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
+        12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
+        9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
+        5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
+        4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
+        2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
+        1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
+        0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
+    ... 
+    ```

 GUI example:


--- a/docs/contributing/vulnerability_management.md
+++ b/docs/contributing/vulnerability_management.md
@@ -34,6 +34,7 @@ you may contact the following individuals:

 - Simon Mo - simon.mo@hey.com
 - Russell Bryant - rbryant@redhat.com
+- Huzaifa Sidhpurwala - huzaifas@redhat.com

 ## Slack Discussion


--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -10,7 +10,7 @@ title: Using Docker
 vLLM offers an official Docker image for deployment.
 The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).

-```console
+```bash
 docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
@@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \

 This image can also be used with other container engines such as [Podman](https://podman.io/).

-```console
+```bash
 podman run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
@@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`

 You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:

-```console
+```bash
 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
 DOCKER_BUILDKIT=1 docker build . \
    --target vllm-openai \
@@ -97,26 +97,28 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
    flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
    Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).

-```console
-# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-python3 use_existing_torch.py
-DOCKER_BUILDKIT=1 docker build . \
-  --file docker/Dockerfile \
-  --target vllm-openai \
-  --platform "linux/arm64" \
-  -t vllm/vllm-gh200-openai:latest \
-  --build-arg max_jobs=66 \
-  --build-arg nvcc_threads=2 \
-  --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
-  --build-arg vllm_fa_cmake_gpu_arches="90-real"
-```
+??? Command
+
+    ```bash
+    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+    python3 use_existing_torch.py
+    DOCKER_BUILDKIT=1 docker build . \
+    --file docker/Dockerfile \
+    --target vllm-openai \
+    --platform "linux/arm64" \
+    -t vllm/vllm-gh200-openai:latest \
+    --build-arg max_jobs=66 \
+    --build-arg nvcc_threads=2 \
+    --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
+    --build-arg vllm_fa_cmake_gpu_arches="90-real"
+    ```

 !!! note
    If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.

    Run the following command on your host machine to register QEMU user static handlers:

-    ```console
+    ```bash
    docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
    ```

@@ -126,7 +128,7 @@ DOCKER_BUILDKIT=1 docker build . \

 To run vLLM with the custom-built Docker image:

-```console
+```bash
 docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \

--- a/docs/deployment/frameworks/anything-llm.md
+++ b/docs/deployment/frameworks/anything-llm.md
@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac

 - Start the vLLM server with the supported chat completion model, e.g.

-```console
+```bash
 vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
 ```


--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@@ -11,7 +11,7 @@ title: AutoGen

 - Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment

-```console
+```bash
 pip install vllm

 # Install AgentChat and OpenAI client from Extensions
@@ -23,58 +23,60 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"

 - Start the vLLM server with the supported chat completion model, e.g.

-```console
+```bash
 python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2
 ```

 - Call it with AutoGen:

-```python
-import asyncio
-from autogen_core.models import UserMessage
-from autogen_ext.models.openai import OpenAIChatCompletionClient
-from autogen_core.models import ModelFamily
-
-
-async def main() -> None:
-    # Create a model client
-    model_client = OpenAIChatCompletionClient(
-        model="mistralai/Mistral-7B-Instruct-v0.2",
-        base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
-        api_key="EMPTY",
-        model_info={
-            "vision": False,
-            "function_calling": False,
-            "json_output": False,
-            "family": ModelFamily.MISTRAL,
-            "structured_output": True,
-        },
-    )
-
-    messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
-
-    # Create a stream.
-    stream = model_client.create_stream(messages=messages)
-
-    # Iterate over the stream and print the responses.
-    print("Streamed responses:")
-    async for response in stream:
-        if isinstance(response, str):
-            # A partial response is a string.
-            print(response, flush=True, end="")
-        else:
-            # The last response is a CreateResult object with the complete message.
-            print("\n\n------------\n")
-            print("The complete response:", flush=True)
-            print(response.content, flush=True)
-
-    # Close the client when done.
-    await model_client.close()
-
-
-asyncio.run(main())
-```
+??? Code
+
+    ```python
+    import asyncio
+    from autogen_core.models import UserMessage
+    from autogen_ext.models.openai import OpenAIChatCompletionClient
+    from autogen_core.models import ModelFamily
+
+
+    async def main() -> None:
+        # Create a model client
+        model_client = OpenAIChatCompletionClient(
+            model="mistralai/Mistral-7B-Instruct-v0.2",
+            base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
+            api_key="EMPTY",
+            model_info={
+                "vision": False,
+                "function_calling": False,
+                "json_output": False,
+                "family": ModelFamily.MISTRAL,
+                "structured_output": True,
+            },
+        )
+
+        messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
+
+        # Create a stream.
+        stream = model_client.create_stream(messages=messages)
+
+        # Iterate over the stream and print the responses.
+        print("Streamed responses:")
+        async for response in stream:
+            if isinstance(response, str):
+                # A partial response is a string.
+                print(response, flush=True, end="")
+            else:
+                # The last response is a CreateResult object with the complete message.
+                print("\n\n------------\n")
+                print("The complete response:", flush=True)
+                print(response.content, flush=True)
+
+        # Close the client when done.
+        await model_client.close()
+
+
+    asyncio.run(main())
+    ```

 For details, see the tutorial:


--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr

 To install the Cerebrium client, run:

-```console
+```bash
 pip install cerebrium
 cerebrium login
 ```

 Next, create your Cerebrium project, run:

-```console
+```bash
 cerebrium init vllm-project
 ```

@@ -34,75 +34,81 @@ vllm = "latest"

 Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:

-```python
-from vllm import LLM, SamplingParams
+??? Code

-llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    ```python
+    from vllm import LLM, SamplingParams

-def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")

-    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
-    outputs = llm.generate(prompts, sampling_params)
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):

-    # Print the outputs.
-    results = []
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        results.append({"prompt": prompt, "generated_text": generated_text})
+        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+        outputs = llm.generate(prompts, sampling_params)

-    return {"results": results}
-```
+        # Print the outputs.
+        results = []
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            results.append({"prompt": prompt, "generated_text": generated_text})
+
+        return {"results": results}
+    ```

 Then, run the following code to deploy it to the cloud:

-```console
+```bash
 cerebrium deploy
 ```

 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)

-```python
-curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
- -H 'Content-Type: application/json' \
- -H 'Authorization: <JWT TOKEN>' \
- --data '{
-   "prompts": [
-     "Hello, my name is",
-     "The president of the United States is",
-     "The capital of France is",
-     "The future of AI is"
-   ]
- }'
-```
+??? Command
+
+    ```python
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: <JWT TOKEN>' \
+    --data '{
+    "prompts": [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is"
+    ]
+    }'
+    ```

 You should get a response like:

-```python
-{
-    "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
-    "result": {
-        "result": [
-            {
-                "prompt": "Hello, my name is",
-                "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
-            },
-            {
-                "prompt": "The president of the United States is",
-                "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
-            },
-            {
-                "prompt": "The capital of France is",
-                "generated_text": " Paris.\n"
-            },
-            {
-                "prompt": "The future of AI is",
-                "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
-            }
-        ]
-    },
-    "run_time_ms": 152.53663063049316
-}
-```
+??? Response
+
+    ```python
+    {
+        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+        "result": {
+            "result": [
+                {
+                    "prompt": "Hello, my name is",
+                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+                },
+                {
+                    "prompt": "The president of the United States is",
+                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+                },
+                {
+                    "prompt": "The capital of France is",
+                    "generated_text": " Paris.\n"
+                },
+                {
+                    "prompt": "The future of AI is",
+                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+                }
+            ]
+        },
+        "run_time_ms": 152.53663063049316
+    }
+    ```

 You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/deployment/frameworks/chatbox.md
+++ b/docs/deployment/frameworks/chatbox.md
@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac

 - Start the vLLM server with the supported chat completion model, e.g.

-```console
+```bash
 vllm serve qwen/Qwen1.5-0.5B-Chat
 ```


--- a/docs/deployment/frameworks/dify.md
+++ b/docs/deployment/frameworks/dify.md
@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.

 - Start the vLLM server with the supported chat completion model, e.g.

-```console
+```bash
 vllm serve Qwen/Qwen1.5-7B-Chat
 ```

 - Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):

-```console
+```bash
 git clone https://github.com/langgenius/dify.git
 cd dify
 cd docker

--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),

 To install dstack client, run:

-```console
+```bash
 pip install "dstack[all]
 dstack server
 ```

 Next, to configure your dstack project, run:

-```console
+```bash
 mkdir -p vllm-dstack
 cd vllm-dstack
 dstack init
@@ -26,75 +26,81 @@ dstack init

 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

-```yaml
-type: service
-
-python: "3.11"
-env:
-    - MODEL=NousResearch/Llama-2-7b-chat-hf
-port: 8000
-resources:
-    gpu: 24GB
-commands:
-    - pip install vllm
-    - vllm serve $MODEL --port 8000
-model:
-    format: openai
-    type: chat
-    name: NousResearch/Llama-2-7b-chat-hf
-```
+??? Config
+
+    ```yaml
+    type: service
+
+    python: "3.11"
+    env:
+        - MODEL=NousResearch/Llama-2-7b-chat-hf
+    port: 8000
+    resources:
+        gpu: 24GB
+    commands:
+        - pip install vllm
+        - vllm serve $MODEL --port 8000
+    model:
+        format: openai
+        type: chat
+        name: NousResearch/Llama-2-7b-chat-hf
+    ```

 Then, run the following CLI for provisioning:

-```console
-$ dstack run . -f serve.dstack.yml
-
-⠸ Getting run plan...
- Configuration  serve.dstack.yml
- Project        deep-diver-main
- User           deep-diver
- Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
- Max price      -
- Max duration   -
- Spot policy    auto
- Retry policy   no
-
- #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
- 1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- 2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- 3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
-    ...
- Shown 3 of 193 offers, $5.876 max
-
-Continue? [y/n]: y
-⠙ Submitting run...
-⠏ Launching spicy-treefrog-1 (pulling)
-spicy-treefrog-1 provisioning completed (running)
-Service is published at ...
-```
+??? Command
+
+    ```console
+    $ dstack run . -f serve.dstack.yml
+
+    ⠸ Getting run plan...
+    Configuration  serve.dstack.yml
+    Project        deep-diver-main
+    User           deep-diver
+    Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
+    Max price      -
+    Max duration   -
+    Spot policy    auto
+    Retry policy   no
+
+    #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
+    1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+        ...
+    Shown 3 of 193 offers, $5.876 max
+
+    Continue? [y/n]: y
+    ⠙ Submitting run...
+    ⠏ Launching spicy-treefrog-1 (pulling)
+    spicy-treefrog-1 provisioning completed (running)
+    Service is published at ...
+    ```

 After the provisioning, you can interact with the model by using the OpenAI SDK:

-```python
-from openai import OpenAI
-
-client = OpenAI(
-    base_url="https://gateway.<gateway domain>",
-    api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
-)
-
-completion = client.chat.completions.create(
-    model="NousResearch/Llama-2-7b-chat-hf",
-    messages=[
-        {
-            "role": "user",
-            "content": "Compose a poem that explains the concept of recursion in programming.",
-        }
-    ]
-)
-
-print(completion.choices[0].message.content)
-```
+??? Code
+
+    ```python
+    from openai import OpenAI
+
+    client = OpenAI(
+        base_url="https://gateway.<gateway domain>",
+        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
+    )
+
+    completion = client.chat.completions.create(
+        model="NousResearch/Llama-2-7b-chat-hf",
+        messages=[
+            {
+                "role": "user",
+                "content": "Compose a poem that explains the concept of recursion in programming.",
+            }
+        ]
+    )
+
+    print(completion.choices[0].message.content)
+    ```

 !!! note
    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac

 - Setup vLLM and Haystack environment

-```console
+```bash
 pip install vllm haystack-ai
 ```

@@ -21,35 +21,35 @@ pip install vllm haystack-ai

 - Start the vLLM server with the supported chat completion model, e.g.

-```console
+```bash
 vllm serve mistralai/Mistral-7B-Instruct-v0.1
 ```

 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

-```python
-from haystack.components.generators.chat import OpenAIChatGenerator
-from haystack.dataclasses import ChatMessage
-from haystack.utils import Secret
-
-generator = OpenAIChatGenerator(
-    # for compatibility with the OpenAI API, a placeholder api_key is needed
-    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
-    model="mistralai/Mistral-7B-Instruct-v0.1",
-    api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
-    generation_kwargs = {"max_tokens": 512}
-)
-
-response = generator.run(
-  messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
-)
-
-print("-"*30)
-print(response)
-print("-"*30)
-```
-
-Output e.g.:
+??? Code
+
+    ```python
+    from haystack.components.generators.chat import OpenAIChatGenerator
+    from haystack.dataclasses import ChatMessage
+    from haystack.utils import Secret
+
+    generator = OpenAIChatGenerator(
+        # for compatibility with the OpenAI API, a placeholder api_key is needed
+        api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
+        model="mistralai/Mistral-7B-Instruct-v0.1",
+        api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
+        generation_kwargs = {"max_tokens": 512}
+    )
+
+    response = generator.run(
+      messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
+    )
+
+    print("-"*30)
+    print(response)
+    print("-"*30)
+    ```

 ```console
 ------------------------------

--- a/docs/deployment/frameworks/helm.md
+++ b/docs/deployment/frameworks/helm.md
@@ -5,9 +5,9 @@ title: Helm

 A Helm chart to deploy vLLM for Kubernetes

-Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
+Helm is a package manager for Kubernetes. It helps automate the deployment of vLLM applications on Kubernetes. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.

-This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
+This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for Helm installation and documentation on architecture and values file.

 ## Prerequisites

@@ -16,21 +16,27 @@ Before you begin, ensure that you have the following:
 - A running Kubernetes cluster
 - NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
 - Available GPU resources in your cluster
- S3 with the model which will be deployed
+- An S3 with the model which will be deployed

 ## Installing the chart

 To install the chart with the release name `test-vllm`:

-```console
-helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
+```bash
+helm upgrade --install --create-namespace \
+  --namespace=ns-vllm test-vllm . \
+  -f values.yaml \
+  --set secrets.s3endpoint=$ACCESS_POINT \
+  --set secrets.s3bucketname=$BUCKET \
+  --set secrets.s3accesskeyid=$ACCESS_KEY \
+  --set secrets.s3accesskey=$SECRET_KEY
 ```

-## Uninstalling the Chart
+## Uninstalling the chart

 To uninstall the `test-vllm` deployment:

-```console
+```bash
 helm uninstall test-vllm --namespace=ns-vllm
 ```

@@ -39,57 +45,59 @@ chart **including persistent volumes** and deletes the release.

 ## Architecture

-![](../../assets/deployment/architecture_helm_deployment.png)
+![helm deployment architecture](../../assets/deployment/architecture_helm_deployment.png)

 ## Values

-| Key                                        | Type    | Default                                                                                                                                                  | Description                                                                                                                               |
-|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
-| autoscaling                                | object  | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}                                                                  | Autoscaling configuration                                                                                                                 |
-| autoscaling.enabled                        | bool    | false                                                                                                                                                    | Enable autoscaling                                                                                                                        |
-| autoscaling.maxReplicas                    | int     | 100                                                                                                                                                      | Maximum replicas                                                                                                                          |
-| autoscaling.minReplicas                    | int     | 1                                                                                                                                                        | Minimum replicas                                                                                                                          |
-| autoscaling.targetCPUUtilizationPercentage | int     | 80                                                                                                                                                       | Target CPU utilization for autoscaling                                                                                                    |
-| configs                                    | object  | {}                                                                                                                                                       | Configmap                                                                                                                                 |
-| containerPort                              | int     | 8000                                                                                                                                                     | Container port                                                                                                                            |
-| customObjects                              | list    | []                                                                                                                                                       | Custom Objects configuration                                                                                                              |
-| deploymentStrategy                         | object  | {}                                                                                                                                                       | Deployment strategy configuration                                                                                                         |
-| externalConfigs                            | list    | []                                                                                                                                                       | External configuration                                                                                                                    |
-| extraContainers                            | list    | []                                                                                                                                                       | Additional containers configuration                                                                                                       |
-| extraInit                                  | object  | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}                                                     | Additional configuration for the init container                                                                                           |
-| extraInit.pvcStorage                       | string  | "50Gi"                                                                                                                                                   | Storage size of the s3                                                                                                                    |
-| extraInit.s3modelpath                      | string  | "relative_s3_model_path/opt-125m"                                                                                                                        | Path of the model on the s3 which hosts model weights and config files                                                                    |
-| extraInit.awsEc2MetadataDisabled           | boolean | true                                                                                                                                                     | Disables the use of the Amazon EC2 instance metadata service                                                                              |
-| extraPorts                                 | list    | []                                                                                                                                                       | Additional ports configuration                                                                                                            |
-| gpuModels                                  | list    | ["TYPE_GPU_USED"]                                                                                                                                        | Type of gpu used                                                                                                                          |
-| image                                      | object  | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration                                                                                                                       |
-| image.command                              | list    | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]                                                            | Container launch command                                                                                                                  |
-| image.repository                           | string  | "vllm/vllm-openai"                                                                                                                                       | Image repository                                                                                                                          |
-| image.tag                                  | string  | "latest"                                                                                                                                                 | Image tag                                                                                                                                 |
-| livenessProbe                              | object  | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}                                              | Liveness probe configuration                                                                                                              |
-| livenessProbe.failureThreshold             | int     | 3                                                                                                                                                        | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
-| livenessProbe.httpGet                      | object  | {"path":"/health","port":8000}                                                                                                                           | Configuration of the Kubelet http request on the server                                                                                   |
-| livenessProbe.httpGet.path                 | string  | "/health"                                                                                                                                                | Path to access on the HTTP server                                                                                                         |
-| livenessProbe.httpGet.port                 | int     | 8000                                                                                                                                                     | Name or number of the port to access on the container, on which the server is listening                                                   |
-| livenessProbe.initialDelaySeconds          | int     | 15                                                                                                                                                       | Number of seconds after the container has started before liveness probe is initiated                                                      |
-| livenessProbe.periodSeconds                | int     | 10                                                                                                                                                       | How often (in seconds) to perform the liveness probe                                                                                      |
-| maxUnavailablePodDisruptionBudget          | string  | ""                                                                                                                                                       | Disruption Budget Configuration                                                                                                           |
-| readinessProbe                             | object  | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}                                                | Readiness probe configuration                                                                                                             |
-| readinessProbe.failureThreshold            | int     | 3                                                                                                                                                        | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
-| readinessProbe.httpGet                     | object  | {"path":"/health","port":8000}                                                                                                                           | Configuration of the Kubelet http request on the server                                                                                   |
-| readinessProbe.httpGet.path                | string  | "/health"                                                                                                                                                | Path to access on the HTTP server                                                                                                         |
-| readinessProbe.httpGet.port                | int     | 8000                                                                                                                                                     | Name or number of the port to access on the container, on which the server is listening                                                   |
-| readinessProbe.initialDelaySeconds         | int     | 5                                                                                                                                                        | Number of seconds after the container has started before readiness probe is initiated                                                     |
-| readinessProbe.periodSeconds               | int     | 5                                                                                                                                                        | How often (in seconds) to perform the readiness probe                                                                                     |
-| replicaCount                               | int     | 1                                                                                                                                                        | Number of replicas                                                                                                                        |
-| resources                                  | object  | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}                                          | Resource configuration                                                                                                                    |
-| resources.limits."nvidia.com/gpu"          | int     | 1                                                                                                                                                        | Number of gpus used                                                                                                                       |
-| resources.limits.cpu                       | int     | 4                                                                                                                                                        | Number of CPUs                                                                                                                            |
-| resources.limits.memory                    | string  | "16Gi"                                                                                                                                                   | CPU memory configuration                                                                                                                  |
-| resources.requests."nvidia.com/gpu"        | int     | 1                                                                                                                                                        | Number of gpus used                                                                                                                       |
-| resources.requests.cpu                     | int     | 4                                                                                                                                                        | Number of CPUs                                                                                                                            |
-| resources.requests.memory                  | string  | "16Gi"                                                                                                                                                   | CPU memory configuration                                                                                                                  |
-| secrets                                    | object  | {}                                                                                                                                                       | Secrets configuration                                                                                                                     |
-| serviceName                                | string  | Service name                                                                                                                                             |                                                                                                                                           |
-| servicePort                                | int     | 80                                                                                                                                                       | Service port                                                                                                                              |
-| labels.environment                         | string  | test                                                                                                                                                     | Environment name                                                                                                                          |
+The following table describes configurable parameters of the chart in `values.yaml`:
+
+| Key | Type | Default | Description |
+|-----|------|---------|-------------|
+| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
+| autoscaling.enabled | bool | false | Enable autoscaling |
+| autoscaling.maxReplicas | int | 100 | Maximum replicas |
+| autoscaling.minReplicas | int | 1 | Minimum replicas |
+| autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling |
+| configs | object | {} | Configmap |
+| containerPort | int | 8000 | Container port |
+| customObjects | list | [] | Custom Objects configuration |
+| deploymentStrategy | object | {} | Deployment strategy configuration |
+| externalConfigs | list | [] | External configuration |
+| extraContainers | list | [] | Additional containers configuration |
+| extraInit | object | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} | Additional configuration for the init container |
+| extraInit.pvcStorage | string | "1Gi" | Storage size of the s3 |
+| extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | Path of the model on the s3 which hosts model weights and config files |
+| extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service |
+| extraPorts | list | [] | Additional ports configuration |
+| gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used |
+| image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration |
+| image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command |
+| image.repository | string | "vllm/vllm-openai" | Image repository |
+| image.tag | string | "latest" | Image tag |
+| livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration |
+| livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
+| livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server |
+| livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated |
+| livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe |
+| maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration |
+| readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration |
+| readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
+| readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the kubelet http request on the server |
+| readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated |
+| readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe |
+| replicaCount | int | 1 | Number of replicas |
+| resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration |
+| resources.limits."nvidia.com/gpu" | int | 1 | Number of GPUs used |
+| resources.limits.cpu | int | 4 | Number of CPUs |
+| resources.limits.memory | string | "16Gi" | CPU memory configuration |
+| resources.requests."nvidia.com/gpu" | int | 1 | Number of GPUs used |
+| resources.requests.cpu | int | 4 | Number of CPUs |
+| resources.requests.memory | string | "16Gi" | CPU memory configuration |
+| secrets | object | {} | Secrets configuration |
+| serviceName | string | "" | Service name |
+| servicePort | int | 80 | Service port |
+| labels.environment | string | test | Environment name |