[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[doc] Fold long code blocks to improve readability (#19926)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
f17aec0d · Reid · GitHub · 493c2753 · f17aec0d · f17aec0d
Unverified Commit f17aec0d authored Jun 23, 2025 by Reid Committed by GitHub Jun 23, 2025
20 changed files
--- a/docs/ci/update_pytorch_version.md
+++ b/docs/ci/update_pytorch_version.md
@@ -91,7 +91,7 @@ source to unblock the update process.
 ### FlashInfer
 Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):

-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
 export FLASHINFER_ENABLE_SM90=1
 uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
@@ -105,14 +105,14 @@ team if you want to get the package published there.
 ### xFormers
 Similar to FlashInfer, here is how to build and install xFormers from source:

-```
+```bash
 export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
 MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
 ```

 ### Mamba

-```
+```bash
 uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
 ```


--- a/docs/cli/README.md
+++ b/docs/cli/README.md
@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}

 Start the vLLM OpenAI Compatible API server.

-Examples:
+??? Examples

-```bash
-# Start with a model
-vllm serve meta-llama/Llama-2-7b-hf
+    ```bash
+    # Start with a model
+    vllm serve meta-llama/Llama-2-7b-hf

-# Specify the port
-vllm serve meta-llama/Llama-2-7b-hf --port 8100
+    # Specify the port
+    vllm serve meta-llama/Llama-2-7b-hf --port 8100

-# Check with --help for more options
-# To list all groups
-vllm serve --help=listgroup
+    # Check with --help for more options
+    # To list all groups
+    vllm serve --help=listgroup

-# To view a argument group
-vllm serve --help=ModelConfig
+    # To view a argument group
+    vllm serve --help=ModelConfig

-# To view a single argument
-vllm serve --help=max-num-seqs
+    # To view a single argument
+    vllm serve --help=max-num-seqs

-# To search by keyword
-vllm serve --help=max
-```
+    # To search by keyword
+    vllm serve --help=max
+    ```

 ## chat

 Generate chat completions via the running API server.

-Examples:
-
 ```bash
 # Directly connect to localhost API without arguments
 vllm chat
@@ -60,8 +58,6 @@ vllm chat --quick "hi"

 Generate text completions based on the given prompt via the running API server.

-Examples:
-
 ```bash
 # Directly connect to localhost API without arguments
 vllm complete
@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm complete --quick "The future of AI is"
 ```

+</details>
+
 ## bench

 Run benchmark tests for latency online serving throughput and offline inference throughput.
@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}

 Benchmark the latency of a single batch of requests.

-Example:
-
 ```bash
 vllm bench latency \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -104,8 +100,6 @@ vllm bench latency \

 Benchmark the online serving throughput.

-Example:
-
 ```bash
 vllm bench serve \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -120,8 +114,6 @@ vllm bench serve \

 Benchmark offline inference throughput.

-Example:
-
 ```bash
 vllm bench throughput \
    --model meta-llama/Llama-3.2-1B-Instruct \
@@ -143,7 +135,8 @@ vllm collect-env

 Run batch prompts and write results to file.

-Examples:
+<details>
+<summary>Examples</summary>

 ```bash
 # Running with a local file
@@ -159,6 +152,8 @@ vllm run-batch \
    --model meta-llama/Meta-Llama-3-8B-Instruct
 ```

+</details>
+
 ## More Help

 For detailed options of any subcommand, use:

--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me

 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

-```python
-from vllm import LLM
-from vllm.config import CompilationConfig, CompilationLevel
-
-llm = LLM(
-    model="meta-llama/Llama-3.1-8B-Instruct",
-    compilation_config=CompilationConfig(
-        level=CompilationLevel.PIECEWISE,
-        # By default, it goes up to max_num_seqs
-        cudagraph_capture_sizes=[1, 2, 4, 8, 16],
-    ),
-)
-```
+??? Code
+
+    ```python
+    from vllm import LLM
+    from vllm.config import CompilationConfig, CompilationLevel
+
+    llm = LLM(
+        model="meta-llama/Llama-3.1-8B-Instruct",
+        compilation_config=CompilationConfig(
+            level=CompilationLevel.PIECEWISE,
+            # By default, it goes up to max_num_seqs
+            cudagraph_capture_sizes=[1, 2, 4, 8, 16],
+        ),
+    )
+    ```

 You can disable graph capturing completely via the `enforce_eager` flag:

@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.

 Here are some examples:

-```python
-from vllm import LLM
+??? Code

-# Available for Qwen2-VL series models
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
-          mm_processor_kwargs={
-              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
-          })
-
-# Available for InternVL series models
-llm = LLM(model="OpenGVLab/InternVL2-2B",
-          mm_processor_kwargs={
-              "max_dynamic_patch": 4,  # Default is 12
-          })
-```
+    ```python
+    from vllm import LLM
+
+    # Available for Qwen2-VL series models
+    llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+            mm_processor_kwargs={
+                "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+            })
+
+    # Available for InternVL series models
+    llm = LLM(model="OpenGVLab/InternVL2-2B",
+            mm_processor_kwargs={
+                "max_dynamic_patch": 4,  # Default is 12
+            })
+    ```
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:

    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).

-```python
--8<-- "vllm/envs.py:env-vars-definition"
-```
+??? Code
+
+    ```python
+    --8<-- "vllm/envs.py:env-vars-definition"
+    ```
--- a/docs/contributing/README.md
+++ b/docs/contributing/README.md
@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo

 ## Testing

-```bash
-pip install -r requirements/dev.txt
+??? note "Commands"

-# Linting, formatting and static type checking
-pre-commit install --hook-type pre-commit --hook-type commit-msg
+    ```bash
+    pip install -r requirements/dev.txt

-# You can manually run pre-commit with
-pre-commit run --all-files
+    # Linting, formatting and static type checking
+    pre-commit install --hook-type pre-commit --hook-type commit-msg

-# To manually run something from CI that does not run
-# locally by default, you can run:
-pre-commit run mypy-3.9 --hook-stage manual --all-files
+    # You can manually run pre-commit with
+    pre-commit run --all-files

-# Unit tests
-pytest tests/
+    # To manually run something from CI that does not run
+    # locally by default, you can run:
+    pre-commit run mypy-3.9 --hook-stage manual --all-files

-# Run tests for a single test file with detailed output
-pytest -s -v tests/test_logger.py
-```
+    # Unit tests
+    pytest tests/
+
+    # Run tests for a single test file with detailed output
+    pytest -s -v tests/test_logger.py
+    ```

 !!! tip
    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.

--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -27,33 +27,35 @@ All vLLM modules within the model must include a `prefix` argument in their cons

 The initialization code should look like this:

-```python
-from torch import nn
-from vllm.config import VllmConfig
-from vllm.attention import Attention
-
-class MyAttention(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.attn = Attention(prefix=f"{prefix}.attn")
-
-class MyDecoderLayer(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
-
-class MyModel(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str):
-        super().__init__()
-        self.layers = nn.ModuleList(
-            [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
-        )
-
-class MyModelForCausalLM(nn.Module):
-    def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
-        super().__init__()
-        self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
-```
+??? Code
+
+    ```python
+    from torch import nn
+    from vllm.config import VllmConfig
+    from vllm.attention import Attention
+
+    class MyAttention(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.attn = Attention(prefix=f"{prefix}.attn")
+
+    class MyDecoderLayer(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
+
+    class MyModel(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str):
+            super().__init__()
+            self.layers = nn.ModuleList(
+                [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
+            )
+
+    class MyModelForCausalLM(nn.Module):
+        def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
+            super().__init__()
+            self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
+    ```

 ### Computation Code


--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@@ -25,59 +25,63 @@ Further update the model as follows:

 - Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.

-    ```python
-    class YourModelForImage2Seq(nn.Module):
-        ...
+    ??? Code

-        def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
+        ```python
+        class YourModelForImage2Seq(nn.Module):
+            ...

-            assert self.vision_encoder is not None
-            image_features = self.vision_encoder(image_input)
-            return self.multi_modal_projector(image_features)
+            def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:

-        def get_multimodal_embeddings(
-                self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+                assert self.vision_encoder is not None
+                image_features = self.vision_encoder(image_input)
+                return self.multi_modal_projector(image_features)

-            # Validate the multimodal input keyword arguments
-            image_input = self._parse_and_validate_image_input(**kwargs)
-            if image_input is None:
-                return None
+            def get_multimodal_embeddings(
+                    self, **kwargs: object) -> Optional[MultiModalEmbeddings]:

-            # Run multimodal inputs through encoder and projector
-            vision_embeddings = self._process_image_input(image_input)
-            return vision_embeddings
-    ```
+                # Validate the multimodal input keyword arguments
+                image_input = self._parse_and_validate_image_input(**kwargs)
+                if image_input is None:
+                    return None
+
+                # Run multimodal inputs through encoder and projector
+                vision_embeddings = self._process_image_input(image_input)
+                return vision_embeddings
+        ```

 !!! important
    The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.

 - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.

-    ```python
-    from .utils import merge_multimodal_embeddings
+    ??? Code

-    class YourModelForImage2Seq(nn.Module):
-        ...
+        ```python
+        from .utils import merge_multimodal_embeddings

-        def get_input_embeddings(
-            self,
-            input_ids: torch.Tensor,
-            multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
-        ) -> torch.Tensor:
-
-            # `get_input_embeddings` should already be implemented for the language 
-            # model as one of the requirements of basic vLLM model implementation.
-            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
-
-            if multimodal_embeddings is not None:
-                inputs_embeds = merge_multimodal_embeddings(
-                    input_ids=input_ids, 
-                    inputs_embeds=inputs_embeds, 
-                    multimodal_embeddings=multimodal_embeddings,
-                    placeholder_token_id=self.config.image_token_index)
-
-            return inputs_embeds
-    ```
+        class YourModelForImage2Seq(nn.Module):
+            ...
+
+            def get_input_embeddings(
+                self,
+                input_ids: torch.Tensor,
+                multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+            ) -> torch.Tensor:
+
+                # `get_input_embeddings` should already be implemented for the language 
+                # model as one of the requirements of basic vLLM model implementation.
+                inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+
+                if multimodal_embeddings is not None:
+                    inputs_embeds = merge_multimodal_embeddings(
+                        input_ids=input_ids, 
+                        inputs_embeds=inputs_embeds, 
+                        multimodal_embeddings=multimodal_embeddings,
+                        placeholder_token_id=self.config.image_token_index)
+
+                return inputs_embeds
+        ```

 - Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.

@@ -135,42 +139,46 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Looking at the code of HF's `LlavaForConditionalGeneration`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
-    n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
-    n_image_features = image_features.shape[0] * image_features.shape[1]
+    ??? Code

-    if n_image_tokens != n_image_features:
-        raise ValueError(
-            f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
+        n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
+        n_image_features = image_features.shape[0] * image_features.shape[1]
+
+        if n_image_tokens != n_image_features:
+            raise ValueError(
+                f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+            )
+        special_image_mask = (
+            (input_ids == self.config.image_token_index)
+            .unsqueeze(-1)
+            .expand_as(inputs_embeds)
+            .to(inputs_embeds.device)
        )
-    special_image_mask = (
-        (input_ids == self.config.image_token_index)
-        .unsqueeze(-1)
-        .expand_as(inputs_embeds)
-        .to(inputs_embeds.device)
-    )
-    image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
-    inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
-    ```
+        image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+        ```

    The number of placeholder feature tokens per image is `image_features.shape[1]`.
    `image_features` is calculated inside the `get_image_features` method:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
-    image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
-
-    selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
-    if vision_feature_select_strategy == "default":
-        selected_image_feature = selected_image_feature[:, 1:]
-    elif vision_feature_select_strategy == "full":
-        selected_image_feature = selected_image_feature
-    else:
-        raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
-    image_features = self.multi_modal_projector(selected_image_feature)
-    return image_features
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
+        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+
+        selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+        if vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        elif vision_feature_select_strategy == "full":
+            selected_image_feature = selected_image_feature
+        else:
+            raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+        image_features = self.multi_modal_projector(selected_image_feature)
+        return image_features
+        ```

    We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
    (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
@@ -193,20 +201,22 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
-    target_dtype = self.patch_embedding.weight.dtype
-    patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
-    patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
-
-    class_embeds = self.class_embedding.expand(batch_size, 1, -1)
-    embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
-    if interpolate_pos_encoding:
-        embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
-    else:
-        embeddings = embeddings + self.position_embedding(self.position_ids)
-    return embeddings
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
+        patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+
+        class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+        embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+        if interpolate_pos_encoding:
+            embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+        else:
+            embeddings = embeddings + self.position_embedding(self.position_ids)
+        return embeddings
+        ```

    We can infer that `embeddings.shape[1] == self.num_positions`, where

@@ -218,55 +228,59 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Overall, the number of placeholder feature tokens for an image can be calculated as:

-    ```python
-    def get_num_image_tokens(
-        self,
-        *,
-        image_width: int,
-        image_height: int,
-    ) -> int:
-        hf_config = self.get_hf_config()
-        hf_processor = self.get_hf_processor()
+    ??? Code

-        image_size = hf_config.vision_config.image_size
-        patch_size = hf_config.vision_config.patch_size
+        ```python
+        def get_num_image_tokens(
+            self,
+            *,
+            image_width: int,
+            image_height: int,
+        ) -> int:
+            hf_config = self.get_hf_config()
+            hf_processor = self.get_hf_processor()

-        num_image_tokens = (image_size // patch_size) ** 2 + 1
-        if hf_processor.vision_feature_select_strategy == "default":
-            num_image_tokens -= 1
+            image_size = hf_config.vision_config.image_size
+            patch_size = hf_config.vision_config.patch_size

-        return num_image_tokens
-    ```
+            num_image_tokens = (image_size // patch_size) ** 2 + 1
+            if hf_processor.vision_feature_select_strategy == "default":
+                num_image_tokens -= 1
+
+            return num_image_tokens
+        ```

    Notice that the number of image tokens doesn't depend on the image width and height.
    We can simply use a dummy `image_size` to calculate the multimodal profiling data:

-    ```python
-    # NOTE: In actuality, this is usually implemented as part of the
-    # model's subclass of `BaseProcessingInfo`, but we show it as is
-    # here for simplicity.
-    def get_image_size_with_most_features(self) -> ImageSize:
-        hf_config = self.get_hf_config()
-        width = height = hf_config.image_size
-        return ImageSize(width=width, height=height)
+    ??? Code

-    def get_dummy_mm_data(
-        self,
-        seq_len: int,
-        mm_counts: Mapping[str, int],
-    ) -> MultiModalDataDict:
-        num_images = mm_counts.get("image", 0)
-
-        target_width, target_height = \
-            self.info.get_image_size_with_most_features()
+        ```python
+        # NOTE: In actuality, this is usually implemented as part of the
+        # model's subclass of `BaseProcessingInfo`, but we show it as is
+        # here for simplicity.
+        def get_image_size_with_most_features(self) -> ImageSize:
+            hf_config = self.get_hf_config()
+            width = height = hf_config.image_size
+            return ImageSize(width=width, height=height)

-        return {
-            "image":
-            self._get_dummy_images(width=target_width,
-                                   height=target_height,
-                                   num_images=num_images)
-        }
-    ```
+        def get_dummy_mm_data(
+            self,
+            seq_len: int,
+            mm_counts: Mapping[str, int],
+        ) -> MultiModalDataDict:
+            num_images = mm_counts.get("image", 0)
+
+            target_width, target_height = \
+                self.info.get_image_size_with_most_features()
+
+            return {
+                "image":
+                self._get_dummy_images(width=target_width,
+                                    height=target_height,
+                                    num_images=num_images)
+            }
+        ```

    For the text, we simply expand the multimodal image token from the model config to match the desired number of images.

@@ -284,21 +298,23 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    Looking at the code of HF's `FuyuForCausalLM`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
-    if image_patches is not None and past_key_values is None:
-        patch_embeddings = [
-            self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
-            .squeeze(0)
-            .to(inputs_embeds.device)
-            for patch in image_patches
-        ]
-        inputs_embeds = self.gather_continuous_embeddings(
-            word_embeddings=inputs_embeds,
-            continuous_embeddings=patch_embeddings,
-            image_patch_input_indices=image_patches_indices,
-        )
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+        if image_patches is not None and past_key_values is None:
+            patch_embeddings = [
+                self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+                .squeeze(0)
+                .to(inputs_embeds.device)
+                for patch in image_patches
+            ]
+            inputs_embeds = self.gather_continuous_embeddings(
+                word_embeddings=inputs_embeds,
+                continuous_embeddings=patch_embeddings,
+                image_patch_input_indices=image_patches_indices,
+            )
+        ```

    The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
    which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
@@ -312,92 +328,98 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
    In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
    returning the dimensions after resizing (but before padding) as metadata.

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
-    image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
-    batch_images = image_encoding["images"]
-    image_unpadded_heights = image_encoding["image_unpadded_heights"]
-    image_unpadded_widths = image_encoding["image_unpadded_widths"]
-
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
-    if do_resize:
-        batch_images = [
-            [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
-            for images in batch_images
-        ]
-
-    image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
-    image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
-    image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
-
-    if do_pad:
-        batch_images = [
-            [
-                self.pad_image(
-                    image,
-                    size=size,
-                    mode=padding_mode,
-                    constant_values=padding_value,
-                    input_data_format=input_data_format,
-                )
-                for image in images
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+        image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+        batch_images = image_encoding["images"]
+        image_unpadded_heights = image_encoding["image_unpadded_heights"]
+        image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+        if do_resize:
+            batch_images = [
+                [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+                for images in batch_images
            ]
-            for images in batch_images
-        ]
-    ```

-    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+        image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+        image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+        image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+        if do_pad:
+            batch_images = [
+                [
+                    self.pad_image(
+                        image,
+                        size=size,
+                        mode=padding_mode,
+                        constant_values=padding_value,
+                        input_data_format=input_data_format,
+                    )
+                    for image in images
+                ]
+                for images in batch_images
+            ]
+        ```

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
-    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-        image_input=tensor_batch_images,
-        image_present=image_present,
-        image_unpadded_h=image_unpadded_heights,
-        image_unpadded_w=image_unpadded_widths,
-        image_placeholder_id=image_placeholder_id,
-        image_newline_id=image_newline_id,
-        variable_sized=True,
-    )
+    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:

-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
-    image_height, image_width = image.shape[1], image.shape[2]
-    if variable_sized:  # variable_sized=True
-        new_h = min(
-            image_height,
-            math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
-        )
-        new_w = min(
-            image_width,
-            math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+            image_input=tensor_batch_images,
+            image_present=image_present,
+            image_unpadded_h=image_unpadded_heights,
+            image_unpadded_w=image_unpadded_widths,
+            image_placeholder_id=image_placeholder_id,
+            image_newline_id=image_newline_id,
+            variable_sized=True,
        )
-        image = image[:, :new_h, :new_w]
-        image_height, image_width = new_h, new_w

-    num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
-    tensor_of_image_ids = torch.full(
-        [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
-    )
-    patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
-    assert num_patches == patches.shape[0]
-    ```
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+        image_height, image_width = image.shape[1], image.shape[2]
+        if variable_sized:  # variable_sized=True
+            new_h = min(
+                image_height,
+                math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+            )
+            new_w = min(
+                image_width,
+                math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+            )
+            image = image[:, :new_h, :new_w]
+            image_height, image_width = new_h, new_w
+
+        num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+        tensor_of_image_ids = torch.full(
+            [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+        )
+        patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+        assert num_patches == patches.shape[0]
+        ```

    The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
-    patch_size = patch_size if patch_size is not None else self.patch_size
-    patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
-
-    if image_height % patch_height != 0:
-        raise ValueError(f"{image_height=} must be divisible by {patch_height}")
-    if image_width % patch_width != 0:
-        raise ValueError(f"{image_width=} must be divisible by {patch_width}")
-
-    num_patches_per_dim_h = image_height // patch_height
-    num_patches_per_dim_w = image_width // patch_width
-    num_patches = num_patches_per_dim_h * num_patches_per_dim_w
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+        patch_size = patch_size if patch_size is not None else self.patch_size
+        patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+        if image_height % patch_height != 0:
+            raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+        if image_width % patch_width != 0:
+            raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+        num_patches_per_dim_h = image_height // patch_height
+        num_patches_per_dim_w = image_width // patch_width
+        num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+        ```

    These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
    to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
@@ -419,23 +441,25 @@ Assuming that the memory usage increases with the number of tokens, the dummy in

    For the multimodal image profiling data, the logic is very similar to LLaVA:

-    ```python
-    def get_dummy_mm_data(
-        self,
-        seq_len: int,
-        mm_counts: Mapping[str, int],
-    ) -> MultiModalDataDict:
-        target_width, target_height = \
-            self.info.get_image_size_with_most_features()
-        num_images = mm_counts.get("image", 0)
+    ??? Code

-        return {
-            "image":
-            self._get_dummy_images(width=target_width,
-                                   height=target_height,
-                                   num_images=num_images)
-        }
-    ```
+        ```python
+        def get_dummy_mm_data(
+            self,
+            seq_len: int,
+            mm_counts: Mapping[str, int],
+        ) -> MultiModalDataDict:
+            target_width, target_height = \
+                self.info.get_image_size_with_most_features()
+            num_images = mm_counts.get("image", 0)
+
+            return {
+                "image":
+                self._get_dummy_images(width=target_width,
+                                    height=target_height,
+                                    num_images=num_images)
+            }
+        ```

 ## 4. Specify processing details

@@ -455,6 +479,7 @@ return a schema of the tensors outputted by the HF processor that are related to
    The output of `CLIPImageProcessor` is a simple tensor with shape
    `(num_images, num_channels, image_height, image_width)`:

+
    ```python
    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
    images = [
@@ -505,35 +530,37 @@ return a schema of the tensors outputted by the HF processor that are related to
    In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
    we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:

-    ```python
-    def _call_hf_processor(
-        self,
-        prompt: str,
-        mm_data: Mapping[str, object],
-        mm_kwargs: Mapping[str, object],
-    ) -> BatchFeature:
-        processed_outputs = super()._call_hf_processor(
-            prompt=prompt,
-            mm_data=mm_data,
-            mm_kwargs=mm_kwargs,
-        )
+    ??? Code

-        image_patches = processed_outputs.get("image_patches")
-        if image_patches is not None:
-            images = mm_data["images"]
-            assert isinstance(images, list)
+        ```python
+        def _call_hf_processor(
+            self,
+            prompt: str,
+            mm_data: Mapping[str, object],
+            mm_kwargs: Mapping[str, object],
+        ) -> BatchFeature:
+            processed_outputs = super()._call_hf_processor(
+                prompt=prompt,
+                mm_data=mm_data,
+                mm_kwargs=mm_kwargs,
+            )

-            # Original output: (1, num_images, Pn, Px * Py * C)
-            # New output: (num_images, Pn, Px * Py * C)
-            assert (isinstance(image_patches, list)
-                    and len(image_patches) == 1)
-            assert (isinstance(image_patches[0], torch.Tensor)
-                    and len(image_patches[0]) == len(images))
+            image_patches = processed_outputs.get("image_patches")
+            if image_patches is not None:
+                images = mm_data["images"]
+                assert isinstance(images, list)

-            processed_outputs["image_patches"] = image_patches[0]
+                # Original output: (1, num_images, Pn, Px * Py * C)
+                # New output: (num_images, Pn, Px * Py * C)
+                assert (isinstance(image_patches, list)
+                        and len(image_patches) == 1)
+                assert (isinstance(image_patches[0], torch.Tensor)
+                        and len(image_patches[0]) == len(images))

-        return processed_outputs
-    ```
+                processed_outputs["image_patches"] = image_patches[0]
+
+            return processed_outputs
+        ```

    !!! note
        Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
@@ -573,35 +600,37 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
    It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
    Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:

-    ```python
-    def _get_prompt_updates(
-        self,
-        mm_items: MultiModalDataItems,
-        hf_processor_mm_kwargs: Mapping[str, object],
-        out_mm_kwargs: MultiModalKwargs,
-    ) -> Sequence[PromptUpdate]:
-        hf_config = self.info.get_hf_config()
-        image_token_id = hf_config.image_token_index
+    ??? Code

-        def get_replacement(item_idx: int):
-            images = mm_items.get_items("image", ImageProcessorItems)
-
-            image_size = images.get_image_size(item_idx)
-            num_image_tokens = self.info.get_num_image_tokens(
-                image_width=image_size.width,
-                image_height=image_size.height,
-            )
+        ```python
+        def _get_prompt_updates(
+            self,
+            mm_items: MultiModalDataItems,
+            hf_processor_mm_kwargs: Mapping[str, object],
+            out_mm_kwargs: MultiModalKwargs,
+        ) -> Sequence[PromptUpdate]:
+            hf_config = self.info.get_hf_config()
+            image_token_id = hf_config.image_token_index
+
+            def get_replacement(item_idx: int):
+                images = mm_items.get_items("image", ImageProcessorItems)
+
+                image_size = images.get_image_size(item_idx)
+                num_image_tokens = self.info.get_num_image_tokens(
+                    image_width=image_size.width,
+                    image_height=image_size.height,
+                )

-            return [image_token_id] * num_image_tokens
+                return [image_token_id] * num_image_tokens

-        return [
-            PromptReplacement(
-                modality="image",
-                target=[image_token_id],
-                replacement=get_replacement,
-            ),
-        ]
-    ```
+            return [
+                PromptReplacement(
+                    modality="image",
+                    target=[image_token_id],
+                    replacement=get_replacement,
+                ),
+            ]
+        ```

 === "Handling additional tokens: Fuyu"

@@ -616,117 +645,90 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies

    We define a helper function to return `ncols` and `nrows` directly:

-    ```python
-    def get_image_feature_grid_size(
-        self,
-        *,
-        image_width: int,
-        image_height: int,
-    ) -> tuple[int, int]:
-        image_processor = self.get_image_processor()
-        target_width = image_processor.size["width"]
-        target_height = image_processor.size["height"]
-        patch_width = image_processor.patch_size["width"]
-        patch_height = image_processor.patch_size["height"]
-
-        if not (image_width <= target_width and image_height <= target_height):
-            height_scale_factor = target_height / image_height
-            width_scale_factor = target_width / image_width
-            optimal_scale_factor = min(height_scale_factor, width_scale_factor)
-
-            image_height = int(image_height * optimal_scale_factor)
-            image_width = int(image_width * optimal_scale_factor)
-
-        ncols = math.ceil(image_width / patch_width)
-        nrows = math.ceil(image_height / patch_height)
-        return ncols, nrows
-    ```
+    ??? Code
+
+        ```python
+        def get_image_feature_grid_size(
+            self,
+            *,
+            image_width: int,
+            image_height: int,
+        ) -> tuple[int, int]:
+            image_processor = self.get_image_processor()
+            target_width = image_processor.size["width"]
+            target_height = image_processor.size["height"]
+            patch_width = image_processor.patch_size["width"]
+            patch_height = image_processor.patch_size["height"]
+
+            if not (image_width <= target_width and image_height <= target_height):
+                height_scale_factor = target_height / image_height
+                width_scale_factor = target_width / image_width
+                optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+                image_height = int(image_height * optimal_scale_factor)
+                image_width = int(image_width * optimal_scale_factor)
+
+            ncols = math.ceil(image_width / patch_width)
+            nrows = math.ceil(image_height / patch_height)
+            return ncols, nrows
+        ```

    Based on this, we can initially define our replacement tokens as:

-    ```python
-    def get_replacement(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-        image_size = images.get_image_size(item_idx)
+    ??? Code

-        ncols, nrows = self.info.get_image_feature_grid_size(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
+        ```python
+        def get_replacement(item_idx: int):
+            images = mm_items.get_items("image", ImageProcessorItems)
+            image_size = images.get_image_size(item_idx)

-        # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
-        # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
-        return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
-    ```
+            ncols, nrows = self.info.get_image_feature_grid_size(
+                image_width=image_size.width,
+                image_height=image_size.height,
+            )
+
+            # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+            # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+            return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+        ```

    However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
    a BOS token (`<s>`) is also added to the promopt:

-    ```python
-    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
-    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-        image_input=tensor_batch_images,
-        image_present=image_present,
-        image_unpadded_h=image_unpadded_heights,
-        image_unpadded_w=image_unpadded_widths,
-        image_placeholder_id=image_placeholder_id,
-        image_newline_id=image_newline_id,
-        variable_sized=True,
-    )
-    prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
-        tokenizer=self.tokenizer,
-        prompts=prompts,
-        scale_factors=scale_factors,
-        max_tokens_to_generate=self.max_tokens_to_generate,
-        max_position_embeddings=self.max_position_embeddings,
-        add_BOS=True,
-        add_beginning_of_answer_token=True,
-    )
-    ```
+    ??? Code
+
+        ```python
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+        model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+            image_input=tensor_batch_images,
+            image_present=image_present,
+            image_unpadded_h=image_unpadded_heights,
+            image_unpadded_w=image_unpadded_widths,
+            image_placeholder_id=image_placeholder_id,
+            image_newline_id=image_newline_id,
+            variable_sized=True,
+        )
+        prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+            tokenizer=self.tokenizer,
+            prompts=prompts,
+            scale_factors=scale_factors,
+            max_tokens_to_generate=self.max_tokens_to_generate,
+            max_position_embeddings=self.max_position_embeddings,
+            add_BOS=True,
+            add_beginning_of_answer_token=True,
+        )
+        ```

    To assign the vision embeddings to only the image tokens, instead of a string
    you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:

-    ```python
-    hf_config = self.info.get_hf_config()
-    bos_token_id = hf_config.bos_token_id  # `<s>`
-    assert isinstance(bos_token_id, int)
-
-    def get_replacement_fuyu(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-        image_size = images.get_image_size(item_idx)
+    ??? Code

-        ncols, nrows = self.info.get_image_feature_grid_size(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
-        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
-                        [_NEWLINE_TOKEN_ID]) * nrows
-
-        return PromptUpdateDetails.select_token_id(
-            image_tokens + [bos_token_id],
-            embed_token_id=_IMAGE_TOKEN_ID,
-        )
-    ```
-
-    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
-    we can search for it to conduct the replacement at the start of the string:
-
-    ```python
-    def _get_prompt_updates(
-        self,
-        mm_items: MultiModalDataItems,
-        hf_processor_mm_kwargs: Mapping[str, object],
-        out_mm_kwargs: MultiModalKwargs,
-    ) -> Sequence[PromptUpdate]:
+        ```python
        hf_config = self.info.get_hf_config()
-        bos_token_id = hf_config.bos_token_id
+        bos_token_id = hf_config.bos_token_id  # `<s>`
        assert isinstance(bos_token_id, int)

-        tokenizer = self.info.get_tokenizer()
-        eot_token_id = tokenizer.bos_token_id
-        assert isinstance(eot_token_id, int)
-
        def get_replacement_fuyu(item_idx: int):
            images = mm_items.get_items("image", ImageProcessorItems)
            image_size = images.get_image_size(item_idx)
@@ -742,15 +744,52 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
                image_tokens + [bos_token_id],
                embed_token_id=_IMAGE_TOKEN_ID,
            )
+        ```

-        return [
-            PromptReplacement(
-                modality="image",
-                target=[eot_token_id],
-                replacement=get_replacement_fuyu,
-            )
-        ]
-    ```
+    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+    we can search for it to conduct the replacement at the start of the string:
+
+    ??? Code
+
+        ```python
+        def _get_prompt_updates(
+            self,
+            mm_items: MultiModalDataItems,
+            hf_processor_mm_kwargs: Mapping[str, object],
+            out_mm_kwargs: MultiModalKwargs,
+        ) -> Sequence[PromptUpdate]:
+            hf_config = self.info.get_hf_config()
+            bos_token_id = hf_config.bos_token_id
+            assert isinstance(bos_token_id, int)
+
+            tokenizer = self.info.get_tokenizer()
+            eot_token_id = tokenizer.bos_token_id
+            assert isinstance(eot_token_id, int)
+
+            def get_replacement_fuyu(item_idx: int):
+                images = mm_items.get_items("image", ImageProcessorItems)
+                image_size = images.get_image_size(item_idx)
+
+                ncols, nrows = self.info.get_image_feature_grid_size(
+                    image_width=image_size.width,
+                    image_height=image_size.height,
+                )
+                image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                                [_NEWLINE_TOKEN_ID]) * nrows
+
+                return PromptUpdateDetails.select_token_id(
+                    image_tokens + [bos_token_id],
+                    embed_token_id=_IMAGE_TOKEN_ID,
+                )
+
+            return [
+                PromptReplacement(
+                    modality="image",
+                    target=[eot_token_id],
+                    replacement=get_replacement_fuyu,
+                )
+            ]
+        ```

 ## 5. Register processor-related classes


--- a/docs/contributing/profiling.md
+++ b/docs/contributing/profiling.md
@@ -97,26 +97,26 @@ to manually kill the profiler and generate your `nsys-rep` report.

 You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).

-CLI example:
-
-```bash
-nsys stats report1.nsys-rep
-...
- ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
-
- Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
- --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
-     46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
-     14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
-     12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
-      9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
-      5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
-      4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
-      2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
-      1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
-      0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
-... 
-```
+??? CLI example
+
+    ```bash
+    nsys stats report1.nsys-rep
+    ...
+    ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
+
+    Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)                                                  Name                                                
+    --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  ----------------------------------------------------------------------------------------------------
+        46.3   10,327,352,338     17,505    589,965.9    144,383.0    27,040  3,126,460    944,263.8  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
+        14.8    3,305,114,764      5,152    641,520.7    293,408.0   287,296  2,822,716    867,124.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
+        12.1    2,692,284,876     14,280    188,535.4     83,904.0    19,328  2,862,237    497,999.9  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
+        9.5    2,116,600,578     33,920     62,399.8     21,504.0    15,326  2,532,285    290,954.1  sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
+        5.0    1,119,749,165     18,912     59,208.4      9,056.0     6,784  2,578,366    271,581.7  void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
+        4.1      916,662,515     21,312     43,011.6     19,776.0     8,928  2,586,205    199,790.1  void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
+        2.6      587,283,113     37,824     15,526.7      3,008.0     2,719  2,517,756    139,091.1  std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
+        1.9      418,362,605     18,912     22,121.5      3,871.0     3,328  2,523,870    175,248.2  void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
+        0.7      167,083,069     18,880      8,849.7      2,240.0     1,471  2,499,996    101,436.1  void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
+    ... 
+    ```

 GUI example:


--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -97,19 +97,21 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
    flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
    Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).

-```console
-# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-python3 use_existing_torch.py
-DOCKER_BUILDKIT=1 docker build . \
-  --file docker/Dockerfile \
-  --target vllm-openai \
-  --platform "linux/arm64" \
-  -t vllm/vllm-gh200-openai:latest \
-  --build-arg max_jobs=66 \
-  --build-arg nvcc_threads=2 \
-  --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
-  --build-arg vllm_fa_cmake_gpu_arches="90-real"
-```
+??? Command
+
+    ```console
+    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+    python3 use_existing_torch.py
+    DOCKER_BUILDKIT=1 docker build . \
+    --file docker/Dockerfile \
+    --target vllm-openai \
+    --platform "linux/arm64" \
+    -t vllm/vllm-gh200-openai:latest \
+    --build-arg max_jobs=66 \
+    --build-arg nvcc_threads=2 \
+    --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
+    --build-arg vllm_fa_cmake_gpu_arches="90-real"
+    ```

 !!! note
    If you are building the `linux/arm64` image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.

--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@@ -30,51 +30,53 @@ python -m vllm.entrypoints.openai.api_server \

 - Call it with AutoGen:

-```python
-import asyncio
-from autogen_core.models import UserMessage
-from autogen_ext.models.openai import OpenAIChatCompletionClient
-from autogen_core.models import ModelFamily
-
-
-async def main() -> None:
-    # Create a model client
-    model_client = OpenAIChatCompletionClient(
-        model="mistralai/Mistral-7B-Instruct-v0.2",
-        base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
-        api_key="EMPTY",
-        model_info={
-            "vision": False,
-            "function_calling": False,
-            "json_output": False,
-            "family": ModelFamily.MISTRAL,
-            "structured_output": True,
-        },
-    )
-
-    messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
-
-    # Create a stream.
-    stream = model_client.create_stream(messages=messages)
-
-    # Iterate over the stream and print the responses.
-    print("Streamed responses:")
-    async for response in stream:
-        if isinstance(response, str):
-            # A partial response is a string.
-            print(response, flush=True, end="")
-        else:
-            # The last response is a CreateResult object with the complete message.
-            print("\n\n------------\n")
-            print("The complete response:", flush=True)
-            print(response.content, flush=True)
-
-    # Close the client when done.
-    await model_client.close()
-
-
-asyncio.run(main())
-```
+??? Code
+
+    ```python
+    import asyncio
+    from autogen_core.models import UserMessage
+    from autogen_ext.models.openai import OpenAIChatCompletionClient
+    from autogen_core.models import ModelFamily
+
+
+    async def main() -> None:
+        # Create a model client
+        model_client = OpenAIChatCompletionClient(
+            model="mistralai/Mistral-7B-Instruct-v0.2",
+            base_url="http://{your-vllm-host-ip}:{your-vllm-host-port}/v1",
+            api_key="EMPTY",
+            model_info={
+                "vision": False,
+                "function_calling": False,
+                "json_output": False,
+                "family": ModelFamily.MISTRAL,
+                "structured_output": True,
+            },
+        )
+
+        messages = [UserMessage(content="Write a very short story about a dragon.", source="user")]
+
+        # Create a stream.
+        stream = model_client.create_stream(messages=messages)
+
+        # Iterate over the stream and print the responses.
+        print("Streamed responses:")
+        async for response in stream:
+            if isinstance(response, str):
+                # A partial response is a string.
+                print(response, flush=True, end="")
+            else:
+                # The last response is a CreateResult object with the complete message.
+                print("\n\n------------\n")
+                print("The complete response:", flush=True)
+                print(response.content, flush=True)
+
+        # Close the client when done.
+        await model_client.close()
+
+
+    asyncio.run(main())
+    ```

 For details, see the tutorial:


--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -34,25 +34,27 @@ vllm = "latest"

 Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:

-```python
-from vllm import LLM, SamplingParams
+??? Code

-llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+    ```python
+    from vllm import LLM, SamplingParams

-def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")

-    sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
-    outputs = llm.generate(prompts, sampling_params)
+    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):

-    # Print the outputs.
-    results = []
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        results.append({"prompt": prompt, "generated_text": generated_text})
+        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+        outputs = llm.generate(prompts, sampling_params)

-    return {"results": results}
-```
+        # Print the outputs.
+        results = []
+        for output in outputs:
+            prompt = output.prompt
+            generated_text = output.outputs[0].text
+            results.append({"prompt": prompt, "generated_text": generated_text})
+
+        return {"results": results}
+    ```

 Then, run the following code to deploy it to the cloud:

@@ -62,47 +64,51 @@ cerebrium deploy

 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)

-```python
-curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
- -H 'Content-Type: application/json' \
- -H 'Authorization: <JWT TOKEN>' \
- --data '{
-   "prompts": [
-     "Hello, my name is",
-     "The president of the United States is",
-     "The capital of France is",
-     "The future of AI is"
-   ]
- }'
-```
+??? Command
+
+    ```python
+    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+    -H 'Content-Type: application/json' \
+    -H 'Authorization: <JWT TOKEN>' \
+    --data '{
+    "prompts": [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is"
+    ]
+    }'
+    ```

 You should get a response like:

-```python
-{
-    "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
-    "result": {
-        "result": [
-            {
-                "prompt": "Hello, my name is",
-                "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
-            },
-            {
-                "prompt": "The president of the United States is",
-                "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
-            },
-            {
-                "prompt": "The capital of France is",
-                "generated_text": " Paris.\n"
-            },
-            {
-                "prompt": "The future of AI is",
-                "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
-            }
-        ]
-    },
-    "run_time_ms": 152.53663063049316
-}
-```
+??? Response
+
+    ```python
+    {
+        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+        "result": {
+            "result": [
+                {
+                    "prompt": "Hello, my name is",
+                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+                },
+                {
+                    "prompt": "The president of the United States is",
+                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+                },
+                {
+                    "prompt": "The capital of France is",
+                    "generated_text": " Paris.\n"
+                },
+                {
+                    "prompt": "The future of AI is",
+                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+                }
+            ]
+        },
+        "run_time_ms": 152.53663063049316
+    }
+    ```

 You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -26,75 +26,81 @@ dstack init

 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

-```yaml
-type: service
-
-python: "3.11"
-env:
-    - MODEL=NousResearch/Llama-2-7b-chat-hf
-port: 8000
-resources:
-    gpu: 24GB
-commands:
-    - pip install vllm
-    - vllm serve $MODEL --port 8000
-model:
-    format: openai
-    type: chat
-    name: NousResearch/Llama-2-7b-chat-hf
-```
+??? Config
+
+    ```yaml
+    type: service
+
+    python: "3.11"
+    env:
+        - MODEL=NousResearch/Llama-2-7b-chat-hf
+    port: 8000
+    resources:
+        gpu: 24GB
+    commands:
+        - pip install vllm
+        - vllm serve $MODEL --port 8000
+    model:
+        format: openai
+        type: chat
+        name: NousResearch/Llama-2-7b-chat-hf
+    ```

 Then, run the following CLI for provisioning:

-```console
-$ dstack run . -f serve.dstack.yml
-
-⠸ Getting run plan...
- Configuration  serve.dstack.yml
- Project        deep-diver-main
- User           deep-diver
- Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
- Max price      -
- Max duration   -
- Spot policy    auto
- Retry policy   no
-
- #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
- 1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- 2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
- 3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
-    ...
- Shown 3 of 193 offers, $5.876 max
-
-Continue? [y/n]: y
-⠙ Submitting run...
-⠏ Launching spicy-treefrog-1 (pulling)
-spicy-treefrog-1 provisioning completed (running)
-Service is published at ...
-```
+??? Command
+
+    ```console
+    $ dstack run . -f serve.dstack.yml
+
+    ⠸ Getting run plan...
+    Configuration  serve.dstack.yml
+    Project        deep-diver-main
+    User           deep-diver
+    Min resources  2..xCPU, 8GB.., 1xGPU (24GB)
+    Max price      -
+    Max duration   -
+    Spot policy    auto
+    Retry policy   no
+
+    #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE
+    1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+    3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804
+        ...
+    Shown 3 of 193 offers, $5.876 max
+
+    Continue? [y/n]: y
+    ⠙ Submitting run...
+    ⠏ Launching spicy-treefrog-1 (pulling)
+    spicy-treefrog-1 provisioning completed (running)
+    Service is published at ...
+    ```

 After the provisioning, you can interact with the model by using the OpenAI SDK:

-```python
-from openai import OpenAI
-
-client = OpenAI(
-    base_url="https://gateway.<gateway domain>",
-    api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
-)
-
-completion = client.chat.completions.create(
-    model="NousResearch/Llama-2-7b-chat-hf",
-    messages=[
-        {
-            "role": "user",
-            "content": "Compose a poem that explains the concept of recursion in programming.",
-        }
-    ]
-)
-
-print(completion.choices[0].message.content)
-```
+??? Code
+
+    ```python
+    from openai import OpenAI
+
+    client = OpenAI(
+        base_url="https://gateway.<gateway domain>",
+        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
+    )
+
+    completion = client.chat.completions.create(
+        model="NousResearch/Llama-2-7b-chat-hf",
+        messages=[
+            {
+                "role": "user",
+                "content": "Compose a poem that explains the concept of recursion in programming.",
+            }
+        ]
+    )
+
+    print(completion.choices[0].message.content)
+    ```

 !!! note
    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@@ -27,29 +27,29 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1

 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

-```python
-from haystack.components.generators.chat import OpenAIChatGenerator
-from haystack.dataclasses import ChatMessage
-from haystack.utils import Secret
-
-generator = OpenAIChatGenerator(
-    # for compatibility with the OpenAI API, a placeholder api_key is needed
-    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
-    model="mistralai/Mistral-7B-Instruct-v0.1",
-    api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
-    generation_kwargs = {"max_tokens": 512}
-)
-
-response = generator.run(
-  messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
-)
-
-print("-"*30)
-print(response)
-print("-"*30)
-```
-
-Output e.g.:
+??? Code
+
+    ```python
+    from haystack.components.generators.chat import OpenAIChatGenerator
+    from haystack.dataclasses import ChatMessage
+    from haystack.utils import Secret
+
+    generator = OpenAIChatGenerator(
+        # for compatibility with the OpenAI API, a placeholder api_key is needed
+        api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),
+        model="mistralai/Mistral-7B-Instruct-v0.1",
+        api_base_url="http://{your-vLLM-host-ip}:{your-vLLM-host-port}/v1",
+        generation_kwargs = {"max_tokens": 512}
+    )
+
+    response = generator.run(
+      messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")]
+    )
+
+    print("-"*30)
+    print(response)
+    print("-"*30)
+    ```

 ```console
 ------------------------------

--- a/docs/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -34,21 +34,23 @@ vllm serve qwen/Qwen1.5-0.5B-Chat

 - Call it with litellm:

-```python
-import litellm 
+??? Code

-messages = [{ "content": "Hello, how are you?","role": "user"}]
+    ```python
+    import litellm 

-# hosted_vllm is prefix key word and necessary
-response = litellm.completion(
-            model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
-            messages=messages,
-            api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
-            temperature=0.2,
-            max_tokens=80)
-
-print(response)
-```
+    messages = [{ "content": "Hello, how are you?","role": "user"}]
+
+    # hosted_vllm is prefix key word and necessary
+    response = litellm.completion(
+                model="hosted_vllm/qwen/Qwen1.5-0.5B-Chat", # pass the vllm model name
+                messages=messages,
+                api_base="http://{your-vllm-server-host}:{your-vllm-server-port}/v1",
+                temperature=0.2,
+                max_tokens=80)
+
+    print(response)
+    ```

 ### Embeddings


--- a/docs/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -17,99 +17,101 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber

 Deploy the following yaml file `lws.yaml`

-```yaml
-apiVersion: leaderworkerset.x-k8s.io/v1
-kind: LeaderWorkerSet
-metadata:
-  name: vllm
-spec:
-  replicas: 2
-  leaderWorkerTemplate:
-    size: 2
-    restartPolicy: RecreateGroupOnPodRestart
-    leaderTemplate:
-      metadata:
-        labels:
-          role: leader
-      spec:
-        containers:
-          - name: vllm-leader
-            image: docker.io/vllm/vllm-openai:latest
-            env:
-              - name: HUGGING_FACE_HUB_TOKEN
-                value: <your-hf-token>
-            command:
-              - sh
-              - -c
-              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
-                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
-            resources:
-              limits:
-                nvidia.com/gpu: "8"
-                memory: 1124Gi
-                ephemeral-storage: 800Gi
-              requests:
-                ephemeral-storage: 800Gi
-                cpu: 125
-            ports:
-              - containerPort: 8080
-            readinessProbe:
-              tcpSocket:
-                port: 8080
-              initialDelaySeconds: 15
-              periodSeconds: 10
-            volumeMounts:
-              - mountPath: /dev/shm
-                name: dshm
-        volumes:
-        - name: dshm
-          emptyDir:
-            medium: Memory
-            sizeLimit: 15Gi
-    workerTemplate:
-      spec:
-        containers:
-          - name: vllm-worker
-            image: docker.io/vllm/vllm-openai:latest
-            command:
-              - sh
-              - -c
-              - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
-            resources:
-              limits:
-                nvidia.com/gpu: "8"
-                memory: 1124Gi
-                ephemeral-storage: 800Gi
-              requests:
-                ephemeral-storage: 800Gi
-                cpu: 125
-            env:
-              - name: HUGGING_FACE_HUB_TOKEN
-                value: <your-hf-token>
-            volumeMounts:
-              - mountPath: /dev/shm
-                name: dshm   
-        volumes:
-        - name: dshm
-          emptyDir:
-            medium: Memory
-            sizeLimit: 15Gi
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: vllm-leader
-spec:
-  ports:
-    - name: http
-      port: 8080
-      protocol: TCP
-      targetPort: 8080
-  selector:
-    leaderworkerset.sigs.k8s.io/name: vllm
-    role: leader
-  type: ClusterIP
-```
+??? Yaml
+
+    ```yaml
+    apiVersion: leaderworkerset.x-k8s.io/v1
+    kind: LeaderWorkerSet
+    metadata:
+      name: vllm
+    spec:
+      replicas: 2
+      leaderWorkerTemplate:
+        size: 2
+        restartPolicy: RecreateGroupOnPodRestart
+        leaderTemplate:
+          metadata:
+            labels:
+              role: leader
+          spec:
+            containers:
+              - name: vllm-leader
+                image: docker.io/vllm/vllm-openai:latest
+                env:
+                  - name: HUGGING_FACE_HUB_TOKEN
+                    value: <your-hf-token>
+                command:
+                  - sh
+                  - -c
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
+                    python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
+                resources:
+                  limits:
+                    nvidia.com/gpu: "8"
+                    memory: 1124Gi
+                    ephemeral-storage: 800Gi
+                  requests:
+                    ephemeral-storage: 800Gi
+                    cpu: 125
+                ports:
+                  - containerPort: 8080
+                readinessProbe:
+                  tcpSocket:
+                    port: 8080
+                  initialDelaySeconds: 15
+                  periodSeconds: 10
+                volumeMounts:
+                  - mountPath: /dev/shm
+                    name: dshm
+            volumes:
+            - name: dshm
+              emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+        workerTemplate:
+          spec:
+            containers:
+              - name: vllm-worker
+                image: docker.io/vllm/vllm-openai:latest
+                command:
+                  - sh
+                  - -c
+                  - "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
+                resources:
+                  limits:
+                    nvidia.com/gpu: "8"
+                    memory: 1124Gi
+                    ephemeral-storage: 800Gi
+                  requests:
+                    ephemeral-storage: 800Gi
+                    cpu: 125
+                env:
+                  - name: HUGGING_FACE_HUB_TOKEN
+                    value: <your-hf-token>
+                volumeMounts:
+                  - mountPath: /dev/shm
+                    name: dshm   
+            volumes:
+            - name: dshm
+              emptyDir:
+                medium: Memory
+                sizeLimit: 15Gi
+    ---
+    apiVersion: v1
+    kind: Service
+    metadata:
+      name: vllm-leader
+    spec:
+      ports:
+        - name: http
+          port: 8080
+          protocol: TCP
+          targetPort: 8080
+      selector:
+        leaderworkerset.sigs.k8s.io/name: vllm
+        role: leader
+      type: ClusterIP
+    ```

 ```bash
 kubectl apply -f lws.yaml
@@ -175,25 +177,27 @@ curl http://localhost:8080/v1/completions \

 The output should be similar to the following

-```text
-{
-  "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
-  "object": "text_completion",
-  "created": 1715138766,
-  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
-  "choices": [
+??? Output
+
+    ```text
    {
-      "index": 0,
-      "text": " top destination for foodies, with",
-      "logprobs": null,
-      "finish_reason": "length",
-      "stop_reason": null
+      "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
+      "object": "text_completion",
+      "created": 1715138766,
+      "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
+      "choices": [
+        {
+          "index": 0,
+          "text": " top destination for foodies, with",
+          "logprobs": null,
+          "finish_reason": "length",
+          "stop_reason": null
+        }
+      ],
+      "usage": {
+        "prompt_tokens": 5,
+        "total_tokens": 12,
+        "completion_tokens": 7
+      }
    }
-  ],
-  "usage": {
-    "prompt_tokens": 5,
-    "total_tokens": 12,
-    "completion_tokens": 7
-  }
-}
-```
+    ```
--- a/docs/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -24,48 +24,50 @@ sky check

 See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).

-```yaml
-resources:
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  use_spot: True
-  disk_size: 512  # Ensure model checkpoints can fit.
-  disk_tier: best
-  ports: 8081  # Expose to internet traffic.
-
-envs:
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-
-setup: |
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
-
-  pip install vllm==0.4.0.post1
-  # Install Gradio for web UI.
-  pip install gradio openai
-  pip install flash-attn==2.5.7
-
-run: |
-  conda activate vllm
-  echo 'Starting vllm api server...'
-  python -u -m vllm.entrypoints.openai.api_server \
-    --port 8081 \
-    --model $MODEL_NAME \
-    --trust-remote-code \
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    2>&1 | tee api_server.log &
-
-  echo 'Waiting for vllm api server to start...'
-  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
-
-  echo 'Starting gradio server...'
-  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-    -m $MODEL_NAME \
-    --port 8811 \
-    --model-url http://localhost:8081/v1 \
-    --stop-token-ids 128009,128001
-```
+??? Yaml
+
+    ```yaml
+    resources:
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+      use_spot: True
+      disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
+      ports: 8081  # Expose to internet traffic.
+
+    envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+    setup: |
+      conda create -n vllm python=3.10 -y
+      conda activate vllm
+
+      pip install vllm==0.4.0.post1
+      # Install Gradio for web UI.
+      pip install gradio openai
+      pip install flash-attn==2.5.7
+
+    run: |
+      conda activate vllm
+      echo 'Starting vllm api server...'
+      python -u -m vllm.entrypoints.openai.api_server \
+        --port 8081 \
+        --model $MODEL_NAME \
+        --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        2>&1 | tee api_server.log &
+
+      echo 'Waiting for vllm api server to start...'
+      while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
+
+      echo 'Starting gradio server...'
+      git clone https://github.com/vllm-project/vllm.git || true
+      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
+        -m $MODEL_NAME \
+        --port 8811 \
+        --model-url http://localhost:8081/v1 \
+        --stop-token-ids 128009,128001
+    ```

 Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):

@@ -93,68 +95,67 @@ HF_TOKEN="your-huggingface-token" \

 SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

-```yaml
-service:
-  replicas: 2
-  # An actual request for readiness probe.
-  readiness_probe:
-    path: /v1/chat/completions
-    post_data:
-    model: $MODEL_NAME
-    messages:
-      - role: user
-        content: Hello! What is your name?
-  max_completion_tokens: 1
-```
-
-<details>
-<summary>Click to see the full recipe YAML</summary>
-
-```yaml
-service:
-  replicas: 2
-  # An actual request for readiness probe.
-  readiness_probe:
-    path: /v1/chat/completions
-    post_data:
-      model: $MODEL_NAME
-      messages:
-        - role: user
-          content: Hello! What is your name?
+??? Yaml
+
+    ```yaml
+    service:
+      replicas: 2
+      # An actual request for readiness probe.
+      readiness_probe:
+        path: /v1/chat/completions
+        post_data:
+        model: $MODEL_NAME
+        messages:
+          - role: user
+            content: Hello! What is your name?
      max_completion_tokens: 1
+    ```

-resources:
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  use_spot: True
-  disk_size: 512  # Ensure model checkpoints can fit.
-  disk_tier: best
-  ports: 8081  # Expose to internet traffic.
-
-envs:
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-
-setup: |
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
-
-  pip install vllm==0.4.0.post1
-  # Install Gradio for web UI.
-  pip install gradio openai
-  pip install flash-attn==2.5.7
-
-run: |
-  conda activate vllm
-  echo 'Starting vllm api server...'
-  python -u -m vllm.entrypoints.openai.api_server \
-    --port 8081 \
-    --model $MODEL_NAME \
-    --trust-remote-code \
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    2>&1 | tee api_server.log
-```
-
-</details>
+??? Yaml
+
+    ```yaml
+    service:
+      replicas: 2
+      # An actual request for readiness probe.
+      readiness_probe:
+        path: /v1/chat/completions
+        post_data:
+          model: $MODEL_NAME
+          messages:
+            - role: user
+              content: Hello! What is your name?
+          max_completion_tokens: 1
+
+    resources:
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+      use_spot: True
+      disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
+      ports: 8081  # Expose to internet traffic.
+
+    envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+    setup: |
+      conda create -n vllm python=3.10 -y
+      conda activate vllm
+
+      pip install vllm==0.4.0.post1
+      # Install Gradio for web UI.
+      pip install gradio openai
+      pip install flash-attn==2.5.7
+
+    run: |
+      conda activate vllm
+      echo 'Starting vllm api server...'
+      python -u -m vllm.entrypoints.openai.api_server \
+        --port 8081 \
+        --model $MODEL_NAME \
+        --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        2>&1 | tee api_server.log
+    ```

 Start the serving the Llama-3 8B model on multiple replicas:

@@ -170,8 +171,7 @@ Wait until the service is ready:
 watch -n10 sky serve status vllm
 ```

-<details>
-<summary>Example outputs:</summary>
+Example outputs:

 ```console
 Services
@@ -184,29 +184,29 @@ vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  R
 vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 ```

-</details>
-
 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

-```console
-ENDPOINT=$(sky serve status --endpoint 8081 vllm)
-curl -L http://$ENDPOINT/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
-    "messages": [
-    {
-      "role": "system",
-      "content": "You are a helpful assistant."
-    },
-    {
-      "role": "user",
-      "content": "Who are you?"
-    }
-    ],
-    "stop_token_ids": [128009,  128001]
-  }'
-```
+??? Commands
+
+    ```bash
+    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
+    curl -L http://$ENDPOINT/v1/chat/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+        "messages": [
+        {
+          "role": "system",
+          "content": "You are a helpful assistant."
+        },
+        {
+          "role": "user",
+          "content": "Who are you?"
+        }
+        ],
+        "stop_token_ids": [128009,  128001]
+      }'
+    ```

 To enable autoscaling, you could replace the `replicas` with the following configs in `service`:

@@ -220,57 +220,54 @@ service:

 This will scale the service up to when the QPS exceeds 2 for each replica.

-<details>
-<summary>Click to see the full recipe YAML</summary>
-
-```yaml
-service:
-  replica_policy:
-    min_replicas: 2
-    max_replicas: 4
-    target_qps_per_replica: 2
-  # An actual request for readiness probe.
-  readiness_probe:
-    path: /v1/chat/completions
-    post_data:
-      model: $MODEL_NAME
-      messages:
-        - role: user
-          content: Hello! What is your name?
-      max_completion_tokens: 1
-
-resources:
-  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
-  use_spot: True
-  disk_size: 512  # Ensure model checkpoints can fit.
-  disk_tier: best
-  ports: 8081  # Expose to internet traffic.
-
-envs:
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
-
-setup: |
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
-
-  pip install vllm==0.4.0.post1
-  # Install Gradio for web UI.
-  pip install gradio openai
-  pip install flash-attn==2.5.7
-
-run: |
-  conda activate vllm
-  echo 'Starting vllm api server...'
-  python -u -m vllm.entrypoints.openai.api_server \
-    --port 8081 \
-    --model $MODEL_NAME \
-    --trust-remote-code \
-    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-    2>&1 | tee api_server.log
-```
-
-</details>
+??? Yaml
+
+    ```yaml
+    service:
+      replica_policy:
+        min_replicas: 2
+        max_replicas: 4
+        target_qps_per_replica: 2
+      # An actual request for readiness probe.
+      readiness_probe:
+        path: /v1/chat/completions
+        post_data:
+          model: $MODEL_NAME
+          messages:
+            - role: user
+              content: Hello! What is your name?
+          max_completion_tokens: 1
+
+    resources:
+      accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+      use_spot: True
+      disk_size: 512  # Ensure model checkpoints can fit.
+      disk_tier: best
+      ports: 8081  # Expose to internet traffic.
+
+    envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+
+    setup: |
+      conda create -n vllm python=3.10 -y
+      conda activate vllm
+
+      pip install vllm==0.4.0.post1
+      # Install Gradio for web UI.
+      pip install gradio openai
+      pip install flash-attn==2.5.7
+
+    run: |
+      conda activate vllm
+      echo 'Starting vllm api server...'
+      python -u -m vllm.entrypoints.openai.api_server \
+        --port 8081 \
+        --model $MODEL_NAME \
+        --trust-remote-code \
+        --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+        2>&1 | tee api_server.log
+    ```

 To update the service with the new config:

@@ -288,38 +285,35 @@ sky serve down vllm

 It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

-<details>
-<summary>Click to see the full GUI YAML</summary>
+??? Yaml

-```yaml
-envs:
-  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
-  ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
-
-resources:
-  cpus: 2
-
-setup: |
-  conda create -n vllm python=3.10 -y
-  conda activate vllm
-
-  # Install Gradio for web UI.
-  pip install gradio openai
-
-run: |
-  conda activate vllm
-  export PATH=$PATH:/sbin
-
-  echo 'Starting gradio server...'
-  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
-    -m $MODEL_NAME \
-    --port 8811 \
-    --model-url http://$ENDPOINT/v1 \
-    --stop-token-ids 128009,128001 | tee ~/gradio.log
-```
+    ```yaml
+    envs:
+      MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+      ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
+
+    resources:
+      cpus: 2

-</details>
+    setup: |
+      conda create -n vllm python=3.10 -y
+      conda activate vllm
+
+      # Install Gradio for web UI.
+      pip install gradio openai
+
+    run: |
+      conda activate vllm
+      export PATH=$PATH:/sbin
+
+      echo 'Starting gradio server...'
+      git clone https://github.com/vllm-project/vllm.git || true
+      python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
+        -m $MODEL_NAME \
+        --port 8811 \
+        --model-url http://$ENDPOINT/v1 \
+        --stop-token-ids 128009,128001 | tee ~/gradio.log
+    ```

 1. Start the chat web UI:


--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -60,22 +60,22 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
 curl -o- http://localhost:30080/models
 ```

-Expected output:
+??? Output

-```json
-{
-  "object": "list",
-  "data": [
+    ```json
    {
-      "id": "facebook/opt-125m",
-      "object": "model",
-      "created": 1737428424,
-      "owned_by": "vllm",
-      "root": null
+      "object": "list",
+      "data": [
+        {
+          "id": "facebook/opt-125m",
+          "object": "model",
+          "created": 1737428424,
+          "owned_by": "vllm",
+          "root": null
+        }
+      ]
    }
-  ]
-}
-```
+    ```

 To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:

@@ -89,23 +89,23 @@ curl -X POST http://localhost:30080/completions \
  }'
 ```

-Expected output:
+??? Output

-```json
-{
-  "id": "completion-id",
-  "object": "text_completion",
-  "created": 1737428424,
-  "model": "facebook/opt-125m",
-  "choices": [
+    ```json
    {
-      "text": " there was a brave knight who...",
-      "index": 0,
-      "finish_reason": "length"
+      "id": "completion-id",
+      "object": "text_completion",
+      "created": 1737428424,
+      "model": "facebook/opt-125m",
+      "choices": [
+        {
+          "text": " there was a brave knight who...",
+          "index": 0,
+          "finish_reason": "length"
+        }
+      ]
    }
-  ]
-}
-```
+    ```

 ### Uninstall

@@ -121,23 +121,25 @@ sudo helm uninstall vllm

 The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:

-```yaml
-servingEngineSpec:
-  runtimeClassName: ""
-  modelSpec:
-  - name: "opt125m"
-    repository: "vllm/vllm-openai"
-    tag: "latest"
-    modelURL: "facebook/opt-125m"
+??? Yaml

-    replicaCount: 1
+    ```yaml
+    servingEngineSpec:
+      runtimeClassName: ""
+      modelSpec:
+      - name: "opt125m"
+        repository: "vllm/vllm-openai"
+        tag: "latest"
+        modelURL: "facebook/opt-125m"

-    requestCPU: 6
-    requestMemory: "16Gi"
-    requestGPU: 1
+        replicaCount: 1

-    pvcStorage: "10Gi"
-```
+        requestCPU: 6
+        requestMemory: "16Gi"
+        requestGPU: 1
+
+        pvcStorage: "10Gi"
+    ```

 In this YAML configuration:
 * **`modelSpec`** includes:

--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -29,85 +29,89 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:

 First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:

-```bash
-cat <<EOF |kubectl apply -f -
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: vllm-models
-spec:
-  accessModes:
-    - ReadWriteOnce
-  volumeMode: Filesystem
-  resources:
-    requests:
-      storage: 50Gi
---
-apiVersion: v1
-kind: Secret
-metadata:
-  name: hf-token-secret
-type: Opaque
-data:
-  token: $(HF_TOKEN)
-EOF
-```
+??? Config
+
+    ```bash
+    cat <<EOF |kubectl apply -f -
+    apiVersion: v1
+    kind: PersistentVolumeClaim
+    metadata:
+      name: vllm-models
+    spec:
+      accessModes:
+        - ReadWriteOnce
+      volumeMode: Filesystem
+      resources:
+        requests:
+          storage: 50Gi
+    ---
+    apiVersion: v1
+    kind: Secret
+    metadata:
+      name: hf-token-secret
+    type: Opaque
+    data:
+      token: $(HF_TOKEN)
+    EOF
+    ```

 Next, start the vLLM server as a Kubernetes Deployment and Service:

-```bash
-cat <<EOF |kubectl apply -f -
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: vllm-server
-spec:
-  replicas: 1
-  selector:
-    matchLabels:
-      app.kubernetes.io/name: vllm
-  template:
+??? Config
+
+    ```bash
+    cat <<EOF |kubectl apply -f -
+    apiVersion: apps/v1
+    kind: Deployment
    metadata:
-      labels:
-        app.kubernetes.io/name: vllm
+      name: vllm-server
    spec:
-      containers:
-      - name: vllm
-        image: vllm/vllm-openai:latest
-        command: ["/bin/sh", "-c"]
-        args: [
-          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
-        ]
-        env:
-        - name: HUGGING_FACE_HUB_TOKEN
-          valueFrom:
-            secretKeyRef:
-              name: hf-token-secret
-              key: token
-        ports:
-          - containerPort: 8000
-        volumeMounts:
+      replicas: 1
+      selector:
+        matchLabels:
+          app.kubernetes.io/name: vllm
+      template:
+        metadata:
+          labels:
+            app.kubernetes.io/name: vllm
+        spec:
+          containers:
+          - name: vllm
+            image: vllm/vllm-openai:latest
+            command: ["/bin/sh", "-c"]
+            args: [
+              "vllm serve meta-llama/Llama-3.2-1B-Instruct"
+            ]
+            env:
+            - name: HUGGING_FACE_HUB_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: token
+            ports:
+              - containerPort: 8000
+            volumeMounts:
+              - name: llama-storage
+                mountPath: /root/.cache/huggingface
+          volumes:
          - name: llama-storage
-            mountPath: /root/.cache/huggingface
-      volumes:
-      - name: llama-storage
-        persistentVolumeClaim:
-          claimName: vllm-models
---
-apiVersion: v1
-kind: Service
-metadata:
-  name: vllm-server
-spec:
-  selector:
-    app.kubernetes.io/name: vllm
-  ports:
-  - protocol: TCP
-    port: 8000
-    targetPort: 8000
-  type: ClusterIP
-EOF
-```
+            persistentVolumeClaim:
+              claimName: vllm-models
+    ---
+    apiVersion: v1
+    kind: Service
+    metadata:
+      name: vllm-server
+    spec:
+      selector:
+        app.kubernetes.io/name: vllm
+      ports:
+      - protocol: TCP
+        port: 8000
+        targetPort: 8000
+      type: ClusterIP
+    EOF
+    ```

 We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):

@@ -128,6 +132,9 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

      PVC is used to store the model cache and it is optional, you can use hostPath or other storage options

+      <details>
+      <summary>Yaml</summary>
+
      ```yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
@@ -144,6 +151,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        volumeMode: Filesystem
      ```

+      </details>
+
      Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models

      ```yaml
@@ -156,13 +165,16 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
      stringData:
        token: "REPLACE_WITH_TOKEN"
      ```
-
+  
      Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.

      Here are two examples for using NVIDIA GPU and AMD GPU.

      NVIDIA GPU:

+      <details>
+      <summary>Yaml</summary>
+
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -233,10 +245,15 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                periodSeconds: 5
      ```

+      </details>
+
      AMD GPU:

      You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.

+      <details>
+      <summary>Yaml</summary>
+
      ```yaml
      apiVersion: apps/v1
      kind: Deployment
@@ -305,12 +322,17 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
                mountPath: /dev/shm
      ```

+      </details>
+
      You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.

 2. Create a Kubernetes Service for vLLM

      Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:

+      <details>
+      <summary>Yaml</summary>
+
      ```yaml
      apiVersion: v1
      kind: Service
@@ -330,6 +352,8 @@ INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
        type: ClusterIP
      ```

+      </details>
+
 3. Deploy and Test

      Apply the deployment and service configurations using `kubectl apply -f <filename>`:

--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -36,23 +36,25 @@ docker build . -f Dockerfile.nginx --tag nginx-lb

 Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.

-```console
-upstream backend {
-    least_conn;
-    server vllm0:8000 max_fails=3 fail_timeout=10000s;
-    server vllm1:8000 max_fails=3 fail_timeout=10000s;
-}
-server {
-    listen 80;
-    location / {
-        proxy_pass http://backend;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
-        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-        proxy_set_header X-Forwarded-Proto $scheme;
+??? Config
+
+    ```console
+    upstream backend {
+        least_conn;
+        server vllm0:8000 max_fails=3 fail_timeout=10000s;
+        server vllm1:8000 max_fails=3 fail_timeout=10000s;
    }
-}
-```
+    server {
+        listen 80;
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+    }
+    ```

 [](){ #nginxloadbalancer-nginx-vllm-container }

@@ -93,30 +95,32 @@ Notes:
 - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
 - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.

-```console
-mkdir -p ~/.cache/huggingface/hub/
-hf_cache_dir=~/.cache/huggingface/
-docker run \
-    -itd \
-    --ipc host \
-    --network vllm_nginx \
-    --gpus device=0 \
-    --shm-size=10.24gb \
-    -v $hf_cache_dir:/root/.cache/huggingface/ \
-    -p 8081:8000 \
-    --name vllm0 vllm \
-    --model meta-llama/Llama-2-7b-chat-hf
-docker run \
-    -itd \
-    --ipc host \
-    --network vllm_nginx \
-    --gpus device=1 \
-    --shm-size=10.24gb \
-    -v $hf_cache_dir:/root/.cache/huggingface/ \
-    -p 8082:8000 \
-    --name vllm1 vllm \
-    --model meta-llama/Llama-2-7b-chat-hf
-```
+??? Commands
+
+    ```console
+    mkdir -p ~/.cache/huggingface/hub/
+    hf_cache_dir=~/.cache/huggingface/
+    docker run \
+        -itd \
+        --ipc host \
+        --network vllm_nginx \
+        --gpus device=0 \
+        --shm-size=10.24gb \
+        -v $hf_cache_dir:/root/.cache/huggingface/ \
+        -p 8081:8000 \
+        --name vllm0 vllm \
+        --model meta-llama/Llama-2-7b-chat-hf
+    docker run \
+        -itd \
+        --ipc host \
+        --network vllm_nginx \
+        --gpus device=1 \
+        --shm-size=10.24gb \
+        -v $hf_cache_dir:/root/.cache/huggingface/ \
+        -p 8082:8000 \
+        --name vllm1 vllm \
+        --model meta-llama/Llama-2-7b-chat-hf
+    ```

 !!! note
    If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.

--- a/docs/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@@ -22,31 +22,33 @@ server.

 Here is a sample of `LLM` class usage:

-```python
-from vllm import LLM, SamplingParams
-
-# Define a list of input prompts
-prompts = [
-    "Hello, my name is",
-    "The capital of France is",
-    "The largest ocean is",
-]
-
-# Define sampling parameters
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-# Initialize the LLM engine with the OPT-125M model
-llm = LLM(model="facebook/opt-125m")
-
-# Generate outputs for the input prompts
-outputs = llm.generate(prompts, sampling_params)
-
-# Print the generated outputs
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
+??? Code
+
+    ```python
+    from vllm import LLM, SamplingParams
+
+    # Define a list of input prompts
+    prompts = [
+        "Hello, my name is",
+        "The capital of France is",
+        "The largest ocean is",
+    ]
+
+    # Define sampling parameters
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    # Initialize the LLM engine with the OPT-125M model
+    llm = LLM(model="facebook/opt-125m")
+
+    # Generate outputs for the input prompts
+    outputs = llm.generate(prompts, sampling_params)
+
+    # Print the generated outputs
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    ```

 More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.

@@ -178,32 +180,34 @@ vision-language model.

    To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:

-    ```python
-    class MyOldModel(nn.Module):
-        def __init__(
-            self,
-            config,
-            cache_config: Optional[CacheConfig] = None,
-            quant_config: Optional[QuantizationConfig] = None,
-            lora_config: Optional[LoRAConfig] = None,
-            prefix: str = "",
-        ) -> None:
-            ...
-
-    from vllm.config import VllmConfig
-    class MyNewModel(MyOldModel):
-        def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
-            config = vllm_config.model_config.hf_config
-            cache_config = vllm_config.cache_config
-            quant_config = vllm_config.quant_config
-            lora_config = vllm_config.lora_config
-            super().__init__(config, cache_config, quant_config, lora_config, prefix)
-
-    if __version__ >= "0.6.4":
-        MyModel = MyNewModel
-    else:
-        MyModel = MyOldModel
-    ```
+    ??? Code
+
+        ```python
+        class MyOldModel(nn.Module):
+            def __init__(
+                self,
+                config,
+                cache_config: Optional[CacheConfig] = None,
+                quant_config: Optional[QuantizationConfig] = None,
+                lora_config: Optional[LoRAConfig] = None,
+                prefix: str = "",
+            ) -> None:
+                ...
+
+        from vllm.config import VllmConfig
+        class MyNewModel(MyOldModel):
+            def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+                config = vllm_config.model_config.hf_config
+                cache_config = vllm_config.cache_config
+                quant_config = vllm_config.quant_config
+                lora_config = vllm_config.lora_config
+                super().__init__(config, cache_config, quant_config, lora_config, prefix)
+
+        if __version__ >= "0.6.4":
+            MyModel = MyNewModel
+        else:
+            MyModel = MyOldModel
+        ```

    This way, the model can work with both old and new versions of vLLM.