Merge tag 'v0.7.3' into v0.7.3-dev

ec5e299c · zhuwenwen · 47bd229c · ed6e9075 · ec5e299c · ec5e299c
Commit ec5e299c authored Feb 21, 2025 by zhuwenwen
20 changed files
--- a/docs/seed_parameter_behavior.md
+++ b/docs/seed_parameter_behavior.md
+# Seed Parameter Behavior in vLLM
+
+## Overview
+
+The `seed` parameter in vLLM is used to control the random states for various random number generators. This parameter can affect the behavior of random operations in user code, especially when working with models in vLLM.
+
+## Default Behavior
+
+By default, the `seed` parameter is set to `None`. When the `seed` parameter is `None`, the global random states for `random`, `np.random`, and `torch.manual_seed` are not set. This means that the random operations will behave as expected, without any fixed random states.
+
+## Specifying a Seed
+
+If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set accordingly. This can be useful for reproducibility, as it ensures that the random operations produce the same results across multiple runs.
+
+## Example Usage
+
+### Without Specifying a Seed
+
+```python
+import random
+from vllm import LLM
+
+# Initialize a vLLM model without specifying a seed
+model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
+
+# Try generating random numbers
+print(random.randint(0, 100))  # Outputs different numbers across runs
+```
+
+### Specifying a Seed
+
+```python
+import random
+from vllm import LLM
+
+# Initialize a vLLM model with a specific seed
+model = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", seed=42)
+
+# Try generating random numbers
+print(random.randint(0, 100))  # Outputs the same number across runs
+```
+
+## Important Notes
+
+- If the `seed` parameter is not specified, the behavior of global random states remains unaffected.
+- If a specific seed value is provided, the global random states for `random`, `np.random`, and `torch.manual_seed` will be set to that value.
+- This behavior can be useful for reproducibility but may lead to non-intuitive behavior if the user is not explicitly aware of it.
+
+## Conclusion
+
+Understanding the behavior of the `seed` parameter in vLLM is crucial for ensuring the expected behavior of random operations in your code. By default, the `seed` parameter is set to `None`, which means that the global random states are not affected. However, specifying a seed value can help achieve reproducibility in your experiments.
--- a/docs/source/_static/custom.css
+++ b/docs/source/_static/custom.css
+.vertical-table-header th.head:not(.stub) {
+    writing-mode: sideways-lr;
+    white-space: nowrap;
+    max-width: 0;
+    p {
+       margin: 0;
+    }
+}
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -12,6 +12,7 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.

+import datetime
 import inspect
 import logging
 import os
@@ -27,7 +28,7 @@ sys.path.append(os.path.abspath("../.."))
 # -- Project information -----------------------------------------------------

 project = 'vLLM'
-copyright = '2024, vLLM Team'
+copyright = f'{datetime.datetime.now().year}, vLLM Team'
 author = 'the vLLM Team'

 # -- General configuration ---------------------------------------------------
@@ -78,8 +79,12 @@ html_theme_options = {
    'use_repository_button': True,
    'use_edit_page_button': True,
 }
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ["_static"]
 html_js_files = ["custom.js"]
+html_css_files = ["custom.css"]

 myst_url_schemes = {
    'http': None,
@@ -121,11 +126,6 @@ if READTHEDOCS_VERSION_TYPE == "tag":
    if os.path.exists(header_file):
        os.remove(header_file)

-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-# html_static_path = ['_static']
-

 # Generate additional rst documentation here.
 def setup(app):

--- a/docs/source/contributing/model/multimodal.md
+++ b/docs/source/contributing/model/multimodal.md
@@ -262,6 +262,255 @@ def get_mm_max_tokens_per_item(
 Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
 :::

+::::
+
+::::{tab-item} Non-consecutive feature tokens: Fuyu
+:sync: fuyu
+
+Looking at the code of HF's `FuyuForCausalLM`:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+if image_patches is not None and past_key_values is None:
+    patch_embeddings = [
+        self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+        .squeeze(0)
+        .to(inputs_embeds.device)
+        for patch in image_patches
+    ]
+    inputs_embeds = self.gather_continuous_embeddings(
+        word_embeddings=inputs_embeds,
+        continuous_embeddings=patch_embeddings,
+        image_patch_input_indices=image_patches_indices,
+    )
+```
+
+The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
+which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
+
+Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
+Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
+
+The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
+`FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
+
+In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
+returning the dimensions after resizing (but before padding) as metadata.
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+batch_images = image_encoding["images"]
+image_unpadded_heights = image_encoding["image_unpadded_heights"]
+image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+if do_resize:
+    batch_images = [
+        [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+        for images in batch_images
+    ]
+
+image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+if do_pad:
+    batch_images = [
+        [
+            self.pad_image(
+                image,
+                size=size,
+                mode=padding_mode,
+                constant_values=padding_value,
+                input_data_format=input_data_format,
+            )
+            for image in images
+        ]
+        for images in batch_images
+    ]
+```
+
+In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+    image_input=tensor_batch_images,
+    image_present=image_present,
+    image_unpadded_h=image_unpadded_heights,
+    image_unpadded_w=image_unpadded_widths,
+    image_placeholder_id=image_placeholder_id,
+    image_newline_id=image_newline_id,
+    variable_sized=True,
+)
+
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+image_height, image_width = image.shape[1], image.shape[2]
+if variable_sized:  # variable_sized=True
+    new_h = min(
+        image_height,
+        math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+    )
+    new_w = min(
+        image_width,
+        math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+    )
+    image = image[:, :new_h, :new_w]
+    image_height, image_width = new_h, new_w
+
+num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+tensor_of_image_ids = torch.full(
+    [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+)
+patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+assert num_patches == patches.shape[0]
+```
+
+The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+patch_size = patch_size if patch_size is not None else self.patch_size
+patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+if image_height % patch_height != 0:
+    raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+if image_width % patch_width != 0:
+    raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+num_patches_per_dim_h = image_height // patch_height
+num_patches_per_dim_w = image_width // patch_width
+num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+```
+
+We can calculate this in vLLM using this code:
+
+```python
+def get_num_image_patches(
+    self,
+    *,
+    image_width: int,
+    image_height: int,
+) -> int:
+    image_processor = self.get_image_processor()
+    target_width = image_processor.size["width"]
+    target_height = image_processor.size["height"]
+    patch_width = image_processor.patch_size["width"]
+    patch_height = image_processor.patch_size["height"]
+
+    if not (image_width <= target_width and image_height <= target_height):
+        height_scale_factor = target_height / image_height
+        width_scale_factor = target_width / image_width
+        optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+        image_height = int(image_height * optimal_scale_factor)
+        image_width = int(image_width * optimal_scale_factor)
+
+    ncols = math.ceil(image_width / patch_width)
+    nrows = math.ceil(image_height / patch_height)
+    return ncols * nrows
+```
+
+These image patches correspond to placeholder tokens (`|SPEAKER|`). However, the processor also
+inserts newline tokens (`|NEWLINE|`) as shown here:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L654-L670
+tensor_of_image_ids = torch.full(
+    [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+)
+patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+assert num_patches == patches.shape[0]
+
+if variable_sized:
+    # Now terminate each line with |NEWLINE|.
+    tensor_of_image_ids = tensor_of_image_ids.reshape(-1, image_width // patch_width)
+    newline_ids = torch.full(
+        [tensor_of_image_ids.shape[0], 1],
+        image_newline_id,
+        dtype=torch.int32,
+        device=image_input.device,
+    )
+    tensor_of_image_ids = torch.cat([tensor_of_image_ids, newline_ids], dim=1)
+    tensor_of_image_ids = tensor_of_image_ids.reshape(-1)
+```
+
+So, the layout of tokens for an image is:
+
+```
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+...
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+```
+
+This makes the placeholder tokens non-consecutive in the prompt.
+Since vLLM requires the feature tokens to be consecutive, **we also treat the newline tokens as feature tokens**.
+
+So overall, the total number of feature tokens is
+
+```python
+def get_num_image_tokens(
+    self,
+    *,
+    image_width: int,
+    image_height: int,
+) -> int:
+    image_processor = self.get_image_processor()
+    target_width = image_processor.size["width"]
+    target_height = image_processor.size["height"]
+    patch_width = image_processor.patch_size["width"]
+    patch_height = image_processor.patch_size["height"]
+
+    if not (image_width <= target_width and image_height <= target_height):
+        height_scale_factor = target_height / image_height
+        width_scale_factor = target_width / image_width
+        optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+        image_height = int(image_height * optimal_scale_factor)
+        image_width = int(image_width * optimal_scale_factor)
+
+    ncols = math.ceil(image_width / patch_width)
+    nrows = math.ceil(image_height / patch_height)
+    return (ncols + 1) * nrows
+```
+
+To calculate the maximum number of image tokens, recall that input images are first resized
+to fit within `image_processor.size`. The maximum possible dimensions of the image before
+being converted into patches is therefore equal to `image_processor.size`.
+
+```python
+def get_image_size_with_most_features(self) -> ImageSize:
+    image_processor = self.get_image_processor()
+    return ImageSize(width=image_processor.size["width"],
+                        height=image_processor.size["height"])
+
+def get_max_image_tokens(self) -> int:
+    target_width, target_height = self.get_image_size_with_most_features()
+
+    return self.get_num_image_tokens(
+        image_width=target_width,
+        image_height=target_height,
+    )
+```
+
+And thus, we can override the method as:
+
+```python
+def get_mm_max_tokens_per_item(
+    self,
+    seq_len: int,
+    mm_counts: Mapping[str, int],
+) -> Mapping[str, int]:
+    return {"image": self.get_max_image_tokens()}
+```
+
+:::{note}
+Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) returns `ncols` and `nrows` directly instead of the total token count.
+This is because `ncols` and `nrows` are used to specify the layout of the feature tokens (as shown in Step 4 of this guide).
+:::
+
 ::::
 :::::

@@ -282,7 +531,8 @@ on the code for {meth}`~vllm.multimodal.processing.BaseProcessingInfo.get_mm_max
 ::::{tab-set}
 :::{tab-item} Basic example: LLaVA
 :sync: llava
-Making use of the `get_image_size_with_most_features` method implemented in the previous section:
+
+Making use of the `get_image_size_with_most_features` method implemented in Step 2:

 ```python
 def get_dummy_processor_inputs(
@@ -312,6 +562,39 @@ def get_dummy_processor_inputs(
 ```

 :::
+
+:::{tab-item} No input placeholders: Fuyu
+:sync: fuyu
+
+Fuyu does not expect image placeholders in the inputs to HF processor, so
+the dummy prompt text is empty regardless of the number of images.
+Otherwise, the logic of this method is very similar to LLaVA:
+
+```python
+def get_dummy_processor_inputs(
+    self,
+    seq_len: int,
+    mm_counts: Mapping[str, int],
+) -> ProcessorInputs:
+    target_width, target_height = \
+        self.info.get_image_size_with_most_features()
+    num_images = mm_counts.get("image", 0)
+
+    mm_data = {
+        "image":
+        self._get_dummy_images(width=target_width,
+                                height=target_height,
+                                num_images=num_images)
+    }
+
+    return ProcessorInputs(
+        prompt_text="",
+        mm_data=mm_data,
+    )
+```
+
+:::
+
 ::::

 ## 4. Specify processing details
@@ -325,40 +608,28 @@ to fill in the missing details about HF processing.

 ### Multi-modal fields

-Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
+Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
 return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.

 :::::{tab-set}
 ::::{tab-item} Basic example: LLaVA
 :sync: llava

-Looking at the model's `forward` method:
+The output of `CLIPImageProcessor` is a simple tensor with shape
+`(num_images, num_channels, image_height, image_width)`:

 ```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L387-L404
-def forward(
-    self,
-    input_ids: torch.LongTensor = None,
-    pixel_values: torch.FloatTensor = None,
-    attention_mask: Optional[torch.Tensor] = None,
-    position_ids: Optional[torch.LongTensor] = None,
-    past_key_values: Optional[List[torch.FloatTensor]] = None,
-    inputs_embeds: Optional[torch.FloatTensor] = None,
-    vision_feature_layer: Optional[int] = None,
-    vision_feature_select_strategy: Optional[str] = None,
-    labels: Optional[torch.LongTensor] = None,
-    use_cache: Optional[bool] = None,
-    output_attentions: Optional[bool] = None,
-    output_hidden_states: Optional[bool] = None,
-    return_dict: Optional[bool] = None,
-    cache_position: Optional[torch.LongTensor] = None,
-    num_logits_to_keep: int = 0,
-) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
-```
-
-The only related keyword argument is `pixel_values` which directly corresponds to input images.
-The shape of `pixel_values` is `(N, C, H, W)` where `N` is the number of images.
-So, we override the method as follows:
+# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
+images = [
+    to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+    for image in all_images
+]
+
+data = {"pixel_values": images}
+return BatchFeature(data=data, tensor_type=return_tensors)
+```
+
+So, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:

 ```python
 def _get_mm_fields_config(
@@ -377,11 +648,83 @@ pre-computed image embeddings, which can be passed to be model via the `image_em
 :::

 ::::
+
+::::{tab-item} With postprocessing: Fuyu
+:sync: fuyu
+
+The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
+the patches from each image belonging to an item in the batch:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
+        image_input_ids.append(tensor_of_image_ids)
+        image_patches.append(patches)
+    else:
+        image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
+
+batch_image_input_ids.append(image_input_ids)
+batch_image_patches.append(image_patches)
+```
+
+The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
+`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
+
+In order to support the use of {func}`MultiModalFieldConfig.batched` like in LLaVA,
+we remove the extra batch dimension by overriding {meth}`BaseMultiModalProcessor._call_hf_processor`:
+
+```python
+def _call_hf_processor(
+    self,
+    prompt: str,
+    mm_data: Mapping[str, object],
+    mm_kwargs: Mapping[str, object],
+) -> BatchFeature:
+    processed_outputs = super()._call_hf_processor(
+        prompt=prompt,
+        mm_data=mm_data,
+        mm_kwargs=mm_kwargs,
+    )
+
+    image_patches = processed_outputs.get("image_patches")
+    if image_patches is not None:
+        images = mm_data["images"]
+        assert isinstance(images, list)
+
+        # Original output: (1, num_images, Pn, Px * Py * C)
+        # New output: (num_images, Pn, Px * Py * C)
+        assert (isinstance(image_patches, list)
+                and len(image_patches) == 1)
+        assert (isinstance(image_patches[0], torch.Tensor)
+                and len(image_patches[0]) == len(images))
+
+        processed_outputs["image_patches"] = image_patches[0]
+
+    return processed_outputs
+```
+
+:::{note}
+Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
+for text-only inputs to prevent unnecessary warnings from HF processor.
+:::
+
+This lets us override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:
+
+```python
+def _get_mm_fields_config(
+    self,
+    hf_inputs: BatchFeature,
+    hf_processor_mm_kwargs: Mapping[str, object],
+) -> Mapping[str, MultiModalFieldConfig]:
+    return dict(image_patches=MultiModalFieldConfig.batched("image"))
+```
+
+::::
+
 :::::

 ### Prompt replacements

-Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements` to
+Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements` to
 return a list of {class}`~vllm.multimodal.processing.PromptReplacement` instances.

 Each {class}`~vllm.multimodal.processing.PromptReplacement` instance specifies a find-and-replace
@@ -402,7 +745,7 @@ for sample in text:
 ```

 It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
-Based on this, we override the method as follows:
+Based on this, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements` as follows:

 ```python
 def _get_prompt_replacements(
@@ -435,6 +778,159 @@ def _get_prompt_replacements(
 ```

 :::
+
+:::{tab-item} Handling additional tokens: Fuyu
+:sync: fuyu
+
+Recall the layout of feature tokens from Step 2:
+
+```
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+...
+|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+```
+
+We define a helper function to return `ncols` and `nrows` directly:
+
+```python
+def get_image_feature_grid_size(
+    self,
+    *,
+    image_width: int,
+    image_height: int,
+) -> tuple[int, int]:
+    image_processor = self.get_image_processor()
+    target_width = image_processor.size["width"]
+    target_height = image_processor.size["height"]
+    patch_width = image_processor.patch_size["width"]
+    patch_height = image_processor.patch_size["height"]
+
+    if not (image_width <= target_width and image_height <= target_height):
+        height_scale_factor = target_height / image_height
+        width_scale_factor = target_width / image_width
+        optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+        image_height = int(image_height * optimal_scale_factor)
+        image_width = int(image_width * optimal_scale_factor)
+
+    ncols = math.ceil(image_width / patch_width)
+    nrows = math.ceil(image_height / patch_height)
+    return ncols, nrows
+```
+
+Based on this, we can initially define our replacement tokens as:
+
+```python
+def get_replacement(item_idx: int):
+    images = mm_items.get_items("image", ImageProcessorItems)
+    image_size = images.get_image_size(item_idx)
+
+    ncols, nrows = self.info.get_image_feature_grid_size(
+        image_width=image_size.width,
+        image_height=image_size.height,
+    )
+
+    # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+    # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+    return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+```
+
+However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
+a BOS token (`<s>`) is also added to the promopt:
+
+```python
+# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+    image_input=tensor_batch_images,
+    image_present=image_present,
+    image_unpadded_h=image_unpadded_heights,
+    image_unpadded_w=image_unpadded_widths,
+    image_placeholder_id=image_placeholder_id,
+    image_newline_id=image_newline_id,
+    variable_sized=True,
+)
+prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+    tokenizer=self.tokenizer,
+    prompts=prompts,
+    scale_factors=scale_factors,
+    max_tokens_to_generate=self.max_tokens_to_generate,
+    max_position_embeddings=self.max_position_embeddings,
+    add_BOS=True,
+    add_beginning_of_answer_token=True,
+)
+```
+
+To accommodate this, instead of a string you can return an instance of `PromptReplacementDetails`
+with different `full` and `feature` attributes:
+
+```python
+hf_config = self.info.get_hf_config()
+bos_token_id = hf_config.bos_token_id  # `<s>`
+assert isinstance(bos_token_id, int)
+
+def get_replacement_fuyu(item_idx: int):
+    images = mm_items.get_items("image", ImageProcessorItems)
+    image_size = images.get_image_size(item_idx)
+
+    ncols, nrows = self.info.get_image_feature_grid_size(
+        image_width=image_size.width,
+        image_height=image_size.height,
+    )
+    image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                    [_NEWLINE_TOKEN_ID]) * nrows
+
+    return PromptReplacementDetails(
+        full=image_tokens + [bos_token_id],
+        features=image_tokens,
+    )
+```
+
+Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+we can search for it to conduct the replacement at the start of the string:
+
+```python
+def _get_prompt_replacements(
+    self,
+    mm_items: MultiModalDataItems,
+    hf_processor_mm_kwargs: Mapping[str, object],
+    out_mm_kwargs: MultiModalKwargs,
+) -> list[PromptReplacement]:
+    hf_config = self.info.get_hf_config()
+    bos_token_id = hf_config.bos_token_id
+    assert isinstance(bos_token_id, int)
+
+    tokenizer = self.info.get_tokenizer()
+    eot_token_id = tokenizer.bos_token_id
+    assert isinstance(eot_token_id, int)
+
+    def get_replacement_fuyu(item_idx: int):
+        images = mm_items.get_items("image", ImageProcessorItems)
+        image_size = images.get_image_size(item_idx)
+
+        ncols, nrows = self.info.get_image_feature_grid_size(
+            image_width=image_size.width,
+            image_height=image_size.height,
+        )
+        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                        [_NEWLINE_TOKEN_ID]) * nrows
+
+        return PromptReplacementDetails(
+            full=image_tokens + [bos_token_id],
+            features=image_tokens,
+        )
+
+    return [
+        PromptReplacement(
+            modality="image",
+            target=[eot_token_id],
+            replacement=get_replacement_fuyu,
+        )
+    ]
+```
+
+:::
+
 ::::

 ## 5. Register processor-related classes

--- a/docs/source/contributing/profiling/profiling_index.md
+++ b/docs/source/contributing/profiling/profiling_index.md
 # Profiling vLLM

+:::{warning}
+Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
+:::
+
 We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`

 The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.

 When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.

-:::{warning}
-Only enable profiling in a development environment.
-:::
-
 Traces can be visualized using <https://ui.perfetto.dev/>.

 :::{tip}

--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
 vllm serve <model>
 ```

-The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>.
+The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>.

 Sometimes you may see the API server entrypoint used directly instead of via the
 `vllm` CLI command. For example:

--- a/docs/source/features/compatibility_matrix.md
+++ b/docs/source/features/compatibility_matrix.md
@@ -4,8 +4,14 @@

 The tables below show mutually exclusive features and the support on some hardware.

+The symbols used have the following meanings:
+
+- ✅ = Full compatibility
+- 🟠 = Partial compatibility
+- ❌ = No compatibility
+
 :::{note}
-Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
+Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
 :::

 ## Feature x Feature
@@ -29,6 +35,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
 :header-rows: 1
 :stub-columns: 1
 :widths: auto
+:class: vertical-table-header

 - * Feature
  * [CP](#chunked-prefill)
@@ -48,7 +55,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * beam-search
  * <abbr title="Guided Decoding">guided dec</abbr>
 - * [CP](#chunked-prefill)
-  *
+  * ✅
  *
  *
  *
@@ -66,7 +73,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  *
 - * [APC](#automatic-prefix-caching)
  * ✅
-  *
+  * ✅
  *
  *
  *
@@ -82,9 +89,9 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  *
  *
 - * [LoRA](#lora-adapter)
-  * [✗](gh-pr:9057)
  * ✅
-  *
+  * ✅
+  * ✅
  *
  *
  *
@@ -102,7 +109,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  *
+  * ✅
  *
  *
  *
@@ -118,9 +125,9 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
 - * [SD](#spec_decode)
  * ✅
  * ✅
-  * ✗
+  * ❌
+  * ✅
  * ✅
-  *
  *
  *
  *
@@ -138,7 +145,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  *
+  * ✅
  *
  *
  *
@@ -150,13 +157,13 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  *
  *
 - * <abbr title="Pooling Models">pooling</abbr>
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  *
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ✅
  *
  *
  *
@@ -167,14 +174,14 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  *
  *
 - * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-  * ✗
-  * [✗](gh-issue:7366)
-  * ✗
-  * ✗
-  * [✗](gh-issue:7366)
+  * ❌
+  * [❌](gh-issue:7366)
+  * ❌
+  * ❌
+  * [❌](gh-issue:7366)
+  * ✅
  * ✅
  * ✅
-  *
  *
  *
  *
@@ -190,9 +197,9 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ✗
+  * ❌
+  * ✅
  * ✅
-  *
  *
  *
  *
@@ -205,12 +212,12 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * [✗](gh-pr:8199)
  * ✅
-  * ✗
+  * ✅
+  * ❌
+  * ✅
  * ✅
  * ✅
-  *
  *
  *
  *
@@ -222,49 +229,49 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ✗
+  * ❌
+  * ✅
+  * ❌
+  * ❌
  * ✅
-  * ✗
-  * ✗
  * ✅
  * ✅
-  *
  *
  *
  *
  *
  *
 - * multi-step
-  * ✗
+  * ❌
  * ✅
-  * ✗
+  * ❌
+  * ✅
+  * ❌
+  * ✅
+  * ❌
+  * ❌
  * ✅
-  * ✗
  * ✅
-  * ✗
-  * ✗
  * ✅
-  * [✗](gh-issue:8198)
  * ✅
-  *
  *
  *
  *
  *
 - * <abbr title="Multimodal Inputs">mm</abbr>
  * ✅
-  * [✗](gh-pr:8348)
-  * [✗](gh-pr:7199)
-  * ?
-  * ?
+  * [🟠](gh-pr:8348)
+  * [🟠](gh-pr:4194)
+  * ❔
+  * ❔
  * ✅
  * ✅
  * ✅
  * ✅
  * ✅
  * ✅
-  * ?
-  *
+  * ❔
+  * ✅
  *
  *
  *
@@ -273,16 +280,16 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * [✗](gh-issue:6137)
+  * [❌](gh-issue:6137)
  * ✅
-  * ✗
+  * ❌
  * ✅
  * ✅
  * ✅
-  * ?
-  * [✗](gh-issue:7968)
+  * ❔
+  * [❌](gh-issue:7968)
+  * ✅
  * ✅
-  *
  *
  *
 - * beam-search
@@ -290,35 +297,35 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * [✗](gh-issue:6137)
+  * [❌](gh-issue:6137)
  * ✅
-  * ✗
+  * ❌
  * ✅
  * ✅
  * ✅
-  * ?
-  * [✗](gh-issue:7968>)
-  * ?
+  * ❔
+  * [❌](gh-issue:7968)
+  * ❔
+  * ✅
  * ✅
-  *
  *
 - * <abbr title="Guided Decoding">guided dec</abbr>
  * ✅
  * ✅
-  * ?
-  * ?
-  * [✗](gh-issue:11484)
+  * ❔
+  * ❔
+  * [❌](gh-issue:11484)
  * ✅
-  * ✗
-  * ?
+  * ❌
+  * ❔
  * ✅
  * ✅
  * ✅
-  * [✗](gh-issue:9893)
-  * ?
+  * [❌](gh-issue:9893)
+  * ❔
+  * ✅
  * ✅
  * ✅
-  *
 :::

 (feature-x-hardware)=
@@ -339,7 +346,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * CPU
  * AMD
 - * [CP](#chunked-prefill)
-  * [✗](gh-issue:2729)
+  * [❌](gh-issue:2729)
  * ✅
  * ✅
  * ✅
@@ -347,7 +354,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
 - * [APC](#automatic-prefix-caching)
-  * [✗](gh-issue:3687)
+  * [❌](gh-issue:3687)
  * ✅
  * ✅
  * ✅
@@ -368,7 +375,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * [✗](gh-issue:8475)
+  * [❌](gh-issue:8475)
  * ✅
 - * [SD](#spec_decode)
  * ✅
@@ -384,7 +391,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ✗
+  * ❌
  * ✅
 - * <abbr title="Pooling Models">pooling</abbr>
  * ✅
@@ -393,7 +400,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ?
+  * ❔
 - * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
  * ✅
  * ✅
@@ -401,7 +408,7 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ✗
+  * ❌
 - * <abbr title="Multimodal Inputs">mm</abbr>
  * ✅
  * ✅
@@ -432,15 +439,15 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
  * ✅
  * ✅
  * ✅
-  * ✗
-  * ✗
+  * ❌
+  * ❌
 - * multi-step
  * ✅
  * ✅
  * ✅
  * ✅
  * ✅
-  * [✗](gh-issue:8477)
+  * [❌](gh-issue:8477)
  * ✅
 - * best-of
  * ✅

--- a/docs/source/features/quantization/supported_hardware.md
+++ b/docs/source/features/quantization/supported_hardware.md
@@ -20,93 +20,93 @@ The table below shows the compatibility of various quantization implementations
  * AWS Inferentia
  * Google TPU
 - * AWQ
-  * ✗
+  * ❌
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
+  * ❌
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
+  * ❌
+  * ❌
 - * GPTQ
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
+  * ❌
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
+  * ❌
+  * ❌
 - * Marlin (GPTQ/AWQ/FP8)
-  * ✗
-  * ✗
+  * ❌
+  * ❌
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
 - * INT8 (W8A8)
-  * ✗
+  * ❌
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
+  * ❌
+  * ❌
  * ✅︎
-  * ✗
-  * ✗
+  * ❌
+  * ❌
 - * FP8 (W8A8)
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌
 - * AQLM
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
 - * bitsandbytes
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
 - * DeepSpeedFP
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌
+  * ❌
 - * GGUF
  * ✅︎
  * ✅︎
@@ -114,16 +114,16 @@ The table below shows the compatibility of various quantization implementations
  * ✅︎
  * ✅︎
  * ✅︎
-  * ✗
-  * ✗
-  * ✗
-  * ✗
+  * ❌
+  * ❌
+  * ❌
+  * ❌

 :::

 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
+- ✅︎ indicates that the quantization method is supported on the specified hardware.
+- ❌ indicates that the quantization method is not supported on the specified hardware.

 :::{note}
 This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -45,7 +45,7 @@ To perform the same with an online mode launch the server:

 ```bash
 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
-    --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
+    --seed 42 -tp 1 --speculative_model facebook/opt-125m \
    --num_speculative_tokens 5 --gpu_memory_utilization 0.8
 ```

@@ -175,7 +175,7 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=4,
-    speculative_model="path/to/modified/eagle/model",
+    speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
    speculative_draft_tensor_parallel_size=1,
 )

@@ -190,14 +190,12 @@ for output in outputs:

 A few important things to consider when using the EAGLE based draft models:

-1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
-   used directly with vLLM due to differences in the expected layer names and model definition.
-   To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
-   to convert them. Note that this script does not modify the model's weights.
-
-   In the above example, use the script to first convert
-   the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
-   and then use the converted checkpoint as the draft model in vLLM.
+1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
+   be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
+   If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
+   [script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
+   and specify `speculative_model="path/to/modified/eagle/model"`. If weight-loading problems still occur when using
+   the latest version of vLLM, please leave a comment or raise an issue.

 2. The EAGLE based draft models need to be run without tensor parallelism
   (i.e. speculative_draft_tensor_parallel_size is set to 1), although

--- a/docs/source/features/tool_calling.md
+++ b/docs/source/features/tool_calling.md
 # Tool Calling

-vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but on the roadmap.
+vLLM currently supports named function calling, as well as the `auto` and `none` options for the `tool_choice` field in the chat completion API. The `tool_choice` option `required` is **not yet supported** but [on the roadmap](gh-issue:13002).

 ## Quickstart


--- a/docs/source/generate_examples.py
+++ b/docs/source/generate_examples.py
@@ -147,7 +147,7 @@ class Example:
            return content

        content += "## Example materials\n\n"
-        for file in self.other_files:
+        for file in sorted(self.other_files):
            include = "include" if file.suffix == ".md" else "literalinclude"
            content += f":::{{admonition}} {file.relative_to(self.path)}\n"
            content += ":class: dropdown\n\n"
@@ -194,7 +194,7 @@ def generate_examples():
            path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
            title="Offline Inference",
            description=
-            "Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.",  # noqa: E501
+            "Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches. We recommend starting with <project:basic.md>.",  # noqa: E501
            caption="Examples",
        ),
    }

--- a/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md
@@ -19,17 +19,19 @@ Currently, there are no pre-built OpenVINO wheels.

 ### Build wheel from source

-First, install Python. For example, on Ubuntu 22.04, you can run:
+First, install Python and ensure you lave the latest pip. For example, on Ubuntu 22.04, you can run:

 ```console
 sudo apt-get update  -y
 sudo apt-get install python3
+pip install --upgrade pip
 ```

-Second, install prerequisites vLLM OpenVINO backend installation:
+Second, clone vLLM and install prerequisites for the vLLM OpenVINO backend installation:

 ```console
-pip install --upgrade pip
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
 pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
 ```


--- a/docs/source/getting_started/installation/cpu/build.inc.md
+++ b/docs/source/getting_started/installation/cpu/build.inc.md
@@ -10,7 +10,7 @@ Second, install Python packages for vLLM CPU backend building:

 ```console
 pip install --upgrade pip
-pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
+pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
 pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
 ```


--- a/docs/source/getting_started/installation/cpu/index.md
+++ b/docs/source/getting_started/installation/cpu/index.md
@@ -170,7 +170,7 @@ vLLM CPU backend supports the following vLLM features:
 sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
 find / -name *libtcmalloc* # find the dynamic link library path
 export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
-python examples/offline_inference/basic.py # run vLLM
+python examples/offline_inference/basic/basic.py # run vLLM
 ```

 - When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
@@ -207,7 +207,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ      MHZ

 # On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
 $ export VLLM_CPU_OMP_THREADS_BIND=0-7
-$ python examples/offline_inference/basic.py
+$ python examples/offline_inference/basic/basic.py
 ```

 - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.

--- a/docs/source/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/source/getting_started/installation/gpu/cuda.inc.md
@@ -89,12 +89,22 @@ cd vllm
 VLLM_USE_PRECOMPILED=1 pip install --editable .
 ```

-This will download the [latest nightly wheel](https://wheels.vllm.ai/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl) and use the compiled libraries from there in the installation.
+This command will do the following:
+1. Look for the current branch in your vLLM clone.
+2. Identify the corresponding base commit in the main branch.
+3. Download the pre-built wheel of the base commit.
+4. Use its compiled libraries in the installation.

-The `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable can be used instead of `VLLM_USE_PRECOMPILED` to specify a custom path or URL to the wheel file. For example, to use the [0.6.1.post1 PyPi wheel](https://pypi.org/project/vllm/#files):
+:::{note}
+1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
+2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
+:::
+
+In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.

 ```console
-export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/4a/4c/ee65ba33467a4c0de350ce29fbae39b9d0e7fcd887cc756fa993654d1228/vllm-0.6.3.post1-cp38-abi3-manylinux1_x86_64.whl
+export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
+export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
 pip install --editable .
 ```


--- a/docs/source/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/source/getting_started/installation/gpu/rocm.inc.md
 # Installation

-vLLM supports AMD GPUs with ROCm 6.2.
+vLLM supports AMD GPUs with ROCm 6.3.

 :::{attention}
 There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
@@ -9,7 +9,7 @@ There are no pre-built wheels for this device, so you must either use the pre-bu
 ## Requirements

 - GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
- ROCm 6.2
+- ROCm 6.3

 ## Set up using Python

@@ -24,9 +24,15 @@ Currently, there are no pre-built ROCm wheels.
 - [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
 - [PyTorch](https://pytorch.org/)

-    For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0`, `rocm/pytorch-nightly`.
+    For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.3_ubuntu24.04_py3.12_pytorch_release_2.4.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.

-    Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/)
+    Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
+
+    ```console
+    # Install PyTorch
+    $ pip uninstall torch -y
+    $ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/rocm6.3
+    ```

 1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)

@@ -37,7 +43,7 @@ Currently, there are no pre-built ROCm wheels.
    pip uninstall -y triton
    git clone https://github.com/OpenAI/triton.git
    cd triton
-    git checkout e192dba
+    git checkout e5be006
    cd python
    pip3 install .
    cd ../..
@@ -49,15 +55,15 @@ Currently, there are no pre-built ROCm wheels.

 2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)

-    Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
+    Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
    Alternatively, wheels intended for vLLM use can be accessed under the releases.

-    For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
+    For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.

    ```console
    git clone https://github.com/ROCm/flash-attention.git
    cd flash-attention
-    git checkout 3cea2fb
+    git checkout b7d29fb
    git submodule update --init
    GPU_ARCHS="gfx90a" python3 setup.py install
    cd ..
@@ -67,20 +73,16 @@ Currently, there are no pre-built ROCm wheels.
    You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
    :::

-3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
+3. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:

    ```bash
    $ pip install --upgrade pip

-    # Install PyTorch
-    $ pip uninstall torch -y
-    $ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/rocm6.2
-
    # Build & install AMD SMI
    $ pip install /opt/rocm/share/amd_smi

    # Install dependencies
-    $ pip install --upgrade numba scipy huggingface-hub[cli]
+    $ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
    $ pip install "numpy<2"
    $ pip install -r requirements-rocm.txt

@@ -91,12 +93,11 @@ Currently, there are no pre-built ROCm wheels.

    This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.

-<!--- pyml disable-num-lines 5 ul-indent-->
    :::{tip}
-    - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
-    - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
-    - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
-    - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
+   - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
+   - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
+   - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
+   - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
    :::

 :::{tip}
@@ -104,7 +105,7 @@ Currently, there are no pre-built ROCm wheels.
  For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
 :::

-## Set up using Docker
+## Set up using Docker (Recommended)

 ### Pre-built images

@@ -120,7 +121,12 @@ for instructions on how to use this prebuilt docker image.

 Building the Docker image from source is the recommended way to use vLLM with ROCm.

-First, build a docker image from <gh-file:Dockerfile.rocm> and launch a docker container from the image.
+#### (Optional) Build an image with ROCm software stack
+
+Build a docker image from <gh-file:Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM.
+**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
+If you choose to build this rocm_base image yourself, the steps are as follows.
+
 It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:

 ```console
@@ -131,7 +137,26 @@ It is important that the user kicks off the docker build using buildkit. Either
 }
 ```

-<gh-file:Dockerfile.rocm> uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
+To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
+
+```console
+DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm_base -t rocm/vllm-dev:base .
+```
+
+#### Build an image with vLLM
+
+First, build a docker image from <gh-file:Dockerfile.rocm> and launch a docker container from the image.
+It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
+
+```console
+{
+    "features": {
+        "buildkit": true
+    }
+}
+```
+
+<gh-file:Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
 It provides flexibility to customize the build of docker image using the following arguments:

 - `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:Dockerfile.rocm_base>
@@ -141,13 +166,13 @@ It provides flexibility to customize the build of docker image using the followi

 Their values can be passed in when running `docker build` with `--build-arg` options.

-To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default:
+To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:

 ```console
 DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm .
 ```

-To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
+To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:

 ```console
 DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f Dockerfile.rocm -t vllm-rocm .

--- a/docs/source/getting_started/installation/python_env_setup.inc.md
+++ b/docs/source/getting_started/installation/python_env_setup.inc.md
@@ -2,8 +2,8 @@ You can create a new Python environment using `conda`:

 ```console
 # (Recommended) Create a new conda environment.
-conda create -n myenv python=3.12 -y
-conda activate myenv
+conda create -n vllm python=3.12 -y
+conda activate vllm
 ```

 :::{note}
@@ -14,6 +14,6 @@ Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/

 ```console
 # (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment.
-uv venv myenv --python 3.12 --seed
-source myenv/bin/activate
+uv venv vllm --python 3.12 --seed
+source vllm/bin/activate
 ```
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@@ -40,7 +40,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in

 ## Offline Batched Inference

-With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
+With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>

 The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:


--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/source/getting_started/troubleshooting.md
@@ -94,20 +94,20 @@ pynccl.disabled = False
 s = torch.cuda.Stream()
 with torch.cuda.stream(s):
    data.fill_(1)
-    pynccl.all_reduce(data, stream=s)
-    value = data.mean().item()
+    out = pynccl.all_reduce(data, stream=s)
+    value = out.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

 print("vLLM NCCL is successful!")

 g = torch.cuda.CUDAGraph()
 with torch.cuda.graph(cuda_graph=g, stream=s):
-    pynccl.all_reduce(data, stream=torch.cuda.current_stream())
+    out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())

 data.fill_(1)
 g.replay()
 torch.cuda.current_stream().synchronize()
-value = data.mean().item()
+value = out.mean().item()
 assert value == world_size, f"Expected {world_size}, got {value}"

 print("vLLM NCCL with cuda graph is successful!")

--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -23,7 +23,7 @@

 vLLM is a fast and easy-to-use library for LLM inference and serving.

-Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.
+Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

 vLLM is fast with: