Merge tag 'v0.14.0' into v0.14.0-dev

7e63ef82 · zhuwenwen · 8cbcac5d · b17039bc · 7e63ef82 · 7e63ef82
Commit 7e63ef82 authored Jan 21, 2026 by zhuwenwen
20 changed files
--- a/docs/design/custom_op.md
+++ b/docs/design/custom_op.md
+# CustomOp
+
+`CustomOp` is an abstract class used for dispatching the forward method of various operations to the appropriate backend. It also offers a mechanism for both vLLM and OOT (Out-Of-Tree) plugins to register their custom operations.
+
+This document will introduce how CustomOp works in vLLM and how to implement a new `CustomOp`.
+
+## How CustomOp Works in vLLM
+
+`CustomOp` manages two dictionaries of all custom ops (i.e., op classes, indexed by registered name) in its class, for vLLM and OOT plugins respectively.
+
+??? code
+
+    ```python
+    class CustomOp(nn.Module):
+
+        op_registry: dict[str, type["CustomOp"]] = {}
+        op_registry_oot: dict[str, type["CustomOp"]] = {}
+    ```
+
+We can use `@CustomOp.register("op_name")` to register an op class to the `CustomOp` system. After this, the `op_name` and its class will be added into the `op_registry` dictionary. In addition, We can also register an OOT op by `@CustomOp.register_oot("op_name")`. We will introduce this mechanism in detail later.
+
+When a `CustomOp` is called (i.e., call its `forward()` method), if it is enabled (i.e., with `--compilation_config.custom_ops '["+op_name"]'`), it will automatically dispatch the forward method to the appropriate backend according to `current_platform`. Otherwise (i.e., it is disabled), it will only call the `forward_native()` method to use PyTorch-native implementation of this forward method.
+
+- **CPU platform:** dispatch to `forward_cpu()`.
+- **CUDA platform:** dispatch to `forward_cuda()`.
+- **ROCm platform:** dispatch to `forward_hip()`. If `forward_hip()` is not implemented, it will use `forward_cuda()` as a fallback.
+- **XPU platform:** dispatch to `forward_xpu()`.
+- **TPU platform:** dispatch to `forward_tpu()`.
+- **OOT platform:** dispatch to `forward_oot()`. This will only be called on OOT platforms.
+- **Default:** dispatch to `forward_native()` as a final fallback for all platforms.
+
+!!! note
+    Note that the dispatching logic might not be absolute because of class inheritance. Derived class might override the behavior.
+
+Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `compilation_config.custom_ops`. To be specific, if a `CustomOp` is not registered in `compilation_config.custom_ops` (i.e., uses the default config), it will be enabled if `compilation_config.custom_ops` contains `all`, or will be disabled if it contains `none`.
+
+!!! note
+    Note that `all` and `none` cannot coexist in `compilation_config.custom_ops`.
+
+By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.
+
+!!! note
+    For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.
+
+    Note that this `enforce_enable` mechanism will be removed after we add a separate `compilation_config` for multi-modal part.
+
+## How to Customise Your Configuration for CustomOp
+
+vLLM also offers fine-grained control over which custom ops to enable or disable for users, by manually passing a `--compilation_config.custom_ops '["..."]'` when launching a server.
+
+For example:
+
+- Use `--compilation_config.custom_ops '["all"]'` to enable all custom ops.
+- Use `--compilation_config.custom_ops '["none"]'` to disable all custom ops.
+- Use `--compilation_config.custom_ops '["all,-op1"]'` to enable all custom ops except op1 (i.e., prefixed with a `-` means "disable").
+- Use `--compilation_config.custom_ops '["none,+op1,+op2"]'` to only enable op1 and op2 (i.e., prefixed with a `+` means "enable").
+
+## Types of Supported CustomOp in vLLM
+
+**1. Attention:**
+
+```python
+--8<-- "vllm/model_executor/layers/attention/mm_encoder_attention.py:mm_encoder_attn"
+
+--8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention"
+```
+
+**2. Activation:**
+
+```python
+--8<-- "vllm/model_executor/layers/activation.py:silu_and_mul"
+
+--8<-- "vllm/model_executor/layers/activation.py:mul_and_silu"
+
+--8<-- "vllm/model_executor/layers/activation.py:gelu_new"
+
+--8<-- "vllm/model_executor/layers/activation.py:gelu_fast"
+
+--8<-- "vllm/model_executor/layers/activation.py:quick_gelu"
+
+--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul"
+
+--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul_sparse"
+
+--8<-- "vllm/model_executor/layers/activation.py:relu2"
+
+--8<-- "vllm/model_executor/layers/activation.py:xielu"
+
+--8<-- "vllm/model_executor/layers/activation.py:swigluoai_and_mul"
+
+--8<-- "vllm/model_executor/layers/activation.py:fatrelu_and_mul"
+```
+
+**3. MM-Conv:**
+
+```python
+--8<-- "vllm/model_executor/layers/conv.py:conv2d"
+
+--8<-- "vllm/model_executor/layers/conv.py:conv3d"
+```
+
+**4. Embedding:**
+
+```python
+--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:vocab_parallel_embedding"
+
+--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:parallel_lm_head"
+```
+
+**5. Linear:**
+
+```python
+--8<-- "vllm/model_executor/layers/linear.py:row_parallel_linear"
+
+--8<-- "vllm/model_executor/layers/linear.py:column_parallel_linear"
+
+--8<-- "vllm/model_executor/layers/linear.py:replicated_linear"
+```
+
+**6. Logits Processor:**
+
+```python
+--8<-- "vllm/model_executor/layers/logits_processor.py:logits_processor"
+```
+
+**7. Mamba:**
+
+```python
+--8<-- "vllm/model_executor/layers/mamba/mamba_mixer.py:mamba_mixer"
+
+--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mamba_mixer2"
+
+--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mixer2_gated_rms_norm"
+
+--8<-- "vllm/model_executor/models/plamo2.py:plamo2_mamba_mixer"
+
+--8<-- "vllm/model_executor/layers/mamba/short_conv.py:short_conv"
+```
+
+**8. MoE:**
+
+```python
+--8<-- "vllm/model_executor/layers/fused_moe/layer.py:fused_moe"
+
+--8<-- "vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py:modular_fused_moe"
+
+--8<-- "vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py:unquantized_fused_moe"
+
+--8<-- "vllm/model_executor/models/transformers/moe.py:transformers_fused_moe"
+
+--8<-- "vllm/model_executor/layers/fused_moe/fused_moe.py:grouped_topk"
+```
+
+**9. Norm:**
+
+```python
+--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm"
+
+--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm_gated"
+
+--8<-- "vllm/model_executor/layers/layernorm.py:gemma_rms_norm"
+```
+
+**10. Quantization:**
+
+```python
+--8<-- "vllm/model_executor/layers/quantization/input_quant_fp8.py:quant_fp8"
+```
+
+**11. Rope:**
+
+```python
+--8<-- "vllm/model_executor/layers/rotary_embedding/base.py:rotary_embedding"
+
+--8<-- "vllm/model_executor/layers/rotary_embedding/dual_chunk_rope.py:dual_chunk_rotary_embedding"
+
+--8<-- "vllm/model_executor/layers/rotary_embedding/common.py:apply_rotary_emb"
+```
+
+## Guidelines for Implementing a New CustomOp
+
+### Implement a New CustomOp in vLLM
+
+This part is a tutorial of how to implement a New `CustomOp` in vLLM.
+
+Steps:
+
+1. Implement a new op class, which extends from `CustomOp` base class.
+2. Add the `@CustomOp.register("op_name")` decorator on this op class to register it into `CustomOp` system.
+3. Implement different `forward_xxx()` method according to your needs.
+
+Taking `MMEncoderAttention` as an example:
+
+??? code
+
+    ```python
+    @CustomOp.register("mm_encoder_attn")
+    class MMEncoderAttention(CustomOp):
+
+        def __init__(
+            self,
+            num_heads: int,
+            head_size: int,
+            scale: float | None = None,
+            num_kv_heads: int | None = None,
+            prefix: str = "",
+            multimodal_config: MultiModalConfig | None = None,
+        ) -> None:
+            super().__init__()
+            # Init...
+
+        def forward_native(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            cu_seqlens: torch.Tensor | None = None,
+            max_seqlen: torch.Tensor | None = None,  # Only used for Flash Attention
+        ) -> torch.Tensor:
+            # Call TORCH_SDPA implementation...
+
+        def forward_cuda(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            cu_seqlens: torch.Tensor | None = None,
+            max_seqlen: torch.Tensor | None = None,  # Only used for Flash Attention
+        ) -> torch.Tensor:
+            # Call FA or TORCH_SDPA implementation...
+
+        def forward_cpu(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            cu_seqlens: torch.Tensor | None = None,
+            max_seqlen: torch.Tensor | None = None,  # Only used for Flash Attention
+        ) -> torch.Tensor:
+            # Call TORCH_SDPA implementation...
+
+        def forward_xpu(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            cu_seqlens: torch.Tensor | None = None,
+            max_seqlen: torch.Tensor | None = None,  # Only used for Flash Attention
+        ) -> torch.Tensor:
+            # Call FA implementation...
+
+        def forward_tpu(
+            self,
+            query: torch.Tensor,
+            key: torch.Tensor,
+            value: torch.Tensor,
+            cu_seqlens: torch.Tensor | None = None,
+            max_seqlen: torch.Tensor | None = None,  # Only used for Flash Attention
+        ) -> torch.Tensor:
+            # Call PALLAS implementation...
+    ```
+
+### Register a New CustomOp in OOT Device Plugins
+
+Currently, thanks to [vLLM's hardware-plugin mechanism](./plugin_system.md), there are various OOT device plugins emerging out to enable vLLM seamlessly runs on different hardwares. You can also find more details about this mechanism at [Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU](https://blog.vllm.ai/2025/05/12/hardware-plugin.html).
+
+- **Official device plugins:** [vllm-ascend](https://github.com/vllm-project/vllm-ascend) (for Huawei Ascend NPU), [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
+(for Spyre), [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi) (for Intel Gaudi), [vllm-neuron](https://github.com/vllm-project/vllm-neuron) (for AWS Neuron), [vllm-meta](https://github.com/vllm-project/vllm-metal) (for Apple Silicon), etc.
+- **Non-official device plugins:** [vllm-metax](https://github.com/MetaX-MACA/vLLM-metax) (for MetaX GPU), [vllm-kunlun](https://github.com/baidu/vLLM-Kunlun) (for Baidu Kunlun XPU), etc.
+
+In this case, `CustomOp` can enable these hardware manufacturers to seamlessly replace vLLM's operations with their deep-optimized kernels for specific devices at runtime, by just registering an OOT `CustomOp` and implementing the `forward_oot()` method.
+
+Now, this part will show you how to register an OOT `CustomOp` for a device plugin.
+
+Taking `MMEncoderAttention` as an example:
+
+1. Implement a `CustomMMEncoderAttention` class which extends from `MMEncoderAttention` and implement its `forward_oot()` method.
+2. Register your `CustomMMEncoderAttention` into vLLM to replace `MMEncoderAttention`.
+
+??? code
+
+    ```python
+    from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
+    from vllm.model_executor.custom_op import CustomOp
+
+
+    @CustomOp.register_oot("MMEncoderAttention")
+    class CustomMMEncoderAttention(MMEncoderAttention):
+
+        def __init__(...):
+            super().__init__(...)
+        
+        def forward_oot(...):
+            # Call optimized device-specific kernels.
+            ...
+    ```
+
+In this case, a new item `{"MMEncoderAttention": CustomMMEncoderAttention}` will be added into `op_registry_oot`. When initializing a `MMEncoderAttention` op object, if the class name (i.e., `MMEncoderAttention`) is contained in the keys of `op_registry_oot`, vLLM will replace it with our registered class (i.e., `CustomMMEncoderAttention`) and instantiate it.
+
+After that, when this `MMEncoderAttention` op is called, your `forward_oot()` will be called if it is enabled. Thus, you will get expected performance on your hardwares without directly modify vLLM.
+
+In addition, you can also register all your `CustomOp` at one place for better management.
+
+??? code
+
+    ```python
+    from vllm.model_executor.custom_op import CustomOp
+
+
+    REGISTERED_CUSTOM_OPS = {
+        "CustomOP1": YourCustomOp1,
+        "CustomOP2": YourCustomOp2,
+        "CustomOP3": YourCustomOp3,
+    }
+
+    for op_name, op_cls in REGISTERED_CUSTOM_OPS.items():
+        CustomOp.register_oot(_decorated_op_cls=op_cls, name=op_name)
+    ```
--- a/docs/design/debug_vllm_compile.md
+++ b/docs/design/debug_vllm_compile.md
@@ -33,7 +33,7 @@ goals while minimizing impact to performance and also helps us (vLLM) when you o
 For more details on the design, please see the following resources:

 - [Introduction to vLLM-torch.compile blogpost](https://blog.vllm.ai/2025/08/20/torch-compile.html)
- [vLLM-torch.compile integration design](https://docs.vllm.ai/en/latest/design/torch_compile.html)
+- [vLLM-torch.compile integration design](./torch_compile.md)
 - [vLLM Office Hours #26](https://www.youtube.com/live/xLyxc7hxCJc?si=Xulo9pe53C6ywf0V&t=561)
 - [Talk at PyTorch Conference 2025](https://youtu.be/1wV1ESbGrVQ?si=s1GqymUfwiwOrDTg&t=725)


--- a/docs/design/fused_moe_modular_kernel.md
+++ b/docs/design/fused_moe_modular_kernel.md
@@ -2,7 +2,7 @@

 ## Introduction

-FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
+FusedMoEModularKernel is implemented [here](../../vllm/model_executor/layers/fused_moe/modular_kernel.py)

 Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.


--- a/docs/design/logits_processors.md
+++ b/docs/design/logits_processors.md
@@ -138,7 +138,7 @@ Note that the sampler will access the logits processors via `SamplingMetadata.lo
            # ...return sampler output data structure...


-        def sample(self, logits, sampling_metadta)
+        def sample(self, logits, sampling_metadata)

            ...


--- a/docs/design/moe_kernel_features.md
+++ b/docs/design/moe_kernel_features.md
@@ -16,7 +16,7 @@ Async backends support the use of DBO (Dual Batch Overlap) and shared expert ove

 Certain models require the topk weights to be applied to the input activations rather than the output activations when topk==1, e.g. Llama. For modular kernels, this feature is supported by the `FusedMoEPrepareAndFinalize` subclass. For non-modular kernels, it is up to the experts function to deal with this flag.

-Unless otherwise specified, backends are controlled via `VLLM_ALL2ALL_BACKEND`. All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP.
+Unless otherwise specified, backends are controlled via the `--all2all-backend` command-line argument (or the `all2all_backend` parameter in `ParallelConfig`). All backends except `flashinfer` only work with EP+DP or EP+TP. `Flashinfer` can work with EP or DP without EP.

 <style>
 td {
@@ -86,13 +86,12 @@ To be used with a particular `FusedMoEPrepareAndFinalize` subclass, MoE kernels
 | triton | standard | all<sup>1</sup> | G,A,T | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] |
 | triton (batched) | batched | all<sup>1</sup> | G,A,T | silu, gelu | <sup>6</sup> | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] |
 | deep gemm | standard,</br>batched | fp8 | G(128),A,T | silu, gelu | <sup>6</sup> | Y | [`deep_gemm_moe_fp8`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.deep_gemm_moe_fp8],</br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] |
-| cutlass_fp4 | standard,</br>batched | nvfp4 | A,T | silu | Y | Y | [`cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp4],</br>[`CutlassExpertsFp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp4] |
-| cutlass_fp8 | standard,</br>batched | fp8 | A,T | silu, gelu | Y | Y | [`cutlass_moe_fp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.cutlass_moe_fp8],</br>[`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],</br>[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8] |
+| cutlass_fp4 | standard,</br>batched | nvfp4 | A,T | silu | Y | Y | [`CutlassExpertsFp4`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp4] |
+| cutlass_fp8 | standard,</br>batched | fp8 | A,T | silu, gelu | Y | Y | [`CutlassExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassExpertsFp8],</br>[`CutlasBatchedExpertsFp8`][vllm.model_executor.layers.fused_moe.cutlass_moe.CutlassBatchedExpertsFp8] |
 | flashinfer | standard | nvfp4,</br>fp8 | T | <sup>5</sup> | N | Y | [`flashinfer_cutlass_moe_fp4`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.flashinfer_cutlass_moe_fp4],</br>[`FlashInferExperts`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe.FlashInferExperts] |
 | gpt oss triton | standard | N/A | N/A | <sup>5</sup> | Y | Y | [`triton_kernel_fused_experts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.triton_kernel_fused_experts],</br>[`OAITritonExperts`][vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe.OAITritonExperts] |
 | marlin | standard,</br>batched | <sup>3</sup> / N/A | <sup>3</sup> / N/A | silu,</br>swigluoai | Y | Y | [`fused_marlin_moe`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.fused_marlin_moe],</br>[`MarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.MarlinExperts],</br>[`BatchedMarlinExperts`][vllm.model_executor.layers.fused_moe.fused_marlin_moe.BatchedMarlinExperts] |
 | trtllm | standard | mxfp4,</br>nvfp4 | G(16),G(32) | <sup>5</sup> | N | Y | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts] |
-| pallas | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_pallas.fused_moe] |
 | iterative | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_torch_iterative.fused_moe] |
 | rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_experts] |
 | cpu_fused_moe | standard | N/A | N/A | silu | N | N | [`CPUFusedMOE`][vllm.model_executor.layers.fused_moe.cpu_fused_moe.CPUFusedMOE] |

--- a/docs/design/paged_attention.md
+++ b/docs/design/paged_attention.md
@@ -139,18 +139,14 @@ token data.
 const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
 ```

-<figure markdown="span">
-  ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
-</figure>
+![query](../assets/design/paged_attention/query.png)

 Each thread defines its own `q_ptr` which points to the assigned
 query token data on global memory. For example, if `VEC_SIZE` is 4
 and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
 total of 128 elements divided into 128 / 4 = 32 vecs.

-<figure markdown="span">
-  ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
-</figure>
+![q_vecs](../assets/design/paged_attention/q_vecs.png)

 ```cpp
 __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -187,9 +183,7 @@ key token at different iterations. As shown above, that `k_ptr`
 points to key token data based on `k_cache` at assigned block,
 assigned head and assigned token.

-<figure markdown="span">
-  ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
-</figure>
+![key](../assets/design/paged_attention/key.png)

 The diagram above illustrates the memory layout for key data. It
 assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -202,9 +196,7 @@ iterations. Inside each rectangle, there are a total 32 vecs (128
 elements for one token) that will be processed by 2 threads (one
 thread group) separately.

-<figure markdown="span">
-  ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
-</figure>
+![k_vecs](../assets/design/paged_attention/k_vecs.png)

 ```cpp
 K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -361,17 +353,11 @@ later steps. Now, it should store the normalized softmax result of

 ## Value

-<figure markdown="span">
-  ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
-</figure>
+![value](../assets/design/paged_attention/value.png)

-<figure markdown="span">
-  ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
-</figure>
+![logits_vec](../assets/design/paged_attention/logits_vec.png)

-<figure markdown="span">
-  ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
-</figure>
+![v_vec](../assets/design/paged_attention/v_vec.png)

 Now we need to retrieve the value data and perform dot multiplication
 with `logits`. Unlike query and key, there is no thread group

--- a/docs/design/plugin_system.md
+++ b/docs/design/plugin_system.md
@@ -109,7 +109,7 @@ Every plugin has three parts:
    - `init_device`: This function is called to set up the device for the worker.
    - `initialize_cache`: This function is called to set cache config for the worker.
    - `load_model`: This function is called to load the model weights to device.
-    - `get_kv_cache_spaces`: This function is called to generate the kv cache spaces for the model.
+    - `get_kv_cache_spec`: This function is called to generate the kv cache spec for the model.
    - `determine_available_memory`: This function is called to profiles the peak memory usage of the model to determine how much memory can be used for KV cache without OOMs.
    - `initialize_from_config`: This function is called to allocate device KV cache with the specified kv_cache_config
    - `execute_model`: This function is called every step to inference the model.
@@ -124,7 +124,7 @@ Every plugin has three parts:

    Please look at the worker base class [WorkerBase][vllm.v1.worker.worker_base.WorkerBase] for more functions that can be implemented.

-5. Implement the attention backend class `MyDummyAttention` in `my_dummy_attention.py`. The attention backend class should inherit from [AttentionBackend][vllm.attention.backends.abstract.AttentionBackend]. It's used to calculate attentions with your device. Take `vllm.v1.attention.backends` as examples, it contains many attention backend implementations.
+5. Implement the attention backend class `MyDummyAttention` in `my_dummy_attention.py`. The attention backend class should inherit from [AttentionBackend][vllm.v1.attention.backend.AttentionBackend]. It's used to calculate attentions with your device. Take `vllm.v1.attention.backends` as examples, it contains many attention backend implementations.

 6. Implement custom ops for high performance. Most ops can be ran by pytorch native implementation, while the performance may not be good. In this case, you can implement specific custom ops for your plugins. Currently, there are kinds of custom ops vLLM supports:

@@ -153,4 +153,5 @@ The interface for the model/module may change during vLLM's development. If you

 !!! warning "Deprecations"
    - `use_v1` parameter in `Platform.get_attn_backend_cls` is deprecated. It has been removed in v0.13.0.
-    - `_Backend` in `vllm.attention` is deprecated. It has been removed in v0.13.0. Please use `vllm.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead.
+    - `_Backend` in `vllm.attention` is deprecated. It has been removed in v0.13.0. Please use `vllm.v1.attention.backends.registry.register_backend` to add new attention backend to `AttentionBackendEnum` instead.
+    - `seed_everything` platform interface is deprecated. It will be removed in v0.15.0 or later. Please use `vllm.utils.torch_utils.set_random_seed` instead.
--- a/docs/design/torch_compile_multimodal.md
+++ b/docs/design/torch_compile_multimodal.md
+# torch.compile with Multimodal Encoders
+
+`torch.compile` can now be applied to multimodal encoders and miscellaneous nn modules in vLLM, including vision-language models like LLaMA 4, Qwen-VL,
+and similar encoder-based architectures.
+
+This document covers the basics of how the `torch.compile` integration works for multimodal encoders in vLLM, as well as how to apply the decorator
+to new models to improve performance.
+
+!!! note
+    For general information about `torch.compile` integration in vLLM, see the [torch.compile design document](./torch_compile.md).
+
+## Overview
+
+We have recently enabled the `@supports_torch_compile` decorator to work for multiple nn module components within a model type; this enables
+turning compile on for multimodal encoders, bringing performance improvements to additional components of the stack.
+
+When applied to the vision block of [`Qwen2_5_vl`](https://github.com/vllm-project/vllm/pull/23207) we observe ~4.5% e2e perf improvements with
+some increase in compilation time
+
+This feature is off by default, but can be enabled by setting `compile_mm_encoder: true` in the compilation config when models have the
+`@supports_torch_compile` decorator.
+
+## How Compilation Works for Multimodal Components
+
+### APIs for Enablement
+
+To compile a multimodal component such as an encoder, we follow the same mechanism as the LLM text backbone, with a few additional scaffoldings:
+
+1. The `@supports_torch_compile` decorator should include `enable_if=should_torch_compile_mm_vit`. This will gate the compilation behind our
+`compile_mm_encoder` configuration
+
+2. `with set_model_tag("<component_name>", is_encoder=True)` context manager should be used around the nn.Module's instantiation. Since torch.compile
+relies on caching artifacts to reduce start time, we must properly propagate the `<component_name>` information to the cache in order to avoid collisions
+with the LLM text-backbone, or other instances of the same artifact (as is the case with vision block). `is_encoder=True` is also needed for encoder
+components (see Compile Range Integration).
+
+3. `with set_forward_context` context manager should be used around the nn.Module's forward call. This will properly forward the vllm_config which is needed
+for torch.compile integration.
+
+### CompilationConfig
+
+With the exception of `compile_mm_encoder: true`, the multimodal encoder will inherit from the same compilation config as the text LLM. We may extend
+this for more configuration in the future.
+
+## Applying torch.compile to a New Multimodal Model/Component
+
+To apply `supports_torch_compile` to a new general nn.Module, we advise following the same steps in [`debug_vllm_compile`](./debug_vllm_compile.md); this includes:
+
+1. Applying `supports_torch_compile` on initially small modules (such as basic MLP layers), then raising to more general modules until one reaches a good performance
+tradeoff
+
+2. Leveraging [`tlparse`](https://github.com/meta-pytorch/tlparse) to identify and eliminate the source of recompiles and graph breaks
+
+3. Using `dynamic_arg_dims` and proper `dynamic_shapes_config` to handle dynamism.
+
+### Common pitfalls
+
+## VllmBackend Feature Support
+
+### Compile ranges
+
+The torch.compile integration will try to rely on max_batch_size to infer compilation ranges for dynamic shapes; however, for modules used in the encoder, this
+shape can be difficult to infer due to the unspecified range of shapes the encoder may see as input. Therefore, we rely on `is_encoder=True` in the `set_model_tag`
+to alert torch.compile to the fact that this range cannot be inferred, and we default to the range (1, MAX_INT).
+
+!!! note
+    We may seek to tighten this range for better performance in the future
+
+### Cudagraphs
+
+We have not yet explored compilation for multimodal encoders with CUDAGraph integration; behavior is currently unspecified.
+
+## Troubleshooting
+
+### Graph Breaks in Vision Encoders
+
+Some vision encoder operations may cause graph breaks. To identify them:
+
+```bash
+TORCH_LOGS="+dynamo" vllm serve <MODEL>
+```
+
+Common causes of graph breaks in multimodal models:
+
+- **Dynamic image sizes**: Use `dynamic_shapes_config` to handle variable resolutions
+- **Untraceable operations**: Some operations (such as to_list) may not be supported by Dynamo
+- **Conditional processing**: Data-dependent branching based on image properties
+
+### Compilation Errors
+
+If compilation fails for a multimodal model:
+
+1. **Disable and test**: First verify the model works without compilation:
+   ```bash
+   VLLM_TORCH_COMPILE_LEVEL=0 vllm serve <model> --compilation-config='{"compile_mm_encoder":"false"}'
+   ```
+
+2. **Check logs**: Enable debug logging to see compilation details:
+   ```bash
+   VLLM_LOGGING_LEVEL=DEBUG vllm serve <model> --compilation-config='{"compile_mm_encoder":"true"}'
+   ```
+
+3. **Report issues**: If you find a bug, [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose)
+
+## See Also
+
+- [torch.compile Integration](./torch_compile.md) - Core design document
+- [Debugging torch.compile](./debug_vllm_compile.md) - Detailed debugging guide
+- [Multimodal Inputs](../features/multimodal_inputs.md) - How to pass multimodal data
+- [Disaggregated Encoder](../features/disagg_encoder.md) - Scaling vision encoders
+- [Supported Multimodal Models](../models/supported_models.md#list-of-multimodal-language-models) - Model compatibility
--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -64,7 +64,7 @@ th:not(:first-child) {
 | [CP](../configuration/optimization.md#chunked-prefill)                                     | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | [APC](automatic_prefix_caching.md)                        | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | [LoRA](lora.md)                                           | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
-| [SD](spec_decode.md)                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | [🟠](https://github.com/vllm-project/vllm/issues/26963)       |
+| [SD](spec_decode.md)                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | ✅        |
 | CUDA graph                                                | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | [❌](https://github.com/vllm-project/vllm/issues/26970)        |
 | [pooling](../models/pooling_models.md)                    | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅        |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr>       | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ❌     | ✅        |

--- a/docs/features/custom_logitsprocs.md
+++ b/docs/features/custom_logitsprocs.md
@@ -180,7 +180,7 @@ The `DummyLogitsProcessor.update_state()` implementation maintains a "sparse" re

 ### Wrapping an Existing Request-Level Logits Processor

-Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here](https://docs.vllm.ai/en/v0.10.1.1/api/vllm/logits_process.html)) conforming to the following type annotation:
+Although the vLLM engine applies logits processors at batch granularity, some users may want to use vLLM with a "request-level" logits processor implementation - an implementation which operates on individual requests. This will be especially true if your logits processor was developed for vLLM version 0, which required it to be a `Callable` (as described [here][vllm.logits_process]) conforming to the following type annotation:

 ``` python
 RequestLogitsProcessor = Union[

--- a/docs/features/disagg_encoder.md
+++ b/docs/features/disagg_encoder.md
@@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow:

 ![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)

-For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
+For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.

 `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)


--- a/docs/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
 # Disaggregated Prefilling (experimental)

-This page introduces you the disaggregated prefilling feature in vLLM.
+This page introduces you to the disaggregated prefilling feature in vLLM.

 !!! note
    This feature is experimental and subject to change.
@@ -37,10 +37,10 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_buffer_device":"cuda", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'
  ```

- **OffloadingConnector**: enable offloading of KV data to CPU memory, customizing the CPU block size (in tokens) and number of blocks to allocate (per worker):
+- **OffloadingConnector**: enable offloading of KV data to CPU memory, customizing the CPU block size (in tokens) and total CPU memory bytes to allocate:

  ```bash
-  --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "num_cpu_blocks": 1000}}'
+  --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "cpu_bytes_to_use": 1000000000}}'
  ```

 ## Benchmarks

--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@@ -10,7 +10,7 @@ them locally with
 ```python
 from huggingface_hub import snapshot_download

-sql_lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
+sql_lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")
 ```

 Then we instantiate the base model and pass in the `enable_lora=True` flag:
@@ -19,7 +19,7 @@ Then we instantiate the base model and pass in the `enable_lora=True` flag:
 from vllm import LLM, SamplingParams
 from vllm.lora.request import LoRARequest

-llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)
+llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", enable_lora=True)
 ```

 We can now submit the prompts and call `llm.generate` with the `lora_request` parameter. The first parameter
@@ -55,14 +55,11 @@ LoRA adapted models can also be served with the Open-AI compatible vLLM server.
 `--lora-modules {name}={path} {name}={path}` to specify each LoRA module when we kick off the server:

 ```bash
-vllm serve meta-llama/Llama-2-7b-hf \
+vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
-    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
+    --lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider
 ```

-!!! note
-    The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-
 The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
 etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
 with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
@@ -75,7 +72,7 @@ with its base model (if `jq` is not installed, you can follow [this guide](https
        "object": "list",
        "data": [
            {
-                "id": "meta-llama/Llama-2-7b-hf",
+                "id": "meta-llama/Llama-3.2-3B-Instruct",
                "object": "model",
                ...
            },
@@ -218,14 +215,14 @@ Alternatively, follow these example steps to implement your own plugin:
 In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:

 ```bash
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
+--lora-modules  sql-lora=jeeejeee/llama32-3b-text2sql-spider
 ```

 This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
 Now, you can specify a base_model_name alongside the name and path using JSON format. For example:

 ```bash
--lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
+--lora-modules '{"name": "sql-lora", "path": "jeeejeee/llama32-3b-text2sql-spider", "base_model_name": "meta-llama/Llama-3.2-3B-Instruct"}'
 ```

 To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
@@ -234,7 +231,7 @@ To provide the backward compatibility support, you can still use the old key-val

 The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:

- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
+- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-3.2-3B-Instruct`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
 - The `root` field points to the artifact location of the lora adapter.

 ??? console "Command output"
@@ -246,11 +243,11 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
        "object": "list",
        "data": [
            {
-            "id": "meta-llama/Llama-2-7b-hf",
+            "id": "meta-llama/Llama-3.2-3B-Instruct",
            "object": "model",
            "created": 1715644056,
            "owned_by": "vllm",
-            "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
+            "root": "meta-llama/Llama-3.2-3B-Instruct",
            "parent": null,
            "permission": [
                {
@@ -263,8 +260,8 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
            "object": "model",
            "created": 1715644056,
            "owned_by": "vllm",
-            "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
-            "parent": meta-llama/Llama-2-7b-hf,
+            "root": "jeeejeee/llama32-3b-text2sql-spider",
+            "parent": "meta-llama/Llama-3.2-3B-Instruct",
            "permission": [
                {
                ....
@@ -275,6 +272,10 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
    }
    ```

+## LoRA Support for Tower and Connector of Multi-Modal Model
+
+Currently, vLLM experimentally supports LoRA for the Tower and Connector components of multi-modal models. To enable this feature, you need to implement the corresponding token helper functions for the tower and connector. For more details on the rationale behind this approach, please refer to [PR 26674](https://github.com/vllm-project/vllm/pull/26674). We welcome contributions to extend LoRA support to additional models' tower and connector. Please refer to [Issue 31479](https://github.com/vllm-project/vllm/issues/31479) to check the current model support status.
+
 ## Default LoRA Models For Multimodal Models

 Some models, e.g., [Granite Speech](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) multimodal, contain LoRA adapter(s) that are expected to always be applied when a given modality is present. This can be a bit tedious to manage with the above approaches, as it requires the user to send the `LoRARequest` (offline) or to filter requests between the base model and LoRA model (server) depending on the content of the request's multimodal data.

--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../

 If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:

-```python
-from vllm import LLM
-from vllm.assets.image import ImageAsset
-
-llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-image_url = "https://picsum.photos/id/32/512/512"
-image_pil = ImageAsset('cherry_blossom').pil_image
-image_embeds = torch.load(...)
-
-conversation = [
-    {"role": "system", "content": "You are a helpful assistant"},
-    {"role": "user", "content": "Hello"},
-    {"role": "assistant", "content": "Hello! How can I assist you today?"},
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image_url",
-                "image_url": {"url": image_url},
-            },
-            {
-                "type": "image_pil",
-                "image_pil": image_pil,
-            },
-            {
-                "type": "image_embeds",
-                "image_embeds": image_embeds,
-            },
-            {
-                "type": "text",
-                "text": "What's in these images?",
-            },
-        ],
-    },
-]
+??? code

-# Perform inference and log output.
-outputs = llm.chat(conversation)
+    ```python
+    from vllm import LLM
+    from vllm.assets.image import ImageAsset

-for o in outputs:
-    generated_text = o.outputs[0].text
-    print(generated_text)
-```
+    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
+    image_url = "https://picsum.photos/id/32/512/512"
+    image_pil = ImageAsset('cherry_blossom').pil_image
+    image_embeds = torch.load(...)
+
+    conversation = [
+        {"role": "system", "content": "You are a helpful assistant"},
+        {"role": "user", "content": "Hello"},
+        {"role": "assistant", "content": "Hello! How can I assist you today?"},
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": image_url},
+                },
+                {
+                    "type": "image_pil",
+                    "image_pil": image_pil,
+                },
+                {
+                    "type": "image_embeds",
+                    "image_embeds": image_embeds,
+                },
+                {
+                    "type": "text",
+                    "text": "What's in these images?",
+                },
+            ],
+        },
+    ]
+
+    # Perform inference and log output.
+    outputs = llm.chat(conversation)
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+    ```

 Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:

@@ -354,6 +356,44 @@ You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the mult

 Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)

+#### Automatic Audio Channel Normalization
+
+vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like `torchaudio`, stereo files return shape `[channels, time]`, but many audio models (particularly Whisper-based models) expect mono audio with shape `[time]`.
+
+**Supported models with automatic mono conversion:**
+
+- **Whisper** and all Whisper-based models
+- **Qwen2-Audio**
+- **Qwen2.5-Omni** / **Qwen3-Omni** (inherits from Qwen2.5-Omni)
+- **Ultravox**
+
+For these models, vLLM automatically:
+
+1. Detects if the model requires mono audio via the feature extractor
+2. Converts multi-channel audio to mono using channel averaging
+3. Handles both `(channels, time)` format (torchaudio) and `(time, channels)` format (soundfile)
+
+**Example with stereo audio:**
+
+```python
+import torchaudio
+from vllm import LLM
+
+# Load stereo audio file - returns (channels, time) shape
+audio, sr = torchaudio.load("stereo_audio.wav")
+print(f"Original shape: {audio.shape}")  # e.g., torch.Size([2, 16000])
+
+# vLLM automatically converts to mono for Whisper-based models
+llm = LLM(model="openai/whisper-large-v3")
+
+outputs = llm.generate({
+    "prompt": "",
+    "multi_modal_data": {"audio": (audio.numpy(), sr)},
+})
+```
+
+No manual conversion is needed - vLLM handles the channel normalization automatically based on the model's requirements.
+
 ### Embedding Inputs

 To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
@@ -506,6 +546,7 @@ Then, you can use the OpenAI client as follows:
 ??? code

    ```python
+    import os
    from openai import OpenAI

    openai_api_key = "EMPTY"
@@ -517,8 +558,11 @@ Then, you can use the OpenAI client as follows:
    )

    # Single-image input inference
+
+    # Public image URL for testing remote image processing
    image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

+    # Create chat completion with remote image
    chat_response = client.chat.completions.create(
        model="microsoft/Phi-3.5-vision-instruct",
        messages=[
@@ -542,6 +586,35 @@ Then, you can use the OpenAI client as follows:
    )
    print("Chat completion output:", chat_response.choices[0].message.content)

+    # Local image file path (update this to point to your actual image file)
+    image_file = "/path/to/image.jpg"
+
+    # Create chat completion with local image file
+    # Launch the API server/engine with the --allowed-local-media-path argument.
+    if os.path.exists(image_file):
+        chat_completion_from_local_image_url = client.chat.completions.create(
+            model="microsoft/Phi-3.5-vision-instruct",
+            messages=[
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "text",
+                            "text": "What’s in this image?",
+                        },
+                        {
+                            "type": "image_url",
+                            "image_url": {"url": f"file://{image_file}"},
+                        },
+                    ],
+                }
+            ],
+        )
+        result = chat_completion_from_local_image_url.choices[0].message.content
+        print("Chat completion output from local image file:\n", result)
+    else:
+        print(f"Local image file not found at {image_file}, skipping local file test.")
+
    # Multi-image input inference
    image_url_duck = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/duck.jpg"
    image_url_lion = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/multimodal_asset/lion.jpg"
@@ -654,6 +727,31 @@ Full example: [examples/online_serving/openai_chat_completion_client_for_multimo
    export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
    ```

+#### Video Frame Recovery
+
+For improved robustness when processing potentially corrupted or truncated video files, vLLM supports optional frame recovery using a dynamic window forward-scan approach. When enabled, if a target frame fails to load during sequential reading, the next successfully grabbed frame (before the next target frame) will be used in its place.
+
+To enable video frame recovery, pass the `frame_recovery` parameter via `--media-io-kwargs`:
+
+```bash
+# Example: Enable frame recovery
+vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct \
+  --media-io-kwargs '{"video": {"frame_recovery": true}}'
+```
+
+**Parameters:**
+
+- `frame_recovery`: Boolean flag to enable forward-scan recovery. When `true`, failed frames are recovered using the next available frame within the dynamic window (up to the next target frame). Default is `false`.
+
+**How it works:**
+
+1. The system reads frames sequentially
+2. If a target frame fails to grab, it's marked as "failed"
+3. The next successfully grabbed frame (before reaching the next target) is used to recover the failed frame
+4. This approach handles both mid-video corruption and end-of-video truncation
+
+Works with common video formats like MP4 when using OpenCV backends.
+
 #### Custom RGBA Background Color

 To use a custom background color for RGBA images, pass the `rgba_background_color` parameter via `--media-io-kwargs`:
@@ -860,6 +958,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se

 For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:

+??? code
+
    ```python
        # Image/video/audio URL:
        {

--- a/docs/features/nixl_connector_usage.md
+++ b/docs/features/nixl_connector_usage.md
@@ -6,11 +6,17 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag

 ### Installation

-Install the NIXL library: `uv pip install nixl`, as a quick start.
+Install the NIXL library: `uv pip install nixl`, as a quick start on Nvidia platform.

 - Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
 - The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files

+For ROCm platform, the [base ROCm docker file](../../docker/Dockerfile.rocm_base) includes RIXL and ucx already.
+
+- Refer to [RIXL official repository](https://github.com/rocm/rixl) for more information
+- The supportive libraries for RIXL can be found in [requirements/kv_connectors_rocm.txt](../../requirements/kv_connectors_rocm.txt)
+- In the future we may remove RIXL from docker image file and users will be able to install from pre-compiled binary packages
+
 For non-cuda platform, please install nixl with ucx build from source, instructed as below.

 ```bash

--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -84,7 +84,7 @@ Since simple RTN does not require data for weight quantization and the activatio
 Install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+pip install vllm "lm-eval[api]>=0.4.9.2"
 ```

 Load and run the model in `vllm`:

--- a/docs/features/quantization/inc.md
+++ b/docs/features/quantization/inc.md
@@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen

 ```bash
 export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
-vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
+vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8
 ```

 !!! tip

--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -18,7 +18,7 @@ pip install llmcompressor
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+pip install vllm "lm-eval[api]>=0.4.9.2"
 ```

 ## Quantization Process

--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -23,7 +23,7 @@ pip install llmcompressor
 Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:

 ```bash
-pip install vllm git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
+pip install vllm "lm-eval[api]>=0.4.9.2"
 ```

 ## Quantization Process

--- a/docs/features/quantization/modelopt.md
+++ b/docs/features/quantization/modelopt.md
@@ -8,6 +8,16 @@ We recommend installing the library with:
 pip install nvidia-modelopt
 ```

+## Supported ModelOpt checkpoint formats
+
+vLLM detects ModelOpt checkpoints via `hf_quant_config.json` and supports the
+following `quantization.quant_algo` values:
+
+- `FP8`: per-tensor weight scale (+ optional static activation scale).
+- `FP8_PER_CHANNEL_PER_TOKEN`: per-channel weight scale and dynamic per-token activation quantization.
+- `FP8_PB_WO` (ModelOpt may emit `fp8_pb_wo`): block-scaled FP8 weight-only (typically 128×128 blocks).
+- `NVFP4`: ModelOpt NVFP4 checkpoints (use `quantization="modelopt_fp4"`).
+
 ## Quantizing HuggingFace Models with PTQ

 You can quantize HuggingFace models using the example scripts provided in the Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
@@ -80,3 +90,24 @@ The quantized checkpoint can then be deployed with vLLM. As an example, the foll
    if __name__ == "__main__":
        main()
    ```
+
+## Running the OpenAI-compatible server
+
+To serve a local ModelOpt checkpoint via the OpenAI-compatible API:
+
+```bash
+vllm serve <path_to_exported_checkpoint> \
+  --quantization modelopt \
+  --host 0.0.0.0 --port 8000
+```
+
+## Testing (local checkpoints)
+
+vLLM's ModelOpt unit tests are gated by local checkpoint paths and are skipped
+by default in CI. To run the tests locally:
+
+```bash
+export VLLM_TEST_MODELOPT_FP8_PC_PT_MODEL_PATH=<path_to_fp8_pc_pt_checkpoint>
+export VLLM_TEST_MODELOPT_FP8_PB_WO_MODEL_PATH=<path_to_fp8_pb_wo_checkpoint>
+pytest -q tests/quantization/test_modelopt.py
+```