[Doc] Update ViT CUDA graph doc for mixed (image+video) inputs (#40355)

Signed-off-by: shen-shanshan <467638484@qq.com>

[Doc] Update ViT CUDA graph doc for mixed (image+video) inputs (#40355)
Signed-off-by: shen-shanshan <467638484@qq.com>
80975912 · Shanshan Shen · GitHub · 20d37434 · 80975912
Unverified Commit 80975912 authored Apr 21, 2026 by Shanshan Shen Committed by GitHub Apr 21, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 5 deletions

docs/design/cuda_graphs_multimodal.md docs/design/cuda_graphs_multimodal.md +2 -5

No files found.
--- a/docs/design/cuda_graphs_multimodal.md
+++ b/docs/design/cuda_graphs_multimodal.md
@@ -52,14 +52,14 @@ For each graph replay:
 When `mm_encoder_tp_mode="data"`, the manager distributes images across TP ranks using load-balanced assignment via `get_load_balance_assignment`, executes locally on each rank, then gathers results back in the original order via `tensor_model_parallel_all_gather`.
-### Video inference support (experimental)
+### Video inference support
 Following <https://github.com/vllm-project/vllm/pull/35963> (ViT full CUDA graph support for image inference), <https://github.com/vllm-project/vllm/pull/38061> extends the encoder CUDA graph framework to support video inference for Qwen3-VL. Previously, the CUDA graph capture/replay path only handled image inputs (`pixel_values` + `image_grid_thw`). Video inputs use different keys (`pixel_values_videos` + `video_grid_thw`) and require larger `cu_seqlens` buffers because each video item contributes multiple frames (`T` attention sequences). This PR generalizes the protocol and manager to handle both modalities through a single shared graph manager.
 !!! note
    Video CUDA graphs are automatically disabled when EVS (Efficient Video Sampling) pruning is enabled, since EVS makes the token count data-dependent and incompatible with CUDA graph capture.
-    Currently, we only support image-only or video-only inputs when enabling CUDA graph, mixed inputs (image + video) are not supported yet (we will work on it in the near future). Thus, it's recommended to turn off the image modality by `--limit-mm-per-prompt '{"image": 0}'` for video-only inputs.
+    Mixed inputs (image+video) per prompt are also supported now.
 ## Model integration via `SupportsEncoderCudaGraph`
@@ -142,7 +142,6 @@ Enable encoder CUDA Graphs via `compilation_config`:
 ```bash
 vllm serve Qwen/Qwen3-VL-32B \
-  --limit-mm-per-prompt '{"image": 0}' \
  --compilation-config '{"cudagraph_mm_encoder": true}'
 ```
@@ -150,7 +149,6 @@ With explicit budgets:
 ```bash
 vllm serve Qwen/Qwen3-VL-32B \
-  --limit-mm-per-prompt '{"image": 0}' \
  --compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_token_budgets": [2048, 4096, 8192, 13824], "encoder_cudagraph_max_vision_items_per_batch": 8, "encoder_cudagraph_max_frames_per_batch": 64}'
 ```
@@ -169,7 +167,6 @@ compilation_config = {
 model = vllm.LLM(
    model="Qwen/Qwen3-VL-32B",
-    limit_mm_per_prompt='{"image": 0}',
    compilation_config=compilation_config,
 )
 ```