[Doc] Fix format of multimodal_inputs.md (#31800)

Signed-off-by: BlankR <hjyblanche@gmail.com>

[Doc] Fix format of multimodal_inputs.md (#31800)
Signed-off-by: BlankR <hjyblanche@gmail.com>
6ebb66cc · BlankR · GitHub · 43d384ba · 6ebb66cc
Unverified Commit 6ebb66cc authored Jan 06, 2026 by BlankR Committed by GitHub Jan 06, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 45 additions and 41 deletions

docs/features/multimodal_inputs.md docs/features/multimodal_inputs.md +45 -41

No files found.
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../
 If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
-```python
+??? code
-from vllm import LLM
-from vllm.assets.image import ImageAsset
-llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-image_url = "https://picsum.photos/id/32/512/512"
-image_pil = ImageAsset('cherry_blossom').pil_image
-image_embeds = torch.load(...)
-conversation = [
-    {"role": "system", "content": "You are a helpful assistant"},
-    {"role": "user", "content": "Hello"},
-    {"role": "assistant", "content": "Hello! How can I assist you today?"},
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image_url",
-                "image_url": {"url": image_url},
-            },
-            {
-                "type": "image_pil",
-                "image_pil": image_pil,
-            },
-            {
-                "type": "image_embeds",
-                "image_embeds": image_embeds,
-            },
-            {
-                "type": "text",
-                "text": "What's in these images?",
-            },
-        ],
-    },
-]
-# Perform inference and log output.
+    ```python
-outputs = llm.chat(conversation)
+    from vllm import LLM
+    from vllm.assets.image import ImageAsset
-for o in outputs:
+    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-    generated_text = o.outputs[0].text
+    image_url = "https://picsum.photos/id/32/512/512"
-    print(generated_text)
+    image_pil = ImageAsset('cherry_blossom').pil_image
-```
+    image_embeds = torch.load(...)
+    conversation = [
+        {"role": "system", "content": "You are a helpful assistant"},
+        {"role": "user", "content": "Hello"},
+        {"role": "assistant", "content": "Hello! How can I assist you today?"},
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": image_url},
+                },
+                {
+                    "type": "image_pil",
+                    "image_pil": image_pil,
+                },
+                {
+                    "type": "image_embeds",
+                    "image_embeds": image_embeds,
+                },
+                {
+                    "type": "text",
+                    "text": "What's in these images?",
+                },
+            ],
+        },
+    ]
+    # Perform inference and log output.
+    outputs = llm.chat(conversation)
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+    ```
 Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
@@ -893,6 +895,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
 For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
+??? code
    ```python
        # Image/video/audio URL:
        {