Unverified Commit 6ebb66cc authored by BlankR's avatar BlankR Committed by GitHub
Browse files

[Doc] Fix format of multimodal_inputs.md (#31800)


Signed-off-by: default avatarBlankR <hjyblanche@gmail.com>
parent 43d384ba
...@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../ ...@@ -166,49 +166,51 @@ Full example: [examples/offline_inference/vision_language_multi_image.py](../../
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
```python ??? code
from vllm import LLM
from vllm.assets.image import ImageAsset
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
image_url = "https://picsum.photos/id/32/512/512"
image_pil = ImageAsset('cherry_blossom').pil_image
image_embeds = torch.load(...)
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_url},
},
{
"type": "image_pil",
"image_pil": image_pil,
},
{
"type": "image_embeds",
"image_embeds": image_embeds,
},
{
"type": "text",
"text": "What's in these images?",
},
],
},
]
# Perform inference and log output. ```python
outputs = llm.chat(conversation) from vllm import LLM
from vllm.assets.image import ImageAsset
for o in outputs: llm = LLM(model="llava-hf/llava-1.5-7b-hf")
generated_text = o.outputs[0].text image_url = "https://picsum.photos/id/32/512/512"
print(generated_text) image_pil = ImageAsset('cherry_blossom').pil_image
``` image_embeds = torch.load(...)
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_url},
},
{
"type": "image_pil",
"image_pil": image_pil,
},
{
"type": "image_embeds",
"image_embeds": image_embeds,
},
{
"type": "text",
"text": "What's in these images?",
},
],
},
]
# Perform inference and log output.
outputs = llm.chat(conversation)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos: Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
...@@ -893,6 +895,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se ...@@ -893,6 +895,8 @@ The following example demonstrates how to pass image embeddings to the OpenAI se
For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this: For Online Serving, you can also skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this:
??? code
```python ```python
# Image/video/audio URL: # Image/video/audio URL:
{ {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment