multimodal_inputs.rst 15.1 KB
Newer Older
1
.. _multimodal_inputs:
2

3
4
Multimodal Inputs
=================
5

6
This page teaches you how to pass multi-modal inputs to :ref:`multi-modal models <supported_mm_models>` in vLLM.
7

8
.. note::
9
    We are actively iterating on multi-modal support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
10
    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
11

12
13
14
Offline Inference
-----------------

15
To input multi-modal data, follow this schema in :class:`vllm.inputs.PromptType`:
16

17
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
18
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
19

20
21
22
23
24
Image
^^^^^

You can pass a single image to the :code:`'image'` field of the multi-modal dictionary, as shown in the following examples:

25
26
.. code-block:: python

27
28
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")

29
30
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
31
32

    # Load the image using PIL.Image
33
    image = PIL.Image.open(...)
34

35
    # Single prompt inference
36
37
    outputs = llm.generate({
        "prompt": prompt,
38
        "multi_modal_data": {"image": image},
39
40
    })

41
42
43
44
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
    # Batch inference
    image_1 = PIL.Image.open(...)
    image_2 = PIL.Image.open(...)
    outputs = llm.generate(
        [
            {
                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_1},
            },
            {
                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_2},
            }
        ]
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
64

65
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
66

67
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
68

69
.. code-block:: python
70

71
72
73
74
75
76
    llm = LLM(
        model="microsoft/Phi-3.5-vision-instruct",
        trust_remote_code=True,  # Required to load Phi-3.5-vision
        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
    )
77

78
    # Refer to the HuggingFace repo for the correct format to use
79
    prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

    # Load the images using PIL.Image
    image1 = PIL.Image.open(...)
    image2 = PIL.Image.open(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "image": [image1, image2]
        },
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.

98
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
99
100
101

.. code-block:: python

102
    # Specify the maximum number of frames per video to be 4. This can be changed.
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})

    # Create the request payload.
    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
    message = {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
        ],
    }
    for i in range(len(video_frames)):
        base64_image = encode_image(video_frames[i]) # base64 encoding.
        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        message["content"].append(new_image)

    # Perform inference and log output.
    outputs = llm.chat([message])
120

121
122
123
124
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
Video
^^^^^

You can pass a list of NumPy arrays directly to the :code:`'video'` field of the multi-modal dictionary
instead of using multi-image input.

Please refer to `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_ for more details.

Audio
^^^^^

You can pass a tuple :code:`(array, sampling_rate)` to the :code:`'audio'` field of the multi-modal dictionary.

Please refer to `examples/offline_inference_audio_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py>`_ for more details.

Embedding
^^^^^^^^^

To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape :code:`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.

.. code-block:: python

    # Inference with image embeddings as input
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")

    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

    # Embeddings for single image
    # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
    image_embeds = torch.load(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image_embeds},
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:

.. code-block:: python

    # Construct the prompt based on your model
    prompt = ...

    # Embeddings for multiple images
    # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
    image_embeds = torch.load(...)

    # Qwen2-VL
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
    mm_data = {
        "image": {
            "image_embeds": image_embeds,
            # image_grid_thw is needed to calculate positional encoding.
            "image_grid_thw": torch.load(...),  # torch.Tensor of shape (1, 3),
        }
    }

    # MiniCPM-V
    llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
    mm_data = {
        "image": {
            "image_embeds": image_embeds,
            # image_size_list is needed to calculate details of the sliced image.
            "image_size_list": [image.size for image in images],  # list of image sizes
        }
    }

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": mm_data,
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

207
208
209
Online Inference
----------------

210
211
212
213
214
215
216
217
218
219
220
Our OpenAI-compatible server accepts multi-modal data via the `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_.

.. important::
    A chat template is **required** to use Chat Completions API.

    Although most models come with a chat template, for others you have to define one yourself.
    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`__.

Image
^^^^^
221

222
223
Image input is supported according to `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
Here is a simple example using Phi-3.5-Vision.
224

225
First, launch the OpenAI-compatible server:
226
227
228

.. code-block:: bash

229
230
    vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
      --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
231

232
Then, you can use the OpenAI client as follows:
233
234
235
236

.. code-block:: python

    from openai import OpenAI
237

238
239
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
240

241
242
243
244
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
245
246
247
248

    # Single-image input inference
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

249
    chat_response = client.chat.completions.create(
250
        model="microsoft/Phi-3.5-vision-instruct",
251
252
253
        messages=[{
            "role": "user",
            "content": [
254
255
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
256
257
                {"type": "text", "text": "What’s in this image?"},
                {"type": "image_url", "image_url": {"url": image_url}},
258
259
260
            ],
        }],
    )
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
    print("Chat completion output:", chat_response.choices[0].message.content)

    # Multi-image input inference
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"

    chat_response = client.chat.completions.create(
        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the animals in these images?"},
                {"type": "image_url", "image_url": {"url": image_url_duck}},
                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
    print("Chat completion output:", chat_response.choices[0].message.content)

280
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
281

282
283
284
285
.. tip::
    Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
    and pass the file path as ``url`` in the API request.

286
287
288
289
.. tip::
    There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
    In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

290
291
.. note::

292
293
    By default, the timeout for fetching images through HTTP URL is ``5`` seconds.
    You can override this by setting the environment variable:
294

295
    .. code-block:: console
296

297
        $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
298

299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
Video
^^^^^

Instead of :code:`image_url`, you can pass a video file via :code:`video_url`.

You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py>`_ as reference.

.. note::

    By default, the timeout for fetching videos through HTTP URL url is ``30`` seconds.
    You can override this by setting the environment variable:

    .. code-block:: console

        $ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
314

315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
Audio
^^^^^

Instead of :code:`image_url`, you can pass an audio file via :code:`audio_url`.

A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.

.. note::

    By default, the timeout for fetching audios through HTTP URL is ``10`` seconds.
    You can override this by setting the environment variable:

    .. code-block:: console

        $ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>

Embedding
^^^^^^^^^

vLLM's Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of chat ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
336
337
338

.. tip::
    The schema of ``messages`` is exactly the same as in Chat Completions API.
339
    You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
340

341
342
343
344
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.

Here is an end-to-end example using VLM2Vec. To serve the model:
345
346
347

.. code-block:: bash

348
    vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
349
      --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
350
351
352

.. important::

353
    Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embed``
354
355
    to run this model in embedding mode instead of text generation mode.

356
357
    The custom chat template is completely different from the original one for this model,
    and can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`__.
358
359

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383

.. code-block:: python

    import requests

    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

    response = requests.post(
        "http://localhost:8000/v1/embeddings",
        json={
            "model": "TIGER-Lab/VLM2Vec-Full",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": "Represent the given image."},
                ],
            }],
            "encoding_format": "float",
        },
    )
    response.raise_for_status()
    response_json = response.json()
    print("Embedding output:", response_json["data"][0]["embedding"])
384

385
Below is another example, this time using the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
386
387
388

.. code-block:: bash

389
    vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
390
391
392
393
      --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja

.. important::

394
    Like with VLM2Vec, we have to explicitly pass ``--task embed``.
395
396
397
    
    Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, which is handled
    by `this custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja>`__.
398
399
400
401
402
403

.. important::

    Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code 
    example below for details.

404
A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.