multimodal_inputs.rst 17.6 KB
Newer Older
1
.. _multimodal_inputs:
2

3
4
Multimodal Inputs
=================
5

6
This page teaches you how to pass multi-modal inputs to :ref:`multi-modal models <supported_mm_models>` in vLLM.
7

8
.. note::
9
    We are actively iterating on multi-modal support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
10
    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
11

12
13
14
Offline Inference
-----------------

15
To input multi-modal data, follow this schema in :class:`vllm.inputs.PromptType`:
16

17
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
18
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
19

20
21
22
23
24
Image
^^^^^

You can pass a single image to the :code:`'image'` field of the multi-modal dictionary, as shown in the following examples:

25
26
.. code-block:: python

27
28
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")

29
30
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
31
32

    # Load the image using PIL.Image
33
    image = PIL.Image.open(...)
34

35
    # Single prompt inference
36
37
    outputs = llm.generate({
        "prompt": prompt,
38
        "multi_modal_data": {"image": image},
39
40
    })

41
42
43
44
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
    # Batch inference
    image_1 = PIL.Image.open(...)
    image_2 = PIL.Image.open(...)
    outputs = llm.generate(
        [
            {
                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_1},
            },
            {
                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_2},
            }
        ]
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
64

65
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
66

67
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
68

69
.. code-block:: python
70

71
72
73
74
75
76
    llm = LLM(
        model="microsoft/Phi-3.5-vision-instruct",
        trust_remote_code=True,  # Required to load Phi-3.5-vision
        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
    )
77

78
    # Refer to the HuggingFace repo for the correct format to use
79
    prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

    # Load the images using PIL.Image
    image1 = PIL.Image.open(...)
    image2 = PIL.Image.open(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "image": [image1, image2]
        },
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.

98
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
99
100
101

.. code-block:: python

102
    # Specify the maximum number of frames per video to be 4. This can be changed.
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})

    # Create the request payload.
    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
    message = {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
        ],
    }
    for i in range(len(video_frames)):
        base64_image = encode_image(video_frames[i]) # base64 encoding.
        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        message["content"].append(new_image)

    # Perform inference and log output.
    outputs = llm.chat([message])
120

121
122
123
124
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
Video
^^^^^

You can pass a list of NumPy arrays directly to the :code:`'video'` field of the multi-modal dictionary
instead of using multi-image input.

Please refer to `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_ for more details.

Audio
^^^^^

You can pass a tuple :code:`(array, sampling_rate)` to the :code:`'audio'` field of the multi-modal dictionary.

Please refer to `examples/offline_inference_audio_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py>`_ for more details.

Embedding
^^^^^^^^^

To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape :code:`(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.

.. code-block:: python

    # Inference with image embeddings as input
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")

    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

    # Embeddings for single image
    # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
    image_embeds = torch.load(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image_embeds},
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:

.. code-block:: python

    # Construct the prompt based on your model
    prompt = ...

    # Embeddings for multiple images
    # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
    image_embeds = torch.load(...)

    # Qwen2-VL
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
    mm_data = {
        "image": {
            "image_embeds": image_embeds,
            # image_grid_thw is needed to calculate positional encoding.
            "image_grid_thw": torch.load(...),  # torch.Tensor of shape (1, 3),
        }
    }

    # MiniCPM-V
    llm = LLM("openbmb/MiniCPM-V-2_6", trust_remote_code=True, limit_mm_per_prompt={"image": 4})
    mm_data = {
        "image": {
            "image_embeds": image_embeds,
            # image_size_list is needed to calculate details of the sliced image.
            "image_size_list": [image.size for image in images],  # list of image sizes
        }
    }

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": mm_data,
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

207
208
209
Online Inference
----------------

210
211
212
213
214
215
216
217
218
219
220
Our OpenAI-compatible server accepts multi-modal data via the `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_.

.. important::
    A chat template is **required** to use Chat Completions API.

    Although most models come with a chat template, for others you have to define one yourself.
    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`__.

Image
^^^^^
221

222
223
Image input is supported according to `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
Here is a simple example using Phi-3.5-Vision.
224

225
First, launch the OpenAI-compatible server:
226
227
228

.. code-block:: bash

229
230
    vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
      --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
231

232
Then, you can use the OpenAI client as follows:
233
234
235
236

.. code-block:: python

    from openai import OpenAI
237

238
239
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
240

241
242
243
244
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
245
246
247
248

    # Single-image input inference
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

249
    chat_response = client.chat.completions.create(
250
        model="microsoft/Phi-3.5-vision-instruct",
251
252
253
        messages=[{
            "role": "user",
            "content": [
254
255
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
256
257
                {"type": "text", "text": "What’s in this image?"},
                {"type": "image_url", "image_url": {"url": image_url}},
258
259
260
            ],
        }],
    )
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
    print("Chat completion output:", chat_response.choices[0].message.content)

    # Multi-image input inference
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"

    chat_response = client.chat.completions.create(
        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the animals in these images?"},
                {"type": "image_url", "image_url": {"url": image_url_duck}},
                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
    print("Chat completion output:", chat_response.choices[0].message.content)

280
A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.
281

282
283
284
285
.. tip::
    Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via ``--allowed-local-media-path`` when launching the API server/engine,
    and pass the file path as ``url`` in the API request.

286
287
288
289
.. tip::
    There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
    In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

290
291
.. note::

292
293
    By default, the timeout for fetching images through HTTP URL is ``5`` seconds.
    You can override this by setting the environment variable:
294

295
    .. code-block:: console
296

297
        $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
298

299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
Video
^^^^^

Instead of :code:`image_url`, you can pass a video file via :code:`video_url`.

You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py>`_ as reference.

.. note::

    By default, the timeout for fetching videos through HTTP URL url is ``30`` seconds.
    You can override this by setting the environment variable:

    .. code-block:: console

        $ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
314

315
316
317
Audio
^^^^^

318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
Audio input is supported according to `OpenAI Audio API <https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in>`_.
Here is a simple example using Ultravox-v0.3.

First, launch the OpenAI-compatible server:

.. code-block:: bash

    vllm serve fixie-ai/ultravox-v0_3
    
Then, you can use the OpenAI client as follows:

.. code-block:: python

    import base64
    import requests
    from openai import OpenAI
    from vllm.assets.audio import AudioAsset

    def encode_base64_content_from_url(content_url: str) -> str:
        """Encode a content retrieved from a remote url to base64 format."""

        with requests.get(content_url) as response:
            response.raise_for_status()
            result = base64.b64encode(response.content).decode('utf-8')

        return result

    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"

    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    # Any format supported by librosa is supported
    audio_url = AudioAsset("winning_call").url
    audio_base64 = encode_base64_content_from_url(audio_url)

    chat_completion_from_base64 = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this audio?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_base64,
                        "format": "wav"
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from input audio:", result)

Alternatively, you can pass :code:`audio_url`, which is the audio counterpart of :code:`image_url` for image input:

.. code-block:: python

    chat_completion_from_url = client.chat.completions.create(
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this audio?"
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": audio_url
                    },
                },
            ],
        }],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output from audio url:", result)
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423

A full code example can be found in `examples/openai_chat_completion_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py>`_.

.. note::

    By default, the timeout for fetching audios through HTTP URL is ``10`` seconds.
    You can override this by setting the environment variable:

    .. code-block:: console

        $ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>

Embedding
^^^^^^^^^

vLLM's Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of chat ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
424
425
426

.. tip::
    The schema of ``messages`` is exactly the same as in Chat Completions API.
427
    You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
428

429
430
431
432
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.

Here is an end-to-end example using VLM2Vec. To serve the model:
433
434
435

.. code-block:: bash

436
    vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
437
      --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
438
439
440

.. important::

441
    Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embed``
442
443
    to run this model in embedding mode instead of text generation mode.

444
445
    The custom chat template is completely different from the original one for this model,
    and can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja>`__.
446
447

Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471

.. code-block:: python

    import requests

    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

    response = requests.post(
        "http://localhost:8000/v1/embeddings",
        json={
            "model": "TIGER-Lab/VLM2Vec-Full",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": "Represent the given image."},
                ],
            }],
            "encoding_format": "float",
        },
    )
    response.raise_for_status()
    response_json = response.json()
    print("Embedding output:", response_json["data"][0]["embedding"])
472

473
Below is another example, this time using the ``MrLight/dse-qwen2-2b-mrl-v1`` model.
474
475
476

.. code-block:: bash

477
    vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
478
479
480
481
      --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja

.. important::

482
    Like with VLM2Vec, we have to explicitly pass ``--task embed``.
483
484
485
    
    Additionally, ``MrLight/dse-qwen2-2b-mrl-v1`` requires an EOS token for embeddings, which is handled
    by `this custom chat template <https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja>`__.
486
487
488
489
490
491

.. important::

    Also important, ``MrLight/dse-qwen2-2b-mrl-v1`` requires a placeholder image of the minimum image size for text query embeddings. See the full code 
    example below for details.

492
A full code example can be found in `examples/openai_chat_embedding_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py>`_.