"vscode:/vscode.git/clone" did not exist on "2f925e5777cce9d574292bb6c91ff9f92de3fe62"
vlm.rst 11.5 KB
Newer Older
1
2
3
4
5
.. _vlm:

Using VLMs
==========

6
7
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
This document shows you how to run and serve these models using vLLM.
8

9
10
11
.. note::
    We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
12

13
14
15
16
17
Offline Inference
-----------------

Single-image input
^^^^^^^^^^^^^^^^^^
18

19
The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
20
21
22

.. code-block:: python

23
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
24

25
To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
26

27
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
28
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
29

30
31
.. code-block:: python

32
33
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
34
35

    # Load the image using PIL.Image
36
    image = PIL.Image.open(...)
37

38
    # Single prompt inference
39
40
    outputs = llm.generate({
        "prompt": prompt,
41
        "multi_modal_data": {"image": image},
42
43
    })

44
45
46
47
48
49
50
51
52
53
54
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

    # Inference with image embeddings as input
    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image_embeds},
    })

55
56
57
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
58
59

    # Inference with image embeddings as input with additional parameters
60
61
62
63
64
65
66
67
68
69
    # Specifically, we are conducting a trial run of Qwen2VL and MiniCPM-V with the new input format, which utilizes additional parameters.
    mm_data = {}

    image_embeds = torch.load(...) # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
    # For Qwen2VL, image_grid_thw is needed to calculate positional encoding.
    mm_data['image'] = {
        "image_embeds": image_embeds,
        "image_grid_thw": torch.load(...) # torch.Tensor of shape (1, 3),
    }
    # For MiniCPM-V, image_size_list is needed to calculate details of the sliced image.
70
71
    mm_data['image'] = {
        "image_embeds": image_embeds,
72
        "image_size_list": [image.size] # list of image sizes
73
74
75
76
77
    }
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": mm_data,
    })
78

79
80
81
82
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
    # Batch inference
    image_1 = PIL.Image.open(...)
    image_2 = PIL.Image.open(...)
    outputs = llm.generate(
        [
            {
                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_1},
            },
            {
                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_2},
            }
        ]
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
102

103
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
104

105
106
Multi-image input
^^^^^^^^^^^^^^^^^
107

108
Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
109

110
To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
111

112
.. code-block:: python
113

114
115
116
117
118
119
    llm = LLM(
        model="microsoft/Phi-3.5-vision-instruct",
        trust_remote_code=True,  # Required to load Phi-3.5-vision
        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
    )
120

121
122
123
124
125
Instead of passing in a single image, you can pass in a list of images.

.. code-block:: python

    # Refer to the HuggingFace repo for the correct format to use
126
    prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

    # Load the images using PIL.Image
    image1 = PIL.Image.open(...)
    image2 = PIL.Image.open(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "image": [image1, image2]
        },
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.

145
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
146
147
148

.. code-block:: python

149
    # Specify the maximum number of frames per video to be 4. This can be changed.
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})

    # Create the request payload.
    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
    message = {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
        ],
    }
    for i in range(len(video_frames)):
        base64_image = encode_image(video_frames[i]) # base64 encoding.
        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        message["content"].append(new_image)

    # Perform inference and log output.
    outputs = llm.chat([message])
167

168
169
170
171
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

172
173
174
175
176
177
178
179
180
Online Inference
----------------

OpenAI Vision API
^^^^^^^^^^^^^^^^^

You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.

Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
181
182
183

.. code-block:: bash

184
185
    vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
      --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
186

187
.. important::
188
    Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
189
190
191
192
193
    a chat template is **required** to launch the API server.

    Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
194

195
196
197
198
199
To consume the server, you can use the OpenAI client like in the example below:

.. code-block:: python

    from openai import OpenAI
200

201
202
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
203

204
205
206
207
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
208
209
210
211

    # Single-image input inference
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

212
    chat_response = client.chat.completions.create(
213
        model="microsoft/Phi-3.5-vision-instruct",
214
215
216
        messages=[{
            "role": "user",
            "content": [
217
218
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
219
220
                {"type": "text", "text": "What’s in this image?"},
                {"type": "image_url", "image_url": {"url": image_url}},
221
222
223
            ],
        }],
    )
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
    print("Chat completion output:", chat_response.choices[0].message.content)

    # Multi-image input inference
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"

    chat_response = client.chat.completions.create(
        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the animals in these images?"},
                {"type": "image_url", "image_url": {"url": image_url_duck}},
                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
    print("Chat completion output:", chat_response.choices[0].message.content)

243

244
A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.
245

246
247
248
249
.. tip::
    There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
    In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

250
251
252
253
.. note::

    By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:

254
    .. code-block:: console
255

256
        $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
257

258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
Chat Embeddings API
^^^^^^^^^^^^^^^^^^^

vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.

.. tip::
    The schema of ``messages`` is exactly the same as in Chat Completions API.

In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.

.. code-block:: bash

    vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
      --trust-remote-code --max-model-len 4096

.. important::

    Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
    to run this model in embedding mode instead of text generation mode.

Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:

.. code-block:: python

    import requests

    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

    response = requests.post(
        "http://localhost:8000/v1/embeddings",
        json={
            "model": "TIGER-Lab/VLM2Vec-Full",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": image_url}},
                    {"type": "text", "text": "Represent the given image."},
                ],
            }],
            "encoding_format": "float",
        },
    )
    response.raise_for_status()
    response_json = response.json()
    print("Embedding output:", response_json["data"][0]["embedding"])