"vscode:/vscode.git/clone" did not exist on "b6074f3342be54a0c69053e86fd5ea9a62fe32a4"
vlm.rst 9.32 KB
Newer Older
1
2
3
4
5
.. _vlm:

Using VLMs
==========

6
7
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
This document shows you how to run and serve these models using vLLM.
8

9
10
11
.. note::
    We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
12

13
14
15
16
17
Offline Inference
-----------------

Single-image input
^^^^^^^^^^^^^^^^^^
18

19
The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
20
21
22

.. code-block:: python

23
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
24

25
To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
26

27
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
28
29
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. 

30
31
.. code-block:: python

32
33
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
34
35

    # Load the image using PIL.Image
36
37
38
    image = PIL.Image.open(...)
    
    # Single prompt inference
39
40
    outputs = llm.generate({
        "prompt": prompt,
41
        "multi_modal_data": {"image": image},
42
43
    })

44
45
46
47
48
49
50
51
52
53
54
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

    # Inference with image embeddings as input
    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image_embeds},
    })

55
56
57
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
58
59
60
61
62
63
64
65
66
67
68
69
70

    # Inference with image embeddings as input with additional parameters
    # Specifically, we are conducting a trial run of Qwen2VL with the new input format, as the model utilizes additional parameters for calculating positional encoding.
    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
    image_grid_thw = torch.load(...) # torch.Tensor of shape (1, 3)
    mm_data['image'] = {
        "image_embeds": image_embeds,
        "image_grid_thw":  image_grid_thw,
    }
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": mm_data,
    })
71
    
72
73
74
75
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
    # Batch inference
    image_1 = PIL.Image.open(...)
    image_2 = PIL.Image.open(...)
    outputs = llm.generate(
        [
            {
                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_1},
            },
            {
                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
                "multi_modal_data": {"image": image_2},
            }
        ]
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
95

96
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
97

98
99
Multi-image input
^^^^^^^^^^^^^^^^^
100

101
Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
102

103
To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
104

105
.. code-block:: python
106

107
108
109
110
111
112
    llm = LLM(
        model="microsoft/Phi-3.5-vision-instruct",
        trust_remote_code=True,  # Required to load Phi-3.5-vision
        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
    )
113

114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
Instead of passing in a single image, you can pass in a list of images.

.. code-block:: python

    # Refer to the HuggingFace repo for the correct format to use
    prompt = "<|user|>\n<image_1>\n<image_2>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"

    # Load the images using PIL.Image
    image1 = PIL.Image.open(...)
    image2 = PIL.Image.open(...)

    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {
            "image": [image1, image2]
        },
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.

138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos: 

.. code-block:: python

    # Specify the maximum number of frames per video to be 4. This can be changed. 
    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})

    # Create the request payload.
    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
    message = {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
        ],
    }
    for i in range(len(video_frames)):
        base64_image = encode_image(video_frames[i]) # base64 encoding.
        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
        message["content"].append(new_image)

    # Perform inference and log output.
    outputs = llm.chat([message])
    
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

165
166
167
168
169
170
171
172
173
Online Inference
----------------

OpenAI Vision API
^^^^^^^^^^^^^^^^^

You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.

Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
174
175
176

.. code-block:: bash

177
178
    vllm serve microsoft/Phi-3.5-vision-instruct --max-model-len 4096 \
      --trust-remote-code --limit-mm-per-prompt image=2
179

180
.. important::
181
182
183
184
185
186
    Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
    a chat template is **required** to launch the API server.

    Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
187

188
189
190
191
192
To consume the server, you can use the OpenAI client like in the example below:

.. code-block:: python

    from openai import OpenAI
193

194
195
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
196

197
198
199
200
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
201
202
203
204

    # Single-image input inference
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

205
    chat_response = client.chat.completions.create(
206
        model="microsoft/Phi-3.5-vision-instruct",
207
208
209
        messages=[{
            "role": "user",
            "content": [
210
211
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
212
213
                {"type": "text", "text": "What’s in this image?"},
                {"type": "image_url", "image_url": {"url": image_url}},
214
215
216
            ],
        }],
    )
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
    print("Chat completion output:", chat_response.choices[0].message.content)

    # Multi-image input inference
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"

    chat_response = client.chat.completions.create(
        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the animals in these images?"},
                {"type": "image_url", "image_url": {"url": image_url_duck}},
                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
    print("Chat completion output:", chat_response.choices[0].message.content)

236

237
238
A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.

239
240
241
242
243
244
245
246
247
.. note::

    By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:

    .. code-block:: shell

        export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>

.. note::
248
    There is no need to format the prompt in the API request since it will be handled by the server.