vlm.rst 6.16 KB
Newer Older
1
2
3
4
5
.. _vlm:

Using VLMs
==========

6
vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM.
7

8
9
10
.. important::
    We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.

11
12
13
14
15
16
17
18
19
20
21
Engine Arguments
----------------

The following :ref:`engine arguments <engine_args>` are specific to VLMs:

.. argparse::
    :module: vllm.engine.arg_utils
    :func: _vlm_engine_args_parser
    :prog: -m vllm.entrypoints.openai.api_server
    :nodefaultconst:

22
23
24
25
.. important::
    Currently, the support for vision language models on vLLM has the following limitations:

    * Only single image input is supported per text prompt.
SangBin Cho's avatar
SangBin Cho committed
26

Cyrus Leung's avatar
Cyrus Leung committed
27
    We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
28

29
30
31
32
33
34
35
36
37
38
39
40
41
42
Offline Batched Inference
-------------------------

To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.

.. code-block:: python

    llm = LLM(
        model="llava-hf/llava-1.5-7b-hf",
        image_token_id=32000,
        image_input_shape="1,3,336,336",
        image_feature_size=576,
    )

43
.. important::
44
45
46
47
48
    Currently, you have to specify ``image_feature_size`` to support memory profiling.
    To avoid OOM during runtime, you should set this to the maximum value supported by the model.
    The calculation of feature size is specific to the model. For more details, please refer to
    the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.

49
50
51
    We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.


52
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
53

54
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
55
56
57
58
59
60
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. 

.. note::

   ``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
    :class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
61
62
63

.. code-block:: python

64
65
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
66
67
68
69
70
71

    # Load the image using PIL.Image
    image = ...

    outputs = llm.generate({
        "prompt": prompt,
72
        "multi_modal_data": {"image": image},
73
74
75
76
77
78
79
    })

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
80

81

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
Online OpenAI Vision API Compatible Inference
----------------------------------------------

You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.

.. note::
    Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
    added in the future.

Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.

.. important::
    Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template 
    is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the 
    HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.

.. code-block:: bash

    python -m vllm.entrypoints.openai.api_server \
        --model llava-hf/llava-1.5-7b-hf \
        --image-token-id 32000 \
        --image-input-shape 1,3,336,336 \
        --image-feature-size 576 \
        --chat-template template_llava.jinja

107
.. important::
108
109
110
111
112
    Currently, you have to specify ``image_feature_size`` to support memory profiling.
    To avoid OOM during runtime, you should set this to the maximum value supported by the model.
    The calculation of feature size is specific to the model. For more details, please refer to
    the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.

113
114
    We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.

115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
To consume the server, you can use the OpenAI client like in the example below:

.. code-block:: python

    from openai import OpenAI
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    chat_response = client.chat.completions.create(
        model="llava-hf/llava-1.5-7b-hf",
        messages=[{
            "role": "user",
            "content": [
131
132
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
133
134
135
136
137
138
139
140
141
142
143
144
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    },
                },
            ],
        }],
    )
    print("Chat response:", chat_response)

145
146
A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.

147
148
149
150
151
152
153
154
155
.. note::

    By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:

    .. code-block:: shell

        export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>

.. note::
156
    There is no need to format the prompt in the API request since it will be handled by the server.