Commit f48954a4 authored by zhuwenwen's avatar zhuwenwen
Browse files

merge v0.5.0

parents 1dba29d3 8f89d720
.. _vlm:
Using VLMs
==========
vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM.
Engine Arguments
----------------
The following :ref:`engine arguments <engine_args>` are specific to VLMs:
.. argparse::
:module: vllm.engine.arg_utils
:func: _vlm_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst:
.. important::
Currently, the support for vision language models on vLLM has the following limitations:
* Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
Offline Batched Inference
-------------------------
To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.
.. code-block:: python
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.
.. code-block:: python
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
# Load the image using PIL.Image
image = ...
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": ImagePixelData(image),
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
Online OpenAI Vision API Compatible Inference
----------------------------------------------
You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
.. note::
Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
added in the future.
Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.
.. important::
Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template
is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the
HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
.. code-block:: bash
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja
To consume the server, you can use the OpenAI client like in the example below:
.. code-block:: python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}],
)
print("Chat response:", chat_response)
.. note::
By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
.. code-block:: shell
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
.. note::
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be
processed automatically by the server.
This diff is collapsed.
...@@ -3,11 +3,9 @@ ...@@ -3,11 +3,9 @@
Distributed Inference and Serving Distributed Inference and Serving
================================= =================================
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with `Ray <https://github.com/ray-project/ray>`_. To run distributed inference, install Ray with: vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
.. code-block:: console Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
$ pip install ray
To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
...@@ -25,10 +23,12 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh ...@@ -25,10 +23,12 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
$ --model facebook/opt-13b \ $ --model facebook/opt-13b \
$ --tensor-parallel-size 4 $ --tensor-parallel-size 4
To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM: To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
.. code-block:: console .. code-block:: console
$ pip install ray
$ # On head node $ # On head node
$ ray start --head $ ray start --head
......
...@@ -30,6 +30,8 @@ Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-refer ...@@ -30,6 +30,8 @@ Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-refer
- Chat: `tools`, and `tool_choice`. - Chat: `tools`, and `tool_choice`.
- Completions: `suffix`. - Completions: `suffix`.
vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).
## Extra Parameters ## Extra Parameters
vLLM supports a set of parameters that are not part of the OpenAI API. vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client. In order to use them, you can pass them as extra parameters in the OpenAI client.
...@@ -109,4 +111,15 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/) ...@@ -109,4 +111,15 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
:module: vllm.entrypoints.openai.cli_args :module: vllm.entrypoints.openai.cli_args
:func: make_arg_parser :func: make_arg_parser
:prog: -m vllm.entrypoints.openai.api_server :prog: -m vllm.entrypoints.openai.api_server
``` ```
\ No newline at end of file
## Tool calling in the chat completion API
vLLM supports only named function calling in the chat completion API. The `tool_choice` options `auto` and `required` are **not yet supported** but on the roadmap.
To use a named function you need to define the function in the `tools` parameter and call it in the `tool_choice` parameter.
It is the callers responsibility to prompt the model with the tool information, vLLM will not automatically manipulate the prompt. **This may change in the future.**
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
Please refer to the OpenAI API reference documentation for more information.
...@@ -3,33 +3,36 @@ import os ...@@ -3,33 +3,36 @@ import os
import subprocess import subprocess
import torch import torch
from PIL import Image
from vllm import LLM from vllm import LLM
from vllm.sequence import MultiModalData from vllm.multimodal.image import ImageFeatureData, ImagePixelData
# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`. # The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
# You can use `.buildkite/download-images.sh` to download them
def run_llava_pixel_values(): def run_llava_pixel_values(*, disable_image_processor: bool = False):
llm = LLM( llm = LLM(
model="llava-hf/llava-1.5-7b-hf", model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values", image_input_type="pixel_values",
image_token_id=32000, image_token_id=32000,
image_input_shape="1,3,336,336", image_input_shape="1,3,336,336",
image_feature_size=576, image_feature_size=576,
disable_image_processor=disable_image_processor,
) )
prompt = "<image>" * 576 + ( prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:") "\nUSER: What is the content of this image?\nASSISTANT:")
# This should be provided by another online or offline component. if disable_image_processor:
image = torch.load("images/stop_sign_pixel_values.pt") image = torch.load("images/stop_sign_pixel_values.pt")
else:
image = Image.open("images/stop_sign.jpg")
outputs = llm.generate({ outputs = llm.generate({
"prompt": "prompt": prompt,
prompt, "multi_modal_data": ImagePixelData(image),
"multi_modal_data":
MultiModalData(type=MultiModalData.Type.IMAGE, data=image),
}) })
for o in outputs: for o in outputs:
...@@ -49,15 +52,13 @@ def run_llava_image_features(): ...@@ -49,15 +52,13 @@ def run_llava_image_features():
prompt = "<image>" * 576 + ( prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:") "\nUSER: What is the content of this image?\nASSISTANT:")
# This should be provided by another online or offline component. image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")
image = torch.load("images/stop_sign_image_features.pt")
outputs = llm.generate({ outputs = llm.generate({
"prompt": "prompt": prompt,
prompt, "multi_modal_data": ImageFeatureData(image),
"multi_modal_data":
MultiModalData(type=MultiModalData.Type.IMAGE, data=image),
}) })
for o in outputs: for o in outputs:
generated_text = o.outputs[0].text generated_text = o.outputs[0].text
print(generated_text) print(generated_text)
......
This diff is collapsed.
This diff is collapsed.
{%- if messages[0]['role'] == 'system' -%}
{%- set system_message = messages[0]['content'] -%}
{%- set messages = messages[1:] -%}
{%- else -%}
{% set system_message = '' -%}
{%- endif -%}
{{ bos_token + system_message }}
{%- for message in messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{%- endif -%}
{%- if message['role'] == 'user' -%}
{{ 'USER: ' + message['content'] + '\n' }}
{%- elif message['role'] == 'assistant' -%}
{{ 'ASSISTANT: ' + message['content'] + eos_token + '\n' }}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{ 'ASSISTANT:' }}
{% endif %}
...@@ -101,6 +101,7 @@ mypy vllm/core --config-file pyproject.toml ...@@ -101,6 +101,7 @@ mypy vllm/core --config-file pyproject.toml
mypy vllm/distributed --config-file pyproject.toml mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml mypy vllm/executor --config-file pyproject.toml
mypy vllm/multimodal --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml mypy vllm/usage --config-file pyproject.toml
mypy vllm/*.py --config-file pyproject.toml mypy vllm/*.py --config-file pyproject.toml
mypy vllm/transformers_utils --config-file pyproject.toml mypy vllm/transformers_utils --config-file pyproject.toml
...@@ -117,7 +118,7 @@ mypy vllm/model_executor --config-file pyproject.toml ...@@ -117,7 +118,7 @@ mypy vllm/model_executor --config-file pyproject.toml
# https://github.com/codespell-project/codespell/issues/1915 # https://github.com/codespell-project/codespell/issues/1915
# Avoiding the "./" prefix and using "/**" globs for directories appears to solve the problem # Avoiding the "./" prefix and using "/**" globs for directories appears to solve the problem
CODESPELL_EXCLUDES=( CODESPELL_EXCLUDES=(
'--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,tests/lora/data/**,build/**' '--skip' 'tests/prompts/**,./benchmarks/sonnet.txt,*tests/lora/data/**,build/**'
) )
# check spelling of specified files # check spelling of specified files
......
...@@ -59,7 +59,7 @@ exclude = [ ...@@ -59,7 +59,7 @@ exclude = [
] ]
[tool.codespell] [tool.codespell]
ignore-words-list = "dout, te, indicies" ignore-words-list = "dout, te, indicies, subtile"
skip = "./tests/prompts,./benchmarks/sonnet.txt,./tests/lora/data,./build" skip = "./tests/prompts,./benchmarks/sonnet.txt,./tests/lora/data,./build"
[tool.isort] [tool.isort]
...@@ -71,4 +71,5 @@ markers = [ ...@@ -71,4 +71,5 @@ markers = [
"skip_global_cleanup", "skip_global_cleanup",
"llm: run tests for vLLM API only", "llm: run tests for vLLM API only",
"openai: run tests for OpenAI API only", "openai: run tests for OpenAI API only",
"llava: run tests for LLaVA models only",
] ]
...@@ -12,10 +12,11 @@ aiohttp ...@@ -12,10 +12,11 @@ aiohttp
openai openai
uvicorn[standard] uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server. pydantic >= 2.0 # Required for OpenAI server.
pillow # Required for image processing
prometheus_client >= 0.18.0 prometheus_client >= 0.18.0
prometheus-fastapi-instrumentator >= 7.0.0 prometheus-fastapi-instrumentator >= 7.0.0
tiktoken >= 0.6.0 # Required for DBRX tokenizer tiktoken >= 0.6.0 # Required for DBRX tokenizer
lm-format-enforcer == 0.10.1 lm-format-enforcer == 0.10.1
outlines == 0.0.34 # Requires torch >= 2.1.0 outlines >= 0.0.43 # Requires torch >= 2.1.0
typing_extensions typing_extensions
filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4 filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4
...@@ -6,4 +6,4 @@ ray >= 2.9 ...@@ -6,4 +6,4 @@ ray >= 2.9
nvidia-ml-py # for pynvml package nvidia-ml-py # for pynvml package
torch == 2.3.0 torch == 2.3.0
xformers == 0.0.26.post1 # Requires PyTorch 2.3.0 xformers == 0.0.26.post1 # Requires PyTorch 2.3.0
vllm-flash-attn == 2.5.8.post2 # Requires PyTorch 2.3.0 vllm-flash-attn == 2.5.9 # Requires PyTorch 2.3.0
...@@ -33,5 +33,5 @@ sentence-transformers # required for embedding ...@@ -33,5 +33,5 @@ sentence-transformers # required for embedding
# Benchmarking # Benchmarking
aiohttp aiohttp
# Multimodal # quantization
pillow bitsandbytes==0.42.0
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment