[vlm] Remove vision language config. (#6089)

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>

[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
d9e98f42 · xwjiang2010 · GitHub · 3c6325f0 · d9e98f42 · d9e98f42
Unverified Commit d9e98f42 authored Jul 03, 2024 by xwjiang2010 Committed by GitHub Jul 03, 2024
20 changed files
--- a/docs/source/dev/multimodal/multimodal_index.rst
+++ b/docs/source/dev/multimodal/multimodal_index.rst
@@ -10,8 +10,13 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
 :class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
 which allows you to pass in multi-modal input alongside text and token prompts.
+.. note::
+   ``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through 
+    :class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
 By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.
 # TODO: Add more instructions on how to do that once embeddings is in.
 Guides

--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -8,18 +8,6 @@ vLLM provides experimental support for Vision Language Models (VLMs). This docum
 .. important::
    We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
-Engine Arguments
----------------
-The following :ref:`engine arguments <engine_args>` are specific to VLMs:
-.. argparse::
-    :module: vllm.engine.arg_utils
-    :func: _vlm_engine_args_parser
-    :prog: -m vllm.entrypoints.openai.api_server
-    :nodefaultconst:
-.. important::
    Currently, the support for vision language models on vLLM has the following limitations:
    * Only single image input is supported per text prompt.
@@ -33,20 +21,17 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
 .. code-block:: python
-    llm = LLM(
+    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-        model="llava-hf/llava-1.5-7b-hf",
-        image_token_id=32000,
-        image_input_shape="1,3,336,336",
-        image_feature_size=576,
-    )
 .. important::
-    Currently, you have to specify ``image_feature_size`` to support memory profiling.
+    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
-    To avoid OOM during runtime, you should set this to the maximum value supported by the model.
+    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
-    The calculation of feature size is specific to the model. For more details, please refer to
+    every model to perform profiling with.
-    the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
-    We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
+    This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through 
+    :meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>` 
+    for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced 
+    with a more accurate profiling strategy in the future.
 To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
@@ -54,19 +39,15 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
 * ``prompt``: The prompt should follow the format that is documented on HuggingFace.
 * ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. 
-.. note::
-   ``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
-    :class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
 .. code-block:: python
    # Refer to the HuggingFace repo for the correct format to use
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
    # Load the image using PIL.Image
-    image = ...
+    image = PIL.Image.open(...)
+    # Single prompt inference
    outputs = llm.generate({
        "prompt": prompt,
        "multi_modal_data": {"image": image},
@@ -76,6 +57,26 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
        generated_text = o.outputs[0].text
        print(generated_text)
+    # Batch inference
+    image_1 = PIL.Image.open(...)
+    image_2 = PIL.Image.open(...)
+    outputs = llm.generate(
+        [
+            {
+                "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
+                "multi_modal_data": {"image": image_1},
+            },
+            {
+                "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
+                "multi_modal_data": {"image": image_2},
+            }
+        ]
+    )
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
 A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
@@ -99,18 +100,17 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
    python -m vllm.entrypoints.openai.api_server \
        --model llava-hf/llava-1.5-7b-hf \
-        --image-token-id 32000 \
-        --image-input-shape 1,3,336,336 \
-        --image-feature-size 576 \
        --chat-template template_llava.jinja
 .. important::
-    Currently, you have to specify ``image_feature_size`` to support memory profiling.
+    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
-    To avoid OOM during runtime, you should set this to the maximum value supported by the model.
+    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
-    The calculation of feature size is specific to the model. For more details, please refer to
+    every model to perform profiling with.
-    the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
+    This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through 
-    We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
+    :meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>` 
+    for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced 
+    with a more accurate profiling strategy in the future.
 To consume the server, you can use the OpenAI client like in the example below:

--- a/examples/llava_example.py
+++ b/examples/llava_example.py
@@ -10,12 +10,7 @@ from vllm import LLM
 def run_llava():
-    llm = LLM(
+    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-        model="llava-hf/llava-1.5-7b-hf",
-        image_token_id=32000,
-        image_input_shape="1,3,336,336",
-        image_feature_size=576,
-    )
    prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

--- a/examples/llava_next_example.py
+++ b/examples/llava_next_example.py
@@ -7,13 +7,7 @@ from vllm import LLM, SamplingParams
 def run_llava_next():
-    llm = LLM(
+    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
-        model="llava-hf/llava-v1.6-mistral-7b-hf",
-        image_token_id=32000,
-        image_input_shape="1,3,336,336",
-        # Use the maximum possible value for memory profiling
-        image_feature_size=2928,
-    )
    prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
    url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"

--- a/examples/openai_vision_api_client.py
+++ b/examples/openai_vision_api_client.py
@@ -3,9 +3,6 @@
 Launch the vLLM server with the following command:
 python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-1.5-7b-hf \
-    --image-token-id 32000 \
-    --image-input-shape 1,3,336,336 \
-    --image-feature-size 576 \
    --chat-template template_llava.jinja
 """
 import base64

--- a/examples/phi3v_example.py
+++ b/examples/phi3v_example.py
@@ -14,15 +14,13 @@ def run_phi3v():
    # Note: The default setting of max_num_seqs (256) and
    # max_model_len (128k) for this model may cause OOM.
+    # You may lower either to run this example on lower-end GPUs.
    # In this example, we override max_num_seqs to 5 while
    # keeping the original context length of 128k.
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
-        image_token_id=32044,
-        image_input_shape="1,3,1008,1344",
-        # Use the maximum possible value for memory profiling
-        image_feature_size=2653,
        max_num_seqs=5,
    )

--- a/tests/distributed/test_multimodal_broadcast.py
+++ b/tests/distributed/test_multimodal_broadcast.py
@@ -20,9 +20,9 @@ from vllm.utils import cuda_device_count_stateless
 model = os.environ["TEST_DIST_MODEL"]
 if model.startswith("llava-hf/llava"):
-    from ..models.test_llava import model_and_vl_config, run_test
+    from ..models.test_llava import models, run_test
 elif model.startswith("microsoft/Phi-3-vision"):
-    from ..models.test_phi3v import model_and_vl_config, run_test
+    from ..models.test_phi3v import models, run_test
 else:
    raise NotImplementedError(f"Unsupported model: {model}")
@@ -44,7 +44,7 @@ def test_models(hf_runner, vllm_runner, image_assets,
        hf_runner,
        vllm_runner,
        image_assets,
-        model_and_config=model_and_vl_config[0],
+        model=models[0],
        size_factors=[1.0],
        dtype=dtype,
        max_tokens=max_tokens,

--- a/tests/entrypoints/openai/test_vision.py
+++ b/tests/entrypoints/openai/test_vision.py
@@ -39,12 +39,6 @@ def server(ray_ctx):
        "--max-model-len",
        "4096",
        "--enforce-eager",
-        "--image-token-id",
-        "32000",
-        "--image-input-shape",
-        "1,3,336,336",
-        "--image-feature-size",
-        "576",
        "--chat-template",
        str(LLAVA_CHAT_TEMPLATE),
    ])

--- a/tests/models/test_llava.py
+++ b/tests/models/test_llava.py
@@ -3,7 +3,6 @@ from typing import List, Optional, Tuple, Type
 import pytest
 from transformers import AutoTokenizer
-from vllm.config import VisionLanguageConfig
 from vllm.multimodal.utils import rescale_image_size
 from vllm.sequence import SampleLogprobs
@@ -21,49 +20,27 @@ HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
    "USER: <image>\nWhat's in this image?\nASSISTANT:",
 })
+IMAGE_TOKEN_ID = 32000
-def iter_llava_configs(model_name: str):
+models = ["llava-hf/llava-1.5-7b-hf"]
-    image_hw_to_feature_size = {
-        (336, 336): 576,
-    }
-    for (h, w), f in image_hw_to_feature_size.items():
-        input_shape = (1, 3, h, w)
-        yield (model_name,
-               VisionLanguageConfig(image_feature_size=f,
-                                    image_token_id=32000,
-                                    image_input_shape=input_shape))
-model_and_vl_config = [
-    *iter_llava_configs("llava-hf/llava-1.5-7b-hf"),
-]
 def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
                                         Optional[SampleLogprobs]],
-                      vlm_config: VisionLanguageConfig, model_id: str):
+                      model: str):
-    """Sanitize vllm output to be comparable with hf output.
+    """Sanitize vllm output to be comparable with hf output."""
-    The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
-    x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
-    It also reduces `output_str` from "<image><image>bla" to "bla".
-    """
    output_ids, output_str, out_logprobs = vllm_output
-    image_token_id = vlm_config.image_token_id
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    tokenizer = AutoTokenizer.from_pretrained(model)
-    image_token_str = tokenizer.decode(image_token_id)
    eos_token_id = tokenizer.eos_token_id
    hf_output_ids = [
        token_id for idx, token_id in enumerate(output_ids)
-        if token_id != image_token_id or output_ids[idx - 1] != image_token_id
+        if token_id != IMAGE_TOKEN_ID or output_ids[idx - 1] != IMAGE_TOKEN_ID
    ]
-    hf_output_str = output_str \
+    assert output_str[0] == " "
-        .replace(image_token_str * vlm_config.image_feature_size, "")
+    hf_output_str = output_str[1:]
-    assert hf_output_str[0] == " "
-    hf_output_str = hf_output_str[1:]
    if hf_output_ids[-1] == eos_token_id:
        hf_output_str = hf_output_str + tokenizer.decode(eos_token_id)
@@ -74,7 +51,7 @@ def run_test(
    hf_runner: Type[HfRunner],
    vllm_runner: Type[VllmRunner],
    image_assets: _ImageAssets,
-    model_and_config: Tuple[str, VisionLanguageConfig],
+    model: str,
    *,
    size_factors: List[float],
    dtype: str,
@@ -92,7 +69,6 @@ def run_test(
    Note, the text input is also adjusted to abide by vllm contract.
    The text output is sanitized to be able to compare with hf.
    """
-    model_id, vlm_config = model_and_config
    images = [asset.pil_image for asset in image_assets]
    inputs_per_image = [(
@@ -106,12 +82,11 @@ def run_test(
    # will hurt multiprocessing backend with fork method (the default method).
    # max_model_len should be greater than image_feature_size
-    with vllm_runner(model_id,
+    with vllm_runner(model,
                     dtype=dtype,
                     tensor_parallel_size=tensor_parallel_size,
                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True,
+                     enforce_eager=True) as vllm_model:
-                     **vlm_config.as_cli_args_dict()) as vllm_model:
        vllm_outputs_per_image = [
            vllm_model.generate_greedy_logprobs(prompts,
                                                max_tokens,
@@ -120,7 +95,7 @@ def run_test(
            for prompts, images in inputs_per_image
        ]
-    with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
+    with hf_runner(model, dtype=dtype, is_vision_model=True) as hf_model:
        hf_outputs_per_image = [
            hf_model.generate_greedy_logprobs_limit(prompts,
                                                    max_tokens,
@@ -136,7 +111,7 @@ def run_test(
        check_logprobs_close(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=[
-                vllm_to_hf_output(vllm_output, vlm_config, model_id)
+                vllm_to_hf_output(vllm_output, model)
                for vllm_output in vllm_outputs
            ],
            name_0="hf",
@@ -144,7 +119,7 @@ def run_test(
        )
-@pytest.mark.parametrize("model_and_config", model_and_vl_config)
+@pytest.mark.parametrize("model", models)
 @pytest.mark.parametrize(
    "size_factors",
    [
@@ -161,14 +136,13 @@ def run_test(
 @pytest.mark.parametrize("dtype", ["half"])
 @pytest.mark.parametrize("max_tokens", [128])
 @pytest.mark.parametrize("num_logprobs", [5])
-def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
+def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
-                size_factors, dtype: str, max_tokens: int,
+                dtype: str, max_tokens: int, num_logprobs: int) -> None:
-                num_logprobs: int) -> None:
    run_test(
        hf_runner,
        vllm_runner,
        image_assets,
-        model_and_config,
+        model,
        size_factors=size_factors,
        dtype=dtype,
        max_tokens=max_tokens,

--- a/tests/models/test_llava_next.py
+++ b/tests/models/test_llava_next.py
@@ -4,7 +4,6 @@ from typing import List, Optional, Tuple
 import pytest
 from transformers import AutoTokenizer
-from vllm.config import VisionLanguageConfig
 from vllm.multimodal.utils import rescale_image_size
 from vllm.sequence import SampleLogprobs
@@ -27,46 +26,22 @@ HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
    f"{_PREFACE} USER: <image>\nWhat's in this image? ASSISTANT:",
 })
+IMAGE_TOKEN_ID = 32000
-def iter_llava_next_configs(model_name: str):
-    # Need to use the max possible feature size for profile_run
-    image_hw_to_feature_size = {
-        (336, 336): 2928,
-    }
-    for (h, w), f in image_hw_to_feature_size.items():
-        input_shape = (1, 3, h, w)
-        yield (model_name,
-               VisionLanguageConfig(
-                   image_feature_size=f,
-                   image_token_id=32000,
-                   image_input_shape=input_shape,
-               ))
-model_and_vl_config = [
-    *iter_llava_next_configs("llava-hf/llava-v1.6-vicuna-7b-hf"),
-]
 def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
                                         Optional[SampleLogprobs]],
-                      vlm_config: VisionLanguageConfig, model_id: str):
+                      model: str):
-    """Sanitize vllm output to be comparable with hf output.
+    """Sanitize vllm output to be comparable with hf output."""
-    The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
-    x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
-    It also reduces `output_str` from "<image><image>bla" to "bla".
-    """
    output_ids, output_str, out_logprobs = vllm_output
-    image_token_id = vlm_config.image_token_id
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    tokenizer = AutoTokenizer.from_pretrained(model)
-    image_token_str = tokenizer.decode(image_token_id)
+    image_token_str = tokenizer.decode(IMAGE_TOKEN_ID)
    eos_token_id = tokenizer.eos_token_id
    hf_output_ids = [
        token_id for idx, token_id in enumerate(output_ids)
-        if token_id != image_token_id or output_ids[idx - 1] != image_token_id
+        if token_id != IMAGE_TOKEN_ID or output_ids[idx - 1] != IMAGE_TOKEN_ID
    ]
    hf_output_str = re.sub(fr"({image_token_str})+", "", output_str)
@@ -78,7 +53,7 @@ def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
    return hf_output_ids, hf_output_str, out_logprobs
-@pytest.mark.parametrize("model_and_config", model_and_vl_config)
+@pytest.mark.parametrize("model", ["llava-hf/llava-v1.6-vicuna-7b-hf"])
 @pytest.mark.parametrize(
    "size_factors",
    [
@@ -95,9 +70,8 @@ def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
 @pytest.mark.parametrize("dtype", ["half"])
 @pytest.mark.parametrize("max_tokens", [128])
 @pytest.mark.parametrize("num_logprobs", [5])
-def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
+def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
-                size_factors, dtype: str, max_tokens: int,
+                dtype, max_tokens, num_logprobs) -> None:
-                num_logprobs: int) -> None:
    """Inference result should be the same between hf and vllm.
    All the image fixtures for the test is under tests/images.
@@ -107,7 +81,6 @@ def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
    Note, the text input is also adjusted to abide by vllm contract.
    The text output is sanitized to be able to compare with hf.
    """
-    model_id, vlm_config = model_and_config
    images = [asset.pil_image for asset in image_assets]
    inputs_per_image = [(
@@ -116,11 +89,10 @@ def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
    ) for image, prompt in zip(images, HF_IMAGE_PROMPTS)]
    # max_model_len should be greater than image_feature_size
-    with vllm_runner(model_id,
+    with vllm_runner(model,
                     dtype=dtype,
                     max_model_len=4096,
-                     enforce_eager=True,
+                     enforce_eager=True) as vllm_model:
-                     **vlm_config.as_cli_args_dict()) as vllm_model:
        vllm_outputs_per_image = [
            vllm_model.generate_greedy_logprobs(prompts,
                                                max_tokens,
@@ -129,7 +101,7 @@ def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
            for prompts, images in inputs_per_image
        ]
-    with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
+    with hf_runner(model, dtype=dtype, is_vision_model=True) as hf_model:
        hf_outputs_per_image = [
            hf_model.generate_greedy_logprobs_limit(prompts,
                                                    max_tokens,
@@ -145,7 +117,7 @@ def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
        check_logprobs_close(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=[
-                vllm_to_hf_output(vllm_output, vlm_config, model_id)
+                vllm_to_hf_output(vllm_output, model)
                for vllm_output in vllm_outputs
            ],
            name_0="hf",

--- a/tests/models/test_phi3v.py
+++ b/tests/models/test_phi3v.py
@@ -4,7 +4,6 @@ from typing import List, Optional, Tuple, Type
 import pytest
 from transformers import AutoTokenizer
-from vllm.config import VisionLanguageConfig
 from vllm.multimodal.utils import rescale_image_size
 from vllm.sequence import SampleLogprobs
 from vllm.utils import is_cpu
@@ -23,35 +22,14 @@ HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
    "<|user|>\n<|image_1|>\nWhat's in this image?<|end|>\n<|assistant|>\n",
 })
+models = ["microsoft/Phi-3-vision-128k-instruct"]
-def iter_phi3v_configs(model_name: str):
-    # Need to use the max possible feature size for profile_run
-    image_hw_to_feature_size = {
-        (1008, 1344): 2653,
-    }
-    for (h, w), f in image_hw_to_feature_size.items():
-        input_shape = (1, 3, h, w)
-        yield (model_name,
-               VisionLanguageConfig(image_feature_size=f,
-                                    image_token_id=32044,
-                                    image_input_shape=input_shape))
-model_and_vl_config = [
-    *iter_phi3v_configs("microsoft/Phi-3-vision-128k-instruct"),
-]
 def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
                                         Optional[SampleLogprobs]],
-                      vlm_config: VisionLanguageConfig, model_id: str):
+                      model: str):
-    """Sanitize vllm output to be comparable with hf output.
+    """Sanitize vllm output to be comparable with hf output."""
-    The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
+    _, output_str, out_logprobs = vllm_output
-    x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
-    It also reduces `output_str` from "<image><image>bla" to "bla".
-    """
-    output_ids, output_str, out_logprobs = vllm_output
    output_str_without_image = re.sub(r"(<\|image_\d+\|>)+", "", output_str)
    assert output_str_without_image[0] == " "
@@ -60,7 +38,7 @@ def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
    hf_output_str = output_str_without_image.replace("<|user|>", "") \
        .replace("<|end|>\n<|assistant|>", " ")
-    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    tokenizer = AutoTokenizer.from_pretrained(model)
    hf_output_ids = tokenizer.encode(output_str_without_image)
    assert hf_output_ids[0] == 1
    hf_output_ids = hf_output_ids[1:]
@@ -77,7 +55,7 @@ def run_test(
    hf_runner: Type[HfRunner],
    vllm_runner: Type[VllmRunner],
    image_assets: _ImageAssets,
-    model_and_config: Tuple[str, VisionLanguageConfig],
+    model: str,
    *,
    size_factors: List[float],
    dtype: str,
@@ -95,7 +73,6 @@ def run_test(
    Note, the text input is also adjusted to abide by vllm contract.
    The text output is sanitized to be able to compare with hf.
    """
-    model_id, vlm_config = model_and_config
    images = [asset.pil_image for asset in image_assets]
    inputs_per_image = [(
@@ -109,13 +86,13 @@ def run_test(
    # will hurt multiprocessing backend with fork method (the default method).
    # max_model_len should be greater than image_feature_size
-    with vllm_runner(model_id,
+    with vllm_runner(model,
                     max_model_len=4096,
+                     max_num_seqs=1,
                     dtype=dtype,
                     tensor_parallel_size=tensor_parallel_size,
                     distributed_executor_backend=distributed_executor_backend,
-                     enforce_eager=True,
+                     enforce_eager=True) as vllm_model:
-                     **vlm_config.as_cli_args_dict()) as vllm_model:
        vllm_outputs_per_image = [
            vllm_model.generate_greedy_logprobs(prompts,
                                                max_tokens,
@@ -126,7 +103,7 @@ def run_test(
    # use eager mode for hf runner, since phi3_v didn't work with flash_attn
    hf_model_kwargs = {"_attn_implementation": "eager"}
-    with hf_runner(model_id, dtype=dtype,
+    with hf_runner(model, dtype=dtype,
                   model_kwargs=hf_model_kwargs) as hf_model:
        eos_token_id = hf_model.processor.tokenizer.eos_token_id
        hf_outputs_per_image = [
@@ -143,7 +120,7 @@ def run_test(
        check_logprobs_close(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=[
-                vllm_to_hf_output(vllm_output, vlm_config, model_id)
+                vllm_to_hf_output(vllm_output, model)
                for vllm_output in vllm_outputs
            ],
            name_0="hf",
@@ -153,7 +130,7 @@ def run_test(
 # Since we use _attn_implementation="eager" for hf_runner, there is more
 # significant numerical difference. The basic `logprobs=5` fails to pass.
-@pytest.mark.parametrize("model_and_config", model_and_vl_config)
+@pytest.mark.parametrize("model", models)
 @pytest.mark.parametrize(
    "size_factors",
    [
@@ -170,14 +147,13 @@ def run_test(
 @pytest.mark.parametrize("dtype", [target_dtype])
 @pytest.mark.parametrize("max_tokens", [128])
 @pytest.mark.parametrize("num_logprobs", [10])
-def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
+def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
-                size_factors, dtype: str, max_tokens: int,
+                dtype: str, max_tokens: int, num_logprobs: int) -> None:
-                num_logprobs: int) -> None:
    run_test(
        hf_runner,
        vllm_runner,
        image_assets,
-        model_and_config,
+        model,
        size_factors=size_factors,
        dtype=dtype,
        max_tokens=max_tokens,

--- a/vllm/config.py
+++ b/vllm/config.py
 import enum
 import json
 from dataclasses import dataclass, field, fields
-from typing import (TYPE_CHECKING, Any, ClassVar, Dict, List, Optional, Tuple,
+from typing import TYPE_CHECKING, ClassVar, List, Optional, Tuple, Union
-                    Union)
 import torch
 from transformers import PretrainedConfig
@@ -120,7 +119,7 @@ class ModelConfig:
        disable_sliding_window: bool = False,
        skip_tokenizer_init: bool = False,
        served_model_name: Optional[Union[str, List[str]]] = None,
-        multimodal_config: Optional["VisionLanguageConfig"] = None,
+        multimodal_config: Optional["MultiModalConfig"] = None,
    ) -> None:
        self.model = model
        self.tokenizer = tokenizer
@@ -1289,35 +1288,12 @@ class LoRAConfig:
            raise ValueError("LoRA is not supported with chunked prefill yet.")
-# TODO: To be replaced by MultiModalConfig.
 @dataclass
-class VisionLanguageConfig:
+class MultiModalConfig:
    """Configs the input data format and how models should run for
-    vision language models."""
+    multimodal models."""
-    # The input id corresponding to image token.
+    # TODO: Add configs to init vision tower or not.
-    image_token_id: int
+    pass
-    # Used for running `run_prefill_max_token`.
-    # For models that support varying resolution, this corresponds to
-    # worst case scenario (biggest supported resolution).
-    image_input_shape: tuple
-    image_feature_size: int
-    def as_cli_args_dict(self) -> Dict[str, Any]:
-        """Flatten vision language config to pure args.
-        Compatible with what llm entrypoint expects.
-        """
-        result: Dict[str, Any] = {}
-        for f in fields(self):
-            value = getattr(self, f.name)
-            if isinstance(value, enum.Enum):
-                result[f.name] = value.name.lower()
-            elif isinstance(value, tuple):
-                result[f.name] = ",".join([str(item) for item in value])
-            else:
-                result[f.name] = value
-        return result
 _STR_DTYPE_TO_TORCH_DTYPE = {
@@ -1541,7 +1517,7 @@ class EngineConfig:
    device_config: DeviceConfig
    load_config: LoadConfig
    lora_config: Optional[LoRAConfig]
-    vision_language_config: Optional[VisionLanguageConfig]
+    multimodal_config: Optional[MultiModalConfig]
    speculative_config: Optional[SpeculativeConfig]
    decoding_config: Optional[DecodingConfig]
    observability_config: Optional[ObservabilityConfig]

--- a/vllm/engine/arg_utils.py
+++ b/vllm/engine/arg_utils.py
@@ -6,11 +6,11 @@ from typing import List, Optional, Tuple, Union
 from vllm.config import (CacheConfig, DecodingConfig, DeviceConfig,
                         EngineConfig, LoadConfig, LoRAConfig, ModelConfig,
-                         ObservabilityConfig, ParallelConfig, SchedulerConfig,
+                         MultiModalConfig, ObservabilityConfig, ParallelConfig,
-                         SpeculativeConfig, TokenizerPoolConfig,
+                         SchedulerConfig, SpeculativeConfig,
-                         VisionLanguageConfig)
+                         TokenizerPoolConfig)
 from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
-from vllm.utils import FlexibleArgumentParser, str_to_int_tuple
+from vllm.utils import FlexibleArgumentParser
 def nullable_str(val: str):
@@ -78,11 +78,6 @@ class EngineArgs:
    model_loader_extra_config: Optional[dict] = None
    preemption_mode: Optional[str] = None
-    # Related to Vision-language models such as llava
-    image_token_id: Optional[int] = None
-    image_input_shape: Optional[str] = None
-    image_feature_size: Optional[int] = None
    scheduler_delay_factor: float = 0.0
    enable_chunked_prefill: bool = False
@@ -106,27 +101,6 @@ class EngineArgs:
        if self.tokenizer is None:
            self.tokenizer = self.model
-    @staticmethod
-    def add_cli_args_for_vlm(
-            parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
-        parser.add_argument('--image-token-id',
-                            type=int,
-                            default=None,
-                            help=('Input id for image token.'))
-        parser.add_argument(
-            '--image-input-shape',
-            type=nullable_str,
-            default=None,
-            help=('The biggest image input shape (worst for memory footprint) '
-                  'given an input type. Only used for vLLM\'s profile_run.'))
-        parser.add_argument(
-            '--image-feature-size',
-            type=int,
-            default=None,
-            help=('The image feature size along the context dimension.'))
-        return parser
    @staticmethod
    def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
        """Shared CLI arguments for vLLM engine."""
@@ -484,9 +458,6 @@ class EngineArgs:
                            ],
                            help='Device type for vLLM execution.')
-        # Related to Vision-language models such as llava
-        parser = EngineArgs.add_cli_args_for_vlm(parser)
        parser.add_argument(
            '--scheduler-delay-factor',
            type=float,
@@ -648,19 +619,7 @@ class EngineArgs:
            raise ValueError(
                "BitsAndBytes load format and QLoRA adapter only support "
                f"'bitsandbytes' quantization, but got {self.quantization}")
-        if self.image_token_id is not None:
+        multimodal_config = MultiModalConfig()
-            if (not self.image_input_shape or not self.image_feature_size):
-                raise ValueError(
-                    'Specify `image_input_shape` and '
-                    '`image_feature_size` together with `image_token_id`.')
-            vision_language_config = VisionLanguageConfig(
-                image_token_id=self.image_token_id,
-                image_input_shape=str_to_int_tuple(self.image_input_shape),
-                image_feature_size=self.image_feature_size,
-            )
-        else:
-            vision_language_config = None
        device_config = DeviceConfig(device=self.device)
        model_config = ModelConfig(
@@ -685,7 +644,7 @@ class EngineArgs:
            disable_sliding_window=self.disable_sliding_window,
            skip_tokenizer_init=self.skip_tokenizer_init,
            served_model_name=self.served_model_name,
-            multimodal_config=vision_language_config)
+            multimodal_config=multimodal_config)
        cache_config = CacheConfig(
            block_size=self.block_size,
            gpu_memory_utilization=self.gpu_memory_utilization,
@@ -787,7 +746,7 @@ class EngineArgs:
            scheduler_config=scheduler_config,
            device_config=device_config,
            lora_config=lora_config,
-            vision_language_config=vision_language_config,
+            multimodal_config=multimodal_config,
            speculative_config=speculative_config,
            load_config=load_config,
            decoding_config=decoding_config,
@@ -831,7 +790,3 @@ def _engine_args_parser():
 def _async_engine_args_parser():
    return AsyncEngineArgs.add_cli_args(FlexibleArgumentParser(),
                                        async_args_only=True)
-def _vlm_engine_args_parser():
-    return EngineArgs.add_cli_args_for_vlm(FlexibleArgumentParser())
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -7,9 +7,9 @@ from typing import Set, Type, TypeVar, Union
 from transformers import PreTrainedTokenizer
 from vllm.config import (CacheConfig, DecodingConfig, DeviceConfig, LoadConfig,
-                         LoRAConfig, ModelConfig, ObservabilityConfig,
+                         LoRAConfig, ModelConfig, MultiModalConfig,
-                         ParallelConfig, SchedulerConfig, SpeculativeConfig,
+                         ObservabilityConfig, ParallelConfig, SchedulerConfig,
-                         VisionLanguageConfig)
+                         SpeculativeConfig)
 from vllm.core.scheduler import (ScheduledSequenceGroup, Scheduler,
                                 SchedulerOutputs)
 from vllm.engine.arg_utils import EngineArgs
@@ -87,8 +87,8 @@ class LLMEngine:
        scheduler_config: The configuration related to the request scheduler.
        device_config: The configuration related to the device.
        lora_config (Optional): The configuration related to serving multi-LoRA.
-        vision_language_config (Optional): The configuration related to vision
+        multimodal_config (Optional): The configuration related to multimodal 
-            language models.
+            models.
        speculative_config (Optional): The configuration related to speculative
            decoding.
        executor_class: The model executor class for managing distributed
@@ -157,7 +157,7 @@ class LLMEngine:
        device_config: DeviceConfig,
        load_config: LoadConfig,
        lora_config: Optional[LoRAConfig],
-        vision_language_config: Optional[VisionLanguageConfig],
+        multimodal_config: Optional[MultiModalConfig],
        speculative_config: Optional[SpeculativeConfig],
        decoding_config: Optional[DecodingConfig],
        observability_config: Optional[ObservabilityConfig],
@@ -215,7 +215,7 @@ class LLMEngine:
        self.model_config = model_config
        self.cache_config = cache_config
        self.lora_config = lora_config
-        self.vision_language_config = vision_language_config
+        self.multimodal_config = multimodal_config
        self.parallel_config = parallel_config
        self.scheduler_config = scheduler_config
        self.device_config = device_config
@@ -247,7 +247,7 @@ class LLMEngine:
            scheduler_config=scheduler_config,
            device_config=device_config,
            lora_config=lora_config,
-            vision_language_config=vision_language_config,
+            multimodal_config=multimodal_config,
            speculative_config=speculative_config,
            load_config=load_config,
        )

--- a/vllm/entrypoints/llm.py
+++ b/vllm/entrypoints/llm.py
@@ -121,6 +121,11 @@ class LLM:
    ) -> None:
        if "disable_log_stats" not in kwargs:
            kwargs["disable_log_stats"] = True
+        removed_vision_keys = ("image_token_id", "image_feature_size",
+                               "image_input_shape", "image_input_type")
+        if any(k in kwargs for k in removed_vision_keys):
+            raise TypeError(
+                "There is no need to pass vision-related arguments anymore.")
        engine_args = EngineArgs(
            model=model,
            tokenizer=tokenizer,

--- a/vllm/entrypoints/openai/serving_chat.py
+++ b/vllm/entrypoints/openai/serving_chat.py
@@ -109,23 +109,12 @@ class OpenAIServingChat(OpenAIServing):
                          "paligemma"):
            # These models do not use image tokens in the prompt
            return None
+        if model_type.startswith("llava"):
+            return self.tokenizer.decode(
+                self.model_config.hf_config.image_token_index)
-        # The default behaviour assumes that the image token is
+        else:
-        # available to the tokenizer.
+            raise TypeError("Unknown model type: {model_type}")
-        # (Suitable for LLaVA, Idefics2, DeepSeek-VL)
-        vlm_config = self.model_config.multimodal_config
-        if vlm_config is None:
-            raise ValueError(
-                "'image_url' input is not supported as the loaded "
-                "model is not multimodal.")
-        image_token_id = vlm_config.image_token_id
-        if vlm_config.image_token_id is None:
-            raise ValueError(
-                "'image_url' input is not supported as the loaded "
-                "model does not specify an image token.")
-        return self.tokenizer.decode(image_token_id)
    # TODO: Let user specify how to insert image tokens into prompt
    # (similar to chat template)

--- a/vllm/executor/cpu_executor.py
+++ b/vllm/executor/cpu_executor.py
@@ -46,7 +46,7 @@ class CPUExecutor(ExecutorBase):
            rank=0,
            distributed_init_method=distributed_init_method,
            lora_config=self.lora_config,
-            vision_language_config=self.vision_language_config,
+            multimodal_config=self.multimodal_config,
            kv_cache_dtype=self.cache_config.cache_dtype,
            is_driver_worker=True,
        )

--- a/vllm/executor/executor_base.py
+++ b/vllm/executor/executor_base.py
@@ -3,8 +3,8 @@ from abc import ABC, abstractmethod
 from typing import List, Optional, Set, Tuple
 from vllm.config import (CacheConfig, DeviceConfig, LoadConfig, LoRAConfig,
-                         ModelConfig, ParallelConfig, SchedulerConfig,
+                         ModelConfig, MultiModalConfig, ParallelConfig,
-                         SpeculativeConfig, VisionLanguageConfig)
+                         SchedulerConfig, SpeculativeConfig)
 from vllm.lora.request import LoRARequest
 from vllm.sequence import ExecuteModelRequest, SamplerOutput
@@ -26,7 +26,7 @@ class ExecutorBase(ABC):
        device_config: DeviceConfig,
        load_config: LoadConfig,
        lora_config: Optional[LoRAConfig],
-        vision_language_config: Optional[VisionLanguageConfig],
+        multimodal_config: Optional[MultiModalConfig],
        speculative_config: Optional[SpeculativeConfig],
    ) -> None:
        self.model_config = model_config
@@ -36,7 +36,7 @@ class ExecutorBase(ABC):
        self.parallel_config = parallel_config
        self.scheduler_config = scheduler_config
        self.device_config = device_config
-        self.vision_language_config = vision_language_config
+        self.multimodal_config = multimodal_config
        self.speculative_config = speculative_config
        self._init_executor()
@@ -120,7 +120,7 @@ class ExecutorAsyncBase(ExecutorBase):
        device_config: DeviceConfig,
        load_config: LoadConfig,
        lora_config: Optional[LoRAConfig],
-        vision_language_config: Optional[VisionLanguageConfig],
+        multimodal_config: Optional[MultiModalConfig],
        speculative_config: Optional[SpeculativeConfig],
    ) -> None:
        # This locks each pipeline parallel stage so multiple virtual engines
@@ -132,8 +132,7 @@ class ExecutorAsyncBase(ExecutorBase):
        super().__init__(model_config, cache_config, parallel_config,
                         scheduler_config, device_config, load_config,
-                         lora_config, vision_language_config,
+                         lora_config, multimodal_config, speculative_config)
-                         speculative_config)
    @abstractmethod
    async def execute_model_async(

--- a/vllm/executor/gpu_executor.py
+++ b/vllm/executor/gpu_executor.py
@@ -43,7 +43,7 @@ class GPUExecutor(ExecutorBase):
            rank=rank,
            distributed_init_method=distributed_init_method,
            lora_config=self.lora_config,
-            vision_language_config=self.vision_language_config,
+            multimodal_config=self.multimodal_config,
            speculative_config=self.speculative_config,
            is_driver_worker=(not self.parallel_config)
            or (rank % self.parallel_config.tensor_parallel_size == 0),

--- a/vllm/executor/openvino_executor.py
+++ b/vllm/executor/openvino_executor.py
@@ -47,7 +47,7 @@ class OpenVINOExecutor(ExecutorBase):
            rank=0,
            distributed_init_method=distributed_init_method,
            lora_config=self.lora_config,
-            vision_language_config=self.vision_language_config,
+            multimodal_config=self.multimodal_config,
            kv_cache_dtype=self.cache_config.cache_dtype,
            is_driver_worker=True,
        )