Unverified Commit 25e48a3a authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Update usage of `--limit-mm-per-prompt` (#34148)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 8a5e0e2b
...@@ -521,7 +521,7 @@ First, launch the OpenAI-compatible server: ...@@ -521,7 +521,7 @@ First, launch the OpenAI-compatible server:
```bash ```bash
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \ vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt.image 2
``` ```
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
......
...@@ -658,7 +658,7 @@ On the other hand, modalities separated by `/` are mutually exclusive. ...@@ -658,7 +658,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model. See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
!!! tip !!! tip
For hybrid-only models such as Llama-4, Step3 and Mistral-3, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (e.g, `--limit-mm-per-prompt '{"image":0}`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache. For hybrid-only models such as Llama-4, Step3, Mistral-3 and Qwen-3.5, a text-only mode can be enabled by setting all supported multimodal modalities to 0 (`--language-model-only`) so that their multimodal modules will not be loaded to free up more GPU memory for KV cache.
!!! note !!! note
vLLM currently supports adding LoRA adapters to the language backbone for most multimodal models. Additionally, vLLM now experimentally supports adding LoRA to the tower and connector modules for some multimodal models. See [this page](../features/lora.md). vLLM currently supports adding LoRA adapters to the language backbone for most multimodal models. Additionally, vLLM now experimentally supports adding LoRA to the tower and connector modules for some multimodal models. See [this page](../features/lora.md).
......
...@@ -18,11 +18,11 @@ from vllm.assets.image import ImageAsset ...@@ -18,11 +18,11 @@ from vllm.assets.image import ImageAsset
# # Mistral format # # Mistral format
# vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \ # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
# --tokenizer-mode mistral --config-format mistral --load-format mistral \ # --tokenizer-mode mistral --config-format mistral --load-format mistral \
# --limit-mm-per-prompt '{"image":4}' --max-model-len 16384 # --limit-mm-per-prompt.image 4 --max-model-len 16384
# #
# # HF format # # HF format
# vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \ # vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
# --limit-mm-per-prompt '{"image":4}' --max-model-len 16384 # --limit-mm-per-prompt.image 4 --max-model-len 16384
# ``` # ```
# #
# - Client: # - Client:
......
...@@ -10,7 +10,7 @@ vllm serve llava-hf/llava-1.5-7b-hf ...@@ -10,7 +10,7 @@ vllm serve llava-hf/llava-1.5-7b-hf
(multi-image inference with Phi-3.5-vision-instruct) (multi-image inference with Phi-3.5-vision-instruct)
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \ vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt.image 2
(audio inference with Ultravox) (audio inference with Ultravox)
vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b \ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b \
......
...@@ -7,7 +7,7 @@ NOTE: ...@@ -7,7 +7,7 @@ NOTE:
vllm serve muziyongshixin/Qwen2.5-VL-7B-for-VideoCls \ vllm serve muziyongshixin/Qwen2.5-VL-7B-for-VideoCls \
--runner pooling \ --runner pooling \
--max-model-len 5000 \ --max-model-len 5000 \
--limit-mm-per-prompt '{"video": 1}' \ --limit-mm-per-prompt.video 1 \
--hf-overrides '{"text_config": {"architectures": ["Qwen2_5_VLForSequenceClassification"]}}' --hf-overrides '{"text_config": {"architectures": ["Qwen2_5_VLForSequenceClassification"]}}'
""" """
......
...@@ -55,12 +55,12 @@ class MultiModalConfig: ...@@ -55,12 +55,12 @@ class MultiModalConfig:
"""Controls the behavior of multimodal models.""" """Controls the behavior of multimodal models."""
language_model_only: bool = False language_model_only: bool = False
"""If True, disables all multimodal inputs by setting all modality limits """If True, disables all multimodal inputs by setting all modality limits to 0.
to 0. Equivalent to setting --limit-mm-per-prompt to 0 for every Equivalent to setting `--limit-mm-per-prompt` to 0 for every modality."""
modality."""
limit_per_prompt: dict[str, DummyOptions] = Field(default_factory=dict) limit_per_prompt: dict[str, DummyOptions] = Field(default_factory=dict)
"""The maximum number of input items and options allowed per """The maximum number of input items and options allowed per
prompt for each modality. prompt for each modality.
Defaults to 999 for each modality. Defaults to 999 for each modality.
Legacy format (count only): Legacy format (count only):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment