vLLM supports generative and pooling models across various tasks.
vLLM supports [generative](generative-models) and [pooling](pooling-models) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
Alongside each architecture, we include some popular models that use it.
## Loading a Model
## Model Implementation
### HuggingFace Hub
### vLLM
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>.
To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
These models are what we list in <project:#supported-text-models> and <project:#supported-mm-models>.
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
Models do not _need_ to be natively supported to be used in vLLM.
(transformers-backend)=
The <project:#transformers-fallback> enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
:::{tip}
### Transformers
The easiest way to check if your model is really supported at runtime is to run the program below:
```python
vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned!
fromvllmimportLLM
# For generative models (task=generate) only
To check if the modeling backend is Transformers, you can simply do this:
llm=LLM(model=...,task="generate")# Name or path of your model
output=llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
```python
llm=LLM(model=...,task="embed")# Name or path of your model
output=llm.encode("Hello, my name is")
print(output)
```
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
:::
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
(transformers-fallback)=
### Transformers fallback
vLLM can fallback to model implementations that are available in Transformers. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is Transformers, you can simply do this:
```python
fromvllmimportLLM
fromvllmimportLLM
llm=LLM(model=...,task="generate")# Name or path of your model
llm=LLM(model=...,task="generate")# Name or path of your model
llm.apply_model(lambdamodel:print(type(model)))
llm.apply_model(lambdamodel:print(type(model)))
```
```
If it is `TransformersModel` then it means it's based on Transformers!
If it is `TransformersForCausalLM` then it means it's based on Transformers!
:::{tip}
:::{tip}
You can force the use of `TransformersModel` by setting `model_impl="transformers"` for <project:#offline-inference> or `--model-impl transformers` for the <project:#openai-compatible-server>.
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for <project:#offline-inference> or `--model-impl transformers` for the <project:#openai-compatible-server>.
:::
:::
:::{note}
:::{note}
...
@@ -69,27 +42,26 @@ vLLM may not fully optimise the Transformers implementation so you may see degra
...
@@ -69,27 +42,26 @@ vLLM may not fully optimise the Transformers implementation so you may see degra
#### Supported features
#### Supported features
The Transformers fallback explicitly supports the following features:
The Transformers modeling backend explicitly supports the following features:
Earlier we mentioned that the Transformers fallback enables you to run remote code models directly in vLLM.
If your model is neither supported natively by vLLM or Transformers, you can still run it in vLLM!
If you are interested in this feature, this section is for you!
Simply set `trust_remote_code=True` and vLLM will run any model on the Model Hub that is compatible with Transformers.
Simply set `trust_remote_code=True` and vLLM will run any model on the Model Hub that is compatible with Transformers.
Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
```python
```python
fromvllmimportLLM
fromvllmimportLLM
llm=LLM(model=...,task="generate",trust_remote_code=True)# Name or path of your model
llm=LLM(model=...,task="generate",trust_remote_code=True)# Name or path of your model
To make your model compatible with the Transformers fallback, it needs:
To make your model compatible with the Transformers backend, it needs:
```{code-block} python
```{code-block} python
:caption: modeling_my_model.py
:caption: modeling_my_model.py
...
@@ -119,9 +91,11 @@ Here is what happens in the background:
...
@@ -119,9 +91,11 @@ Here is what happens in the background:
1. The config is loaded
1. The config is loaded
2.`MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
2.`MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
3. The `TransformersForCausalLM` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
That's it!
To make your model compatible with tensor parallel, it needs:
For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
```{code-block} python
```{code-block} python
:caption: configuration_my_model.py
:caption: configuration_my_model.py
...
@@ -130,20 +104,65 @@ from transformers import PretrainedConfig
...
@@ -130,20 +104,65 @@ from transformers import PretrainedConfig
-`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
-`base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
* You only need to do this for layers which are not present on all pipeline stages
* vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
* The `list` in the first element of the `tuple` contains the names of the input arguments
* The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
## Loading a Model
### Hugging Face Hub
By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models).
To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
Models do not _need_ to be natively supported to be used in vLLM.
The [Transformers backend](#transformers-backend) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
:::{tip}
:::{tip}
`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
The easiest way to check if your model is really supported at runtime is to run the program below:
```python
fromvllmimportLLM
# For generative models (task=generate) only
llm=LLM(model=...,task="generate")# Name or path of your model
output=llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
llm=LLM(model=...,task="embed")# Name or path of your model
output=llm.encode("Hello, my name is")
print(output)
```
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
:::
:::
That's it!
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### ModelScope
### ModelScope
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:
```shell
```shell
export VLLM_USE_MODELSCOPE=True
export VLLM_USE_MODELSCOPE=True
...
@@ -165,6 +184,8 @@ output = llm.encode("Hello, my name is")
...
@@ -165,6 +184,8 @@ output = llm.encode("Hello, my name is")
print(output)
print(output)
```
```
(supported-text-models)=
## List of Text-only Language Models
## List of Text-only Language Models
### Generative Models
### Generative Models
...
@@ -197,6 +218,11 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -197,6 +218,11 @@ See [this page](#generative-models) for more information on how to use generativ
*`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
*`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
@@ -224,7 +250,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -224,7 +250,7 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
-*`DeciLMForCausalLM`
-*`DeciLMForCausalLM`
* DeciLM
* DeciLM
*`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.
*`nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc.
*
*
* ✅︎
* ✅︎
-*`DeepseekForCausalLM`
-*`DeepseekForCausalLM`
...
@@ -482,6 +508,11 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -482,6 +508,11 @@ See [this page](#generative-models) for more information on how to use generativ
*`xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
*`xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`MiniMaxText01ForCausalLM`
* MiniMax-Text
*`MiniMaxAI/MiniMax-Text-01`, etc.
*
* ✅︎
-*`Zamba2ForCausalLM`
-*`Zamba2ForCausalLM`
* Zamba2
* Zamba2
*`Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.
*`Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.
...
@@ -545,7 +576,7 @@ you should explicitly specify the task type to ensure that the model is used in
...
@@ -545,7 +576,7 @@ you should explicitly specify the task type to ensure that the model is used in
*
*
-*`XLMRobertaModel`
-*`XLMRobertaModel`
* XLM-RoBERTa-based
* XLM-RoBERTa-based
*`intfloat/multilingual-e5-large`, etc.
*`intfloat/multilingual-e5-large`, `jinaai/jina-reranker-v2-base-multilingual`, etc.
*
*
*
*
:::
:::
...
@@ -732,6 +763,13 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -732,6 +763,13 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`AyaVisionForConditionalGeneration`
* Aya Vision
* T + I<sup>+</sup>
*`CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc.
*
* ✅︎
* ✅︎
-*`Blip2ForConditionalGeneration`
-*`Blip2ForConditionalGeneration`
* BLIP-2
* BLIP-2
* T + I<sup>E</sup>
* T + I<sup>E</sup>
...
@@ -802,6 +840,13 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -802,6 +840,13 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`Llama4ForConditionalGeneration`
* Llama-4-17B-Omni-Instruct
* T + I<sup>+</sup>
*`meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc.
*
*
* ✅︎
-*`LlavaForConditionalGeneration`
-*`LlavaForConditionalGeneration`
* LLaVA-1.5
* LLaVA-1.5
* T + I<sup>E+</sup>
* T + I<sup>E+</sup>
...
@@ -836,14 +881,21 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -836,14 +881,21 @@ See [this page](#generative-models) for more information on how to use generativ
*`openbmb/MiniCPM-o-2_6`, etc.
*`openbmb/MiniCPM-o-2_6`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
-*`MiniCPMV`
-*`MiniCPMV`
* MiniCPM-V
* MiniCPM-V
* T + I<sup>E+</sup> + V<sup>E+</sup>
* T + I<sup>E+</sup> + V<sup>E+</sup>
*`openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
*`openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`Mistral3ForConditionalGeneration`
* Mistral3
* T + I<sup>+</sup>
*`mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc.
*
*
* ✅︎
* ✅︎
-*`MllamaForConditionalGeneration`
-*`MllamaForConditionalGeneration`
* Llama 3.2
* Llama 3.2
* T + I<sup>+</sup>
* T + I<sup>+</sup>
...
@@ -853,7 +905,7 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -853,7 +905,7 @@ See [this page](#generative-models) for more information on how to use generativ
*
*
-*`MolmoForCausalLM`
-*`MolmoForCausalLM`
* Molmo
* Molmo
* T + I
* T + I<sup>+</sup>
*`allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
*`allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
* ✅︎
* ✅︎
* ✅︎
* ✅︎
...
@@ -921,6 +973,13 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -921,6 +973,13 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
-*`SkyworkR1VChatModel`
* Skywork-R1V-38B
* T + I
*`Skywork/Skywork-R1V-38B`
*
* ✅︎
* ✅︎
-*`UltravoxModel`
-*`UltravoxModel`
* Ultravox
* Ultravox
* T + A<sup>E+</sup>
* T + A<sup>E+</sup>
...
@@ -930,10 +989,10 @@ See [this page](#generative-models) for more information on how to use generativ
...
@@ -930,10 +989,10 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
:::
:::
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
• For example, to use DeepSeek-VL2 series models:
• For example, to use DeepSeek-VL2 series models:
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
:::{important}
:::{important}
...
@@ -1059,7 +1118,7 @@ At vLLM, we are committed to facilitating the integration and support of third-p
...
@@ -1059,7 +1118,7 @@ At vLLM, we are committed to facilitating the integration and support of third-p
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
2.**Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
:::{tip}
:::{tip}
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
When comparing the output of `model.generate` from HuggingFace Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
:::
:::
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
3.**Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
@@ -31,6 +31,8 @@ vLLM supports an experimental feature chunked prefill. Chunked prefill allows to
...
@@ -31,6 +31,8 @@ vLLM supports an experimental feature chunked prefill. Chunked prefill allows to
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
You can enable the feature by specifying `--enable-chunked-prefill` in the command line or setting `enable_chunked_prefill=True` in the LLM constructor.
@@ -21,6 +21,8 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
...
@@ -21,6 +21,8 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
```python
```python
fromvllmimportLLM
llm=LLM(model="llava-hf/llava-1.5-7b-hf")
llm=LLM(model="llava-hf/llava-1.5-7b-hf")
# Refer to the HuggingFace repo for the correct format to use
# Refer to the HuggingFace repo for the correct format to use
...
@@ -65,6 +67,8 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
...
@@ -65,6 +67,8 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
```python
```python
fromvllmimportLLM
llm=LLM(
llm=LLM(
model="microsoft/Phi-3.5-vision-instruct",
model="microsoft/Phi-3.5-vision-instruct",
trust_remote_code=True,# Required to load Phi-3.5-vision
trust_remote_code=True,# Required to load Phi-3.5-vision
...
@@ -96,6 +100,8 @@ Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py
...
@@ -96,6 +100,8 @@ Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
```python
```python
fromvllmimportLLM
# Specify the maximum number of frames per video to be 4. This can be changed.
# Specify the maximum number of frames per video to be 4. This can be changed.
vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information, and will be publicly released for the community's benefit.
vLLM collects anonymous usage data by default to help the engineering team better understand which hardware and model configurations are widely used. This data allows them to prioritize their efforts on the most common workloads. The collected data is transparent, does not contain any sensitive information.
A subset of the data, after cleaning and aggregation, will be publicly released for the community's benefit. For example, you can see the 2024 usage report [here](https://2024.vllm.ai).