Commit 4d3a2c28 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.6.5' into v0.6.5-dev

parents 92ec5d8e 2d1b9baa
......@@ -43,7 +43,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism and pipeline parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
* Prefix caching support
* Multi-lora support
......@@ -66,6 +66,8 @@ Documentation
getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation
getting_started/gaudi-installation
getting_started/arm-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
......@@ -79,26 +81,40 @@ Documentation
serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/deploying_with_helm
serving/deploying_with_nginx
serving/distributed_serving
serving/metrics
serving/env_vars
serving/usage_stats
serving/integrations
serving/tensorizer
serving/faq
.. toctree::
:maxdepth: 1
:caption: Models
models/supported_models
models/generative_models
models/pooling_models
models/adding_model
models/enabling_multimodal_inputs
models/engine_args
models/lora
models/vlm
models/spec_decode
models/performance
.. toctree::
:maxdepth: 1
:caption: Usage
usage/lora
usage/multimodal_inputs
usage/tool_calling
usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
usage/performance
usage/faq
usage/engine_args
usage/env_vars
usage/usage_stats
usage/disagg_prefill
.. toctree::
:maxdepth: 1
......@@ -122,29 +138,53 @@ Documentation
.. toctree::
:maxdepth: 1
:caption: Performance benchmarks
:caption: Performance
performance/benchmarks
.. Community: User community resources
.. toctree::
:maxdepth: 1
:caption: Community
community/meetups
community/sponsors
performance_benchmark/benchmarks
.. API Documentation: API reference aimed at vllm library usage
.. toctree::
:maxdepth: 2
:caption: Developer Documentation
:caption: API Documentation
dev/sampling_params
dev/pooling_params
dev/offline_inference/offline_index
dev/engine/engine_index
dev/kernel/paged_attention
dev/input_processing/model_inputs_index
dev/multimodal/multimodal_index
dev/dockerfile/dockerfile
dev/profiling/profiling_index
.. Design: docs about vLLM internals
.. toctree::
:maxdepth: 1
:caption: Community
:maxdepth: 2
:caption: Design
community/meetups
community/sponsors
design/arch_overview
design/huggingface_integration
design/plugin_system
design/input_processing/model_inputs_index
design/kernel/paged_attention
design/multimodal/multimodal_index
design/multiprocessing
.. For Developers: contributing to the vLLM project
.. toctree::
:maxdepth: 2
:caption: For Developers
contributing/overview
contributing/profiling/profiling_index
contributing/dockerfile/dockerfile
Indices and tables
==================
......
......@@ -38,41 +38,70 @@ For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
2. Rewrite the :code:`forward` methods
2. Make your code compatible with vLLM
--------------------------------------
Next, you need to rewrite the :meth:`~torch.nn.Module.forward` method of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters:
.. code-block:: diff
def forward(
self,
input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
- ) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor,
+ kv_caches: List[torch.Tensor],
+ attn_metadata: AttentionMetadata,
+ ) -> Optional[SamplerOutput]:
1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
To ensure compatibility with vLLM, your model must meet the following requirements:
Initialization Code
^^^^^^^^^^^^^^^^^^^
All vLLM modules within the model must include a ``prefix`` argument in their constructor. This ``prefix`` is typically the full name of the module in the model's state dictionary and is crucial for:
* Runtime support: vLLM's attention operators are registered in a model's state by their full names. Each attention operator must have a unique prefix as its layer name to avoid conflicts.
* Non-uniform quantization support: A quantized checkpoint can selectively quantize certain layers while keeping others in full precision. By providing the ``prefix`` during initialization, vLLM can match the current layer's ``prefix`` with the quantization configuration to determine if the layer should be initialized in quantized mode.
The initialization code should look like this:
.. code-block:: python
from torch import nn
from vllm.config import VllmConfig
from vllm.attention import Attention
class MyAttention(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.attn = Attention(prefix=f"{prefix}.attn")
class MyDecoderLayer(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
class MyModel(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str):
super().__init__()
self.layers = nn.ModuleList(
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
)
class MyModelForCausalLM(nn.Module):
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
super().__init__()
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
Computation Code
^^^^^^^^^^^^^^^^
Rewrite the :meth:`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat ``input_ids`` and ``positions`` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
.. code-block:: python
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
) -> torch.Tensor:
...
.. note::
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out the `LLAMA model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py>`__. vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out the `vLLM models <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models>`__ directory for more examples.
3. (Optional) Implement tensor parallelism and quantization support
-------------------------------------------------------------------
......@@ -85,28 +114,28 @@ When it comes to the linear layers, we provide the following options to parallel
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple :code:`ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
Note that all the linear layers above take :code:`linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
4. Implement the weight loading logic
-------------------------------------
You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for :code:`MergedColumnParallelLinear` and :code:`QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
5. Register your model
----------------------
Finally, register your :code:`*ForCausalLM` class to the :code:`_MODELS` in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_.
Finally, register your :code:`*ForCausalLM` class to the :code:`_VLLM_MODELS` in `vllm/model_executor/models/registry.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py>`_.
6. Out-of-Tree Model Integration
--------------------------------------------
--------------------------------
We also provide a way to integrate a model without modifying the vLLM codebase. Step 2, 3, 4 are still required, but you can skip step 1 and 5.
You can integrate a model without modifying the vLLM codebase. Steps 2, 3, and 4 are still required, but you can skip steps 1 and 5. Instead, write a plugin to register your model. For general introduction of the plugin system, see :ref:`plugin_system`.
Just add the following lines in your code:
To register the model, use the following code:
.. code-block:: python
......@@ -114,14 +143,17 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code:
If your model imports modules that initialize CUDA, consider lazy-importing it to avoid errors like :code:`RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
.. code-block:: python
from vllm import ModelRegistry
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
import runpy
runpy.run_module('vllm.entrypoints.openai.api_server', run_name='__main__')
Save the above code in a file and run it with :code:`python your_file.py <args>`.
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
.. important::
If your model is a multimodal model, ensure the model class implements the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
Read more about that :ref:`here <enabling_multimodal_inputs>`.
.. note::
Although you can directly put these code snippets in your script using ``vllm.LLM``, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
......@@ -3,7 +3,7 @@
Enabling Multimodal Inputs
==========================
This document walks you through the steps to extend a vLLM model so that it accepts :ref:`multi-modal <multi_modality>` inputs.
This document walks you through the steps to extend a vLLM model so that it accepts :ref:`multi-modal inputs <multimodal_inputs>`.
.. seealso::
:ref:`adding_a_new_model`
......@@ -66,7 +66,7 @@ A default mapper is available for each modality in the core vLLM library. This i
3. Register maximum number of multi-modal tokens
------------------------------------------------
For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data instance
For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data item
and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.
.. code-block:: diff
......
.. _generative_models:
Generative Models
=================
vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the :class:`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through :class:`~vllm.model_executor.layers.Sampler` to obtain the final text.
Offline Inference
-----------------
The :class:`~vllm.LLM` class provides various methods for offline inference.
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
For generative models, the only supported :code:`task` option is :code:`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
``LLM.generate``
^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM.generate` method is available to all generative models in vLLM.
It is similar to `its counterpart in HF Transformers <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__,
except that tokenization and detokenization are also performed automatically.
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate("Hello, my name is")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
You can optionally control the language generation by passing :class:`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting :code:`temperature=0`:
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(temperature=0)
outputs = llm.generate("Hello, my name is", params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
A code example can be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
``LLM.beam_search``
^^^^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM.beam_search` method implements `beam search <https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding>`__ on top of :class:`~vllm.LLM.generate`.
For example, to search using 5 beams and output at most 50 tokens:
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
params = BeamSearchParams(beam_width=5, max_tokens=50)
outputs = llm.generate("Hello, my name is", params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``LLM.chat``
^^^^^^^^^^^^
The :class:`~vllm.LLM.chat` method implements chat functionality on top of :class:`~vllm.LLM.generate`.
In particular, it accepts input similar to `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
and automatically applies the model's `chat template <https://huggingface.co/docs/transformers/en/chat_templating>`__ to format the prompt.
.. important::
In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation.
.. code-block:: python
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
outputs = llm.chat(conversation)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
A code example can be found in `examples/offline_inference_chat.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py>`_.
If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:
.. code-block:: python
from vllm.entrypoints.chat_utils import load_chat_template
# You can find a list of existing chat templates under `examples/`
custom_template = load_chat_template(chat_template="<path_to_template>")
print("Loaded chat template:", custom_template)
outputs = llm.chat(conversation, chat_template=custom_template)
Online Inference
----------------
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
Please click on the above link for more details on how to launch the server.
Completions API
^^^^^^^^^^^^^^^
Our Completions API is similar to ``LLM.generate`` but only accepts text.
It is compatible with `OpenAI Completions API <https://platform.openai.com/docs/api-reference/completions>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
Chat API
^^^^^^^^
Our Chat API is similar to ``LLM.chat``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
It is compatible with `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_chat_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py>`_.
.. _pooling_models:
Pooling Models
==============
vLLM also supports pooling models, including embedding, reranking and reward models.
In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface.
These models use a :class:`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
before returning them.
.. note::
We currently support pooling models primarily as a matter of convenience.
As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
Offline Inference
-----------------
The :class:`~vllm.LLM` class provides various methods for offline inference.
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
For pooling models, we support the following :code:`task` options:
- Embedding (:code:`"embed"` / :code:`"embedding"`)
- Classification (:code:`"classify"`)
- Sentence Pair Scoring (:code:`"score"`)
- Reward Modeling (:code:`"reward"`)
The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used:
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`).
You can customize the model's pooling method via the :code:`override_pooler_config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
``LLM.encode``
^^^^^^^^^^^^^^
The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM.
It returns the extracted hidden states directly, which is useful for reward models.
.. code-block:: python
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")
data = output.outputs.data
print(f"Data: {data!r}")
``LLM.embed``
^^^^^^^^^^^^^
The :class:`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.
.. code-block:: python
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_.
``LLM.classify``
^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM.classify` method outputs a probability vector for each prompt.
It is primarily designed for classification models.
.. code-block:: python
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")
probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
A code example can be found in `examples/offline_inference_classification.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_classification.py>`_.
``LLM.score``
^^^^^^^^^^^^^
The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__.
These types of models serve as rerankers between candidate query-document pairs in RAG systems.
.. note::
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_.
.. code-block:: python
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.")
score = output.outputs.score
print(f"Score: {score}")
A code example can be found in `examples/offline_inference_scoring.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_scoring.py>`_.
Online Inference
----------------
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
Please click on the above link for more details on how to launch the server.
Embeddings API
^^^^^^^^^^^^^^
Our Embeddings API is similar to ``LLM.embed``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_.
The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__,
so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it.
Score API
^^^^^^^^^
Our Score API is similar to ``LLM.score``.
Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it.
......@@ -3,347 +3,794 @@
Supported Models
================
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM.
vLLM supports generative and pooling models across various tasks.
If a model supports more than one task, you can set the task via the :code:`--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
----
Loading a Model
^^^^^^^^^^^^^^^
HuggingFace Hub
+++++++++++++++
By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
.. tip::
The easiest way to check if your model is really supported at runtime is to run the program below:
.. code-block:: python
from vllm import LLM
# For generative models (task=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward}) only
llm = LLM(model=..., task="embed") # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
for instructions on how to implement your model in vLLM.
Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.
ModelScope
++++++++++
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
.. code-block:: shell
$ export VLLM_USE_MODELSCOPE=True
And use with :code:`trust_remote_code=True`.
.. code-block:: python
from vllm import LLM
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
# For generative models (task=generate) only
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward}) only
output = llm.encode("Hello, my name is")
print(output)
List of Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
Decoder-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:widths: 25 25 50 5
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HuggingFace Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`AquilaForCausalLM`
- Aquila & Aquila2
- Aquila, Aquila2
- :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
- ✅︎
- ✅︎
* - :code:`ArcticForCausalLM`
- Arctic
- :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc.
-
- ✅︎
* - :code:`BaiChuanForCausalLM`
- Baichuan & Baichuan2
- Baichuan2, Baichuan
- :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc.
- ✅︎
- ✅︎
* - :code:`BloomForCausalLM`
- BLOOM, BLOOMZ, BLOOMChat
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
-
- ✅︎
* - :code:`BartForConditionalGeneration`
- BART
- :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
-
-
* - :code:`ChatGLMModel`
- ChatGLM
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
- ✅︎
* - :code:`CohereForCausalLM`
- ✅︎
* - :code:`CohereForCausalLM`,:code:`Cohere2ForCausalLM`
- Command-R
- :code:`CohereForAI/c4ai-command-r-v01`, etc.
-
- :code:`CohereForAI/c4ai-command-r-v01`, :code:`CohereForAI/c4ai-command-r7b-12-2024`, etc.
- ✅︎
- ✅︎
* - :code:`DbrxForCausalLM`
- DBRX
- :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc.
-
- ✅︎
* - :code:`DeciLMForCausalLM`
- DeciLM
- :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc.
-
- ✅︎
* - :code:`DeepseekForCausalLM`
- DeepSeek
- :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc.
-
- ✅︎
* - :code:`DeepseekV2ForCausalLM`
- DeepSeek-V2
- :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc.
-
- ✅︎
* - :code:`ExaoneForCausalLM`
- EXAONE-3
- :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
- ✅︎
- ✅︎
* - :code:`FalconForCausalLM`
- Falcon
- :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc.
-
- ✅︎
* - :code:`FalconMambaForCausalLM`
- FalconMamba
- :code:`tiiuae/falcon-mamba-7b`, :code:`tiiuae/falcon-mamba-7b-instruct`, etc.
- ✅︎
- ✅︎
* - :code:`GemmaForCausalLM`
- Gemma
- :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc.
- ✅︎
- ✅︎
* - :code:`Gemma2ForCausalLM`
- Gemma2
- :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc.
- ✅︎
- ✅︎
* - :code:`GlmForCausalLM`
- GLM-4
- :code:`THUDM/glm-4-9b-chat-hf`, etc.
- ✅︎
- ✅︎
* - :code:`GPT2LMHeadModel`
- GPT-2
- :code:`gpt2`, :code:`gpt2-xl`, etc.
-
- ✅︎
* - :code:`GPTBigCodeForCausalLM`
- StarCoder, SantaCoder, WizardCoder
- :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc.
- ✅︎
- ✅︎
* - :code:`GPTJForCausalLM`
- GPT-J
- :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc.
-
- ✅︎
* - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
- :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc.
-
- ✅︎
* - :code:`GraniteForCausalLM`
- Granite 3.0, PowerLM
- :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.0-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
- ✅︎
- ✅︎
* - :code:`GraniteMoeForCausalLM`
- Granite 3.0 MoE, PowerMoE
- :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc.
- ✅︎
- ✅︎
* - :code:`GritLM`
- GritLM
- :code:`parasail-ai/GritLM-7B-vllm`.
- ✅︎
- ✅︎
* - :code:`InternLMForCausalLM`
- InternLM
- :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc.
- ✅︎
- ✅︎
* - :code:`InternLM2ForCausalLM`
- InternLM2
- :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc.
-
- ✅︎
- ✅︎
* - :code:`JAISLMHeadModel`
- Jais
- :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc.
- :code:`inceptionai/jais-13b`, :code:`inceptionai/jais-13b-chat`, :code:`inceptionai/jais-30b-v3`, :code:`inceptionai/jais-30b-chat-v3`, etc.
-
- ✅︎
* - :code:`JambaForCausalLM`
- Jamba
- :code:`ai21labs/Jamba-v0.1`, etc.
- :code:`ai21labs/AI21-Jamba-1.5-Large`, :code:`ai21labs/AI21-Jamba-1.5-Mini`, :code:`ai21labs/Jamba-v0.1`, etc.
- ✅︎
- ✅︎
* - :code:`LlamaForCausalLM`
- Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
- :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc.
- ✅︎
- ✅︎
* - :code:`MambaForCausalLM`
- Mamba
- :code:`state-spaces/mamba-130m-hf`, :code:`state-spaces/mamba-790m-hf`, :code:`state-spaces/mamba-2.8b-hf`, etc.
-
- ✅︎
* - :code:`MiniCPMForCausalLM`
- MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
-
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, :code:`openbmb/MiniCPM-S-1B-sft`, etc.
- ✅︎
- ✅︎
* - :code:`MiniCPM3ForCausalLM`
- MiniCPM3
- :code:`openbmb/MiniCPM3-4B`, etc.
-
- ✅︎
- ✅︎
* - :code:`MistralForCausalLM`
- Mistral, Mistral-Instruct
- :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc.
- ✅︎
- ✅︎
* - :code:`MixtralForCausalLM`
- Mixtral-8x7B, Mixtral-8x7B-Instruct
- :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc.
- ✅︎
- ✅︎
* - :code:`MPTForCausalLM`
- MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
- :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc.
-
- ✅︎
* - :code:`NemotronForCausalLM`
- Nemotron-3, Nemotron-4, Minitron
- :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
- ✅︎
* - :code:`OLMoEForCausalLM`
- OLMoE
- :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc.
-
- ✅︎
* - :code:`OLMoForCausalLM`
- OLMo
- :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc.
-
- ✅︎
* - :code:`OLMo2ForCausalLM`
- OLMo2
- :code:`allenai/OLMo2-7B-1124`, etc.
-
- ✅︎
* - :code:`OLMoEForCausalLM`
- OLMoE
- :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc.
- ✅︎
- ✅︎
* - :code:`OPTForCausalLM`
- OPT, OPT-IML
- :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc.
-
- ✅︎
* - :code:`OrionForCausalLM`
- Orion
- :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc.
-
- ✅︎
* - :code:`PhiForCausalLM`
- Phi
- :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc.
- ✅︎
- ✅︎
* - :code:`Phi3ForCausalLM`
- Phi-3
- :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc.
-
- ✅︎
- ✅︎
* - :code:`Phi3SmallForCausalLM`
- Phi-3-Small
- :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc.
-
- ✅︎
* - :code:`PhiMoEForCausalLM`
- Phi-3.5-MoE
- :code:`microsoft/Phi-3.5-MoE-instruct`, etc.
-
- ✅︎
- ✅︎
* - :code:`PersimmonForCausalLM`
- Persimmon
- :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc.
-
- ✅︎
* - :code:`QWenLMHeadModel`
- Qwen
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
-
- ✅︎
- ✅︎
* - :code:`Qwen2ForCausalLM`
- Qwen2
- :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc.
- :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc.
- ✅︎
- ✅︎
* - :code:`Qwen2MoeForCausalLM`
- Qwen2MoE
- :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
-
- ✅︎
* - :code:`StableLmForCausalLM`
- StableLM
- :code:`stabilityai/stablelm-3b-4e1t/` , :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
- :code:`stabilityai/stablelm-3b-4e1t`, :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
-
- ✅︎
* - :code:`Starcoder2ForCausalLM`
- Starcoder2
- :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc.
-
- ✅︎
* - :code:`SolarForCausalLM`
- EXAONE-3
- Solar Pro
- :code:`upstage/solar-pro-preview-instruct`, etc.
-
- ✅︎
- ✅︎
* - :code:`TeleChat2ForCausalLM`
- TeleChat2
- :code:`TeleAI/TeleChat2-3B`, :code:`TeleAI/TeleChat2-7B`, :code:`TeleAI/TeleChat2-35B`, etc.
- ✅︎
- ✅︎
* - :code:`XverseForCausalLM`
- Xverse
- XVERSE
- :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc.
-
- ✅︎
- ✅︎
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
.. _supported_vlms:
Pooling Models
++++++++++++++
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
Text Embedding (``--task embed``)
---------------------------------
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`BertModel`
- BERT-based
- :code:`BAAI/bge-base-en-v1.5`, etc.
-
-
* - :code:`Gemma2Model`
- Gemma2-based
- :code:`BAAI/bge-multilingual-gemma2`, etc.
-
- ✅︎
* - :code:`GritLM`
- GritLM
- :code:`parasail-ai/GritLM-7B-vllm`.
- ✅︎
- ✅︎
* - :code:`LlamaModel`, :code:`LlamaForCausalLM`, :code:`MistralModel`, etc.
- Llama-based
- :code:`intfloat/e5-mistral-7b-instruct`, etc.
- ✅︎
- ✅︎
* - :code:`Qwen2Model`, :code:`Qwen2ForCausalLM`
- Qwen2-based
- :code:`ssmits/Qwen2-7B-Instruct-embed-base` (see note), :code:`Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc.
- ✅︎
- ✅︎
* - :code:`RobertaModel`, :code:`RobertaForMaskedLM`
- RoBERTa-based
- :code:`sentence-transformers/all-roberta-large-v1`, :code:`sentence-transformers/all-roberta-large-v1`, etc.
-
-
* - :code:`XLMRobertaModel`
- XLM-RoBERTa-based
- :code:`intfloat/multilingual-e5-large`, etc.
-
-
.. note::
:code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
.. note::
Unlike base Qwen2, :code:`Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention.
You can set :code:`--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.
On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
despite being described otherwise on its model card.
Reward Modeling (``--task reward``)
-----------------------------------
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`LlamaForCausalLM`
- Llama-based
- :code:`peiyi9979/math-shepherd-mistral-7b-prm`, etc.
- ✅︎
- ✅︎
* - :code:`Qwen2ForRewardModel`
- Qwen2-based
- :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
- ✅︎
- ✅︎
.. important::
For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Classification (``--task classify``)
------------------------------------
.. list-table::
:widths: 25 25 25 25 5
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Modalities
- Example HuggingFace Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`Qwen2ForSequenceClassification`
- Qwen2-based
- :code:`jason9693/Qwen2.5-1.5B-apeach`, etc.
- ✅︎
- ✅︎
Sentence Pair Scoring (``--task score``)
----------------------------------------
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
* - Architecture
- Models
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`BertForSequenceClassification`
- BERT-based
- :code:`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
-
-
* - :code:`RobertaForSequenceClassification`
- RoBERTa-based
- :code:`cross-encoder/quora-roberta-base`, etc.
-
-
* - :code:`XLMRobertaForSequenceClassification`
- XLM-RoBERTa-based
- :code:`BAAI/bge-reranker-v2-m3`, etc.
-
-
.. _supported_mm_models:
List of Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following modalities are supported depending on the model:
- **T**\ ext
- **I**\ mage
- **V**\ ideo
- **A**\ udio
Any combination of modalities joined by :code:`+` are supported.
- e.g.: :code:`T + I` means that the model supports text-only, image-only, and text-with-image inputs.
On the other hand, modalities separated by :code:`/` are mutually exclusive.
- e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
.. list-table::
:widths: 25 25 15 20 5 5 5
:header-rows: 1
* - Architecture
- Models
- Inputs
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
- V1
* - :code:`AriaForConditionalGeneration`
- Aria
- T + I
- :code:`rhymes-ai/Aria`
-
- ✅︎
-
* - :code:`Blip2ForConditionalGeneration`
- BLIP-2
- Image\ :sup:`E`
- T + I\ :sup:`E`
- :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
-
- ✅︎
-
* - :code:`ChameleonForConditionalGeneration`
- Chameleon
- Image
- T + I
- :code:`facebook/chameleon-7b` etc.
-
- ✅︎
-
* - :code:`FuyuForCausalLM`
- Fuyu
- Image
- T + I
- :code:`adept/fuyu-8b` etc.
-
- ✅︎
-
* - :code:`ChatGLMModel`
- GLM-4V
- Image
- T + I
- :code:`THUDM/glm-4v-9b` etc.
- ✅︎
- ✅︎
-
* - :code:`H2OVLChatModel`
- H2OVL
- T + I\ :sup:`E+`
- :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc.
-
- ✅︎
-
* - :code:`Idefics3ForConditionalGeneration`
- Idefics3
- T + I
- :code:`HuggingFaceM4/Idefics3-8B-Llama3` etc.
- ✅︎
-
-
* - :code:`InternVLChatModel`
- InternVL2
- Image\ :sup:`E+`
- :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
- InternVL 2.5, Mono-InternVL, InternVL 2.0
- T + I\ :sup:`E+`
- :code:`OpenGVLab/InternVL2_5-4B`, :code:`OpenGVLab/Mono-InternVL-2B`, :code:`OpenGVLab/InternVL2-4B`, etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- Image\ :sup:`E+`
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
-
- ✅︎
- ✅︎
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- Image\ :sup:`E+`
- T + I\ :sup:`E+`
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
-
- ✅︎
-
* - :code:`LlavaNextVideoForConditionalGeneration`
- LLaVA-NeXT-Video
- Video
- T + V
- :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
-
- ✅︎
-
* - :code:`LlavaOnevisionForConditionalGeneration`
- LLaVA-Onevision
- Image\ :sup:`+` / Video
- T + I\ :sup:`+` + V\ :sup:`+`
- :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
-
- ✅︎
-
* - :code:`MiniCPMV`
- MiniCPM-V
- Image\ :sup:`+`
- T + I\ :sup:`E+`
- :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
-
- ✅︎
- ✅︎
-
* - :code:`MllamaForConditionalGeneration`
- Llama 3.2
- Image
- T + I\ :sup:`+`
- :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
-
-
-
* - :code:`MolmoForCausalLM`
- Molmo
- T + I
- :code:`allenai/Molmo-7B-D-0924`, :code:`allenai/Molmo-72B-0924`, etc.
-
- ✅︎
- ✅︎
* - :code:`NVLM_D_Model`
- NVLM-D 1.0
- T + I\ :sup:`E+`
- :code:`nvidia/NVLM-D-72B`, etc.
-
- ✅︎
- ✅︎
* - :code:`PaliGemmaForConditionalGeneration`
- PaliGemma
- Image\ :sup:`E`
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
- PaliGemma, PaliGemma 2
- T + I\ :sup:`E`
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, :code:`google/paligemma2-3b-ft-docci-448`, etc.
-
- ✅︎
-
* - :code:`Phi3VForCausalLM`
- Phi-3-Vision, Phi-3.5-Vision
- Image\ :sup:`E+`
- T + I\ :sup:`E+`
- :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
-
- ✅︎
- ✅︎
* - :code:`PixtralForConditionalGeneration`
- Pixtral
- Image\ :sup:`+`
- :code:`mistralai/Pixtral-12B-2409`
- T + I\ :sup:`+`
- :code:`mistralai/Pixtral-12B-2409`, :code:`mistral-community/pixtral-12b` etc.
-
- ✅︎
- ✅︎
* - :code:`QWenLMHeadModel`
- Qwen-VL
- Image\ :sup:`E+`
- T + I\ :sup:`E+`
- :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
- ✅︎
- ✅︎
-
* - :code:`Qwen2AudioForConditionalGeneration`
- Qwen2-Audio
- T + A\ :sup:`+`
- :code:`Qwen/Qwen2-Audio-7B-Instruct`
-
- ✅︎
-
* - :code:`Qwen2VLForConditionalGeneration`
- Qwen2-VL
- Image\ :sup:`+` / Video\ :sup:`+`
- Qwen2-VL
- T + I\ :sup:`E+` + V\ :sup:`E+`
- :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
-
- ✅︎
- ✅︎
-
* - :code:`UltravoxModel`
- Ultravox
- Audio\ :sup:`E+`
- T + A\ :sup:`E+`
- :code:`fixie-ai/ultravox-v0_3`
-
- ✅︎
-
| :sup:`E` Pre-computed embeddings can be inputted for this modality.
| :sup:`+` Multiple items can be inputted per text prompt for this modality.
.. note::
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
.. important::
To enable multiple multi-modal items per text prompt, you have to set :code:`limit_mm_per_prompt` (offline inference)
or :code:`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
----
.. code-block:: python
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
for instructions on how to implement support for your model.
Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ project.
llm = LLM(
model="Qwen/Qwen2-VL-7B-Instruct",
limit_mm_per_prompt={"image": 4},
)
.. tip::
The easiest way to check if your model is supported is to run the program below:
.. code-block:: bash
.. code-block:: python
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
from vllm import LLM
.. note::
vLLM currently only supports adding LoRA to the language backbone of multimodal models.
llm = LLM(model=...) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
.. note::
To use :code:`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo (:code:`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
and pass :code:`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
If vLLM successfully generates text, it indicates that your model is supported.
.. note::
The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
.. tip::
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
Pooling Models
++++++++++++++
.. code-block:: shell
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
$ export VLLM_USE_MODELSCOPE=True
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
And use with :code:`trust_remote_code=True`.
Text Embedding (``--task embed``)
---------------------------------
.. code-block:: python
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
from vllm import LLM
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 15 25 5 5
:header-rows: 1
* - Architecture
- Models
- Inputs
- Example HF Models
- :ref:`LoRA <lora>`
- :ref:`PP <distributed_serving>`
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT-based
- T / I
- :code:`royokong/e5-v`
-
- ✅︎
* - :code:`Phi3VForCausalLM`
- Phi-3-Vision-based
- T + I
- :code:`TIGER-Lab/VLM2Vec-Full`
- 🚧
- ✅︎
* - :code:`Qwen2VLForConditionalGeneration`
- Qwen2-VL-based
- T + I
- :code:`MrLight/dse-qwen2-2b-mrl-v1`
-
- ✅︎
----
Model Support Policy
=====================
......@@ -354,6 +801,9 @@ At vLLM, we are committed to facilitating the integration and support of third-p
2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
.. tip::
When comparing the output of :code:`model.generate` from HuggingFace Transformers with the output of :code:`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., `generation_config.json <https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945>`__) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
......@@ -369,4 +819,4 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `models tests <https://github.com/vllm-project/vllm/blob/main/tests/models>`_ for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to `functionality tests <https://github.com/vllm-project/vllm/tree/main/tests>`_ and `examples <https://github.com/vllm-project/vllm/tree/main/examples>`_ for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
\ No newline at end of file
.. _vlm:
Using VLMs
==========
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
This document shows you how to run and serve these models using vLLM.
.. important::
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
Offline Inference
-----------------
Single-image input
^^^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
.. code-block:: python
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
.. note::
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
the above snippet. Specifically, ``image_feature_size`` can no longer be specified as we now calculate that internally for each model.
To pass an image to the model, note the following in :class:`vllm.inputs.PromptInputs`:
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
.. code-block:: python
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image
image = PIL.Image.open(...)
# Single prompt inference
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Inference with image embeddings as input
image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image_embeds},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
Multi-image input
^^^^^^^^^^^^^^^^^
Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
.. code-block:: python
llm = LLM(
model="microsoft/Phi-3.5-vision-instruct",
trust_remote_code=True, # Required to load Phi-3.5-vision
max_model_len=4096, # Otherwise, it may not fit in smaller GPUs
limit_mm_per_prompt={"image": 2}, # The maximum number to accept
)
Instead of passing in a single image, you can pass in a list of images.
.. code-block:: python
# Refer to the HuggingFace repo for the correct format to use
prompt = "<|user|>\n<image_1>\n<image_2>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
# Load the images using PIL.Image
image1 = PIL.Image.open(...)
image2 = PIL.Image.open(...)
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {
"image": [image1, image2]
},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
Online Inference
----------------
OpenAI Vision API
^^^^^^^^^^^^^^^^^
You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
.. code-block:: bash
vllm serve microsoft/Phi-3.5-vision-instruct --max-model-len 4096 \
--trust-remote-code --limit-mm-per-prompt image=2
.. important::
Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
a chat template is **required** to launch the API server.
Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
To consume the server, you can use the OpenAI client like in the example below:
.. code-block:: python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "What’s in this image?"},
{"type": "image_url", "image_url": {"url": image_url}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
# Multi-image input inference
image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
chat_response = client.chat.completions.create(
model="microsoft/Phi-3.5-vision-instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What are the animals in these images?"},
{"type": "image_url", "image_url": {"url": image_url_duck}},
{"type": "image_url", "image_url": {"url": image_url_lion}},
],
}],
)
print("Chat completion output:", chat_response.choices[0].message.content)
A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.
.. note::
By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
.. code-block:: shell
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
.. note::
There is no need to format the prompt in the API request since it will be handled by the server.
.. _benchmarks:
================
Benchmark Suites
================
vLLM contains two sets of benchmarks:
+ :ref:`Performance benchmarks <performance_benchmarks>`
+ :ref:`Nightly benchmarks <nightly_benchmarks>`
.. _performance_benchmarks:
Performance Benchmarks
----------------------
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the ``perf-benchmarks`` and ``ready`` labels, and when a PR is merged into vLLM.
The latest performance results are hosted on the public `vLLM Performance Dashboard <https://perf.vllm.ai>`_.
More information on the performance benchmarks and their parameters can be found `here <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md>`__.
.. _nightly_benchmarks:
Nightly Benchmarks
------------------
These compare vLLM's performance against alternatives (``tgi``, ``trt-llm``, and ``lmdeploy``) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the ``perf-benchmarks`` and ``nightly-benchmarks`` labels.
The latest nightly benchmark results are shared in major release blog posts such as `vLLM v0.6.0 <https://blog.vllm.ai/2024/09/05/perf-update.html>`_.
More information on the nightly benchmarks and their parameters can be found `here <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`__.
\ No newline at end of file
.. _benchmarks:
Benchmark suites of vLLM
========================
vLLM contains two sets of benchmarks:
+ **Performance benchmarks**: benchmark vLLM's performance under various workloads at a high frequency (when a pull request (PR for short) of vLLM is being merged). See `vLLM performance dashboard <https://perf.vllm.ai>`_ for the latest performance results.
+ **Nightly benchmarks**: compare vLLM's performance against alternatives (tgi, trt-llm, and lmdeploy) when there are major updates of vLLM (e.g., bumping up to a new version). The latest results are available in the `vLLM GitHub README <https://github.com/vllm-project/vllm/blob/main/README.md>`_.
Trigger a benchmark
-------------------
The performance benchmarks and nightly benchmarks can be triggered by submitting a PR to vLLM, and label the PR with `perf-benchmarks` and `nightly-benchmarks`.
.. note::
Please refer to `vLLM performance benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md>`_ and `vLLM nightly benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`_ for detailed descriptions on benchmark environment, workload and metrics.
......@@ -11,7 +11,7 @@ Below are the steps to utilize BitsAndBytes with vLLM.
.. code-block:: console
$ pip install bitsandbytes>=0.44.0
$ pip install bitsandbytes>=0.45.0
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
......
......@@ -45,7 +45,7 @@ To produce performant FP8 quantized models with vLLM, you'll need to install the
.. code-block:: console
$ pip install llmcompressor==0.1.0
$ pip install llmcompressor
Quantization Process
--------------------
......@@ -106,7 +106,7 @@ Install ``vllm`` and ``lm-evaluation-harness``:
.. code-block:: console
$ pip install vllm lm_eval==0.4.3
$ pip install vllm lm-eval==0.4.4
Load and run the model in ``vllm``:
......
......@@ -4,7 +4,7 @@ FP8 E5M2 KV Cache
==================
The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bflaot16 and fp8 to each other.
The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.
Here is an example of how to enable this feature:
......
......@@ -19,7 +19,7 @@ To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <
.. code-block:: console
$ pip install llmcompressor==0.1.0
$ pip install llmcompressor
Quantization Process
--------------------
......@@ -142,4 +142,4 @@ Best Practices
Troubleshooting and Support
---------------------------
If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
\ No newline at end of file
If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
......@@ -27,8 +27,8 @@ The table below shows the compatibility of various quantization implementations
- ✅︎
- ✅︎
- ✗
-
-
- ✅︎
- ✅︎
- ✗
- ✗
* - GPTQ
......@@ -38,8 +38,8 @@ The table below shows the compatibility of various quantization implementations
- ✅︎
- ✅︎
- ✗
-
-
- ✅︎
- ✅︎
- ✗
- ✗
* - Marlin (GPTQ/AWQ/FP8)
......@@ -61,7 +61,7 @@ The table below shows the compatibility of various quantization implementations
- ✅︎
- ✗
- ✗
-
- ✅︎
- ✗
- ✗
* - FP8 (W8A8)
......@@ -129,4 +129,4 @@ Notes:
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
\ No newline at end of file
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
......@@ -37,6 +37,29 @@ You can build and run vLLM from source via the provided `Dockerfile <https://git
current GPU type the machine is running on, you can add the argument ``--build-arg torch_cuda_arch_list=""``
for vLLM to find the current GPU type and build for that.
Building for Arm64/aarch64
--------------------------
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
.. note::
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
flags to speed up build process. However, ensure your 'max_jobs' is substantially larger than 'nvcc_threads' to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
.. code-block:: console
# Example of building on Nvidia GH200 server. (Memory usage: ~12GB, Build time: ~1475s / ~25 min, Image size: 7.26GB)
$ DOCKER_BUILDKIT=1 sudo docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t vllm/vllm-gh200-openai:latest \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
To run vLLM:
......
.. _deploying_with_helm:
Deploying with Helm
===================
A Helm chart to deploy vLLM for Kubernetes
Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLMm Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variables values.
This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm install and documentation on architecture and values file.
Prerequisites
-------------
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (``k8s-device-plugin``): This can be found at `https://github.com/NVIDIA/k8s-device-plugin <https://github.com/NVIDIA/k8s-device-plugin>`__
- Available GPU resources in your cluster
- S3 with the model which will be deployed
Installing the chart
--------------------
To install the chart with the release name ``test-vllm``:
.. code-block:: console
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
Uninstalling the Chart
----------------------
To uninstall the ``test-vllm`` deployment:
.. code-block:: console
helm uninstall test-vllm --namespace=ns-vllm
The command removes all the Kubernetes components associated with the
chart **including persistent volumes** and deletes the release.
Architecture
------------
.. image:: architecture_helm_deployment.png
Values
------
.. list-table:: Values
:widths: 25 25 25 25
:header-rows: 1
* - Key
- Type
- Default
- Description
* - autoscaling
- object
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
- Autoscaling configuration
* - autoscaling.enabled
- bool
- false
- Enable autoscaling
* - autoscaling.maxReplicas
- int
- 100
- Maximum replicas
* - autoscaling.minReplicas
- int
- 1
- Minimum replicas
* - autoscaling.targetCPUUtilizationPercentage
- int
- 80
- Target CPU utilization for autoscaling
* - configs
- object
- {}
- Configmap
* - containerPort
- int
- 8000
- Container port
* - customObjects
- list
- []
- Custom Objects configuration
* - deploymentStrategy
- object
- {}
- Deployment strategy configuration
* - externalConfigs
- list
- []
- External configuration
* - extraContainers
- list
- []
- Additional containers configuration
* - extraInit
- object
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
- Additional configuration for the init container
* - extraInit.pvcStorage
- string
- "50Gi"
- Storage size of the s3
* - extraInit.s3modelpath
- string
- "relative_s3_model_path/opt-125m"
- Path of the model on the s3 which hosts model weights and config files
* - extraInit.awsEc2MetadataDisabled
- boolean
- true
- Disables the use of the Amazon EC2 instance metadata service
* - extraPorts
- list
- []
- Additional ports configuration
* - gpuModels
- list
- ["TYPE_GPU_USED"]
- Type of gpu used
* - image
- object
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
- Image configuration
* - image.command
- list
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
- Container launch command
* - image.repository
- string
- "vllm/vllm-openai"
- Image repository
* - image.tag
- string
- "latest"
- Image tag
* - livenessProbe
- object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
- Liveness probe configuration
* - livenessProbe.failureThreshold
- int
- 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
* - livenessProbe.httpGet
- object
- {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server
* - livenessProbe.httpGet.path
- string
- "/health"
- Path to access on the HTTP server
* - livenessProbe.httpGet.port
- int
- 8000
- Name or number of the port to access on the container, on which the server is listening
* - livenessProbe.initialDelaySeconds
- int
- 15
- Number of seconds after the container has started before liveness probe is initiated
* - livenessProbe.periodSeconds
- int
- 10
- How often (in seconds) to perform the liveness probe
* - maxUnavailablePodDisruptionBudget
- string
- ""
- Disruption Budget Configuration
* - readinessProbe
- object
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
- Readiness probe configuration
* - readinessProbe.failureThreshold
- int
- 3
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
* - readinessProbe.httpGet
- object
- {"path":"/health","port":8000}
- Configuration of the Kubelet http request on the server
* - readinessProbe.httpGet.path
- string
- "/health"
- Path to access on the HTTP server
* - readinessProbe.httpGet.port
- int
- 8000
- Name or number of the port to access on the container, on which the server is listening
* - readinessProbe.initialDelaySeconds
- int
- 5
- Number of seconds after the container has started before readiness probe is initiated
* - readinessProbe.periodSeconds
- int
- 5
- How often (in seconds) to perform the readiness probe
* - replicaCount
- int
- 1
- Number of replicas
* - resources
- object
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
- Resource configuration
* - resources.limits."nvidia.com/gpu"
- int
- 1
- Number of gpus used
* - resources.limits.cpu
- int
- 4
- Number of CPUs
* - resources.limits.memory
- string
- "16Gi"
- CPU memory configuration
* - resources.requests."nvidia.com/gpu"
- int
- 1
- Number of gpus used
* - resources.requests.cpu
- int
- 4
- Number of CPUs
* - resources.requests.memory
- string
- "16Gi"
- CPU memory configuration
* - secrets
- object
- {}
- Secrets configuration
* - serviceName
- string
-
- Service name
* - servicePort
- int
- 80
- Service port
* - labels.environment
- string
- test
- Environment name
* - labels.release
- string
- test
- Release name
.. _deploying_with_k8s:
Deploying with Kubernetes
==========================
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Prerequisites
-------------
Before you begin, ensure that you have the following:
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
Deployment Steps
----------------
1. **Create a PVC , Secret and Deployment for vLLM**
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
.. code-block:: yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
.. code-block:: yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
data:
token: "REPLACE_WITH_TOKEN"
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
.. code-block:: yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
2. **Create a Kubernetes Service for vLLM**
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
.. code-block:: yaml
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
3. **Deploy and Test**
Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
.. code-block:: console
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
To test the deployment, run the following ``curl`` command:
.. code-block:: console
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
If the service is correctly deployed, you should receive a response from the vLLM model.
Conclusion
----------
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
......@@ -5,4 +5,4 @@ Deploying with KServe
vLLM can be deployed with `KServe <https://github.com/kserve/kserve>`_ on Kubernetes for highly scalable distributed model serving.
Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/vllm/>`_ for more details on using vLLM with KServe.
Please see `this guide <https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/>`_ for more details on using vLLM with KServe.
.. _deploying_with_kubeai:
Deploying with KubeAI
=====================
`KubeAI <https://github.com/substratusai/kubeai>`_ is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
* `Any Kubernetes Cluster <https://www.kubeai.org/installation/any/>`_
* `EKS <https://www.kubeai.org/installation/eks/>`_
* `GKE <https://www.kubeai.org/installation/gke/>`_
Once you have KubeAI installed, you can
`configure text generation models <https://www.kubeai.org/how-to/configure-text-generation-models/>`_
using vLLM.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment