Commit 705f6a35 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.5.2' into v0.5.2-dtk24.04.1

parents af837396 4cf256ae
...@@ -10,6 +10,10 @@ This document provides a high-level guide on integrating a `HuggingFace Transfor ...@@ -10,6 +10,10 @@ This document provides a high-level guide on integrating a `HuggingFace Transfor
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM. The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex. However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
.. note::
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support,
please follow :ref:`this guide <enabling_multimodal_inputs>` after implementing the model here.
.. tip:: .. tip::
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository. If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
We will be happy to help you out! We will be happy to help you out!
...@@ -37,30 +41,30 @@ For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/ ...@@ -37,30 +41,30 @@ For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/
2. Rewrite the :code:`forward` methods 2. Rewrite the :code:`forward` methods
-------------------------------------- --------------------------------------
Next, you need to rewrite the :code:`forward` methods of your model by following these steps: Next, you need to rewrite the :meth:`~torch.nn.Module.forward` method of your model by following these steps:
1. Remove any unnecessary code, such as the code only used for training. 1. Remove any unnecessary code, such as the code only used for training.
2. Change the input parameters: 2. Change the input parameters:
.. code-block:: diff .. code-block:: diff
def forward( def forward(
self, self,
input_ids: torch.Tensor, input_ids: torch.Tensor,
- attention_mask: Optional[torch.Tensor] = None, - attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None, - position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None, - past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None, - inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None, - labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None, - use_cache: Optional[bool] = None,
- output_attentions: Optional[bool] = None, - output_attentions: Optional[bool] = None,
- output_hidden_states: Optional[bool] = None, - output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None, - return_dict: Optional[bool] = None,
-) -> Union[Tuple, CausalLMOutputWithPast]: - ) -> Union[Tuple, CausalLMOutputWithPast]:
+ positions: torch.Tensor, + positions: torch.Tensor,
+ kv_caches: List[torch.Tensor], + kv_caches: List[torch.Tensor],
+ attn_metadata: AttentionMetadata, + attn_metadata: AttentionMetadata,
+) -> Optional[SamplerOutput]: + ) -> Optional[SamplerOutput]:
1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors. 1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture. 2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
...@@ -75,7 +79,7 @@ Next, you need to rewrite the :code:`forward` methods of your model by following ...@@ -75,7 +79,7 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it. If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions. To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`. For the embedding layer, you can simply replace :class:`torch.nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
When it comes to the linear layers, we provide the following options to parallelize them: When it comes to the linear layers, we provide the following options to parallelize them:
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving. * :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
......
.. _enabling_multimodal_inputs:
Enabling Multimodal Inputs
==========================
This document walks you through the steps to extend a vLLM model so that it accepts :ref:`multi-modal <multi_modality>` inputs.
.. seealso::
:ref:`adding_a_new_model`
1. Update the base vLLM model
-----------------------------
It is assumed that you have already implemented the model in vLLM according to :ref:`these steps <adding_a_new_model>`.
Further update the model as follows:
- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.
.. code-block:: diff
+ from vllm.model_executor.models.interfaces import SupportsVision
- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsVision):
.. note::
The model class does not have to be named :code:`*ForCausalLM`.
Check out `the HuggingFace Transformers documentation <https://huggingface.co/docs/transformers/model_doc/auto#multimodal>`__ for some examples.
- If you haven't already done so, reserve a keyword parameter in :meth:`~torch.nn.Module.forward`
for each input tensor that corresponds to a multi-modal input, as shown in the following example:
.. code-block:: diff
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
2. Register input mappers
-------------------------
For each modality type that the model accepts as input, decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in :meth:`~torch.nn.Module.forward`.
.. code-block:: diff
from vllm.model_executor.models.interfaces import SupportsVision
+ from vllm.multimodal import MULTIMODAL_REGISTRY
+ @MULTIMODAL_REGISTRY.register_image_input_mapper()
class YourModelForImage2Seq(nn.Module, SupportsVision):
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
.. seealso::
:ref:`input_processing_pipeline`
3. Register maximum number of multi-modal tokens
------------------------------------------------
For each modality type that the model accepts as input, calculate the maximum possible number of tokens
and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.
.. code-block:: diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsVision):
Here are some examples:
- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__
.. seealso::
:ref:`input_processing_pipeline`
4. (Optional) Register dummy data
---------------------------------
During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.
.. code-block:: diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsVision):
.. note::
The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
Here are some examples:
- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__
.. seealso::
:ref:`input_processing_pipeline`
5. (Optional) Register input processor
--------------------------------------
Sometimes, there is a need to process inputs at the :class:`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's :meth:`~torch.nn.Module.forward` call.
You can register input processors via :meth:`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.
.. code-block:: diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_input_mapper()
@MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
class YourModelForImage2Seq(nn.Module, SupportsVision):
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:
- Insert static number of image tokens: `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Insert dynamic number of image tokens: `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__
.. seealso::
:ref:`input_processing_pipeline`
...@@ -4,6 +4,9 @@ Using LoRA adapters ...@@ -4,6 +4,9 @@ Using LoRA adapters
=================== ===================
This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model. This document shows you how to use `LoRA adapters <https://arxiv.org/abs/2106.09685>`_ with vLLM on top of a base model.
LoRA adapters can be used with any vLLM model that implements :class:`~vllm.model_executor.models.interfaces.SupportsLoRA`.
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with them locally with
......
...@@ -73,5 +73,5 @@ Resources for vLLM contributors ...@@ -73,5 +73,5 @@ Resources for vLLM contributors
------------------------------- -------------------------------
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_ * `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_ * `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
* `Information on batch expansion. <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_ * `Information on batch expansion <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_ * `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_
...@@ -7,6 +7,8 @@ vLLM supports a variety of generative Transformer models in `HuggingFace Transfo ...@@ -7,6 +7,8 @@ vLLM supports a variety of generative Transformer models in `HuggingFace Transfo
The following is the list of model architectures that are currently supported by vLLM. The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it. Alongside each architecture, we include some popular models that use it.
Decoder-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table:: .. list-table::
:widths: 25 25 50 5 :widths: 25 25 50 5
:header-rows: 1 :header-rows: 1
...@@ -55,6 +57,10 @@ Alongside each architecture, we include some popular models that use it. ...@@ -55,6 +57,10 @@ Alongside each architecture, we include some popular models that use it.
- Gemma - Gemma
- :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc. - :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc.
- ✅︎ - ✅︎
* - :code:`Gemma2ForCausalLM`
- Gemma2
- :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc.
- ✅︎
* - :code:`GPT2LMHeadModel` * - :code:`GPT2LMHeadModel`
- GPT-2 - GPT-2
- :code:`gpt2`, :code:`gpt2-xl`, etc. - :code:`gpt2`, :code:`gpt2-xl`, etc.
...@@ -83,18 +89,14 @@ Alongside each architecture, we include some popular models that use it. ...@@ -83,18 +89,14 @@ Alongside each architecture, we include some popular models that use it.
- Jais - Jais
- :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc. - :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc.
- -
* - :code:`JambaForCausalLM`
- Jamba
- :code:`ai21labs/Jamba-v0.1`, etc.
- ✅︎
* - :code:`LlamaForCausalLM` * - :code:`LlamaForCausalLM`
- LLaMA, Llama 2, Meta Llama 3, Vicuna, Alpaca, Yi - LLaMA, Llama 2, Meta Llama 3, Vicuna, Alpaca, Yi
- :code:`meta-llama/Meta-Llama-3-8B-Instruct`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc. - :code:`meta-llama/Meta-Llama-3-8B-Instruct`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
- ✅︎ - ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
-
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
-
* - :code:`MiniCPMForCausalLM` * - :code:`MiniCPMForCausalLM`
- MiniCPM - MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc. - :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
...@@ -129,12 +131,16 @@ Alongside each architecture, we include some popular models that use it. ...@@ -129,12 +131,16 @@ Alongside each architecture, we include some popular models that use it.
- ✅︎ - ✅︎
* - :code:`Phi3ForCausalLM` * - :code:`Phi3ForCausalLM`
- Phi-3 - Phi-3
- :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, etc. - :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc.
- -
* - :code:`Phi3SmallForCausalLM` * - :code:`Phi3SmallForCausalLM`
- Phi-3-Small - Phi-3-Small
- :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc. - :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc.
- -
* - :code:`PersimmonForCausalLM`
- Persimmon
- :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc.
-
* - :code:`QWenLMHeadModel` * - :code:`QWenLMHeadModel`
- Qwen - Qwen
- :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc. - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
...@@ -160,14 +166,48 @@ Alongside each architecture, we include some popular models that use it. ...@@ -160,14 +166,48 @@ Alongside each architecture, we include some popular models that use it.
- :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc. - :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc.
- -
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
.. _supported_vlms:
Vision Language Models
^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:widths: 25 25 50 5
:header-rows: 1
* - Architecture
- Models
- Example HuggingFace Models
- :ref:`LoRA <lora>`
* - :code:`FuyuForCausalLM`
- Fuyu
- :code:`adept/fuyu-8b` etc.
-
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
-
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
-
* - :code:`PaliGemmaForConditionalGeneration`
- PaliGemma
- :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
-
* - :code:`Phi3VForCausalLM`
- Phi-3-Vision
- :code:`microsoft/Phi-3-vision-128k-instruct`, etc.
-
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM. If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` for instructions on how to implement support for your model. Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
for instructions on how to implement support for your model.
Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ project. Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ project.
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
.. tip:: .. tip::
The easiest way to check if your model is supported is to run the program below: The easiest way to check if your model is supported is to run the program below:
...@@ -198,8 +238,9 @@ Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-pr ...@@ -198,8 +238,9 @@ Alternatively, you can raise an issue on our `GitHub <https://github.com/vllm-pr
output = llm.generate("Hello, my name is") output = llm.generate("Hello, my name is")
print(output) print(output)
Model Support Policy Model Support Policy
--------------------- =====================
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support: At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
......
...@@ -3,26 +3,17 @@ ...@@ -3,26 +3,17 @@
Using VLMs Using VLMs
========== ==========
vLLM provides experimental support for Vision Language Models (VLMs). This document shows you how to run and serve these models using vLLM. vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
This document shows you how to run and serve these models using vLLM.
Engine Arguments
----------------
The following :ref:`engine arguments <engine_args>` are specific to VLMs:
.. argparse::
:module: vllm.engine.arg_utils
:func: _vlm_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst:
.. important:: .. important::
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
Currently, the support for vision language models on vLLM has the following limitations: Currently, the support for vision language models on vLLM has the following limitations:
* Only single image input is supported per text prompt. * Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests. We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
Offline Batched Inference Offline Batched Inference
------------------------- -------------------------
...@@ -31,38 +22,60 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` ...@@ -31,38 +22,60 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
.. code-block:: python .. code-block:: python
llm = LLM( llm = LLM(model="llava-hf/llava-1.5-7b-hf")
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values", .. important::
image_token_id=32000, We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
image_input_shape="1,3,336,336", the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
image_feature_size=576, internally for each model.
)
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`: To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``. * ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`. * ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
.. code-block:: python .. code-block:: python
prompt = "<image>" * 576 + ( # Refer to the HuggingFace repo for the correct format to use
"\nUSER: What is the content of this image?\nASSISTANT:") prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image # Load the image using PIL.Image
image = ... image = PIL.Image.open(...)
# Single prompt inference
outputs = llm.generate({ outputs = llm.generate({
"prompt": prompt, "prompt": prompt,
"multi_modal_data": ImagePixelData(image), "multi_modal_data": {"image": image},
}) })
for o in outputs: for o in outputs:
generated_text = o.outputs[0].text generated_text = o.outputs[0].text
print(generated_text) print(generated_text)
# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_. A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
Online OpenAI Vision API Compatible Inference Online OpenAI Vision API Compatible Inference
---------------------------------------------- ----------------------------------------------
...@@ -83,12 +96,13 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with ...@@ -83,12 +96,13 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
python -m vllm.entrypoints.openai.api_server \ python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \ --model llava-hf/llava-1.5-7b-hf \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja --chat-template template_llava.jinja
.. important::
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
internally for each model.
To consume the server, you can use the OpenAI client like in the example below: To consume the server, you can use the OpenAI client like in the example below:
.. code-block:: python .. code-block:: python
...@@ -105,6 +119,8 @@ To consume the server, you can use the OpenAI client like in the example below: ...@@ -105,6 +119,8 @@ To consume the server, you can use the OpenAI client like in the example below:
messages=[{ messages=[{
"role": "user", "role": "user",
"content": [ "content": [
# NOTE: The prompt formatting with the image token `<image>` is not needed
# since the prompt will be processed automatically by the API server.
{"type": "text", "text": "What's in this image?"}, {"type": "text", "text": "What's in this image?"},
{ {
"type": "image_url", "type": "image_url",
...@@ -117,6 +133,8 @@ To consume the server, you can use the OpenAI client like in the example below: ...@@ -117,6 +133,8 @@ To consume the server, you can use the OpenAI client like in the example below:
) )
print("Chat response:", chat_response) print("Chat response:", chat_response)
A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.
.. note:: .. note::
By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable: By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
...@@ -126,5 +144,4 @@ To consume the server, you can use the OpenAI client like in the example below: ...@@ -126,5 +144,4 @@ To consume the server, you can use the OpenAI client like in the example below:
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
.. note:: .. note::
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be There is no need to format the prompt in the API request since it will be handled by the server.
processed automatically by the server.
...@@ -3,7 +3,10 @@ ...@@ -3,7 +3,10 @@
FP8 FP8
================== ==================
vLLM supports FP8 (8-bit floating point) computation using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. Currently, only Hopper and Ada Lovelace GPUs are supported. Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_. Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_.
......
.. _supported_hardware_for_quantization:
Supported Hardware for Quantization Kernels
===========================================
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
Implementation Volta Turing Ampere Ada Hopper AMD GPU Intel GPU x86 CPU AWS Inferentia Google TPU
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
AQLM ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
AWQ ❌ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
DeepSpeedFP ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
FP8 ❌ ❌ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
Marlin ❌ ❌ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
GPTQ ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
SqueezeLLM ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
bitsandbytes ✅ ✅ ✅ ✅ ✅ ❌ ❌ ❌ ❌ ❌
============== ====== ======= ======= ===== ====== ======= ========= ======= ============== ==========
Notes:
^^^^^^
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅" indicates that the quantization method is supported on the specified hardware.
- "❌" indicates that the quantization method is not supported on the specified hardware.
Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please check the `quantization directory <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/layers/quantization>`_ or consult with the vLLM development team.
\ No newline at end of file
.. _deploying_with_cerebrium:
Deploying with Cerebrium
============================
.. raw:: html
<p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
</p>
vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
To install the Cerebrium client, run:
.. code-block:: console
$ pip install cerebrium
$ cerebrium login
Next, create your Cerebrium project, run:
.. code-block:: console
$ cerebrium init vllm-project
Next, to install the required packages, add the following to your cerebrium.toml:
.. code-block:: toml
[cerebrium.dependencies.pip]
vllm = "latest"
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
.. code-block:: python
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
Then, run the following code to deploy it to the cloud
.. code-block:: console
$ cerebrium deploy
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
.. code-block:: python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
You should get a response like:
.. code-block:: python
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
You now have an autoscaling endpoint where you only pay for the compute you use!
...@@ -3,9 +3,8 @@ ...@@ -3,9 +3,8 @@
Deploying with Docker Deploying with Docker
============================ ============================
vLLM offers official docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server. The image can be used to run OpenAI compatible server and is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.com/r/vllm/vllm-openai/tags>`_.
.. code-block:: console .. code-block:: console
...@@ -25,7 +24,7 @@ The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.co ...@@ -25,7 +24,7 @@ The image is available on Docker Hub as `vllm/vllm-openai <https://hub.docker.co
memory to share data between processes under the hood, particularly for tensor parallel inference. memory to share data between processes under the hood, particularly for tensor parallel inference.
You can build and run vLLM from source via the provided dockerfile. To build vLLM: You can build and run vLLM from source via the provided `Dockerfile <https://github.com/vllm-project/vllm/blob/main/Dockerfile>`_. To build vLLM:
.. code-block:: console .. code-block:: console
......
.. _distributed_serving: .. _distributed_serving:
How to decide the distributed inference strategy?
=================================================
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like ``# GPU blocks: 790``. Multiply the number by ``16`` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
.. note::
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
Distributed Inference and Serving Distributed Inference and Serving
================================= =================================
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray. vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We also support pipeline parallel as a beta feature for online serving. We manage the distributed runtime with either `Ray <https://github.com/ray-project/ray>`_ or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inferencing currently requires Ray.
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case. Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured :code:`tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the :code:`LLM` class :code:`distributed-executor-backend` argument or :code:`--distributed-executor-backend` API server argument. Set it to :code:`mp` for multiprocessing or :code:`ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
...@@ -19,10 +35,23 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh ...@@ -19,10 +35,23 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
.. code-block:: console .. code-block:: console
$ python -m vllm.entrypoints.api_server \ $ python -m vllm.entrypoints.openai.api_server \
$ --model facebook/opt-13b \ $ --model facebook/opt-13b \
$ --tensor-parallel-size 4 $ --tensor-parallel-size 4
You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
.. code-block:: console
$ python -m vllm.entrypoints.openai.api_server \
$ --model gpt2 \
$ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2 \
$ --distributed-executor-backend ray
.. note::
Pipeline parallel is a beta feature. It is only supported for online serving and the ray backend for now, as well as LLaMa and GPT2 style models.
To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM: To scale vLLM beyond a single machine, install and start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
.. code-block:: console .. code-block:: console
...@@ -35,4 +64,7 @@ To scale vLLM beyond a single machine, install and start a `Ray runtime <https:/ ...@@ -35,4 +64,7 @@ To scale vLLM beyond a single machine, install and start a `Ray runtime <https:/
$ # On worker nodes $ # On worker nodes
$ ray start --address=<ray-head-address> $ ray start --address=<ray-head-address>
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines. After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` multiplied by :code:`pipeline_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
\ No newline at end of file
.. warning::
Please make sure you downloaded the model to all the nodes, or the model is downloaded to some distributed file system that is accessible by all nodes.
...@@ -3,6 +3,11 @@ Environment Variables ...@@ -3,6 +3,11 @@ Environment Variables
vLLM uses the following environment variables to configure the system: vLLM uses the following environment variables to configure the system:
.. warning::
Please note that ``VLLM_PORT`` and ``VLLM_HOST_IP`` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use ``--host $VLLM_HOST_IP`` and ``--port $VLLM_PORT`` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with ``VLLM_``. **Special care should be taken for Kubernetes users**: please do not name the service as ``vllm``, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because `Kubernetes sets environment variables for each service with the capitalized service name as the prefix <https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables>`_.
.. literalinclude:: ../../../vllm/envs.py .. literalinclude:: ../../../vllm/envs.py
:language: python :language: python
:start-after: begin-env-vars-definition :start-after: begin-env-vars-definition
......
Frequently Asked Questions
===========================
Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
----------------------------------------
Q: Which model to use for offline inference embedding?
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
...@@ -8,6 +8,7 @@ Integrations ...@@ -8,6 +8,7 @@ Integrations
deploying_with_kserve deploying_with_kserve
deploying_with_triton deploying_with_triton
deploying_with_bentoml deploying_with_bentoml
deploying_with_cerebrium
deploying_with_lws deploying_with_lws
deploying_with_dstack deploying_with_dstack
serving_with_langchain serving_with_langchain
...@@ -109,7 +109,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/) ...@@ -109,7 +109,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
```{argparse} ```{argparse}
:module: vllm.entrypoints.openai.cli_args :module: vllm.entrypoints.openai.cli_args
:func: make_arg_parser :func: create_parser_for_docs
:prog: -m vllm.entrypoints.openai.api_server :prog: -m vllm.entrypoints.openai.api_server
``` ```
......
.. _tensorizer:
Loading Models with CoreWeave's Tensorizer
==========================================
vLLM supports loading models with `CoreWeave's Tensorizer <https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer>`_.
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
at runtime extremely quickly directly to the GPU, resulting in significantly
shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.
For more information on CoreWeave's Tensorizer, please refer to
`CoreWeave's Tensorizer documentation <https://github.com/coreweave/tensorizer>`_. For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the `vLLM example script <https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html>`_.
\ No newline at end of file
"""Example Python client for vllm.entrypoints.api_server""" """Example Python client for vllm.entrypoints.api_server
NOTE: The API server is used only for demonstration and simple performance
benchmarks. It is not intended for production use.
For production use, we recommend vllm.entrypoints.openai.api_server
and the OpenAI client API
"""
import argparse import argparse
import json import json
......
import argparse
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.utils import FlexibleArgumentParser
def main(): def main():
parser = argparse.ArgumentParser(description='AQLM examples') parser = FlexibleArgumentParser(description='AQLM examples')
parser.add_argument('--model', parser.add_argument('--model',
'-m', '-m',
...@@ -17,7 +16,7 @@ def main(): ...@@ -17,7 +16,7 @@ def main():
type=int, type=int,
default=0, default=0,
help='known good models by index, [0-4]') help='known good models by index, [0-4]')
parser.add_argument('--tensor_parallel_size', parser.add_argument('--tensor-parallel-size',
'-t', '-t',
type=int, type=int,
default=1, default=1,
......
...@@ -2,7 +2,7 @@ import argparse ...@@ -2,7 +2,7 @@ import argparse
import glob import glob
import json import json
import os import os
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple from typing import Any, Callable, Dict, List, Optional, Tuple
import numpy as np import numpy as np
import torch import torch
...@@ -19,7 +19,7 @@ def _prepare_hf_weights( ...@@ -19,7 +19,7 @@ def _prepare_hf_weights(
quantized_model_dir: str, quantized_model_dir: str,
load_format: str = "auto", load_format: str = "auto",
fall_back_to_pt: bool = True, fall_back_to_pt: bool = True,
) -> Tuple[str, List[str], bool]: ) -> Tuple[List[str], bool]:
if not os.path.isdir(quantized_model_dir): if not os.path.isdir(quantized_model_dir):
raise FileNotFoundError( raise FileNotFoundError(
f"The quantized model directory `{quantized_model_dir}` " f"The quantized model directory `{quantized_model_dir}` "
...@@ -94,7 +94,7 @@ def _hf_tensorfile_iterator(filename: str, load_format: str, ...@@ -94,7 +94,7 @@ def _hf_tensorfile_iterator(filename: str, load_format: str,
def _kv_scales_extractor( def _kv_scales_extractor(
hf_tensor_files: Iterable[str], hf_tensor_files: List[str],
use_safetensors: bool, use_safetensors: bool,
rank_keyword: str = "rank", rank_keyword: str = "rank",
expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]: expected_tp_size: Optional[int] = None) -> Dict[int, Dict[int, float]]:
...@@ -115,7 +115,7 @@ def _kv_scales_extractor( ...@@ -115,7 +115,7 @@ def _kv_scales_extractor(
for char in rank_keyword: for char in rank_keyword:
assert not char.isdecimal( assert not char.isdecimal(
), f"Rank keyword {rank_keyword} contains a numeric character!" ), f"Rank keyword {rank_keyword} contains a numeric character!"
rank_scales_map = {} rank_scales_map: Dict[int, Dict[int, float]] = {}
for tensor_file in hf_tensor_files: for tensor_file in hf_tensor_files:
try: try:
rank_idx = tensor_file.find(rank_keyword) rank_idx = tensor_file.find(rank_keyword)
...@@ -141,7 +141,7 @@ def _kv_scales_extractor( ...@@ -141,7 +141,7 @@ def _kv_scales_extractor(
raise raise
if rank not in rank_scales_map: if rank not in rank_scales_map:
layer_scales_map = {} layer_scales_map: Dict[int, float] = {}
rank_scales_map[rank] = layer_scales_map rank_scales_map[rank] = layer_scales_map
else: else:
raise RuntimeError( raise RuntimeError(
...@@ -222,7 +222,7 @@ def _metadata_extractor(quantized_model_dir: str, ...@@ -222,7 +222,7 @@ def _metadata_extractor(quantized_model_dir: str,
"does not exist.") "does not exist.")
metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json")) metadata_files = glob.glob(os.path.join(quantized_model_dir, "*.json"))
result = {} result: Dict[str, Any] = {}
for file in metadata_files: for file in metadata_files:
with open(file) as f: with open(file) as f:
try: try:
...@@ -327,7 +327,7 @@ if __name__ == "__main__": ...@@ -327,7 +327,7 @@ if __name__ == "__main__":
"--quantization-param-path <filename>). This is only used " "--quantization-param-path <filename>). This is only used "
"if the KV cache dtype is FP8 and on ROCm (AMD GPU).") "if the KV cache dtype is FP8 and on ROCm (AMD GPU).")
parser.add_argument( parser.add_argument(
"--quantized_model", "--quantized-model",
help="Specify the directory containing a single quantized HF model. " help="Specify the directory containing a single quantized HF model. "
"It is expected that the quantization format is FP8_E4M3, for use " "It is expected that the quantization format is FP8_E4M3, for use "
"on ROCm (AMD GPU).", "on ROCm (AMD GPU).",
...@@ -339,18 +339,18 @@ if __name__ == "__main__": ...@@ -339,18 +339,18 @@ if __name__ == "__main__":
choices=["auto", "safetensors", "npz", "pt"], choices=["auto", "safetensors", "npz", "pt"],
default="auto") default="auto")
parser.add_argument( parser.add_argument(
"--output_dir", "--output-dir",
help="Optionally specify the output directory. By default the " help="Optionally specify the output directory. By default the "
"KV cache scaling factors will be saved in the model directory, " "KV cache scaling factors will be saved in the model directory, "
"however you can override this behavior here.", "however you can override this behavior here.",
default=None) default=None)
parser.add_argument( parser.add_argument(
"--output_name", "--output-name",
help="Optionally specify the output filename.", help="Optionally specify the output filename.",
# TODO: Change this once additional scaling factors are enabled # TODO: Change this once additional scaling factors are enabled
default="kv_cache_scales.json") default="kv_cache_scales.json")
parser.add_argument( parser.add_argument(
"--tp_size", "--tp-size",
help="Optionally specify the tensor-parallel (TP) size that the " help="Optionally specify the tensor-parallel (TP) size that the "
"quantized model should correspond to. If specified, during KV " "quantized model should correspond to. If specified, during KV "
"cache scaling factor extraction the observed TP size will be " "cache scaling factor extraction the observed TP size will be "
......
...@@ -332,7 +332,7 @@ def main(args): ...@@ -332,7 +332,7 @@ def main(args):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__) parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--model_dir", parser.add_argument("--model-dir",
help="Specify where the HuggingFace model is", help="Specify where the HuggingFace model is",
required=True) required=True)
parser.add_argument("--device", default="cuda") parser.add_argument("--device", default="cuda")
...@@ -346,19 +346,19 @@ if __name__ == "__main__": ...@@ -346,19 +346,19 @@ if __name__ == "__main__":
"full_prec" "full_prec"
], ],
) )
parser.add_argument("--batch_size", parser.add_argument("--batch-size",
help="Batch size for calibration.", help="Batch size for calibration.",
type=int, type=int,
default=1) default=1)
parser.add_argument("--calib_size", parser.add_argument("--calib-size",
help="Number of samples for calibration.", help="Number of samples for calibration.",
type=int, type=int,
default=512) default=512)
parser.add_argument("--output_dir", default="exported_model") parser.add_argument("--output-dir", default="exported_model")
parser.add_argument("--tp_size", type=int, default=1) parser.add_argument("--tp-size", type=int, default=1)
parser.add_argument("--pp_size", type=int, default=1) parser.add_argument("--pp-size", type=int, default=1)
parser.add_argument("--awq_block_size", type=int, default=128) parser.add_argument("--awq-block-size", type=int, default=128)
parser.add_argument("--kv_cache_dtype", parser.add_argument("--kv-cache-dtype",
help="KV Cache dtype.", help="KV Cache dtype.",
default=None, default=None,
choices=["int8", "fp8", None]) choices=["int8", "fp8", None])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment