Merge branch 'v0.6.3.post1-dev'

ad385667 · zhuwenwen · be0967c1 · 903593d3 · ad385667 · ad385667
Commit ad385667 authored Oct 23, 2024 by zhuwenwen
20 changed files
--- a/docs/source/getting_started/xpu-installation.rst
+++ b/docs/source/getting_started/xpu-installation.rst
@@ -17,8 +17,8 @@ Requirements
 ------------
 * OS: Linux
-* Supported Hardware: Intel Data Center GPU (Intel ARC GPU WIP)
+* Supported Hardware: Intel Data Center GPU, Intel ARC GPU
-* OneAPI requirements: oneAPI 2024.1 
+* OneAPI requirements: oneAPI 2024.2 
 .. _xpu_backend_quick_start_dockerfile:
@@ -40,7 +40,7 @@ Quick start using Dockerfile
 Build from source
 -----------------
- First, install required driver and intel OneAPI 2024.1 or later.
+- First, install required driver and intel OneAPI 2024.2 or later.
 - Second, install Python packages for vLLM XPU backend building:

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -31,8 +31,10 @@ vLLM is fast with:
 * Efficient management of attention key and value memory with **PagedAttention**
 * Continuous batching of incoming requests
 * Fast model execution with CUDA/HIP graph
-* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
+* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, INT4, INT8, and FP8
-* Optimized CUDA kernels
+* Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
+* Speculative decoding
+* Chunked prefill
 vLLM is flexible and easy to use with:
@@ -41,9 +43,9 @@ vLLM is flexible and easy to use with:
 * Tensor parallelism and pipeline parallelism support for distributed inference
 * Streaming outputs
 * OpenAI-compatible API server
-* Support NVIDIA GPUs and AMD GPUs
+* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
-* (Experimental) Prefix caching support
+* Prefix caching support
-* (Experimental) Multi-lora support
+* Multi-lora support
 For more information, check out the following:
@@ -53,7 +55,6 @@ For more information, check out the following:
 * :ref:`vLLM Meetups <meetups>`.
 Documentation
 -------------
@@ -78,12 +79,14 @@ Documentation
   serving/openai_compatible_server
   serving/deploying_with_docker
+   serving/deploying_with_k8s
   serving/distributed_serving
   serving/metrics
   serving/env_vars
   serving/usage_stats
   serving/integrations
   serving/tensorizer
+   serving/compatibility_matrix
   serving/faq
 .. toctree::
@@ -106,6 +109,8 @@ Documentation
   quantization/supported_hardware
   quantization/auto_awq
   quantization/bnb
+   quantization/gguf
+   quantization/int8
   quantization/fp8
   quantization/fp8_e5m2_kvcache
   quantization/fp8_e4m3_kvcache
@@ -134,6 +139,7 @@ Documentation
   dev/input_processing/model_inputs_index
   dev/multimodal/multimodal_index
   dev/dockerfile/dockerfile
+   dev/profiling/profiling_index
 .. toctree::
   :maxdepth: 1

--- a/docs/source/models/adding_model.rst
+++ b/docs/source/models/adding_model.rst
@@ -85,21 +85,21 @@ When it comes to the linear layers, we provide the following options to parallel
 * :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
 * :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
 * :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
-* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
+* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple :code:`ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
 * :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
-Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
+Note that all the linear layers above take :code:`linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
 4. Implement the weight loading logic
 -------------------------------------
 You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
-This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
+This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for :code:`MergedColumnParallelLinear` and :code:`QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
 5. Register your model
 ----------------------
-Finally, register your :code:`*ForCausalLM` class to the :code:`_MODELS` in `vllm/model_executor/models/__init__.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/__init__.py>`_.
+Finally, register your :code:`*ForCausalLM` class to the :code:`_VLLM_MODELS` in `vllm/model_executor/models/registry.py <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py>`_.
 6. Out-of-Tree Model Integration
 --------------------------------------------
@@ -114,6 +114,18 @@ Just add the following lines in your code:
    from your_code import YourModelForCausalLM
    ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
+If your model imports modules that initialize CUDA, consider instead lazy-importing it to avoid an error like :code:`RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
+.. code-block:: python
+    from vllm import ModelRegistry
+    ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
+.. important::
+    If your model is a multimodal model, make sure the model class implements the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
+    Read more about that :ref:`here <enabling_multimodal_inputs>`.
 If you are running api server with :code:`vllm serve <args>`, you can wrap the entrypoint with the following code:
 .. code-block:: python

--- a/docs/source/models/enabling_multimodal_inputs.rst
+++ b/docs/source/models/enabling_multimodal_inputs.rst
@@ -15,14 +15,14 @@ This document walks you through the steps to extend a vLLM model so that it acce
 It is assumed that you have already implemented the model in vLLM according to :ref:`these steps <adding_a_new_model>`.
 Further update the model as follows:
- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.
+- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
  .. code-block:: diff
-      + from vllm.model_executor.models.interfaces import SupportsVision
+      + from vllm.model_executor.models.interfaces import SupportsMultiModal
      - class YourModelForImage2Seq(nn.Module):
-      + class YourModelForImage2Seq(nn.Module, SupportsVision):
+      + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
  .. note::
      The model class does not have to be named :code:`*ForCausalLM`.
@@ -51,11 +51,11 @@ This decorator accepts a function that maps multi-modal inputs to the keyword ar
 .. code-block:: diff
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
    + from vllm.multimodal import MULTIMODAL_REGISTRY
    + @MULTIMODAL_REGISTRY.register_image_input_mapper()
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
 A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
@@ -66,19 +66,19 @@ A default mapper is available for each modality in the core vLLM library. This i
 3. Register maximum number of multi-modal tokens
 ------------------------------------------------
-For each modality type that the model accepts as input, calculate the maximum possible number of tokens
+For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data instance
 and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.
 .. code-block:: diff
      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY
      @MULTIMODAL_REGISTRY.register_image_input_mapper()
    + @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
      @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
 Here are some examples:
@@ -98,13 +98,13 @@ In such cases, you can define your own dummy data by registering a factory metho
 .. code-block:: diff
      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY
      @MULTIMODAL_REGISTRY.register_image_input_mapper()
      @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
    + @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
 .. note::
    The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
@@ -128,14 +128,14 @@ You can register input processors via :meth:`INPUT_REGISTRY.register_input_proce
 .. code-block:: diff
      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY
      @MULTIMODAL_REGISTRY.register_image_input_mapper()
      @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
      @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
    + @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
 A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
 Here are some examples:

--- a/docs/source/models/lora.rst
+++ b/docs/source/models/lora.rst
@@ -107,3 +107,119 @@ The following is an example request
            "max_tokens": 7,
            "temperature": 0
        }' | jq
+Dynamically serving LoRA Adapters
+---------------------------------
+In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
+LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
+to change models on-the-fly is needed.
+Note: Enabling this feature in production environments is risky as user may participate model adapter management.
+To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
+is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
+.. code-block:: bash
+    export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
+Loading a LoRA Adapter:
+To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
+details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
+Example request to load a LoRA adapter:
+.. code-block:: bash
+    curl -X POST http://localhost:8000/v1/load_lora_adapter \
+    -H "Content-Type: application/json" \
+    -d '{
+        "lora_name": "sql_adapter",
+        "lora_path": "/path/to/sql-lora-adapter"
+    }'
+Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
+cannot be found or loaded, an appropriate error message will be returned.
+Unloading a LoRA Adapter:
+To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
+with the name or ID of the adapter to be unloaded.
+Example request to unload a LoRA adapter:
+.. code-block:: bash
+    curl -X POST http://localhost:8000/v1/unload_lora_adapter \
+    -H "Content-Type: application/json" \
+    -d '{
+        "lora_name": "sql_adapter"
+    }'
+New format for `--lora-modules`
+-------------------------------
+In the previous version, users would provide LoRA modules via the following format, either as a key-value pair or in JSON format. For example:
+.. code-block:: bash
+    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
+This would only include the `name` and `path` for each LoRA module, but did not provide a way to specify a `base_model_name`.
+Now, you can specify a base_model_name alongside the name and path using JSON format. For example:
+.. code-block:: bash
+    --lora-modules '{"name": "sql-lora", "path": "/path/to/lora", "base_model_name": "meta-llama/Llama-2-7b"}'
+To provide the backward compatibility support, you can still use the old key-value format (name=path), but the `base_model_name` will remain unspecified in that case.
+Lora model lineage in model card
+--------------------------------
+The new format of `--lora-modules` is mainly to support the display of parent model information in the model card. Here's an explanation of how your current response supports this:
+- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
+- The `root` field points to the artifact location of the lora adapter.
+.. code-block:: bash
+    $ curl http://localhost:8000/v1/models
+    {
+        "object": "list",
+        "data": [
+            {
+            "id": "meta-llama/Llama-2-7b-hf",
+            "object": "model",
+            "created": 1715644056,
+            "owned_by": "vllm",
+            "root": "~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/",
+            "parent": null,
+            "permission": [
+                {
+                .....
+                }
+            ]
+            },
+            {
+            "id": "sql-lora",
+            "object": "model",
+            "created": 1715644056,
+            "owned_by": "vllm",
+            "root": "~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/",
+            "parent": meta-llama/Llama-2-7b-hf,
+            "permission": [
+                {
+                ....
+                }
+            ]
+            }
+        ]
+    }
--- a/docs/source/models/performance.rst
+++ b/docs/source/models/performance.rst
@@ -22,6 +22,8 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
 You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.
+.. _chunked-prefill:
 Chunked Prefill
 ---------------
 vLLM supports an experimental feature chunked prefill. Chunked prefill allows to chunk large prefills into smaller chunks and batch them together with decode requests.

--- a/docs/source/models/spec_decode.rst
+++ b/docs/source/models/spec_decode.rst
@@ -14,31 +14,74 @@ Speculative decoding is a technique which improves inter-token latency in memory
 Speculating with a draft model
 ------------------------------
-The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
+The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
 .. code-block:: python
    from vllm import LLM, SamplingParams
    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(
        model="facebook/opt-6.7b",
        tensor_parallel_size=1,
        speculative_model="facebook/opt-125m",
        num_speculative_tokens=5,
-        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+To perform the same with an online mode launch the server:
+.. code-block:: bash
+    python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
+        --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
+        --num_speculative_tokens 5 --gpu_memory_utilization 0.8
+Then use a client:
+.. code-block:: python
+    from openai import OpenAI
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+    client = OpenAI(
+        # defaults to os.environ.get("OPENAI_API_KEY")
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+    models = client.models.list()
+    model = models.data[0].id
+    # Completion API
+    stream = False
+    completion = client.completions.create(
+        model=model,
+        prompt="The future of AI is",
+        echo=False,
+        n=1,
+        stream=stream,
+    )
+    print("Completion results:")
+    if stream:
+        for c in completion:
+            print(c)
+    else:
+        print(completion)
 Speculating by matching n-grams in the prompt
 ---------------------------------------------
@@ -48,27 +91,114 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
 .. code-block:: python
    from vllm import LLM, SamplingParams
    prompts = [
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(
        model="facebook/opt-6.7b",
        tensor_parallel_size=1,
        speculative_model="[ngram]",
        num_speculative_tokens=5,
        ngram_prompt_lookup_max=4,
-        use_v2_block_manager=True,
    )
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+Speculating using MLP speculators
+---------------------------------
+The following code configures vLLM to use speculative decoding where proposals are generated by
+draft models that conditioning draft predictions on both context vectors and sampled tokens.
+For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
+`this technical report <https://arxiv.org/abs/2404.19124>`_.
+.. code-block:: python
+    from vllm import LLM, SamplingParams
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+        tensor_parallel_size=4,
+        speculative_model="ibm-fms/llama3-70b-accelerator",
+        speculative_draft_tensor_parallel_size=1,
+    )
+    outputs = llm.generate(prompts, sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+Note that these speculative models currently need to be run without tensor parallelism, although
+it is possible to run the main model using tensor parallelism (see example above). Since the
+speculative models are relatively small, we still see significant speedups. However, this
+limitation will be fixed in a future release.
+A variety of speculative models of this type are available on HF hub:
+* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
+* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
+* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
+* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
+* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
+* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
+* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
+* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
+* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
+Lossless guarantees of Speculative Decoding
+-------------------------------------------
+In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of 
+speculative decoding, breaking down the guarantees into three key areas:
+1. **Theoretical Losslessness**
+   - Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might 
+   cause slight variations in output distributions, as discussed 
+   in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
+2. **Algorithmic Losslessness**
+   - vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
+    - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target 
+      distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
+    - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
+      without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler, 
+      provides a lossless guarantee.  Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
+      verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
+3. **vLLM Logprob Stability**
+   - vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the 
+   same request across runs. For more details, see the FAQ section 
+   titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
+**Conclusion**
+While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding 
+can occur due to following factors:
+- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
+- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially 
+  due to non-deterministic behavior in batched operations or numerical instability.
+**Mitigation Strategies**
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
 Resources for vLLM contributors
 -------------------------------
 * `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_

--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -3,232 +3,485 @@
 Supported Models
 ================
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.
----
+Text-only Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Text Generation
+---------------
-Decoder-only Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. list-table::
-  :widths: 25 25 50 5
+  :widths: 25 25 50 5 5
  :header-rows: 1
  * - Architecture
    - Models
-    - Example HuggingFace Models
+    - Example HF Models
    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
  * - :code:`AquilaForCausalLM`
-    - Aquila & Aquila2
+    - Aquila, Aquila2
    - :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`ArcticForCausalLM`
    - Arctic
    - :code:`Snowflake/snowflake-arctic-base`, :code:`Snowflake/snowflake-arctic-instruct`, etc.
    -
+    - ✅︎
  * - :code:`BaiChuanForCausalLM`
-    - Baichuan & Baichuan2
+    - Baichuan2, Baichuan
    - :code:`baichuan-inc/Baichuan2-13B-Chat`, :code:`baichuan-inc/Baichuan-7B`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`BloomForCausalLM`
    - BLOOM, BLOOMZ, BLOOMChat
    - :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
    -
+    - ✅︎
+  * - :code:`BartForConditionalGeneration`
+    - BART
+    - :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
+    - 
+    - 
  * - :code:`ChatGLMModel`
    - ChatGLM
    - :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`CohereForCausalLM`
    - Command-R
    - :code:`CohereForAI/c4ai-command-r-v01`, etc.
-    -
+    - ✅︎
+    - ✅︎
  * - :code:`DbrxForCausalLM`
    - DBRX
    - :code:`databricks/dbrx-base`, :code:`databricks/dbrx-instruct`, etc.
    -
+    - ✅︎
  * - :code:`DeciLMForCausalLM`
    - DeciLM
    - :code:`Deci/DeciLM-7B`, :code:`Deci/DeciLM-7B-instruct`, etc.
    -
+    - ✅︎
+  * - :code:`DeepseekForCausalLM`
+    - DeepSeek
+    - :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc.
+    - 
+    - ✅︎
+  * - :code:`DeepseekV2ForCausalLM`
+    - DeepSeek-V2
+    - :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc.
+    - 
+    - ✅︎
+  * - :code:`ExaoneForCausalLM`
+    - EXAONE-3
+    - :code:`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`FalconForCausalLM`
    - Falcon
    - :code:`tiiuae/falcon-7b`, :code:`tiiuae/falcon-40b`, :code:`tiiuae/falcon-rw-7b`, etc.
    -
+    - ✅︎
  * - :code:`GemmaForCausalLM`
    - Gemma
    - :code:`google/gemma-2b`, :code:`google/gemma-7b`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`Gemma2ForCausalLM`
    - Gemma2
    - :code:`google/gemma-2-9b`, :code:`google/gemma-2-27b`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`GPT2LMHeadModel`
    - GPT-2
    - :code:`gpt2`, :code:`gpt2-xl`, etc.
    -
+    - ✅︎
  * - :code:`GPTBigCodeForCausalLM`
    - StarCoder, SantaCoder, WizardCoder
    - :code:`bigcode/starcoder`, :code:`bigcode/gpt_bigcode-santacoder`, :code:`WizardLM/WizardCoder-15B-V1.0`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`GPTJForCausalLM`
    - GPT-J
    - :code:`EleutherAI/gpt-j-6b`, :code:`nomic-ai/gpt4all-j`, etc.
    -
+    - ✅︎
  * - :code:`GPTNeoXForCausalLM`
    - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
    - :code:`EleutherAI/gpt-neox-20b`, :code:`EleutherAI/pythia-12b`, :code:`OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, :code:`databricks/dolly-v2-12b`, :code:`stabilityai/stablelm-tuned-alpha-7b`, etc.
    -
+    - ✅︎
+  * - :code:`GraniteForCausalLM`
+    - PowerLM
+    - :code:`ibm/PowerLM-3b` etc.
+    - ✅︎
+    - ✅︎
+  * - :code:`GraniteMoeForCausalLM`
+    - PowerMoE
+    - :code:`ibm/PowerMoE-3b` etc.
+    - ✅︎
+    - ✅︎
  * - :code:`InternLMForCausalLM`
    - InternLM
    - :code:`internlm/internlm-7b`, :code:`internlm/internlm-chat-7b`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`InternLM2ForCausalLM`
    - InternLM2
    - :code:`internlm/internlm2-7b`, :code:`internlm/internlm2-chat-7b`, etc.
    -
+    - ✅︎
  * - :code:`JAISLMHeadModel`
    - Jais
    - :code:`core42/jais-13b`, :code:`core42/jais-13b-chat`, :code:`core42/jais-30b-v3`, :code:`core42/jais-30b-chat-v3`, etc.
    -
+    - ✅︎
  * - :code:`JambaForCausalLM`
    - Jamba
-    - :code:`ai21labs/Jamba-v0.1`, etc.
+    - :code:`ai21labs/AI21-Jamba-1.5-Large`, :code:`ai21labs/AI21-Jamba-1.5-Mini`, :code:`ai21labs/Jamba-v0.1`, etc.
    - ✅︎
+    - 
  * - :code:`LlamaForCausalLM`
    - Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
    - :code:`meta-llama/Meta-Llama-3.1-405B-Instruct`, :code:`meta-llama/Meta-Llama-3.1-70B`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-70b-hf`, :code:`01-ai/Yi-34B`, etc.
    - ✅︎
+    - ✅︎
+  * - :code:`MambaForCausalLM`
+    - Mamba
+    - :code:`state-spaces/mamba-130m-hf`, :code:`state-spaces/mamba-790m-hf`, :code:`state-spaces/mamba-2.8b-hf`, etc.
+    -
+    -
  * - :code:`MiniCPMForCausalLM`
    - MiniCPM
-    - :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
+    - :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, :code:`openbmb/MiniCPM-S-1B-sft`, etc.
-    -
+    - ✅︎
+    - ✅︎
+  * - :code:`MiniCPM3ForCausalLM`
+    - MiniCPM3
+    - :code:`openbmb/MiniCPM3-4B`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`MistralForCausalLM`
    - Mistral, Mistral-Instruct
    - :code:`mistralai/Mistral-7B-v0.1`, :code:`mistralai/Mistral-7B-Instruct-v0.1`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`MixtralForCausalLM`
    - Mixtral-8x7B, Mixtral-8x7B-Instruct
    - :code:`mistralai/Mixtral-8x7B-v0.1`, :code:`mistralai/Mixtral-8x7B-Instruct-v0.1`, :code:`mistral-community/Mixtral-8x22B-v0.1`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`MPTForCausalLM`
    - MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
    - :code:`mosaicml/mpt-7b`, :code:`mosaicml/mpt-7b-storywriter`, :code:`mosaicml/mpt-30b`, etc.
    -
+    - ✅︎
  * - :code:`NemotronForCausalLM`
    - Nemotron-3, Nemotron-4, Minitron
    - :code:`nvidia/Minitron-8B-Base`, :code:`mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`OLMoForCausalLM`
    - OLMo
    - :code:`allenai/OLMo-1B-hf`, :code:`allenai/OLMo-7B-hf`, etc.
    -
+    - ✅︎
+  * - :code:`OLMoEForCausalLM`
+    - OLMoE
+    - :code:`allenai/OLMoE-1B-7B-0924`, :code:`allenai/OLMoE-1B-7B-0924-Instruct`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`OPTForCausalLM`
    - OPT, OPT-IML
    - :code:`facebook/opt-66b`, :code:`facebook/opt-iml-max-30b`, etc.
    -
+    - ✅︎
  * - :code:`OrionForCausalLM`
    - Orion
    - :code:`OrionStarAI/Orion-14B-Base`, :code:`OrionStarAI/Orion-14B-Chat`, etc.
    -
+    - ✅︎
  * - :code:`PhiForCausalLM`
    - Phi
    - :code:`microsoft/phi-1_5`, :code:`microsoft/phi-2`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`Phi3ForCausalLM`
    - Phi-3
    - :code:`microsoft/Phi-3-mini-4k-instruct`, :code:`microsoft/Phi-3-mini-128k-instruct`, :code:`microsoft/Phi-3-medium-128k-instruct`, etc.
-    -
+    - ✅︎
+    - ✅︎
  * - :code:`Phi3SmallForCausalLM`
    - Phi-3-Small
    - :code:`microsoft/Phi-3-small-8k-instruct`, :code:`microsoft/Phi-3-small-128k-instruct`, etc.
    -
+    - ✅︎
+  * - :code:`PhiMoEForCausalLM`
+    - Phi-3.5-MoE
+    - :code:`microsoft/Phi-3.5-MoE-instruct`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`PersimmonForCausalLM`
    - Persimmon
    - :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc.
    - 
+    - ✅︎
  * - :code:`QWenLMHeadModel`
    - Qwen
    - :code:`Qwen/Qwen-7B`, :code:`Qwen/Qwen-7B-Chat`, etc.
    -
+    - ✅︎
  * - :code:`Qwen2ForCausalLM`
    - Qwen2
    - :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc.
    - ✅︎
+    - ✅︎
  * - :code:`Qwen2MoeForCausalLM`
    - Qwen2MoE
    - :code:`Qwen/Qwen1.5-MoE-A2.7B`, :code:`Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
    -
+    - ✅︎
  * - :code:`StableLmForCausalLM`
    - StableLM
-    - :code:`stabilityai/stablelm-3b-4e1t/` , :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
+    - :code:`stabilityai/stablelm-3b-4e1t`, :code:`stabilityai/stablelm-base-alpha-7b-v2`, etc.
    -
+    - ✅︎
  * - :code:`Starcoder2ForCausalLM`
    - Starcoder2
    - :code:`bigcode/starcoder2-3b`, :code:`bigcode/starcoder2-7b`, :code:`bigcode/starcoder2-15b`, etc.
    -
+    - ✅︎
+  * - :code:`SolarForCausalLM`
+    - Solar Pro
+    - :code:`upstage/solar-pro-preview-instruct`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`XverseForCausalLM`
-    - Xverse
+    - XVERSE
    - :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc.
-    -
+    - ✅︎
+    - ✅︎
 .. note::
    Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
+Text Embedding
+--------------
+.. list-table::
+  :widths: 25 25 50 5 5
+  :header-rows: 1
+  * - Architecture
+    - Models
+    - Example HF Models
+    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
+  * - :code:`Gemma2Model`
+    - Gemma2-based
+    - :code:`BAAI/bge-multilingual-gemma2`, etc.
+    - 
+    - ✅︎
+  * - :code:`MistralModel`
+    - Mistral-based
+    - :code:`intfloat/e5-mistral-7b-instruct`, etc.
+    - 
+    - ✅︎
+Reward Modeling
+---------------
+.. list-table::
+  :widths: 25 25 50 5 5
+  :header-rows: 1
+  * - Architecture
+    - Models
+    - Example HF Models
+    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
+  * - :code:`Qwen2ForRewardModel`
+    - Qwen2-based
+    - :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
+    - 
+    - ✅︎
+.. note::
+    As an interim measure, these models are supported via Embeddings API. See `this RFC <https://github.com/vllm-project/vllm/issues/8967>`_ for upcoming changes.
+Multimodal Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+The following modalities are supported depending on the model:
+- **T**\ ext
+- **I**\ mage
+- **V**\ ideo
+- **A**\ udio
 .. _supported_vlms:
-Vision Language Models
+Text Generation
-^^^^^^^^^^^^^^^^^^^^^^^
+---------------
 .. list-table::
-  :widths: 25 25 50 5
+  :widths: 25 25 15 25 5 5
  :header-rows: 1
  * - Architecture
    - Models
-    - Example HuggingFace Models
+    - Inputs
+    - Example HF Models
    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
  * - :code:`Blip2ForConditionalGeneration`
    - BLIP-2
+    - T + I\ :sup:`E`
    - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
    -
+    - ✅︎
  * - :code:`ChameleonForConditionalGeneration`
    - Chameleon
+    - T + I
    - :code:`facebook/chameleon-7b` etc.
    - 
+    - ✅︎
  * - :code:`FuyuForCausalLM`
    - Fuyu
+    - T + I
    - :code:`adept/fuyu-8b` etc.
    - 
+    - ✅︎
+  * - :code:`ChatGLMModel`
+    - GLM-4V
+    - T + I
+    - :code:`THUDM/glm-4v-9b` etc.
+    - 
+    - ✅︎
  * - :code:`InternVLChatModel`
    - InternVL2
+    - T + I\ :sup:`E+`
    - :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
    - 
+    - ✅︎
  * - :code:`LlavaForConditionalGeneration`
    - LLaVA-1.5
+    - T + I\ :sup:`E+`
    - :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
    -
+    - ✅︎
  * - :code:`LlavaNextForConditionalGeneration`
    - LLaVA-NeXT
+    - T + I\ :sup:`E+`
    - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
    -
+    - ✅︎
+  * - :code:`LlavaNextVideoForConditionalGeneration`
+    - LLaVA-NeXT-Video
+    - T + V
+    - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
+    -
+    - ✅︎
+  * - :code:`LlavaOnevisionForConditionalGeneration`
+    - LLaVA-Onevision
+    - T + I\ :sup:`+` + V
+    - :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
+    -
+    - ✅︎
+  * - :code:`MiniCPMV`
+    - MiniCPM-V
+    - T + I\ :sup:`E+`
+    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
+    - ✅︎
+    - ✅︎
+  * - :code:`MllamaForConditionalGeneration`
+    - Llama 3.2
+    - T + I
+    - :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
+    -
+    -
+  * - :code:`MolmoForCausalLM`
+    - Molmo
+    - Image
+    - :code:`allenai/Molmo-7B-D-0924`, :code:`allenai/Molmo-72B-0924`, etc.
+    -
+    - ✅︎
+  * - :code:`NVLM_D_Model`
+    - NVLM-D 1.0
+    - T + I\ :sup:`E+`
+    - :code:`nvidia/NVLM-D-72B`, etc.
+    - 
+    - ✅︎
  * - :code:`PaliGemmaForConditionalGeneration`
    - PaliGemma
+    - T + I\ :sup:`E`
    - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
    - 
+    - ✅︎
  * - :code:`Phi3VForCausalLM`
-    - Phi-3-Vision
+    - Phi-3-Vision, Phi-3.5-Vision
-    - :code:`microsoft/Phi-3-vision-128k-instruct`, etc.
+    - T + I\ :sup:`E+`
+    - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
    -
-  * - :code:`MiniCPMV`
+    - ✅︎
-    - MiniCPM-V
+  * - :code:`PixtralForConditionalGeneration`
-    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, etc.
+    - Pixtral
+    - T + I\ :sup:`+`
+    - :code:`mistralai/Pixtral-12B-2409`
    -
+    - ✅︎
+  * - :code:`QWenLMHeadModel`
+    - Qwen-VL
+    - T + I\ :sup:`E+`
+    - :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
+    -
+    - ✅︎
+  * - :code:`Qwen2VLForConditionalGeneration`
+    - Qwen2-VL
+    - T + I\ :sup:`E+` + V\ :sup:`+`
+    - :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
+    -
+    - ✅︎
+  * - :code:`UltravoxModel`
+    - Ultravox
+    - T + A\ :sup:`E+`
+    - :code:`fixie-ai/ultravox-v0_3`
+    -
+    - ✅︎
+| :sup:`E` Pre-computed embeddings can be inputted for this modality.
+| :sup:`+` Multiple items can be inputted per text prompt for this modality.
 .. note::
  For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
  For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
+Multimodal Embedding
+--------------------
+.. list-table::
+  :widths: 25 25 15 25 5 5
+  :header-rows: 1
+  * - Architecture
+    - Models
+    - Inputs
+    - Example HF Models
+    - :ref:`LoRA <lora>`
+    - :ref:`PP <distributed_serving>`
+  * - :code:`Phi3VForCausalLM`
+    - Phi-3-Vision-based
+    - T + I
+    - :code:`TIGER-Lab/VLM2Vec-Full`
+    - 🚧
+    - ✅︎
 ----
 If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
@@ -288,7 +541,7 @@ Note that, as an inference engine, vLLM does not introduce new models. Therefore
 We have the following levels of testing for models:
-1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `test_models.py <https://github.com/vllm-project/vllm/blob/main/tests/models/test_models.py>`_ and `test_big_models.py <https://github.com/vllm-project/vllm/blob/main/tests/models/test_big_models.py>`_ for the models that have passed this test.
+1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `models tests <https://github.com/vllm-project/vllm/blob/main/tests/models>`_ for the models that have passed this test.
 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to `functionality tests <https://github.com/vllm-project/vllm/tree/main/tests>`_ and `examples <https://github.com/vllm-project/vllm/tree/main/examples>`_ for the models that have passed this test.
 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -6,34 +6,26 @@ Using VLMs
 vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here <supported_vlms>`.
 This document shows you how to run and serve these models using vLLM.
-.. important::
+.. note::
-    We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
+    We are actively iterating on VLM support. See `this RFC <https://github.com/vllm-project/vllm/issues/4194>`_ for upcoming changes,
+    and `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
-    Currently, the support for vision language models on vLLM has the following limitations:
-    * Only single image input is supported per text prompt.
-    We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
+Offline Inference
+-----------------
-Offline Batched Inference
+Single-image input
-------------------------
+^^^^^^^^^^^^^^^^^^
-To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.
+The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
 .. code-block:: python
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-.. important::
+To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
-    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
-    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
-    internally for each model.
-To pass an image to the model, note the following in :class:`vllm.inputs.PromptInputs`:
 * ``prompt``: The prompt should follow the format that is documented on HuggingFace.
-* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. 
+* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
 .. code-block:: python
@@ -42,7 +34,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
    # Load the image using PIL.Image
    image = PIL.Image.open(...)
    # Single prompt inference
    outputs = llm.generate({
        "prompt": prompt,
@@ -52,7 +44,42 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)
+    # Inference with image embeddings as input
+    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
+    outputs = llm.generate({
+        "prompt": prompt,
+        "multi_modal_data": {"image": image_embeds},
+    })
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+    # Inference with image embeddings as input with additional parameters
+    # Specifically, we are conducting a trial run of Qwen2VL and MiniCPM-V with the new input format, which utilizes additional parameters.
+    mm_data = {}
+    image_embeds = torch.load(...) # torch.Tensor of shape (num_images, image_feature_size, hidden_size of LM)
+    # For Qwen2VL, image_grid_thw is needed to calculate positional encoding.
+    mm_data['image'] = {
+        "image_embeds": image_embeds,
+        "image_grid_thw": torch.load(...) # torch.Tensor of shape (1, 3),
+    }
+    # For MiniCPM-V, image_size_list is needed to calculate details of the sliced image.
+    mm_data['image'] = {
+        "image_embeds": image_embeds,
+        "image_size_list": [image.size] # list of image sizes
+    }
+    outputs = llm.generate({
+        "prompt": prompt,
+        "multi_modal_data": mm_data,
+    })
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
    # Batch inference
    image_1 = PIL.Image.open(...)
    image_2 = PIL.Image.open(...)
@@ -75,63 +102,146 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
 A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
+Multi-image input
+^^^^^^^^^^^^^^^^^
-Online OpenAI Vision API Compatible Inference
+Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
----------------------------------------------
-You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
+To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
-.. note::
+.. code-block:: python
-    Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
-    added in the future.
-Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.
+    llm = LLM(
+        model="microsoft/Phi-3.5-vision-instruct",
+        trust_remote_code=True,  # Required to load Phi-3.5-vision
+        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
+        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
+    )
-.. important::
+Instead of passing in a single image, you can pass in a list of images.
-    Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template 
-    is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the 
+.. code-block:: python
-    HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
+    # Refer to the HuggingFace repo for the correct format to use
+    prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
+    # Load the images using PIL.Image
+    image1 = PIL.Image.open(...)
+    image2 = PIL.Image.open(...)
+    outputs = llm.generate({
+        "prompt": prompt,
+        "multi_modal_data": {
+            "image": [image1, image2]
+        },
+    })
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
+Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
+.. code-block:: python
+    # Specify the maximum number of frames per video to be 4. This can be changed.
+    llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4})
+    # Create the request payload.
+    video_frames = ... # load your video making sure it only has the number of frames specified earlier.
+    message = {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Describe this set of frames. Consider the frames to be a part of the same video."},
+        ],
+    }
+    for i in range(len(video_frames)):
+        base64_image = encode_image(video_frames[i]) # base64 encoding.
+        new_image = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
+        message["content"].append(new_image)
+    # Perform inference and log output.
+    outputs = llm.chat([message])
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+Online Inference
+----------------
+OpenAI Vision API
+^^^^^^^^^^^^^^^^^
+You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
+Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
 .. code-block:: bash
-    vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
+    vllm serve microsoft/Phi-3.5-vision-instruct --max-model-len 4096 \
+      --trust-remote-code --limit-mm-per-prompt image=2
 .. important::
-    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
+    Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
-    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
+    a chat template is **required** to launch the API server.
-    internally for each model.
+    Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
+    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
+    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
 To consume the server, you can use the OpenAI client like in the example below:
 .. code-block:: python
    from openai import OpenAI
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
+    # Single-image input inference
+    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    chat_response = client.chat.completions.create(
-        model="llava-hf/llava-1.5-7b-hf",
+        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
-                {"type": "text", "text": "What's in this image?"},
+                {"type": "text", "text": "What’s in this image?"},
-                {
+                {"type": "image_url", "image_url": {"url": image_url}},
-                    "type": "image_url",
+            ],
-                    "image_url": {
+        }],
-                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
+    )
-                    },
+    print("Chat completion output:", chat_response.choices[0].message.content)
-                },
+    # Multi-image input inference
+    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
+    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
+    chat_response = client.chat.completions.create(
+        model="microsoft/Phi-3.5-vision-instruct",
+        messages=[{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What are the animals in these images?"},
+                {"type": "image_url", "image_url": {"url": image_url_duck}},
+                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
-    print("Chat response:", chat_response)
+    print("Chat completion output:", chat_response.choices[0].message.content)
-A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.
+A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.
 .. note::

--- a/docs/source/performance_benchmark/benchmarks.rst
+++ b/docs/source/performance_benchmark/benchmarks.rst
@@ -20,4 +20,4 @@ The performance benchmarks and nightly benchmarks can be triggered by submitting
 .. note::
-   Please refer to `vLLM performance benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/tests/descriptions.md>`_ and `vLLM nightly benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`_ for detailed descriptions on benchmark environment, workload and metrics.
+   Please refer to `vLLM performance benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md>`_ and `vLLM nightly benchmark descriptions <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`_ for detailed descriptions on benchmark environment, workload and metrics.
--- a/docs/source/quantization/auto_awq.rst
+++ b/docs/source/quantization/auto_awq.rst
@@ -19,27 +19,31 @@ You can quantize your own models by installing AutoAWQ or picking one of the `40
    $ pip install autoawq
-After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize Vicuna 7B v1.5:
+After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
 .. code-block:: python
    from awq import AutoAWQForCausalLM
    from transformers import AutoTokenizer
-    model_path = 'lmsys/vicuna-7b-v1.5'
+    model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
-    quant_path = 'vicuna-7b-v1.5-awq'
+    quant_path = 'mistral-instruct-v0.2-awq'
    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
    # Load model
-    model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
+    model = AutoAWQForCausalLM.from_pretrained(
+        model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+    )
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)
    # Save quantized model
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)
+    print(f'Model is quantized and saved at "{quant_path}"')
 To run an AWQ model with vLLM, you can use `TheBloke/Llama-2-7b-Chat-AWQ <https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ>`_ with the following command:

--- a/docs/source/quantization/bnb.rst
+++ b/docs/source/quantization/bnb.rst
@@ -11,7 +11,7 @@ Below are the steps to utilize BitsAndBytes with vLLM.
 .. code-block:: console
-    $ pip install bitsandbytes>=0.42.0
+    $ pip install bitsandbytes>=0.44.0
 vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

--- a/docs/source/quantization/fp8.rst
+++ b/docs/source/quantization/fp8.rst
 .. _fp8:
-FP8
+FP8 W8A8
 ==================
 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. 
@@ -15,6 +15,11 @@ The FP8 types typically supported in hardware have two distinct representations,
 - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
 - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
+.. note::
+   FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+   FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
 Quick Start with Online Dynamic Quantization
 --------------------------------------------
@@ -33,106 +38,134 @@ In this mode, all Linear modules (except for the final ``lm_head``) have their w
    Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-Offline Quantization
+Installation
+------------
+To produce performant FP8 quantized models with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
+.. code-block:: console
+   $ pip install llmcompressor==0.1.0
+Quantization Process
 --------------------
-For offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
+The quantization process involves three main steps:
-.. code-block:: bash
+1. Loading the model
+2. Applying quantization
+3. Evaluating accuracy in vLLM
-    git clone https://github.com/neuralmagic/AutoFP8.git
+1. Loading the Model
-    pip install -e AutoFP8
+^^^^^^^^^^^^^^^^^^^^
-This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
-Offline Quantization with Dynamic Activation Scaling Factors
+.. code-block:: python
------------------------------------------------------------
-You can use AutoFP8 to produce checkpoints with their weights quantized to FP8 ahead of time and let vLLM handle calculating dynamic scales for the activations at runtime for maximum accuracy. You can enable this with the ``activation_scheme="dynamic"`` argument.
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer
-.. warning::
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+   model = SparseAutoModelForCausalLM.from_pretrained(
+     MODEL_ID, device_map="auto", torch_dtype="auto")
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+2. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
-    Please note that although this mode doesn't give you better performance, it reduces memory footprint compared to online quantization.
+For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
+Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
 .. code-block:: python
-    from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import QuantizationModifier
-    pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
+   # Configure the simple PTQ quantization
-    quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-Dynamic"
+   recipe = QuantizationModifier(
+     targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
-    # Define quantization config with static activation scales
+   # Apply the quantization algorithm.
-    quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
+   oneshot(model=model, recipe=recipe)
-    # For dynamic activation scales, there is no need for calbration examples
-    examples = []
-    # Load the model, quantize, and save checkpoint
+   # Save the model.
-    model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
-    model.quantize(examples)
+   model.save_pretrained(SAVE_DIR)
-    model.save_quantized(quantized_model_dir)
+   tokenizer.save_pretrained(SAVE_DIR)
+3. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+Install ``vllm`` and ``lm-evaluation-harness``:
+.. code-block:: console
+   $ pip install vllm lm-eval==0.4.4
+Load and run the model in ``vllm``:
+.. code-block:: python
+   from vllm import LLM
+   model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+   model.generate("Hello my name is")
+Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
+.. note::
+   Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
+.. code-block:: console
+   $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic 
+   $ lm_eval \
+     --model vllm \
+     --model_args pretrained=$MODEL,add_bos_token=True \
+     --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
-In the output of the above script, you should be able to see the quantized Linear modules (FP8DynamicLinear) replaced in the model definition. 
+Here's an example of the resulting scores:
-Note that the ``lm_head`` Linear module at the end is currently skipped by default.
 .. code-block:: text
-    LlamaForCausalLM(
+   |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
-      (model): LlamaModel(
+   |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
-        (embed_tokens): Embedding(128256, 4096)
+   |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
-        (layers): ModuleList(
+   |     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
-          (0-31): 32 x LlamaDecoderLayer(
-            (self_attn): LlamaSdpaAttention(
-              (q_proj): FP8DynamicLinear()
-              (k_proj): FP8DynamicLinear()
-              (v_proj): FP8DynamicLinear()
-              (o_proj): FP8DynamicLinear()
-              (rotary_emb): LlamaRotaryEmbedding()
-            )
-            (mlp): LlamaMLP(
-              (gate_proj): FP8DynamicLinear()
-              (up_proj): FP8DynamicLinear()
-              (down_proj): FP8DynamicLinear()
-              (act_fn): SiLU()
-            )
-            (input_layernorm): LlamaRMSNorm()
-            (post_attention_layernorm): LlamaRMSNorm()
-          )
-        )
-        (norm): LlamaRMSNorm()
-      )
-      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
-    )
-    Saving the model to Meta-Llama-3-8B-Instruct-FP8-Dynamic
-Your model checkpoint with quantized weights should be available at ``Meta-Llama-3-8B-Instruct-FP8/``.
-We can see that the weights are smaller than the original BF16 precision.
-.. code-block:: bash
+Troubleshooting and Support
+---------------------------
-    ls -lh Meta-Llama-3-8B-Instruct-FP8-Dynamic/
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
-    total 8.5G
-    -rw-rw-r-- 1 user user  869 Jun  7 14:43 config.json
-    -rw-rw-r-- 1 user user  194 Jun  7 14:43 generation_config.json
-    -rw-rw-r-- 1 user user 4.7G Jun  7 14:43 model-00001-of-00002.safetensors
-    -rw-rw-r-- 1 user user 3.9G Jun  7 14:43 model-00002-of-00002.safetensors
-    -rw-rw-r-- 1 user user  43K Jun  7 14:43 model.safetensors.index.json
-    -rw-rw-r-- 1 user user  296 Jun  7 14:43 special_tokens_map.json
-    -rw-rw-r-- 1 user user  50K Jun  7 14:43 tokenizer_config.json
-    -rw-rw-r-- 1 user user 8.7M Jun  7 14:43 tokenizer.json
-Finally, you can load the quantized model checkpoint directly in vLLM.
-.. code-block:: python
+Deprecated Flow
+------------------
-    from vllm import LLM
+.. note::
-    model = LLM(model="Meta-Llama-3-8B-Instruct-FP8-Dynamic/")
-    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
+   The following information is preserved for reference and search purposes.
-    result = model.generate("Hello, my name is")
+   The quantization method described below is deprecated in favor of the ``llmcompressor`` method described above.
+For static per-tensor offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
+.. code-block:: bash
+    git clone https://github.com/neuralmagic/AutoFP8.git
+    pip install -e AutoFP8
+This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.
 Offline Quantization with Static Activation Scaling Factors
 -----------------------------------------------------------
-For the best inference performance, you can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.
+You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.
 .. code-block:: python
@@ -169,41 +202,3 @@ Finally, you can load the quantized model checkpoint directly in vLLM.
    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
    result = model.generate("Hello, my name is")
-FP8 checkpoint structure explanation
-----------------------------------------------------------
-Here we detail the structure for the FP8 checkpoints.
-The following is necessary to be present in the model's ``config.json``:
-.. code-block:: text
-    "quantization_config": {
-        "quant_method": "fp8",
-        "activation_scheme": "static" or "dynamic"
-    }
-Each quantized layer in the state_dict will have these tensors:
-* If the config has ``"activation_scheme": "static"``:
-.. code-block:: text
-    model.layers.0.mlp.down_proj.weight              < F8_E4M3
-    model.layers.0.mlp.down_proj.input_scale         < F32
-    model.layers.0.mlp.down_proj.weight_scale        < F32
-* If the config has ``"activation_scheme": "dynamic"``:
-.. code-block:: text
-    model.layers.0.mlp.down_proj.weight              < F8_E4M3
-    model.layers.0.mlp.down_proj.weight_scale        < F32
-Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
-.. code-block:: text
-    model.layers.0.self_attn.kv_scale	             < F32
--- a/docs/source/quantization/fp8_e4m3_kvcache.rst
+++ b/docs/source/quantization/fp8_e4m3_kvcache.rst
@@ -45,5 +45,3 @@ Here is an example of how to enable this feature:
        # output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
        # output w/o scaling factors:  England, located in the southeastern part of the country. It is known 
-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
--- a/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/docs/source/quantization/fp8_e5m2_kvcache.rst
@@ -32,5 +32,3 @@ Here is an example of how to enable this feature:
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
--- a/docs/source/quantization/gguf.rst
+++ b/docs/source/quantization/gguf.rst
+.. _gguf:
+GGUF
+==================
+.. warning::
+   Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
+.. warning::
+   Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use `gguf-split <https://github.com/ggerganov/llama.cpp/pull/6135>`_ tool to merge them to a single-file model.
+To run a GGUF model with vLLM, you can download and use the local GGUF model from `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF <https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF>`_ with the following command:
+.. code-block:: console
+   $ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
+   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+You can also add ``--tensor-parallel-size 2`` to enable tensor parallelism inference with 2 GPUs:
+.. code-block:: console
+   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
+.. warning::
+   We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
+You can also use the GGUF model directly through the LLM entrypoint:
+.. code-block:: python
+   from vllm import LLM, SamplingParams
+   # In this script, we demonstrate how to pass input to the chat method:
+   conversation = [
+      {
+         "role": "system",
+         "content": "You are a helpful assistant"
+      },
+      {
+         "role": "user",
+         "content": "Hello"
+      },
+      {
+         "role": "assistant",
+         "content": "Hello! How can I assist you today?"
+      },
+      {
+         "role": "user",
+         "content": "Write an essay about the importance of higher education.",
+      },
+   ]
+   # Create a sampling params object.
+   sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+   # Create an LLM.
+   llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
+            tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+   # Generate texts from the prompts. The output is a list of RequestOutput objects
+   # that contain the prompt, generated text, and other information.
+   outputs = llm.chat(conversation, sampling_params)
+   # Print the outputs.
+   for output in outputs:
+      prompt = output.prompt
+      generated_text = output.outputs[0].text
+      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/docs/source/quantization/int8.rst
+++ b/docs/source/quantization/int8.rst
+.. _int8:
+INT8 W8A8
+==================
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
+.. note::
+   INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+Prerequisites
+-------------
+To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
+.. code-block:: console
+   $ pip install llmcompressor==0.1.0
+Quantization Process
+--------------------
+The quantization process involves four main steps:
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+.. code-block:: python
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+   model = SparseAutoModelForCausalLM.from_pretrained(
+       MODEL_ID, device_map="auto", torch_dtype="auto",
+   )
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+2. Preparing Calibration Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data. 
+For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
+.. code-block:: python
+   from datasets import load_dataset
+   NUM_CALIBRATION_SAMPLES = 512
+   MAX_SEQUENCE_LENGTH = 2048
+   # Load and preprocess the dataset
+   ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+   ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+   def preprocess(example):
+       return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+   ds = ds.map(preprocess)
+   def tokenize(sample):
+       return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+   ds = ds.map(tokenize, remove_columns=ds.column_names)
+3. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+Now, apply the quantization algorithms:
+.. code-block:: python
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import GPTQModifier
+   from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+   # Configure the quantization algorithms
+   recipe = [
+       SmoothQuantModifier(smoothing_strength=0.8),
+       GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+   ]
+   # Apply quantization
+   oneshot(
+       model=model,
+       dataset=ds,
+       recipe=recipe,
+       max_seq_length=MAX_SEQUENCE_LENGTH,
+       num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+   )
+   # Save the compressed model
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+   model.save_pretrained(SAVE_DIR, save_compressed=True)
+   tokenizer.save_pretrained(SAVE_DIR)
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+4. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+After quantization, you can load and run the model in vLLM:
+.. code-block:: python
+   from vllm import LLM
+   model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+To evaluate accuracy, you can use ``lm_eval``:
+.. code-block:: console
+   $ lm_eval --model vllm \
+     --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+     --tasks gsm8k \
+     --num_fewshot 5 \
+     --limit 250 \
+     --batch_size 'auto'
+.. note::
+   Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
+Best Practices
+--------------
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+Troubleshooting and Support
+---------------------------
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
\ No newline at end of file
--- a/docs/source/quantization/supported_hardware.rst
+++ b/docs/source/quantization/supported_hardware.rst
@@ -5,25 +5,127 @@ Supported Hardware for Quantization Kernels
 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+.. list-table::
-Implementation  Volta   Turing   Ampere   Ada    Hopper  AMD GPU  Intel GPU  x86 CPU  AWS Inferentia  Google TPU
+   :header-rows: 1
-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+   :widths: 20 8 8 8 8 8 8 8 8 8 8
-AQLM            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-AWQ             ❌      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+   * - Implementation
-DeepSpeedFP     ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - Volta
-FP8             ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - Turing
-Marlin          ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - Ampere
-GPTQ            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - Ada
-SqueezeLLM      ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - Hopper
-bitsandbytes    ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
+     - AMD GPU
-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+     - Intel GPU
+     - x86 CPU
+     - AWS Inferentia
+     - Google TPU
+   * - AWQ
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✅︎
+     - ✗
+     - ✗
+   * - GPTQ
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - Marlin (GPTQ/AWQ/FP8)
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - INT8 (W8A8)
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✅︎
+     - ✗
+     - ✗
+   * - FP8 (W8A8)
+     - ✗
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - AQLM
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - bitsandbytes
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - DeepSpeedFP
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - GGUF
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
 Notes:
 ^^^^^^
 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅" indicates that the quantization method is supported on the specified hardware.
+- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "❌" indicates that the quantization method is not supported on the specified hardware.
+- "✗" indicates that the quantization method is not supported on the specified hardware.
 Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

--- a/docs/source/serving/compatibility_matrix.rst
+++ b/docs/source/serving/compatibility_matrix.rst
+.. _compatibility_matrix:
+Compatibility Matrix
+====================
+The tables below show mutually exclusive features and the support on some hardware. 
+.. note::
+   Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
+Feature x Feature
+-----------------
+.. raw:: html
+    <style>
+      /* Make smaller to try to improve readability  */
+      td {
+        font-size: 0.8rem;
+        text-align: center;
+      }
+      th {
+        text-align: center;
+        font-size: 0.8rem;
+      }
+    </style>
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Feature
+     - :ref:`CP <chunked-prefill>`
+     - :ref:`APC <apc>`
+     - :ref:`LoRA <lora>`
+     - :abbr:`prmpt adptr (Prompt Adapter)`
+     - :ref:`SD <spec_decode>`
+     - CUDA graph
+     - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - :abbr:`logP (Logprobs)`
+     - :abbr:`prmpt logP (Prompt Logprobs)`
+     - :abbr:`async output (Async Output Processing)`
+     - multi-step
+     - :abbr:`MM (Multimodal)`
+     - best-of
+     - beam-search
+     - :abbr:`guided dec (Guided Decoding)`
+   * - :ref:`CP <chunked-prefill>`
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`APC <apc>`
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`LoRA <lora>`
+     - `✗ <https://github.com/vllm-project/vllm/pull/9057>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`prmpt adptr (Prompt Adapter)`
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :ref:`SD <spec_decode>`
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - CUDA graph
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - ✗
+     - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ 
+     - ✗ 
+     - ✗
+     - `✗ <https://github.com/vllm-project/vllm/issues/7366>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`logP (Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`prmpt logP (Prompt Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/pull/8199>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`async output (Async Output Processing)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅ 
+     - ✗
+     - ✅
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - multi-step
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - ✗
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8198>`__ 
+     - ✅
+     - 
+     - 
+     - 
+     - 
+     - 
+   * - :abbr:`MM (Multimodal)`
+     -  `✗ <https://github.com/vllm-project/vllm/pull/8346>`__ 
+     -  `✗ <https://github.com/vllm-project/vllm/pull/8348>`__ 
+     -  `✗ <https://github.com/vllm-project/vllm/pull/7199>`__ 
+     - ?
+     - ?
+     - ✅
+     - ✗
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - 
+     - 
+     - 
+     - 
+   * - best-of
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - `✗ <https://github.com/vllm-project/vllm/issues/7968>`__ 
+     - ✅
+     - 
+     - 
+     - 
+   * - beam-search
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/6137>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ?
+     - `✗ <https://github.com/vllm-project/vllm/issues/7968>`__ 
+     - ?
+     - ✅
+     - 
+     - 
+   * - :abbr:`guided dec (Guided Decoding)`
+     - ✅
+     - ✅
+     - ?
+     - ?
+     - ✅
+     - ✅
+     - ?
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ?
+     - ✅
+     - ✅
+     - 
+Feature x Hardware
+^^^^^^^^^^^^^^^^^^
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Feature
+     - Volta
+     - Turing
+     - Ampere
+     - Ada
+     - Hopper
+     - CPU
+     - AMD
+   * - :ref:`CP <chunked-prefill>`
+     - `✗ <https://github.com/vllm-project/vllm/issues/2729>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗ 
+     - ✅
+   * - :ref:`APC <apc>`
+     - `✗ <https://github.com/vllm-project/vllm/issues/3687>`__ 
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅
+   * - :ref:`LoRA <lora>`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/pull/4830>`__ 
+     - ✅
+   * - :abbr:`prmpt adptr (Prompt Adapter)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8475>`__ 
+     - ✅
+   * - :ref:`SD <spec_decode>`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - CUDA graph
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✅
+   * - :abbr:`enc-dec (Encoder-Decoder Models)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/blob/a84e598e2125960d3b4f716b78863f24ac562947/vllm/worker/cpu_model_runner.py#L125>`__ 
+     - ✗
+   * - :abbr:`logP (Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`prmpt logP (Prompt Logprobs)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`async output (Async Output Processing)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✗
+     - ✗
+   * - multi-step
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - `✗ <https://github.com/vllm-project/vllm/issues/8477>`__ 
+     - ✅
+   * - :abbr:`MM (Multimodal)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - best-of
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - beam-search
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+   * - :abbr:`guided dec (Guided Decoding)`
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
+     - ✅
--- a/docs/source/serving/deploying_with_k8s.rst
+++ b/docs/source/serving/deploying_with_k8s.rst
+.. _deploying_with_k8s:
+Deploying with Kubernetes
+==========================
+Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
+Prerequisites
+-------------
+Before you begin, ensure that you have the following:
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
+- Available GPU resources in your cluster
+Deployment Steps
+----------------
+1.  **Create a PVC , Secret and Deployment for vLLM**
+PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
+.. code-block:: yaml
+  apiVersion: v1
+  kind: PersistentVolumeClaim
+  metadata:
+    name: mistral-7b
+    namespace: default
+  spec:
+    accessModes:
+    - ReadWriteOnce
+    resources:
+      requests:
+        storage: 50Gi
+    storageClassName: default
+    volumeMode: Filesystem
+Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
+.. code-block:: yaml
+  apiVersion: v1
+  kind: Secret
+  metadata:
+    name: hf-token-secret
+    namespace: default
+  type: Opaque
+  data:
+    token: "REPLACE_WITH_TOKEN"
+Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
+.. code-block:: yaml
+  apiVersion: apps/v1
+  kind: Deployment
+  metadata:
+    name: mistral-7b
+    namespace: default
+    labels:
+      app: mistral-7b
+  spec:
+    replicas: 1
+    selector:
+      matchLabels:
+        app: mistral-7b
+    template:
+      metadata:
+        labels:
+          app: mistral-7b
+      spec:
+        volumes:
+        - name: cache-volume
+          persistentVolumeClaim:
+            claimName: mistral-7b
+        # vLLM needs to access the host's shared memory for tensor parallel inference.
+        - name: shm
+          emptyDir:
+            medium: Memory
+            sizeLimit: "2Gi"
+        containers:
+        - name: mistral-7b
+          image: vllm/vllm-openai:latest
+          command: ["/bin/sh", "-c"]
+          args: [
+            "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+          ]
+          env:
+          - name: HUGGING_FACE_HUB_TOKEN
+            valueFrom:
+              secretKeyRef:
+                name: hf-token-secret
+                key: token
+          ports:
+          - containerPort: 8000
+          resources:
+            limits:
+              cpu: "10"
+              memory: 20G
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "2"
+              memory: 6G
+              nvidia.com/gpu: "1"
+          volumeMounts:
+          - mountPath: /root/.cache/huggingface
+            name: cache-volume
+          - name: shm
+            mountPath: /dev/shm
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            initialDelaySeconds: 60
+            periodSeconds: 10
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 8000
+            initialDelaySeconds: 60
+            periodSeconds: 5
+2. **Create a Kubernetes Service for vLLM**
+Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+.. code-block:: yaml
+    apiVersion: v1
+    kind: Service
+    metadata:
+      name: mistral-7b
+      namespace: default
+    spec:
+      ports:
+      - name: http-mistral-7b
+        port: 80
+        protocol: TCP
+        targetPort: 8000
+      # The label selector should match the deployment labels & it is useful for prefix caching feature
+      selector:
+        app: mistral-7b
+      sessionAffinity: None
+      type: ClusterIP
+3. **Deploy and Test**
+Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
+.. code-block:: console
+    kubectl apply -f deployment.yaml
+    kubectl apply -f service.yaml
+To test the deployment, run the following ``curl`` command:
+.. code-block:: console
+    curl http://mistral-7b.default.svc.cluster.local/v1/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+            "model": "facebook/opt-125m",
+            "prompt": "San Francisco is a",
+            "max_tokens": 7,
+            "temperature": 0
+          }'
+If the service is correctly deployed, you should receive a response from the vLLM model.
+Conclusion
+----------
+Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
\ No newline at end of file