Merge tag 'v0.6.1' into v0.6.1-dev

4851c202 · zhuwenwen · 9b902f9e · 3fd2b0d2 · 4851c202 · 4851c202
Commit 4851c202 authored Sep 13, 2024 by zhuwenwen
20 changed files
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@@ -21,7 +21,7 @@ If you have already taken care of the above issues, but the vLLM instance still
 With more logging, hopefully you can find the root cause of the issue.
-If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the ``LLM`` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error.
+If it crashes, and the error trace shows somewhere around ``self.graph.replay()`` in ``vllm/worker/model_runner.py``, it is a cuda error inside cudagraph. To know the particular cuda operation that causes the error, you can add ``--enforce-eager`` to the command line, or ``enforce_eager=True`` to the :class:`~vllm.LLM` class, to disable the cudagraph optimization. This way, you can locate the exact cuda operation that causes the error.
 Here are some common issues that can cause hangs:

--- a/docs/source/getting_started/quickstart.rst
+++ b/docs/source/getting_started/quickstart.rst
@@ -24,7 +24,9 @@ Offline Batched Inference
 We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to generate texts for a list of input prompts.
-Import ``LLM`` and ``SamplingParams`` from vLLM. The ``LLM`` class is the main class for running offline inference with vLLM engine. The ``SamplingParams`` class specifies the parameters for the sampling process.
+Import :class:`~vllm.LLM` and :class:`~vllm.SamplingParams` from vLLM.
+The :class:`~vllm.LLM` class is the main class for running offline inference with vLLM engine.
+The :class:`~vllm.SamplingParams` class specifies the parameters for the sampling process.
 .. code-block:: python
@@ -42,7 +44,7 @@ Define the list of input prompts and the sampling parameters for generation. The
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-Initialize vLLM's engine for offline inference with the ``LLM`` class and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_. The list of supported models can be found at :ref:`supported models <supported_models>`.
+Initialize vLLM's engine for offline inference with the :class:`~vllm.LLM` class and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_. The list of supported models can be found at :ref:`supported models <supported_models>`.
 .. code-block:: python

--- a/docs/source/models/lora.rst
+++ b/docs/source/models/lora.rst
@@ -107,3 +107,55 @@ The following is an example request
            "max_tokens": 7,
            "temperature": 0
        }' | jq
+Dynamically serving LoRA Adapters
+---------------------------------
+In addition to serving LoRA adapters at server startup, the vLLM server now supports dynamically loading and unloading
+LoRA adapters at runtime through dedicated API endpoints. This feature can be particularly useful when the flexibility
+to change models on-the-fly is needed.
+Note: Enabling this feature in production environments is risky as user may participate model adapter management.
+To enable dynamic LoRA loading and unloading, ensure that the environment variable `VLLM_ALLOW_RUNTIME_LORA_UPDATING`
+is set to `True`. When this option is enabled, the API server will log a warning to indicate that dynamic loading is active.
+.. code-block:: bash
+    export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
+Loading a LoRA Adapter:
+To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
+details of the adapter to be loaded. The request payload should include the name and path to the LoRA adapter.
+Example request to load a LoRA adapter:
+.. code-block:: bash
+    curl -X POST http://localhost:8000/v1/load_lora_adapter \
+    -H "Content-Type: application/json" \
+    -d '{
+        "lora_name": "sql_adapter",
+        "lora_path": "/path/to/sql-lora-adapter"
+    }'
+Upon a successful request, the API will respond with a 200 OK status code. If an error occurs, such as if the adapter
+cannot be found or loaded, an appropriate error message will be returned.
+Unloading a LoRA Adapter:
+To unload a LoRA adapter that has been previously loaded, send a POST request to the `/v1/unload_lora_adapter` endpoint
+with the name or ID of the adapter to be unloaded.
+Example request to unload a LoRA adapter:
+.. code-block:: bash
+    curl -X POST http://localhost:8000/v1/unload_lora_adapter \
+    -H "Content-Type: application/json" \
+    -d '{
+        "lora_name": "sql_adapter"
+    }'
--- a/docs/source/models/spec_decode.rst
+++ b/docs/source/models/spec_decode.rst
@@ -161,6 +161,46 @@ A variety of speculative models of this type are available on HF hub:
 * `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
 * `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
+Lossless guarantees of Speculative Decoding
+-------------------------------------------
+In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of 
+speculative decoding, breaking down the guarantees into three key areas:
+1. **Theoretical Losslessness**
+   - Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might 
+   cause slight variations in output distributions, as discussed 
+   in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
+2. **Algorithmic Losslessness**
+   - vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
+    - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target 
+      distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
+    - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
+      without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler, 
+      provides a lossless guarantee.  Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
+      verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
+3. **vLLM Logprob Stability**
+   - vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the 
+   same request across runs. For more details, see the FAQ section 
+   titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
+**Conclusion**
+While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding 
+can occur due to following factors:
+- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
+- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially 
+  due to non-deterministic behavior in batched operations or numerical instability.
+**Mitigation Strategies**
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
 Resources for vLLM contributors
 -------------------------------

--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -194,12 +194,12 @@ Multimodal Language Models
  * - Architecture
    - Models
-    - Supported Modalities
+    - Modalities
    - Example HuggingFace Models
    - :ref:`LoRA <lora>`
  * - :code:`Blip2ForConditionalGeneration`
    - BLIP-2
-    - Image
+    - Image\ :sup:`E`
    - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
    -
  * - :code:`ChameleonForConditionalGeneration`
@@ -214,44 +214,75 @@ Multimodal Language Models
    - 
  * - :code:`InternVLChatModel`
    - InternVL2
-    - Image
+    - Image\ :sup:`E+`
    - :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
    - 
  * - :code:`LlavaForConditionalGeneration`
    - LLaVA-1.5
-    - Image
+    - Image\ :sup:`E+`
    - :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
    -
  * - :code:`LlavaNextForConditionalGeneration`
    - LLaVA-NeXT
-    - Image
+    - Image\ :sup:`E+`
    - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
    -
+  * - :code:`LlavaNextVideoForConditionalGeneration`
+    - LLaVA-NeXT-Video
+    - Video
+    - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. (see note)
+    -
+  * - :code:`MiniCPMV`
+    - MiniCPM-V
+    - Image\ :sup:`+`
+    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
+    -
  * - :code:`PaliGemmaForConditionalGeneration`
    - PaliGemma
-    - Image
+    - Image\ :sup:`E`
    - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
    - 
  * - :code:`Phi3VForCausalLM`
    - Phi-3-Vision, Phi-3.5-Vision
-    - Image
+    - Image\ :sup:`E+`
    - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
    -
-  * - :code:`MiniCPMV`
+  * - :code:`PixtralForConditionalGeneration`
-    - MiniCPM-V
+    - Pixtral
-    - Image
+    - Image\ :sup:`+`
-    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
+    - :code:`mistralai/Pixtral-12B-2409`
+    -
+  * - :code:`QWenLMHeadModel`
+    - Qwen-VL
+    - Image\ :sup:`E`
+    - :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
+    -
+  * - :code:`Qwen2VLForConditionalGeneration`
+    - Qwen2-VL (see note)
+    - Image\ :sup:`+` / Video\ :sup:`+`
+    - :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
    -
  * - :code:`UltravoxModel`
    - Ultravox
-    - Audio
+    - Audio\ :sup:`E+`
    - :code:`fixie-ai/ultravox-v0_3`
    -
+| :sup:`E` Pre-computed embeddings can be inputted for this modality.
+| :sup:`+` Multiple items can be inputted per text prompt for this modality.
 .. note::
  For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
  For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
+.. note::
+  For :code:`LLaVA-NeXT-Video` and :code:`Qwen2-VL`, the latest release of :code:`huggingface/transformers` doesn't work yet, so we need to use a developer version (:code:`21fac7abba2a37fae86106f87fcf9974fd1e3830`) for now.
+  This can be installed by running the following command: 
+  .. code-block:: bash
+    pip install git+https://github.com/huggingface/transformers.git@21fac7abba2a37fae86106f87fcf9974fd1e3830
 ----
 If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.

--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -9,26 +9,23 @@ This document shows you how to run and serve these models using vLLM.
 .. important::
    We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
-    Currently, the support for vision language models on vLLM has the following limitations:
-    * Only single image input is supported per text prompt.
    We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
-Offline Batched Inference
+Offline Inference
-------------------------
+-----------------
-To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.
+Single-image input
+^^^^^^^^^^^^^^^^^^
+The :class:`~vllm.LLM` class can be instantiated in much the same way as language-only models.
 .. code-block:: python
    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-.. important::
+.. note::
    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
-    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
+    the above snippet. Specifically, ``image_feature_size`` can no longer be specified as we now calculate that internally for each model.
-    internally for each model.
 To pass an image to the model, note the following in :class:`vllm.inputs.PromptInputs`:
@@ -86,61 +83,117 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
 A code example can be found in `examples/offline_inference_vision_language.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py>`_.
+Multi-image input
+^^^^^^^^^^^^^^^^^
-Online OpenAI Vision API Compatible Inference
+Multi-image input is only supported for a subset of VLMs, as shown :ref:`here <supported_vlms>`.
----------------------------------------------
-You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
+To enable multiple multi-modal items per text prompt, you have to set ``limit_mm_per_prompt`` for the :class:`~vllm.LLM` class.
-.. note::
+.. code-block:: python
-    Currently, vLLM supports only **single** ``image_url`` input per ``messages``. Support for multi-image inputs will be
-    added in the future.
+    llm = LLM(
+        model="microsoft/Phi-3.5-vision-instruct",
+        trust_remote_code=True,  # Required to load Phi-3.5-vision
+        max_model_len=4096,  # Otherwise, it may not fit in smaller GPUs
+        limit_mm_per_prompt={"image": 2},  # The maximum number to accept
+    )
-Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with vLLM API server.
+Instead of passing in a single image, you can pass in a list of images.
-.. important::
+.. code-block:: python
-    Since OpenAI Vision API is based on `Chat <https://platform.openai.com/docs/api-reference/chat>`_ API, a chat template 
-    is **required** to launch the API server if the model's tokenizer does not come with one. In this example, we use the 
+    # Refer to the HuggingFace repo for the correct format to use
-    HuggingFace Llava chat template that you can find in the example folder `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
+    prompt = "<|user|>\n<image_1>\n<image_2>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
+    # Load the images using PIL.Image
+    image1 = PIL.Image.open(...)
+    image2 = PIL.Image.open(...)
+    outputs = llm.generate({
+        "prompt": prompt,
+        "multi_modal_data": {
+            "image": [image1, image2]
+        },
+    })
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
+Online Inference
+----------------
+OpenAI Vision API
+^^^^^^^^^^^^^^^^^
+You can serve vision language models with vLLM's HTTP server that is compatible with `OpenAI Vision API <https://platform.openai.com/docs/guides/vision>`_.
+Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruct`` with vLLM's OpenAI-compatible API server.
 .. code-block:: bash
-    vllm serve llava-hf/llava-1.5-7b-hf --chat-template template_llava.jinja
+    vllm serve microsoft/Phi-3.5-vision-instruct --max-model-len 4096 \
+      --trust-remote-code --limit-mm-per-prompt image=2
 .. important::
-    We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
+    Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
-    the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified as we now calculate that
+    a chat template is **required** to launch the API server.
-    internally for each model.
+    Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
+    The chat template can be inferred based on the documentation on the model's HuggingFace repo.
+    For example, LLaVA-1.5 (``llava-hf/llava-1.5-7b-hf``) requires a chat template that can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja>`_.
 To consume the server, you can use the OpenAI client like in the example below:
 .. code-block:: python
    from openai import OpenAI
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
+    # Single-image input inference
+    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    chat_response = client.chat.completions.create(
-        model="llava-hf/llava-1.5-7b-hf",
+        model="microsoft/Phi-3.5-vision-instruct",
        messages=[{
            "role": "user",
            "content": [
                # NOTE: The prompt formatting with the image token `<image>` is not needed
                # since the prompt will be processed automatically by the API server.
-                {"type": "text", "text": "What's in this image?"},
+                {"type": "text", "text": "What’s in this image?"},
-                {
+                {"type": "image_url", "image_url": {"url": image_url}},
-                    "type": "image_url",
+            ],
-                    "image_url": {
+        }],
-                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
+    )
-                    },
+    print("Chat completion output:", chat_response.choices[0].message.content)
-                },
+    # Multi-image input inference
+    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
+    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
+    chat_response = client.chat.completions.create(
+        model="microsoft/Phi-3.5-vision-instruct",
+        messages=[{
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What are the animals in these images?"},
+                {"type": "image_url", "image_url": {"url": image_url_duck}},
+                {"type": "image_url", "image_url": {"url": image_url_lion}},
            ],
        }],
    )
-    print("Chat response:", chat_response)
+    print("Chat completion output:", chat_response.choices[0].message.content)
 A full code example can be found in `examples/openai_vision_api_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_vision_api_client.py>`_.

--- a/docs/source/quantization/supported_hardware.rst
+++ b/docs/source/quantization/supported_hardware.rst
@@ -119,17 +119,6 @@ The table below shows the compatibility of various quantization implementations
     - ✗
     - ✗
     - ✗
-   * - SqueezeLLM
-     - ✅︎
-     - ✅︎
-     - ✅︎
-     - ✅︎
-     - ✅︎
-     - ✗
-     - ✗
-     - ✗
-     - ✗
-     - ✗
 Notes:
 ^^^^^^

--- a/docs/source/serving/faq.rst
+++ b/docs/source/serving/faq.rst
@@ -10,3 +10,22 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
    Q: Which model to use for offline inference embedding?
 A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
+----------------------------------------
+    Q: Can the output of a prompt vary across runs in vLLM?
+A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
+numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details, 
+see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
+In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
+changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations, 
+can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in 
+different tokens being sampled. Once a different token is sampled, further divergence is likely.
+**Mitigation Strategies**
+- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
+- If using `bfloat16`, switching to `float16` can also help.
+- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
--- a/examples/fp8/README.md
+++ b/examples/fp8/README.md
@@ -62,7 +62,7 @@ This script evaluates the inference throughput of language models using various
 python3 benchmarks/benchmark_throughput.py --help 
 usage: benchmark_throughput.py [-h] [--backend {vllm,hf,mii}] [--dataset DATASET] [--input-len INPUT_LEN] [--output-len OUTPUT_LEN] [--model MODEL]
-                               [--tokenizer TOKENIZER] [--quantization {awq,gptq,squeezellm,None}] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--n N]
+                               [--tokenizer TOKENIZER] [--quantization {awq,gptq,None}] [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--n N]
                               [--use-beam-search] [--num-prompts NUM_PROMPTS] [--seed SEED] [--hf-max-batch-size HF_MAX_BATCH_SIZE] [--trust-remote-code]
                               [--max-model-len MAX_MODEL_LEN] [--dtype {auto,half,float16,bfloat16,float,float32}] [--enforce-eager] [--kv-cache-dtype {auto,fp8}]
                               [--quantization-param-path KV_CACHE_quantization_param_path]
@@ -76,7 +76,7 @@ optional arguments:
  --output-len OUTPUT_LEN  Output length for each request. Overrides the output length from the dataset.
  --model MODEL
  --tokenizer TOKENIZER
-  --quantization {awq,gptq,squeezellm,None}, -q {awq,gptq,squeezellm,None}
+  --quantization {awq,gptq,None}, -q {awq,gptq,None}
  --tensor-parallel-size TENSOR_PARALLEL_SIZE, -tp TENSOR_PARALLEL_SIZE
  --n N  Number of generated sequences per prompt.
  --use-beam-search

--- a/examples/fp8/quantizer/README.md
+++ b/examples/fp8/quantizer/README.md
 ### Quantizer Utilities
-`quantize.py`: NVIDIA Quantization utilities using AMMO, ported from TensorRT-LLM:
+`quantize.py`: NVIDIA Quantization utilities using TensorRT-Model-Optimizer, ported
-`https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py`
+from TensorRT-LLM: [`examples/quantization/quantize.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py)
 ### Prerequisite

--- a/examples/offline_inference_pixtral.py
+++ b/examples/offline_inference_pixtral.py
+# ruff: noqa
+import argparse
+from vllm import LLM
+from vllm.sampling_params import SamplingParams
+# This script is an offline demo for running Pixtral.
+#
+# If you want to run a server/client setup, please follow this code:
+#
+# - Server:
+#
+# ```bash
+# vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens 16384
+# ```
+#
+# - Client:
+#
+# ```bash
+# curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
+# --header 'Content-Type: application/json' \
+# --header 'Authorization: Bearer token' \
+# --data '{
+#     "model": "mistralai/Pixtral-12B-2409",
+#     "messages": [
+#       {
+#         "role": "user",
+#         "content": [
+#             {"type" : "text", "text": "Describe this image in detail please."},
+#             {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}},
+#             {"type" : "text", "text": "and this one as well. Answer in French."},
+#             {"type": "image_url", "image_url": {"url": "https://www.wolframcloud.com/obj/resourcesystem/images/a0e/a0ee3983-46c6-4c92-b85d-059044639928/6af8cfb971db031b.png"}}
+#         ]
+#       }
+#     ]
+#   }'
+# ```
+#
+# Usage:
+#     python demo.py simple
+#     python demo.py advanced
+def run_simple_demo():
+    model_name = "mistralai/Pixtral-12B-2409"
+    sampling_params = SamplingParams(max_tokens=8192)
+    llm = LLM(model=model_name, tokenizer_mode="mistral")
+    prompt = "Describe this image in one sentence."
+    image_url = "https://picsum.photos/id/237/200/300"
+    messages = [
+        {
+            "role":
+            "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": image_url
+                    }
+                },
+            ],
+        },
+    ]
+    outputs = llm.chat(messages, sampling_params=sampling_params)
+    print(outputs[0].outputs[0].text)
+def run_advanced_demo():
+    model_name = "mistralai/Pixtral-12B-2409"
+    max_img_per_msg = 5
+    max_tokens_per_img = 4096
+    sampling_params = SamplingParams(max_tokens=8192, temperature=0.7)
+    llm = LLM(
+        model=model_name,
+        tokenizer_mode="mistral",
+        limit_mm_per_prompt={"image": max_img_per_msg},
+        max_num_batched_tokens=max_img_per_msg * max_tokens_per_img,
+    )
+    prompt = "Describe the following image."
+    url_1 = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
+    url_2 = "https://picsum.photos/seed/picsum/200/300"
+    url_3 = "https://picsum.photos/id/32/512/512"
+    messages = [
+        {
+            "role":
+            "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_1
+                    }
+                },
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_2
+                    }
+                },
+            ],
+        },
+        {
+            "role": "assistant",
+            "content": "The images show nature.",
+        },
+        {
+            "role": "user",
+            "content": "More details please and answer only in French!.",
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": url_3
+                    }
+                },
+            ],
+        },
+    ]
+    outputs = llm.chat(messages=messages, sampling_params=sampling_params)
+    print(outputs[0].outputs[0].text)
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run a demo in simple or advanced mode.")
+    parser.add_argument(
+        "mode",
+        choices=["simple", "advanced"],
+        help="Specify the demo mode: 'simple' or 'advanced'",
+    )
+    args = parser.parse_args()
+    if args.mode == "simple":
+        print("Running simple demo...")
+        run_simple_demo()
+    elif args.mode == "advanced":
+        print("Running advanced demo...")
+        run_advanced_demo()
+if __name__ == "__main__":
+    main()
--- a/examples/offline_inference_vision_language.py
+++ b/examples/offline_inference_vision_language.py
@@ -9,12 +9,9 @@ from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
 from vllm.assets.image import ImageAsset
+from vllm.assets.video import VideoAsset
 from vllm.utils import FlexibleArgumentParser
-# Input image and question
-image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
-question = "What is the content of this image?"
 # LLaVA-1.5
 def run_llava(question):
@@ -30,7 +27,16 @@ def run_llava(question):
 def run_llava_next(question):
    prompt = f"[INST] <image>\n{question} [/INST]"
-    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf")
+    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=8192)
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
+# LlaVA-NeXT-Video
+# Currently only support for video input
+def run_llava_next_video(question):
+    prompt = f"USER: <video>\n{question} ASSISTANT:"
+    llm = LLM(model="llava-hf/LLaVA-NeXT-Video-7B-hf", max_model_len=8192)
    stop_token_ids = None
    return llm, prompt, stop_token_ids
@@ -159,9 +165,41 @@ def run_blip2(question):
    return llm, prompt, stop_token_ids
+# Qwen
+def run_qwen_vl(question):
+    llm = LLM(
+        model="Qwen/Qwen-VL",
+        trust_remote_code=True,
+        max_num_seqs=5,
+    )
+    prompt = f"{question}Picture 1: <img></img>\n"
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
+# Qwen2-VL
+def run_qwen2_vl(question):
+    model_name = "Qwen/Qwen2-VL-7B-Instruct"
+    llm = LLM(
+        model=model_name,
+        max_num_seqs=5,
+    )
+    prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
+              "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
+              f"{question}<|im_end|>\n"
+              "<|im_start|>assistant\n")
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
 model_example_map = {
    "llava": run_llava,
    "llava-next": run_llava_next,
+    "llava-next-video": run_llava_next_video,
    "fuyu": run_fuyu,
    "phi3_v": run_phi3v,
    "paligemma": run_paligemma,
@@ -169,14 +207,54 @@ model_example_map = {
    "minicpmv": run_minicpmv,
    "blip-2": run_blip2,
    "internvl_chat": run_internvl,
+    "qwen_vl": run_qwen_vl,
+    "qwen2_vl": run_qwen2_vl,
 }
+def get_multi_modal_input(args):
+    """
+    return {
+        "data": image or video,
+        "question": question,
+    }
+    """
+    if args.modality == "image":
+        # Input image and question
+        image = ImageAsset("cherry_blossom") \
+            .pil_image.convert("RGB")
+        img_question = "What is the content of this image?"
+        return {
+            "data": image,
+            "question": img_question,
+        }
+    if args.modality == "video":
+        # Input video and question
+        video = VideoAsset(name="sample_demo_1.mp4",
+                           num_frames=args.num_frames).np_ndarrays
+        vid_question = "Why is this video funny?"
+        return {
+            "data": video,
+            "question": vid_question,
+        }
+    msg = f"Modality {args.modality} is not supported."
+    raise ValueError(msg)
 def main(args):
    model = args.model_type
    if model not in model_example_map:
        raise ValueError(f"Model type {model} is not supported.")
+    modality = args.modality
+    mm_input = get_multi_modal_input(args)
+    data = mm_input["data"]
+    question = mm_input["question"]
    llm, prompt, stop_token_ids = model_example_map[model](question)
    # We set temperature to 0.2 so that outputs can be different
@@ -191,7 +269,7 @@ def main(args):
        inputs = {
            "prompt": prompt,
            "multi_modal_data": {
-                "image": image
+                modality: data
            },
        }
@@ -200,7 +278,7 @@ def main(args):
        inputs = [{
            "prompt": prompt,
            "multi_modal_data": {
-                "image": image
+                modality: data
            },
        } for _ in range(args.num_prompts)]
@@ -223,8 +301,15 @@ if __name__ == "__main__":
                        help='Huggingface "model_type".')
    parser.add_argument('--num-prompts',
                        type=int,
-                        default=1,
+                        default=4,
                        help='Number of prompts to run.')
+    parser.add_argument('--modality',
+                        type=str,
+                        default="image",
+                        help='Modality of the input.')
+    parser.add_argument('--num-frames',
+                        type=int,
+                        default=16,
+                        help='Number of frames to extract from the video.')
    args = parser.parse_args()
    main(args)
--- a/examples/offline_inference_vision_language_multi_image.py
+++ b/examples/offline_inference_vision_language_multi_image.py
+"""
+This example shows how to use vLLM for running offline inference with
+multi-image input on vision language models, using the chat template defined
+by the model.
+"""
+from argparse import Namespace
+from typing import List
+from transformers import AutoProcessor, AutoTokenizer
+from vllm import LLM, SamplingParams
+from vllm.multimodal.utils import fetch_image
+from vllm.utils import FlexibleArgumentParser
+QUESTION = "What is the content of each image?"
+IMAGE_URLS = [
+    "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg",
+    "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg",
+]
+def load_phi3v(question, image_urls: List[str]):
+    llm = LLM(
+        model="microsoft/Phi-3.5-vision-instruct",
+        trust_remote_code=True,
+        max_model_len=4096,
+        limit_mm_per_prompt={"image": len(image_urls)},
+    )
+    placeholders = "\n".join(f"<|image_{i}|>"
+                             for i, _ in enumerate(image_urls, start=1))
+    prompt = f"<|user|>\n{placeholders}\n{question}<|end|>\n<|assistant|>\n"
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids, None
+def load_internvl(question, image_urls: List[str]):
+    model_name = "OpenGVLab/InternVL2-2B"
+    llm = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        max_num_seqs=5,
+        max_model_len=4096,
+        limit_mm_per_prompt={"image": len(image_urls)},
+    )
+    placeholders = "\n".join(f"Image-{i}: <image>\n"
+                             for i, _ in enumerate(image_urls, start=1))
+    messages = [{'role': 'user', 'content': f"{placeholders}\n{question}"}]
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              trust_remote_code=True)
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+    # Stop tokens for InternVL
+    # models variants may have different stop tokens
+    # please refer to the model card for the correct "stop words":
+    # https://huggingface.co/OpenGVLab/InternVL2-2B#service
+    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+    return llm, prompt, stop_token_ids, None
+def load_qwen2_vl(question, image_urls: List[str]):
+    try:
+        from qwen_vl_utils import process_vision_info
+    except ModuleNotFoundError:
+        print('WARNING: `qwen-vl-utils` not installed, input images will not '
+              'be automatically resized. You can enable this functionality by '
+              '`pip install qwen-vl-utils`.')
+        process_vision_info = None
+    model_name = "Qwen/Qwen2-VL-7B-Instruct"
+    llm = LLM(
+        model=model_name,
+        max_num_seqs=5,
+        max_model_len=32768 if process_vision_info is None else 4096,
+        limit_mm_per_prompt={"image": len(image_urls)},
+    )
+    placeholders = [{"type": "image", "image": url} for url in image_urls]
+    messages = [{
+        "role": "system",
+        "content": "You are a helpful assistant."
+    }, {
+        "role":
+        "user",
+        "content": [
+            *placeholders,
+            {
+                "type": "text",
+                "text": question
+            },
+        ],
+    }]
+    processor = AutoProcessor.from_pretrained(model_name)
+    prompt = processor.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+    stop_token_ids = None
+    if process_vision_info is None:
+        image_data = [fetch_image(url) for url in image_urls]
+    else:
+        image_data, _ = process_vision_info(messages)
+    return llm, prompt, stop_token_ids, image_data
+model_example_map = {
+    "phi3_v": load_phi3v,
+    "internvl_chat": load_internvl,
+    "qwen2_vl": load_qwen2_vl,
+}
+def run_generate(model, question: str, image_urls: List[str]):
+    llm, prompt, stop_token_ids, image_data = model_example_map[model](
+        question, image_urls)
+    if image_data is None:
+        image_data = [fetch_image(url) for url in image_urls]
+    sampling_params = SamplingParams(temperature=0.0,
+                                     max_tokens=128,
+                                     stop_token_ids=stop_token_ids)
+    outputs = llm.generate(
+        {
+            "prompt": prompt,
+            "multi_modal_data": {
+                "image": image_data
+            },
+        },
+        sampling_params=sampling_params)
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+def run_chat(model: str, question: str, image_urls: List[str]):
+    llm, _, stop_token_ids, _ = model_example_map[model](question, image_urls)
+    sampling_params = SamplingParams(temperature=0.0,
+                                     max_tokens=128,
+                                     stop_token_ids=stop_token_ids)
+    outputs = llm.chat([{
+        "role":
+        "user",
+        "content": [
+            {
+                "type": "text",
+                "text": question,
+            },
+            *({
+                "type": "image_url",
+                "image_url": {
+                    "url": image_url
+                },
+            } for image_url in image_urls),
+        ],
+    }],
+                       sampling_params=sampling_params)
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+def main(args: Namespace):
+    model = args.model_type
+    method = args.method
+    if method == "generate":
+        run_generate(model, QUESTION, IMAGE_URLS)
+    elif method == "chat":
+        run_chat(model, QUESTION, IMAGE_URLS)
+    else:
+        raise ValueError(f"Invalid method: {method}")
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser(
+        description='Demo on using vLLM for offline inference with '
+        'vision language models that support multi-image input')
+    parser.add_argument('--model-type',
+                        '-m',
+                        type=str,
+                        default="phi3_v",
+                        choices=model_example_map.keys(),
+                        help='Huggingface "model_type".')
+    parser.add_argument("--method",
+                        type=str,
+                        default="generate",
+                        choices=["generate", "chat"],
+                        help="The method to run in `vllm.LLM`.")
+    args = parser.parse_args()
+    main(args)
--- a/examples/offline_inference_with_profiler.py
+++ b/examples/offline_inference_with_profiler.py
+import os
+from vllm import LLM, SamplingParams
+# enable torch profiler, can also be set on cmd line
+os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+# Create an LLM.
+llm = LLM(model="facebook/opt-125m")
+llm.start_profile()
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+llm.stop_profile()
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/examples/openai_vision_api_client.py
+++ b/examples/openai_vision_api_client.py
@@ -27,9 +27,10 @@ client = OpenAI(
 models = client.models.list()
 model = models.data[0].id
+# Single-image input inference
 image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-# Use image url in the payload
+## Use image url in the payload
 chat_completion_from_url = client.chat.completions.create(
    messages=[{
        "role":
@@ -52,10 +53,10 @@ chat_completion_from_url = client.chat.completions.create(
 )
 result = chat_completion_from_url.choices[0].message.content
-print(f"Chat completion output:{result}")
+print("Chat completion output:", result)
-# Use base64 encoded image in the payload
+## Use base64 encoded image in the payload
 def encode_image_base64_from_url(image_url: str) -> str:
    """Encode an image retrieved from a remote url to base64 format."""
@@ -122,4 +123,4 @@ chat_completion_from_url = client.chat.completions.create(
 )
 result = chat_completion_from_url.choices[0].message.content
-print(f"Chat completion output:{result}")
+print("Chat completion output:", result)
--- a/examples/tool_chat_template_hermes.jinja
+++ b/examples/tool_chat_template_hermes.jinja
@@ -97,12 +97,13 @@
            {{- '{' }}
            {{- '"name": "' }}
            {{- tool_call.name }}
-                {{- '"}' }}
+            {{- '"' }}
-                {{- ', ' }}
            {%- if tool_call.arguments is defined %}
+                {{- ', ' }}
                {{- '"arguments": ' }}
                {{- tool_call.arguments|tojson }}
            {%- endif %}
+            {{- '}' }}
            {{- '\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}

--- a/requirements-adag.txt
+++ b/requirements-adag.txt
-# Dependencies for Ray accelerated DAG
-cupy-cuda12x
-ray >= 2.32
\ No newline at end of file
--- a/requirements-common.txt
+++ b/requirements-common.txt
@@ -25,5 +25,7 @@ pyzmq
 msgspec
 gguf == 0.9.1
 importlib_metadata
-mistral_common >= 1.3.4
+mistral_common >= 1.4.0
 pyyaml
+six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12
+einops # Required for Qwen2-VL.
--- a/requirements-test.txt
+++ b/requirements-test.txt
-# Needed for Ray accelerated DAG tests
-r requirements-adag.txt
 # testing
 pytest
 tensorizer>=2.9.0
@@ -14,9 +11,10 @@ awscli
 einops # required for MPT, qwen-vl and Mamba
 httpx
 librosa # required for audio test
+opencv-python # required for video test
 peft
 requests
-ray
+ray[adag]>=2.35
 sentence-transformers # required for embedding
 soundfile # required for audio test
 compressed-tensors==0.4.0 # required for compressed-tensors

--- a/setup.py
+++ b/setup.py
@@ -177,14 +177,17 @@ class cmake_build_ext(build_ext):
        if is_sccache_available():
            cmake_args += [
+                '-DCMAKE_C_COMPILER_LAUNCHER=sccache',
                '-DCMAKE_CXX_COMPILER_LAUNCHER=sccache',
                '-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache',
-                '-DCMAKE_C_COMPILER_LAUNCHER=sccache',
+                '-DCMAKE_HIP_COMPILER_LAUNCHER=sccache',
            ]
        elif is_ccache_available():
            cmake_args += [
+                '-DCMAKE_C_COMPILER_LAUNCHER=ccache',
                '-DCMAKE_CXX_COMPILER_LAUNCHER=ccache',
                '-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache',
+                '-DCMAKE_HIP_COMPILER_LAUNCHER=ccache',
            ]
        # Pass the python executable to cmake so it can find an exact
@@ -412,8 +415,8 @@ except Exception as e:
                  stacklevel=2)
    __commit__ = "COMMIT_HASH_PLACEHOLDER"
-__version__ = "0.6.0"
+__version__ = "0.6.1"
-__dcu_version__ = f'0.6.0+{version}' 
+__dcu_version__ = f'0.6.1+{version}' 
 """
@@ -575,6 +578,7 @@ setup(
    ext_modules=ext_modules,
    extras_require={
        "tensorizer": ["tensorizer>=2.9.0"],
+        "video": ["opencv-python"],  # Required for video processing
        "audio": ["librosa", "soundfile"]  # Required for audio processing
    },
    cmdclass={"build_ext": cmake_build_ext} if len(ext_modules) > 0 else {},