Merge tag 'v0.5.5' into v0.5.5-dtk24.04.1

af7f4372 · zhuwenwen · 5e19cdef · 09c77926 · af7f4372 · af7f4372
Commit af7f4372 authored Sep 03, 2024 by zhuwenwen
20 changed files
--- a/docs/source/models/enabling_multimodal_inputs.rst
+++ b/docs/source/models/enabling_multimodal_inputs.rst
@@ -15,14 +15,14 @@ This document walks you through the steps to extend a vLLM model so that it acce
 It is assumed that you have already implemented the model in vLLM according to :ref:`these steps <adding_a_new_model>`.
 Further update the model as follows:

- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.
+- Implement the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.

  .. code-block:: diff

-      + from vllm.model_executor.models.interfaces import SupportsVision
+      + from vllm.model_executor.models.interfaces import SupportsMultiModal

      - class YourModelForImage2Seq(nn.Module):
-      + class YourModelForImage2Seq(nn.Module, SupportsVision):
+      + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):

  .. note::
      The model class does not have to be named :code:`*ForCausalLM`.
@@ -51,11 +51,11 @@ This decorator accepts a function that maps multi-modal inputs to the keyword ar

 .. code-block:: diff

-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
    + from vllm.multimodal import MULTIMODAL_REGISTRY

    + @MULTIMODAL_REGISTRY.register_image_input_mapper()
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):

 A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.

@@ -66,19 +66,19 @@ A default mapper is available for each modality in the core vLLM library. This i
 3. Register maximum number of multi-modal tokens
 ------------------------------------------------

-For each modality type that the model accepts as input, calculate the maximum possible number of tokens
+For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data instance
 and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.

 .. code-block:: diff

      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY

      @MULTIMODAL_REGISTRY.register_image_input_mapper()
    + @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
      @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):

 Here are some examples:

@@ -98,13 +98,13 @@ In such cases, you can define your own dummy data by registering a factory metho
 .. code-block:: diff

      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY

      @MULTIMODAL_REGISTRY.register_image_input_mapper()
      @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
    + @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):

 .. note::
    The dummy data should have the maximum possible number of multi-modal tokens, as described in the previous step.
@@ -128,14 +128,14 @@ You can register input processors via :meth:`INPUT_REGISTRY.register_input_proce
 .. code-block:: diff

      from vllm.inputs import INPUT_REGISTRY
-      from vllm.model_executor.models.interfaces import SupportsVision
+      from vllm.model_executor.models.interfaces import SupportsMultiModal
      from vllm.multimodal import MULTIMODAL_REGISTRY

      @MULTIMODAL_REGISTRY.register_image_input_mapper()
      @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
      @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
    + @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
-      class YourModelForImage2Seq(nn.Module, SupportsVision):
+      class YourModelForImage2Seq(nn.Module, SupportsMultiModal):

 A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
 Here are some examples:

--- a/docs/source/models/spec_decode.rst
+++ b/docs/source/models/spec_decode.rst
@@ -14,7 +14,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
 Speculating with a draft model
 ------------------------------

-The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
+The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.

 .. code-block:: python

@@ -39,6 +39,50 @@ The following code configures vLLM to use speculative decoding with a draft mode
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

+To perform the same with an online mode launch the server:
+
+.. code-block:: bash
+
+    python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
+    --seed 42 -tp 1 --speculative_model facebook/opt-125m --use-v2-block-manager \
+    --num_speculative_tokens 5 --gpu_memory_utilization 0.8
+
+ Then use a client:
+
+.. code-block:: python
+
+    from openai import OpenAI
+
+    # Modify OpenAI's API key and API base to use vLLM's API server.
+    openai_api_key = "EMPTY"
+    openai_api_base = "http://localhost:8000/v1"
+
+    client = OpenAI(
+        # defaults to os.environ.get("OPENAI_API_KEY")
+        api_key=openai_api_key,
+        base_url=openai_api_base,
+    )
+
+    models = client.models.list()
+    model = models.data[0].id
+
+    # Completion API
+    stream = False
+    completion = client.completions.create(
+        model=model,
+        prompt="The future of AI is",
+        echo=False,
+        n=1,
+        stream=stream,
+    )
+
+    print("Completion results:")
+    if stream:
+        for c in completion:
+            print(c)
+    else:
+        print(completion)
+
 Speculating by matching n-grams in the prompt
 ---------------------------------------------

@@ -69,6 +113,55 @@ matching n-grams in the prompt. For more information read `this thread. <https:/
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

+Speculating using MLP speculators
+---------------------------------
+
+The following code configures vLLM to use speculative decoding where proposals are generated by
+draft models that conditioning draft predictions on both context vectors and sampled tokens.
+For more information see `this blog <https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/>`_ or
+`this technical report <https://arxiv.org/abs/2404.19124>`_.
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+        tensor_parallel_size=4,
+        speculative_model="ibm-fms/llama3-70b-accelerator",
+        speculative_draft_tensor_parallel_size=1,
+        use_v2_block_manager=True,
+    )
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+Note that these speculative models currently need to be run without tensor parallelism, although
+it is possible to run the main model using tensor parallelism (see example above). Since the
+speculative models are relatively small, we still see significant speedups. However, this
+limitation will be fixed in a future release.
+
+A variety of speculative models of this type are available on HF hub:
+
+* `llama-13b-accelerator <https://huggingface.co/ibm-fms/llama-13b-accelerator>`_
+* `llama3-8b-accelerator <https://huggingface.co/ibm-fms/llama3-8b-accelerator>`_
+* `codellama-34b-accelerator <https://huggingface.co/ibm-fms/codellama-34b-accelerator>`_
+* `llama2-70b-accelerator <https://huggingface.co/ibm-fms/llama2-70b-accelerator>`_
+* `llama3-70b-accelerator <https://huggingface.co/ibm-fms/llama3-70b-accelerator>`_
+* `granite-3b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-3b-code-instruct-accelerator>`_
+* `granite-8b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-8b-code-instruct-accelerator>`_
+* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
+* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
+
+
 Resources for vLLM contributors
 -------------------------------
 * `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_

--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -177,52 +177,67 @@ Decoder-only Language Models

 .. _supported_vlms:

-Vision Language Models
-^^^^^^^^^^^^^^^^^^^^^^^
+Multimodal Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 .. list-table::
-  :widths: 25 25 50 5
+  :widths: 25 25 25 25 5
  :header-rows: 1

  * - Architecture
    - Models
+    - Supported Modalities
    - Example HuggingFace Models
    - :ref:`LoRA <lora>`
  * - :code:`Blip2ForConditionalGeneration`
    - BLIP-2
+    - Image
    - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
    -
  * - :code:`ChameleonForConditionalGeneration`
    - Chameleon
+    - Image
    - :code:`facebook/chameleon-7b` etc.
    - 
  * - :code:`FuyuForCausalLM`
    - Fuyu
+    - Image
    - :code:`adept/fuyu-8b` etc.
    - 
  * - :code:`InternVLChatModel`
    - InternVL2
+    - Image
    - :code:`OpenGVLab/InternVL2-4B`, :code:`OpenGVLab/InternVL2-8B`, etc.
    - 
  * - :code:`LlavaForConditionalGeneration`
    - LLaVA-1.5
+    - Image
    - :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
    -
  * - :code:`LlavaNextForConditionalGeneration`
    - LLaVA-NeXT
+    - Image
    - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
    -
  * - :code:`PaliGemmaForConditionalGeneration`
    - PaliGemma
+    - Image
    - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, etc.
    - 
  * - :code:`Phi3VForCausalLM`
-    - Phi-3-Vision
-    - :code:`microsoft/Phi-3-vision-128k-instruct`, etc.
+    - Phi-3-Vision, Phi-3.5-Vision
+    - Image
+    - :code:`microsoft/Phi-3-vision-128k-instruct`, :code:`microsoft/Phi-3.5-vision-instruct` etc.
    -
  * - :code:`MiniCPMV`
    - MiniCPM-V
-    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, etc.
+    - Image
+    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
+    -
+  * - :code:`UltravoxModel`
+    - Ultravox
+    - Audio
+    - :code:`fixie-ai/ultravox-v0_3`
    -

 .. note::

--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -49,6 +49,17 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptI
        "multi_modal_data": {"image": image},
    })

+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+    # Inference with image embeddings as input
+    image_embeds = torch.load(...) # torch.Tensor of shape (1, image_feature_size, hidden_size of LM)
+    outputs = llm.generate({
+        "prompt": prompt,
+        "multi_modal_data": {"image": image_embeds},
+    })
+
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

--- a/docs/source/quantization/fp8.rst
+++ b/docs/source/quantization/fp8.rst
 .. _fp8:

-FP8
+FP8 W8A8
 ==================

 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. 
@@ -15,6 +15,11 @@ The FP8 types typically supported in hardware have two distinct representations,
 - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
 - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.

+.. note::
+
+   FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+   FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+
 Quick Start with Online Dynamic Quantization
 --------------------------------------------

@@ -33,106 +38,134 @@ In this mode, all Linear modules (except for the final ``lm_head``) have their w

    Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.

-Offline Quantization
--------------------
+Installation
+------------

-For offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
+To produce performant FP8 quantized models with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:

-.. code-block:: bash
+.. code-block:: console

-    git clone https://github.com/neuralmagic/AutoFP8.git
-    pip install -e AutoFP8
+   $ pip install llmcompressor==0.1.0

-This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.
+Quantization Process
+--------------------

-Offline Quantization with Dynamic Activation Scaling Factors
------------------------------------------------------------
+The quantization process involves three main steps:

-You can use AutoFP8 to produce checkpoints with their weights quantized to FP8 ahead of time and let vLLM handle calculating dynamic scales for the activations at runtime for maximum accuracy. You can enable this with the ``activation_scheme="dynamic"`` argument.
+1. Loading the model
+2. Applying quantization
+3. Evaluating accuracy in vLLM

-.. warning::
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^

-    Please note that although this mode doesn't give you better performance, it reduces memory footprint compared to online quantization.
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:

 .. code-block:: python

-    from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer

-    pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
-    quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-Dynamic"
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

-    # Define quantization config with static activation scales
-    quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
-    # For dynamic activation scales, there is no need for calbration examples
-    examples = []
+   model = SparseAutoModelForCausalLM.from_pretrained(
+     MODEL_ID, device_map="auto", torch_dtype="auto")
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

-    # Load the model, quantize, and save checkpoint
-    model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
-    model.quantize(examples)
-    model.save_quantized(quantized_model_dir)
+2. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^

-In the output of the above script, you should be able to see the quantized Linear modules (FP8DynamicLinear) replaced in the model definition. 
-Note that the ``lm_head`` Linear module at the end is currently skipped by default.
+For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:

-.. code-block:: text
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations

-    LlamaForCausalLM(
-      (model): LlamaModel(
-        (embed_tokens): Embedding(128256, 4096)
-        (layers): ModuleList(
-          (0-31): 32 x LlamaDecoderLayer(
-            (self_attn): LlamaSdpaAttention(
-              (q_proj): FP8DynamicLinear()
-              (k_proj): FP8DynamicLinear()
-              (v_proj): FP8DynamicLinear()
-              (o_proj): FP8DynamicLinear()
-              (rotary_emb): LlamaRotaryEmbedding()
-            )
-            (mlp): LlamaMLP(
-              (gate_proj): FP8DynamicLinear()
-              (up_proj): FP8DynamicLinear()
-              (down_proj): FP8DynamicLinear()
-              (act_fn): SiLU()
-            )
-            (input_layernorm): LlamaRMSNorm()
-            (post_attention_layernorm): LlamaRMSNorm()
-          )
-        )
-        (norm): LlamaRMSNorm()
-      )
-      (lm_head): Linear(in_features=4096, out_features=128256, bias=False)
-    )
-    Saving the model to Meta-Llama-3-8B-Instruct-FP8-Dynamic
-
-Your model checkpoint with quantized weights should be available at ``Meta-Llama-3-8B-Instruct-FP8/``.
-We can see that the weights are smaller than the original BF16 precision.
+Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.

-.. code-block:: bash
+.. code-block:: python

-    ls -lh Meta-Llama-3-8B-Instruct-FP8-Dynamic/
-    total 8.5G
-    -rw-rw-r-- 1 user user  869 Jun  7 14:43 config.json
-    -rw-rw-r-- 1 user user  194 Jun  7 14:43 generation_config.json
-    -rw-rw-r-- 1 user user 4.7G Jun  7 14:43 model-00001-of-00002.safetensors
-    -rw-rw-r-- 1 user user 3.9G Jun  7 14:43 model-00002-of-00002.safetensors
-    -rw-rw-r-- 1 user user  43K Jun  7 14:43 model.safetensors.index.json
-    -rw-rw-r-- 1 user user  296 Jun  7 14:43 special_tokens_map.json
-    -rw-rw-r-- 1 user user  50K Jun  7 14:43 tokenizer_config.json
-    -rw-rw-r-- 1 user user 8.7M Jun  7 14:43 tokenizer.json
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import QuantizationModifier

-Finally, you can load the quantized model checkpoint directly in vLLM.
+   # Configure the simple PTQ quantization
+   recipe = QuantizationModifier(
+     targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+
+   # Apply the quantization algorithm.
+   oneshot(model=model, recipe=recipe)
+
+   # Save the model.
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+   model.save_pretrained(SAVE_DIR)
+   tokenizer.save_pretrained(SAVE_DIR)
+
+3. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+Install ``vllm`` and ``lm-evaluation-harness``:
+
+.. code-block:: console
+
+   $ pip install vllm lm_eval==0.4.3
+
+Load and run the model in ``vllm``:

 .. code-block:: python

   from vllm import LLM
-    model = LLM(model="Meta-Llama-3-8B-Instruct-FP8-Dynamic/")
-    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
-    result = model.generate("Hello, my name is")
+   model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+   model.generate("Hello my name is")
+
+Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
+
+.. note::
+
+   Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
+
+.. code-block:: console
+
+   $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic 
+   $ lm_eval \
+     --model vllm \
+     --model_args pretrained=$MODEL,add_bos_token=True \
+     --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
+
+Here's an example of the resulting scores:
+
+.. code-block:: text
+
+   |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
+   |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+   |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
+   |     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
+
+
+Deprecated Flow
+------------------
+
+.. note::
+
+   The following information is preserved for reference and search purposes.
+   The quantization method described below is deprecated in favor of the ``llmcompressor`` method described above.
+
+For static per-tensor offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
+
+.. code-block:: bash
+
+    git clone https://github.com/neuralmagic/AutoFP8.git
+    pip install -e AutoFP8
+
+This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.

 Offline Quantization with Static Activation Scaling Factors
 -----------------------------------------------------------

-For the best inference performance, you can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.
+You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.

 .. code-block:: python

@@ -169,41 +202,3 @@ Finally, you can load the quantized model checkpoint directly in vLLM.
    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
    result = model.generate("Hello, my name is")

-FP8 checkpoint structure explanation
-----------------------------------------------------------
-
-Here we detail the structure for the FP8 checkpoints.
-
-The following is necessary to be present in the model's ``config.json``:
-
-.. code-block:: text
-
-    "quantization_config": {
-        "quant_method": "fp8",
-        "activation_scheme": "static" or "dynamic"
-    }
-
-
-Each quantized layer in the state_dict will have these tensors:
-
-* If the config has ``"activation_scheme": "static"``:
-
-.. code-block:: text
-
-    model.layers.0.mlp.down_proj.weight              < F8_E4M3
-    model.layers.0.mlp.down_proj.input_scale         < F32
-    model.layers.0.mlp.down_proj.weight_scale        < F32
-
-* If the config has ``"activation_scheme": "dynamic"``:
-
-.. code-block:: text
-
-    model.layers.0.mlp.down_proj.weight              < F8_E4M3
-    model.layers.0.mlp.down_proj.weight_scale        < F32
-
-
-Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
-
-.. code-block:: text
-
-    model.layers.0.self_attn.kv_scale	             < F32
--- a/docs/source/quantization/fp8_e4m3_kvcache.rst
+++ b/docs/source/quantization/fp8_e4m3_kvcache.rst
@@ -45,5 +45,3 @@ Here is an example of how to enable this feature:
        # output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
        # output w/o scaling factors:  England, located in the southeastern part of the country. It is known 

-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
-
--- a/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/docs/source/quantization/fp8_e5m2_kvcache.rst
@@ -32,5 +32,3 @@ Here is an example of how to enable this feature:
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


-Note, current prefix caching doesn't work with FP8 KV cache enabled, forward_prefix kernel should handle different KV and cache type.
-
--- a/docs/source/quantization/int8.rst
+++ b/docs/source/quantization/int8.rst
+.. _int8:
+
+INT8 W8A8
+==================
+
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+
+Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
+
+.. note::
+
+   INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+
+Prerequisites
+-------------
+
+To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
+
+.. code-block:: console
+
+   $ pip install llmcompressor==0.1.0
+
+Quantization Process
+--------------------
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+
+.. code-block:: python
+
+   from llmcompressor.transformers import SparseAutoModelForCausalLM
+   from transformers import AutoTokenizer
+
+   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+   model = SparseAutoModelForCausalLM.from_pretrained(
+       MODEL_ID, device_map="auto", torch_dtype="auto",
+   )
+   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+2. Preparing Calibration Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data. 
+For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
+
+.. code-block:: python
+
+   from datasets import load_dataset
+
+   NUM_CALIBRATION_SAMPLES = 512
+   MAX_SEQUENCE_LENGTH = 2048
+
+   # Load and preprocess the dataset
+   ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+   ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+   def preprocess(example):
+       return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+   ds = ds.map(preprocess)
+
+   def tokenize(sample):
+       return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+   ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+3. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Now, apply the quantization algorithms:
+
+.. code-block:: python
+
+   from llmcompressor.transformers import oneshot
+   from llmcompressor.modifiers.quantization import GPTQModifier
+   from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+
+   # Configure the quantization algorithms
+   recipe = [
+       SmoothQuantModifier(smoothing_strength=0.8),
+       GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+   ]
+
+   # Apply quantization
+   oneshot(
+       model=model,
+       dataset=ds,
+       recipe=recipe,
+       max_seq_length=MAX_SEQUENCE_LENGTH,
+       num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+   )
+
+   # Save the compressed model
+   SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+   model.save_pretrained(SAVE_DIR, save_compressed=True)
+   tokenizer.save_pretrained(SAVE_DIR)
+
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+
+4. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+After quantization, you can load and run the model in vLLM:
+
+.. code-block:: python
+
+   from vllm import LLM
+   model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+
+To evaluate accuracy, you can use ``lm_eval``:
+
+.. code-block:: console
+
+   $ lm_eval --model vllm \
+     --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+     --tasks gsm8k \
+     --num_fewshot 5 \
+     --limit 250 \
+     --batch_size 'auto'
+
+.. note::
+
+   Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
+
+Best Practices
+--------------
+
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
\ No newline at end of file
--- a/docs/source/quantization/supported_hardware.rst
+++ b/docs/source/quantization/supported_hardware.rst
@@ -5,25 +5,138 @@ Supported Hardware for Quantization Kernels

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
-Implementation  Volta   Turing   Ampere   Ada    Hopper  AMD GPU  Intel GPU  x86 CPU  AWS Inferentia  Google TPU
-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
-AQLM            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-AWQ             ❌      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-DeepSpeedFP     ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-FP8             ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-Marlin          ❌      ❌       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-GPTQ            ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-SqueezeLLM      ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-bitsandbytes    ✅      ✅       ✅       ✅     ✅      ❌        ❌         ❌       ❌              ❌
-==============  ======  =======  =======  =====  ======  =======  =========  =======  ==============  ==========
+.. list-table::
+   :header-rows: 1
+   :widths: 20 8 8 8 8 8 8 8 8 8 8
+
+   * - Implementation
+     - Volta
+     - Turing
+     - Ampere
+     - Ada
+     - Hopper
+     - AMD GPU
+     - Intel GPU
+     - x86 CPU
+     - AWS Inferentia
+     - Google TPU
+   * - AWQ
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - GPTQ
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - Marlin (GPTQ/AWQ/FP8)
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - INT8 (W8A8)
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - FP8 (W8A8)
+     - ✗
+     - ✗
+     - ✗
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - AQLM
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - bitsandbytes
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - DeepSpeedFP
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - GGUF
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+   * - SqueezeLLM
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✅︎
+     - ✗
+     - ✗
+     - ✗
+     - ✗
+     - ✗

 Notes:
 ^^^^^^

 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅" indicates that the quantization method is supported on the specified hardware.
- "❌" indicates that the quantization method is not supported on the specified hardware.
+- "✅︎" indicates that the quantization method is supported on the specified hardware.
+- "✗" indicates that the quantization method is not supported on the specified hardware.

 Please note that this compatibility chart may be subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.


--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@@ -12,3 +12,4 @@ Integrations
   deploying_with_lws
   deploying_with_dstack
   serving_with_langchain
+   serving_with_llamaindex
--- a/docs/source/serving/serving_with_llamaindex.rst
+++ b/docs/source/serving/serving_with_llamaindex.rst
+.. _run_on_llamaindex:
+
+Serving with llama_index
+============================
+
+vLLM is also available via `llama_index <https://github.com/run-llama/llama_index>`_ .
+
+To install llamaindex, run
+
+.. code-block:: console
+
+    $ pip install llama-index-llms-vllm -q
+
+To run inference on a single or multiple GPUs, use ``Vllm`` class from ``llamaindex``.
+
+.. code-block:: python
+
+    from llama_index.llms.vllm import Vllm
+
+    llm = Vllm(
+        model="microsoft/Orca-2-7b",
+        tensor_parallel_size=4,
+        max_new_tokens=100,
+        vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
+    )
+
+Please refer to this `Tutorial <https://docs.llamaindex.ai/en/latest/examples/llm/vllm/>`_ for more details.
--- a/examples/gguf_inference.py
+++ b/examples/gguf_inference.py
+from huggingface_hub import hf_hub_download
+
+from vllm import LLM, SamplingParams
+
+
+def run_gguf_inference(model_path):
+    PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"  # noqa: E501
+    system_message = "You are a friendly chatbot who always responds in the style of a pirate."  # noqa: E501
+    # Sample prompts.
+    prompts = [
+        "How many helicopters can a human eat in one sitting?",
+        "What's the future of AI?",
+    ]
+    prompts = [
+        PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
+        for prompt in prompts
+    ]
+    # Create a sampling params object.
+    sampling_params = SamplingParams(temperature=0, max_tokens=128)
+
+    # Create an LLM.
+    llm = LLM(model=model_path,
+              tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+              gpu_memory_utilization=0.95)
+
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs.
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
+if __name__ == "__main__":
+    repo_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
+    filename = "tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
+    model = hf_hub_download(repo_id, filename=filename)
+    run_gguf_inference(model)
--- a/examples/offline_inference_audio_language.py
+++ b/examples/offline_inference_audio_language.py
+"""
+This example shows how to use vLLM for running offline inference 
+with the correct prompt format on audio language models.
+
+For most models, the prompt format should follow corresponding examples
+on HuggingFace model repository.
+"""
+from transformers import AutoTokenizer
+
+from vllm import LLM, SamplingParams
+from vllm.assets.audio import AudioAsset
+from vllm.utils import FlexibleArgumentParser
+
+# Input audio and question
+audio_and_sample_rate = AudioAsset("mary_had_lamb").audio_and_sample_rate
+question = "What is recited in the audio?"
+
+
+# Ultravox 0.3
+def run_ultravox(question):
+    model_name = "fixie-ai/ultravox-v0_3"
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    messages = [{
+        'role': 'user',
+        'content': f"<|reserved_special_token_0|>\n{question}"
+    }]
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+
+    llm = LLM(model=model_name)
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
+
+
+model_example_map = {
+    "ultravox": run_ultravox,
+}
+
+
+def main(args):
+    model = args.model_type
+    if model not in model_example_map:
+        raise ValueError(f"Model type {model} is not supported.")
+
+    llm, prompt, stop_token_ids = model_example_map[model](question)
+
+    # We set temperature to 0.2 so that outputs can be different
+    # even when all prompts are identical when running batch inference.
+    sampling_params = SamplingParams(temperature=0.2,
+                                     max_tokens=64,
+                                     stop_token_ids=stop_token_ids)
+
+    assert args.num_prompts > 0
+    if args.num_prompts == 1:
+        # Single inference
+        inputs = {
+            "prompt": prompt,
+            "multi_modal_data": {
+                "audio": audio_and_sample_rate
+            },
+        }
+
+    else:
+        # Batch inference
+        inputs = [{
+            "prompt": prompt,
+            "multi_modal_data": {
+                "audio": audio_and_sample_rate
+            },
+        } for _ in range(args.num_prompts)]
+
+    outputs = llm.generate(inputs, sampling_params=sampling_params)
+
+    for o in outputs:
+        generated_text = o.outputs[0].text
+        print(generated_text)
+
+
+if __name__ == "__main__":
+    parser = FlexibleArgumentParser(
+        description='Demo on using vLLM for offline inference with '
+        'audio language models')
+    parser.add_argument('--model-type',
+                        '-m',
+                        type=str,
+                        default="ultravox",
+                        choices=model_example_map.keys(),
+                        help='Huggingface "model_type".')
+    parser.add_argument('--num-prompts',
+                        type=int,
+                        default=1,
+                        help='Number of prompts to run.')
+
+    args = parser.parse_args()
+    main(args)
--- a/examples/offline_inference_chat.py
+++ b/examples/offline_inference_chat.py
+from vllm import LLM, SamplingParams
+
+llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
+sampling_params = SamplingParams(temperature=0.5)
+
+
+def print_outputs(outputs):
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    print("-" * 80)
+
+
+print("=" * 80)
+
+# In this script, we demonstrate how to pass input to the chat method:
+
+conversation = [
+    {
+        "role": "system",
+        "content": "You are a helpful assistant"
+    },
+    {
+        "role": "user",
+        "content": "Hello"
+    },
+    {
+        "role": "assistant",
+        "content": "Hello! How can I assist you today?"
+    },
+    {
+        "role": "user",
+        "content": "Write an essay about the importance of higher education.",
+    },
+]
+outputs = llm.chat(conversation,
+                   sampling_params=sampling_params,
+                   use_tqdm=False)
+print_outputs(outputs)
+
+# A chat template can be optionally supplied.
+# If not, the model will use its default chat template.
+
+# with open('template_falcon_180b.jinja', "r") as f:
+#     chat_template = f.read()
+
+# outputs = llm.chat(
+#     conversations,
+#     sampling_params=sampling_params,
+#     use_tqdm=False,
+#     chat_template=chat_template,
+# )
--- a/examples/offline_inference_encoder_decoder.py
+++ b/examples/offline_inference_encoder_decoder.py
+'''
+Demonstrate prompting of text-to-text
+encoder/decoder models, specifically BART
+'''
+
+from vllm import LLM, SamplingParams
+from vllm.inputs import (ExplicitEncoderDecoderPrompt, TextPrompt,
+                         TokensPrompt, zip_enc_dec_prompts)
+
+dtype = "float"
+
+# Create a BART encoder/decoder model instance
+llm = LLM(
+    model="facebook/bart-large-cnn",
+    dtype=dtype,
+)
+
+# Get BART tokenizer
+tokenizer = llm.llm_engine.get_tokenizer_group()
+
+# Test prompts
+#
+# This section shows all of the valid ways to prompt an
+# encoder/decoder model.
+#
+# - Helpers for building prompts
+text_prompt_raw = "Hello, my name is"
+text_prompt = TextPrompt(prompt="The president of the United States is")
+tokens_prompt = TokensPrompt(prompt_token_ids=tokenizer.encode(
+    prompt="The capital of France is"))
+# - Pass a single prompt to encoder/decoder model
+#   (implicitly encoder input prompt);
+#   decoder input prompt is assumed to be None
+
+single_text_prompt_raw = text_prompt_raw  # Pass a string directly
+single_text_prompt = text_prompt  # Pass a TextPrompt
+single_tokens_prompt = tokens_prompt  # Pass a TokensPrompt
+
+# - Pass explicit encoder and decoder input prompts within one data structure.
+#   Encoder and decoder prompts can both independently be text or tokens, with
+#   no requirement that they be the same prompt type. Some example prompt-type
+#   combinations are shown below, note that these are not exhaustive.
+
+enc_dec_prompt1 = ExplicitEncoderDecoderPrompt(
+    # Pass encoder prompt string directly, &
+    # pass decoder prompt tokens
+    encoder_prompt=single_text_prompt_raw,
+    decoder_prompt=single_tokens_prompt,
+)
+enc_dec_prompt2 = ExplicitEncoderDecoderPrompt(
+    # Pass TextPrompt to encoder, and
+    # pass decoder prompt string directly
+    encoder_prompt=single_text_prompt,
+    decoder_prompt=single_text_prompt_raw,
+)
+enc_dec_prompt3 = ExplicitEncoderDecoderPrompt(
+    # Pass encoder prompt tokens directly, and
+    # pass TextPrompt to decoder
+    encoder_prompt=single_tokens_prompt,
+    decoder_prompt=single_text_prompt,
+)
+
+# - Finally, here's a useful helper function for zipping encoder and
+#   decoder prompts together into a list of ExplicitEncoderDecoderPrompt
+#   instances
+zipped_prompt_list = zip_enc_dec_prompts(
+    ['An encoder prompt', 'Another encoder prompt'],
+    ['A decoder prompt', 'Another decoder prompt'])
+
+# - Let's put all of the above example prompts together into one list
+#   which we will pass to the encoder/decoder LLM.
+prompts = [
+    single_text_prompt_raw, single_text_prompt, single_tokens_prompt,
+    enc_dec_prompt1, enc_dec_prompt2, enc_dec_prompt3
+] + zipped_prompt_list
+
+print(prompts)
+
+# Create a sampling params object.
+sampling_params = SamplingParams(
+    temperature=0,
+    top_p=1.0,
+    min_tokens=0,
+    max_tokens=20,
+)
+
+# Generate output tokens from the prompts. The output is a list of
+# RequestOutput objects that contain the prompt, generated
+# text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    encoder_prompt = output.encoder_prompt
+    generated_text = output.outputs[0].text
+    print(f"Encoder prompt: {encoder_prompt!r}, "
+          f"Decoder prompt: {prompt!r}, "
+          f"Generated text: {generated_text!r}")
--- a/examples/offline_inference_openai.md
+++ b/examples/offline_inference_openai.md
@@ -10,7 +10,7 @@
 
 Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details.
 
- **NOTE:** We currently only support to `/v1/chat/completions` endpoint (embeddings and completions coming soon).
+ **NOTE:** We currently only support `/v1/chat/completions` and `/v1/embeddings` endpoints (completions coming soon).
 
 ## Pre-requisites
 
@@ -21,7 +21,7 @@
  - Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
 
 
- ## Example: Running with a local file
+ ## Example 1: Running with a local file
 
 ### Step 1: Create your batch file
 
@@ -54,7 +54,7 @@ python -m vllm.entrypoints.openai.run_batch -i openai_example_batch.jsonl -o res
 You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`

 ```
-$ cat ../results.jsonl
+$ cat results.jsonl
 {"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
 {"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
 ```
@@ -107,7 +107,7 @@ aws s3 cp openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
  
 ### Step 2: Generate your presigned urls

-Presigned put urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.
+Presigned urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.

 (The script is adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py)

@@ -170,3 +170,36 @@ Your results are now on S3. You can view them in your terminal by running
 ```
 aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
 ```
+
+## Example 4: Using embeddings endpoint
+
+### Additional prerequisites
+
+* Ensure you are using `vllm >= 0.5.5`.
+
+### Step 1: Create your batch file
+ 
+ Add embedding requests to your batch file. The following is an example:
+ 
+ ```
+ {"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
+{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
+```
+ 
+ You can even mix chat completion and embedding requests in the batch file, as long as the model you are using supports both chat completion and embeddings (note that all requests must use the same model).
+
+
+ ### Step 2: Run the batch
+
+You can run the batch using the same command as in earlier examples.
+
+
+### Step 3: Check your results
+
+You can check your results by running `cat results.jsonl`
+
+```
+$ cat results.jsonl
+{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
+...```
+```
--- a/examples/offline_inference_vision_language.py
+++ b/examples/offline_inference_vision_language.py
@@ -22,8 +22,8 @@ def run_llava(question):
    prompt = f"USER: <image>\n{question}\nASSISTANT:"

    llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # LLaVA-1.6/LLaVA-NeXT
@@ -31,8 +31,8 @@ def run_llava_next(question):

    prompt = f"[INST] <image>\n{question} [/INST]"
    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf")
-
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # Fuyu
@@ -40,8 +40,8 @@ def run_fuyu(question):

    prompt = f"{question}\n"
    llm = LLM(model="adept/fuyu-8b")
-
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # Phi-3-Vision
@@ -59,7 +59,8 @@ def run_phi3v(question):
        trust_remote_code=True,
        max_num_seqs=5,
    )
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # PaliGemma
@@ -68,8 +69,8 @@ def run_paligemma(question):
    # PaliGemma has special prompt format for VQA
    prompt = "caption en"
    llm = LLM(model="google/paligemma-3b-mix-224")
-
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # Chameleon
@@ -77,7 +78,8 @@ def run_chameleon(question):

    prompt = f"{question}<image>"
    llm = LLM(model="facebook/chameleon-7b")
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 # MiniCPM-V
@@ -89,13 +91,26 @@ def run_minicpmv(question):
    # model_name = "HwwwH/MiniCPM-V-2"

    # 2.5
-    model_name = "openbmb/MiniCPM-Llama3-V-2_5"
+    # model_name = "openbmb/MiniCPM-Llama3-V-2_5"
+
+    #2.6
+    model_name = "openbmb/MiniCPM-V-2_6"
    tokenizer = AutoTokenizer.from_pretrained(model_name,
                                              trust_remote_code=True)
    llm = LLM(
        model=model_name,
        trust_remote_code=True,
    )
+    # NOTE The stop_token_ids are different for various versions of MiniCPM-V
+    # 2.0
+    # stop_token_ids = [tokenizer.eos_id]
+
+    # 2.5
+    # stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
+
+    # 2.6
+    stop_tokens = ['<|im_end|>', '<|endoftext|>']
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

    messages = [{
        'role': 'user',
@@ -104,21 +119,33 @@ def run_minicpmv(question):
    prompt = tokenizer.apply_chat_template(messages,
                                           tokenize=False,
                                           add_generation_prompt=True)
-    return llm, prompt
+    return llm, prompt, stop_token_ids


 # InternVL
 def run_internvl(question):
-    # Generally, InternVL can use chatml template for conversation
-    TEMPLATE = "<|im_start|>User\n{prompt}<|im_end|>\n<|im_start|>Assistant\n"
-    prompt = f"<image>\n{question}\n"
-    prompt = TEMPLATE.format(prompt=prompt)
+    model_name = "OpenGVLab/InternVL2-2B"
+
    llm = LLM(
-        model="OpenGVLab/InternVL2-4B",
+        model=model_name,
        trust_remote_code=True,
        max_num_seqs=5,
    )
-    return llm, prompt
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              trust_remote_code=True)
+    messages = [{'role': 'user', 'content': f"<image>\n{question}"}]
+    prompt = tokenizer.apply_chat_template(messages,
+                                           tokenize=False,
+                                           add_generation_prompt=True)
+
+    # Stop tokens for InternVL
+    # models variants may have different stop tokens
+    # please refer to the model card for the correct "stop words":
+    # https://huggingface.co/OpenGVLab/InternVL2-2B#service
+    stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
+    stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
+    return llm, prompt, stop_token_ids


 # BLIP-2
@@ -128,7 +155,8 @@ def run_blip2(question):
    # See https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/15#64ff02f3f8cf9e4f5b038262 #noqa
    prompt = f"Question: {question} Answer:"
    llm = LLM(model="Salesforce/blip2-opt-2.7b")
-    return llm, prompt
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids


 model_example_map = {
@@ -149,11 +177,13 @@ def main(args):
    if model not in model_example_map:
        raise ValueError(f"Model type {model} is not supported.")

-    llm, prompt = model_example_map[model](question)
+    llm, prompt, stop_token_ids = model_example_map[model](question)

    # We set temperature to 0.2 so that outputs can be different
    # even when all prompts are identical when running batch inference.
-    sampling_params = SamplingParams(temperature=0.2, max_tokens=64)
+    sampling_params = SamplingParams(temperature=0.2,
+                                     max_tokens=64,
+                                     stop_token_ids=stop_token_ids)

    assert args.num_prompts > 0
    if args.num_prompts == 1:

--- a/examples/openai_audio_api_client.py
+++ b/examples/openai_audio_api_client.py
+"""An example showing how to use vLLM to serve VLMs.
+
+Launch the vLLM server with the following command:
+vllm serve fixie-ai/ultravox-v0_3
+"""
+import base64
+
+import requests
+from openai import OpenAI
+
+from vllm.assets.audio import AudioAsset
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+client = OpenAI(
+    # defaults to os.environ.get("OPENAI_API_KEY")
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+
+models = client.models.list()
+model = models.data[0].id
+
+# Any format supported by librosa is supported
+audio_url = AudioAsset("winning_call").url
+
+# Use audio url in the payload
+chat_completion_from_url = client.chat.completions.create(
+    messages=[{
+        "role":
+        "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "What's in this audio?"
+            },
+            {
+                "type": "audio_url",
+                "audio_url": {
+                    "url": audio_url
+                },
+            },
+        ],
+    }],
+    model=model,
+    max_tokens=64,
+)
+
+result = chat_completion_from_url.choices[0].message.content
+print(f"Chat completion output:{result}")
+
+
+# Use base64 encoded audio in the payload
+def encode_audio_base64_from_url(audio_url: str) -> str:
+    """Encode an audio retrieved from a remote url to base64 format."""
+
+    with requests.get(audio_url) as response:
+        response.raise_for_status()
+        result = base64.b64encode(response.content).decode('utf-8')
+
+    return result
+
+
+audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)
+chat_completion_from_base64 = client.chat.completions.create(
+    messages=[{
+        "role":
+        "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "What's in this audio?"
+            },
+            {
+                "type": "audio_url",
+                "audio_url": {
+                    # Any format supported by librosa is supported
+                    "url": f"data:audio/ogg;base64,{audio_base64}"
+                },
+            },
+        ],
+    }],
+    model=model,
+    max_tokens=64,
+)
+
+result = chat_completion_from_base64.choices[0].message.content
+print(f"Chat completion output:{result}")
--- a/examples/production_monitoring/Otel.md
+++ b/examples/production_monitoring/Otel.md
@@ -3,10 +3,10 @@
 1. Install OpenTelemetry packages:
    ```
    pip install \
-        opentelemetry-sdk \
-        opentelemetry-api \
-        opentelemetry-exporter-otlp \
-        opentelemetry-semantic-conventions-ai
+      'opentelemetry-sdk>=1.26.0,<1.27.0' \
+      'opentelemetry-api>=1.26.0,<1.27.0' \
+      'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \
+      'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'
    ```

 1. Start Jaeger in a docker container:

--- a/format.sh
+++ b/format.sh
@@ -102,7 +102,6 @@ mypy vllm/attention --follow-imports skip
 mypy vllm/core --follow-imports skip
 mypy vllm/distributed --follow-imports skip
 mypy vllm/engine  --follow-imports skip
-mypy vllm/entrypoints --follow-imports skip
 mypy vllm/executor --follow-imports skip
 mypy vllm/lora --follow-imports skip
 mypy vllm/model_executor  --follow-imports skip
@@ -242,6 +241,11 @@ echo 'vLLM isort: Done'
 # NOTE: Keep up to date with .github/workflows/clang-format.yml
 CLANG_FORMAT_EXCLUDES=(
    'csrc/moe/topk_softmax_kernels.cu'
+    'csrc/quantization/gguf/ggml-common.h'
+    'csrc/quantization/gguf/dequantize.cuh'
+    'csrc/quantization/gguf/vecdotq.cuh'
+    'csrc/quantization/gguf/mmq.cuh'
+    'csrc/quantization/gguf/mmvq.cuh'
 )

 # Format specified files with clang-format