Merge tag 'v0.6.6.post1' into v0.6.6.post1-dev

96ae75ad · zhuwenwen · f9f4a735 · 2339d59f · 96ae75ad · f9f4a735
Commit 96ae75ad authored Jan 04, 2025 by zhuwenwen
20 changed files
--- a/docs/source/models/generative_models.md
+++ b/docs/source/models/generative_models.md
+(generative-models)=
+
+# Generative Models
+
+vLLM provides first-class support for generative models, which covers most of LLMs.
+
+In vLLM, generative models implement the {class}`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
+Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
+which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.
+
+## Offline Inference
+
+The {class}`~vllm.LLM` class provides various methods for offline inference.
+See [Engine Arguments](#engine-args) for a list of options when initializing the model.
+
+For generative models, the only supported {code}`task` option is {code}`"generate"`.
+Usually, this is automatically inferred so you don't have to specify it.
+
+### `LLM.generate`
+
+The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
+It is similar to [its counterpart in HF Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate),
+except that tokenization and detokenization are also performed automatically.
+
+```python
+llm = LLM(model="facebook/opt-125m")
+outputs = llm.generate("Hello, my name is")
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
+For example, you can use greedy sampling by setting {code}`temperature=0`:
+
+```python
+llm = LLM(model="facebook/opt-125m")
+params = SamplingParams(temperature=0)
+outputs = llm.generate("Hello, my name is", params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+A code example can be found here: <gh-file:examples/offline_inference.py>
+
+### `LLM.beam_search`
+
+The {class}`~vllm.LLM.beam_search` method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding) on top of {class}`~vllm.LLM.generate`.
+For example, to search using 5 beams and output at most 50 tokens:
+
+```python
+llm = LLM(model="facebook/opt-125m")
+params = BeamSearchParams(beam_width=5, max_tokens=50)
+outputs = llm.generate("Hello, my name is", params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+### `LLM.chat`
+
+The {class}`~vllm.LLM.chat` method implements chat functionality on top of {class}`~vllm.LLM.generate`.
+In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
+and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
+
+```{important}
+In general, only instruction-tuned models have a chat template.
+Base models may perform poorly as they are not trained to respond to the chat conversation.
+```
+
+```python
+llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
+conversation = [
+    {
+        "role": "system",
+        "content": "You are a helpful assistant"
+    },
+    {
+        "role": "user",
+        "content": "Hello"
+    },
+    {
+        "role": "assistant",
+        "content": "Hello! How can I assist you today?"
+    },
+    {
+        "role": "user",
+        "content": "Write an essay about the importance of higher education.",
+    },
+]
+outputs = llm.chat(conversation)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+
+A code example can be found here: <gh-file:examples/offline_inference_chat.py>
+
+If the model doesn't have a chat template or you want to specify another one,
+you can explicitly pass a chat template:
+
+```python
+from vllm.entrypoints.chat_utils import load_chat_template
+
+# You can find a list of existing chat templates under `examples/`
+custom_template = load_chat_template(chat_template="<path_to_template>")
+print("Loaded chat template:", custom_template)
+
+outputs = llm.chat(conversation, chat_template=custom_template)
+```
+
+## Online Inference
+
+Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
+
+- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
+- [Chat API](#chat-api)  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
--- a/docs/source/models/generative_models.rst
+++ b/docs/source/models/generative_models.rst
-.. _generative_models:
-
-Generative Models
-=================
-
-vLLM provides first-class support for generative models, which covers most of LLMs.
-
-In vLLM, generative models implement the :class:`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
-Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
-which are then passed through :class:`~vllm.model_executor.layers.Sampler` to obtain the final text.
-
-Offline Inference
-----------------
-
-The :class:`~vllm.LLM` class provides various methods for offline inference.
-See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
-
-For generative models, the only supported :code:`task` option is :code:`"generate"`.
-Usually, this is automatically inferred so you don't have to specify it.
-
-``LLM.generate``
-^^^^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.generate` method is available to all generative models in vLLM.
-It is similar to `its counterpart in HF Transformers <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__,
-except that tokenization and detokenization are also performed automatically.
-
-.. code-block:: python
-
-    llm = LLM(model="facebook/opt-125m")
-    outputs = llm.generate("Hello, my name is")
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-You can optionally control the language generation by passing :class:`~vllm.SamplingParams`.
-For example, you can use greedy sampling by setting :code:`temperature=0`:
-
-.. code-block:: python
-
-    llm = LLM(model="facebook/opt-125m")
-    params = SamplingParams(temperature=0)
-    outputs = llm.generate("Hello, my name is", params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-A code example can be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
-
-``LLM.beam_search``
-^^^^^^^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.beam_search` method implements `beam search <https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding>`__ on top of :class:`~vllm.LLM.generate`.
-For example, to search using 5 beams and output at most 50 tokens:
-
-.. code-block:: python
-
-    llm = LLM(model="facebook/opt-125m")
-    params = BeamSearchParams(beam_width=5, max_tokens=50)
-    outputs = llm.generate("Hello, my name is", params)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-``LLM.chat``
-^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.chat` method implements chat functionality on top of :class:`~vllm.LLM.generate`.
-In particular, it accepts input similar to `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
-and automatically applies the model's `chat template <https://huggingface.co/docs/transformers/en/chat_templating>`__ to format the prompt.
-
-.. important::
-
-    In general, only instruction-tuned models have a chat template.
-    Base models may perform poorly as they are not trained to respond to the chat conversation.
-
-.. code-block:: python
-
-    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
-    conversation = [
-        {
-            "role": "system",
-            "content": "You are a helpful assistant"
-        },
-        {
-            "role": "user",
-            "content": "Hello"
-        },
-        {
-            "role": "assistant",
-            "content": "Hello! How can I assist you today?"
-        },
-        {
-            "role": "user",
-            "content": "Write an essay about the importance of higher education.",
-        },
-    ]
-    outputs = llm.chat(conversation)
-
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-A code example can be found in `examples/offline_inference_chat.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py>`_.
-
-If the model doesn't have a chat template or you want to specify another one,
-you can explicitly pass a chat template:
-
-.. code-block:: python
-
-    from vllm.entrypoints.chat_utils import load_chat_template
-
-    # You can find a list of existing chat templates under `examples/`
-    custom_template = load_chat_template(chat_template="<path_to_template>")
-    print("Loaded chat template:", custom_template)
-
-    outputs = llm.chat(conversation, chat_template=custom_template)
-
-Online Inference
----------------
-
-Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
-Please click on the above link for more details on how to launch the server.
-
-Completions API
-^^^^^^^^^^^^^^^
-
-Our Completions API is similar to ``LLM.generate`` but only accepts text.
-It is compatible with `OpenAI Completions API <https://platform.openai.com/docs/api-reference/completions>`__
-so that you can use OpenAI client to interact with it.
-A code example can be found in `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
-
-Chat API
-^^^^^^^^
-
-Our Chat API is similar to ``LLM.chat``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
-It is compatible with `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
-so that you can use OpenAI client to interact with it.
-A code example can be found in `examples/openai_chat_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py>`_.
--- a/docs/source/models/pooling_models.md
+++ b/docs/source/models/pooling_models.md
+(pooling-models)=
+
+# Pooling Models
+
+vLLM also supports pooling models, including embedding, reranking and reward models.
+
+In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmModelForPooling` interface.
+These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
+before returning them.
+
+```{note}
+We currently support pooling models primarily as a matter of convenience.
+As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
+pooling models as they only work on the generation or decode stage, so performance may not improve as much.
+```
+
+## Offline Inference
+
+The {class}`~vllm.LLM` class provides various methods for offline inference.
+See [Engine Arguments](#engine-args) for a list of options when initializing the model.
+
+For pooling models, we support the following {code}`task` options:
+
+- Embedding ({code}`"embed"` / {code}`"embedding"`)
+- Classification ({code}`"classify"`)
+- Sentence Pair Scoring ({code}`"score"`)
+- Reward Modeling ({code}`"reward"`)
+
+The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used:
+
+- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
+- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
+- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
+- Reward Modeling: Extract all of the hidden states and return them directly.
+
+When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
+we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`).
+
+You can customize the model's pooling method via the {code}`override_pooler_config` option,
+which takes priority over both the model's and Sentence Transformers's defaults.
+
+### `LLM.encode`
+
+The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
+It returns the extracted hidden states directly, which is useful for reward models.
+
+```python
+llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
+(output,) = llm.encode("Hello, my name is")
+
+data = output.outputs.data
+print(f"Data: {data!r}")
+```
+
+### `LLM.embed`
+
+The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
+It is primarily designed for embedding models.
+
+```python
+llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
+(output,) = llm.embed("Hello, my name is")
+
+embeds = output.outputs.embedding
+print(f"Embeddings: {embeds!r} (size={len(embeds)})")
+```
+
+A code example can be found here: <gh-file:examples/offline_inference_embedding.py>
+
+### `LLM.classify`
+
+The {class}`~vllm.LLM.classify` method outputs a probability vector for each prompt.
+It is primarily designed for classification models.
+
+```python
+llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
+(output,) = llm.classify("Hello, my name is")
+
+probs = output.outputs.probs
+print(f"Class Probabilities: {probs!r} (size={len(probs)})")
+```
+
+A code example can be found here: <gh-file:examples/offline_inference_classification.py>
+
+### `LLM.score`
+
+The {class}`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
+It is primarily designed for [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html).
+These types of models serve as rerankers between candidate query-document pairs in RAG systems.
+
+```{note}
+vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
+To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
+```
+
+```python
+llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
+(output,) = llm.score("What is the capital of France?",
+                      "The capital of Brazil is Brasilia.")
+
+score = output.outputs.score
+print(f"Score: {score}")
+```
+
+A code example can be found here: <gh-file:examples/offline_inference_scoring.py>
+
+## Online Inference
+
+Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
+
+- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
+- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
+- [Score API](#score-api) is similar to `LLM.score` for cross-encoder models.
--- a/docs/source/models/pooling_models.rst
+++ b/docs/source/models/pooling_models.rst
-.. _pooling_models:
-
-Pooling Models
-==============
-
-vLLM also supports pooling models, including embedding, reranking and reward models.
-
-In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface.
-These models use a :class:`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
-before returning them.
-
-.. note::
-
-    We currently support pooling models primarily as a matter of convenience.
-    As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to
-    pooling models as they only work on the generation or decode stage, so performance may not improve as much.
-
-Offline Inference
-----------------
-
-The :class:`~vllm.LLM` class provides various methods for offline inference.
-See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
-
-For pooling models, we support the following :code:`task` options:
-
- Embedding (:code:`"embed"` / :code:`"embedding"`)
- Classification (:code:`"classify"`)
- Sentence Pair Scoring (:code:`"score"`)
- Reward Modeling (:code:`"reward"`)
-
-The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used:
-
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
-
-When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models,
-we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`).
-
-You can customize the model's pooling method via the :code:`override_pooler_config` option,
-which takes priority over both the model's and Sentence Transformers's defaults.
-
-``LLM.encode``
-^^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM.
-It returns the extracted hidden states directly, which is useful for reward models.
-
-.. code-block:: python
-
-    llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
-    (output,) = llm.encode("Hello, my name is")
-
-    data = output.outputs.data
-    print(f"Data: {data!r}")
-
-``LLM.embed``
-^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
-It is primarily designed for embedding models.
-
-.. code-block:: python
-
-    llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
-    (output,) = llm.embed("Hello, my name is")
-
-    embeds = output.outputs.embedding
-    print(f"Embeddings: {embeds!r} (size={len(embeds)})")
-
-A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_.
-
-``LLM.classify``
-^^^^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.classify` method outputs a probability vector for each prompt.
-It is primarily designed for classification models.
-
-.. code-block:: python
-
-    llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
-    (output,) = llm.classify("Hello, my name is")
-
-    probs = output.outputs.probs
-    print(f"Class Probabilities: {probs!r} (size={len(probs)})")
-
-A code example can be found in `examples/offline_inference_classification.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_classification.py>`_.
-
-``LLM.score``
-^^^^^^^^^^^^^
-
-The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
-It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__.
-These types of models serve as rerankers between candidate query-document pairs in RAG systems.
-
-.. note::
-
-    vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
-    To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_.
-
-.. code-block:: python
-
-    llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
-    (output,) = llm.score("What is the capital of France?",
-                          "The capital of Brazil is Brasilia.")
-
-    score = output.outputs.score
-    print(f"Score: {score}")
-
-A code example can be found in `examples/offline_inference_scoring.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_scoring.py>`_.
-
-Online Inference
----------------
-
-Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
-Please click on the above link for more details on how to launch the server.
-
-Embeddings API
-^^^^^^^^^^^^^^
-
-Our Embeddings API is similar to ``LLM.embed``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
-
-The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
-so that you can use OpenAI client to interact with it.
-A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_.
-
-The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
-that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__,
-so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it.
-
-Score API
-^^^^^^^^^
-
-Our Score API is similar to ``LLM.score``.
-Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it.
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
-.. _supported_models:
+(supported-models)=

-Supported Models
-================
+# Supported Models

 vLLM supports generative and pooling models across various tasks.
-If a model supports more than one task, you can set the task via the :code:`--task` argument.
+If a model supports more than one task, you can set the task via the {code}`--task` argument.

 For each task, we list the model architectures that have been implemented in vLLM.
 Alongside each architecture, we include some popular models that use it.

-Loading a Model
-^^^^^^^^^^^^^^^
+## Loading a Model

-HuggingFace Hub
-+++++++++++++++
+### HuggingFace Hub

-By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
+By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).

-To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
-If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
+To determine whether a given model is supported, you can check the {code}`config.json` file inside the HF repository.
+If the {code}`"architectures"` field contains a model architecture listed below, then it should be supported in theory.

-.. tip::
-    The easiest way to check if your model is really supported at runtime is to run the program below:
+````{tip}
+The easiest way to check if your model is really supported at runtime is to run the program below:

-    .. code-block:: python
+```python
+from vllm import LLM

-        from vllm import LLM
+# For generative models (task=generate) only
+llm = LLM(model=..., task="generate")  # Name or path of your model
+output = llm.generate("Hello, my name is")
+print(output)

-        # For generative models (task=generate) only
-        llm = LLM(model=..., task="generate")  # Name or path of your model
-        output = llm.generate("Hello, my name is")
-        print(output)
+# For pooling models (task={embed,classify,reward,score}) only
+llm = LLM(model=..., task="embed")  # Name or path of your model
+output = llm.encode("Hello, my name is")
+print(output)
+```

-        # For pooling models (task={embed,classify,reward}) only
-        llm = LLM(model=..., task="embed")  # Name or path of your model
-        output = llm.encode("Hello, my name is")
-        print(output)
+If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
+````

-    If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
+Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM.
+Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.

-Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>` 
-for instructions on how to implement your model in vLLM.
-Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.
+### ModelScope

-ModelScope
-++++++++++
+To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:

-To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
+```shell
+$ export VLLM_USE_MODELSCOPE=True
+```

-.. code-block:: shell
+And use with {code}`trust_remote_code=True`.

-    $ export VLLM_USE_MODELSCOPE=True
+```python
+from vllm import LLM

-And use with :code:`trust_remote_code=True`.
+llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)

-.. code-block:: python
+# For generative models (task=generate) only
+output = llm.generate("Hello, my name is")
+print(output)

-    from vllm import LLM
+# For pooling models (task={embed,classify,reward,score}) only
+output = llm.encode("Hello, my name is")
+print(output)
+```

-    llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
+## List of Text-only Language Models

-    # For generative models (task=generate) only
-    output = llm.generate("Hello, my name is")
-    print(output)
+### Generative Models

-    # For pooling models (task={embed,classify,reward}) only
-    output = llm.encode("Hello, my name is")
-    print(output)
+See [this page](#generative-models) for more information on how to use generative models.

-List of Text-only Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Generative Models
-+++++++++++++++++
-
-See :ref:`this page <generative_models>` for more information on how to use generative models.
-
-Text Generation (``--task generate``)
-------------------------------------
+#### Text Generation (`--task generate`)

+```{eval-rst}
 .. list-table::
  :widths: 25 25 50 5 5
  :header-rows: 1
@@ -86,8 +80,8 @@ Text Generation (``--task generate``)
  * - Architecture
    - Models
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
  * - :code:`AquilaForCausalLM`
    - Aquila, Aquila2
    - :code:`BAAI/Aquila-7B`, :code:`BAAI/AquilaChat-7B`, etc.
@@ -111,8 +105,8 @@ Text Generation (``--task generate``)
  * - :code:`BartForConditionalGeneration`
    - BART
    - :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
-    - 
-    - 
+    -
+    -
  * - :code:`ChatGLMModel`
    - ChatGLM
    - :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
@@ -136,12 +130,17 @@ Text Generation (``--task generate``)
  * - :code:`DeepseekForCausalLM`
    - DeepSeek
    - :code:`deepseek-ai/deepseek-llm-67b-base`, :code:`deepseek-ai/deepseek-llm-7b-chat` etc.
-    - 
+    -
    - ✅︎
  * - :code:`DeepseekV2ForCausalLM`
    - DeepSeek-V2
    - :code:`deepseek-ai/DeepSeek-V2`, :code:`deepseek-ai/DeepSeek-V2-Chat` etc.
-    - 
+    -
+    - ✅︎
+  * - :code:`DeepseekV3ForCausalLM`
+    - DeepSeek-V3
+    - :code:`deepseek-ai/DeepSeek-V3-Base`, :code:`deepseek-ai/DeepSeek-V3` etc.
+    -
    - ✅︎
  * - :code:`ExaoneForCausalLM`
    - EXAONE-3
@@ -194,8 +193,8 @@ Text Generation (``--task generate``)
    -
    - ✅︎
  * - :code:`GraniteForCausalLM`
-    - Granite 3.0, PowerLM
-    - :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.0-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
+    - Granite 3.0, Granite 3.1, PowerLM
+    - :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.1-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
    - ✅︎
    - ✅︎
  * - :code:`GraniteMoeForCausalLM`
@@ -316,7 +315,7 @@ Text Generation (``--task generate``)
  * - :code:`PersimmonForCausalLM`
    - Persimmon
    - :code:`adept/persimmon-8b-base`, :code:`adept/persimmon-8b-chat`, etc.
-    - 
+    -
    - ✅︎
  * - :code:`QWenLMHeadModel`
    - Qwen
@@ -325,7 +324,7 @@ Text Generation (``--task generate``)
    - ✅︎
  * - :code:`Qwen2ForCausalLM`
    - Qwen2
-    - :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc.
+    - :code:`Qwen/QwQ-32B-Preview`, :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc.
    - ✅︎
    - ✅︎
  * - :code:`Qwen2MoeForCausalLM`
@@ -358,29 +357,24 @@ Text Generation (``--task generate``)
    - :code:`xverse/XVERSE-7B-Chat`, :code:`xverse/XVERSE-13B-Chat`, :code:`xverse/XVERSE-65B-Chat`, etc.
    - ✅︎
    - ✅︎
+```

-.. note::
-    Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
-
-Pooling Models
-++++++++++++++
-
-See :ref:`this page <pooling_models>` for more information on how to use pooling models.
+```{note}
+Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
+```

-.. important::
-    Since some model architectures support both generative and pooling tasks,
-    you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+### Pooling Models

-Text Embedding (``--task embed``)
---------------------------------
+See [this page](pooling-models) for more information on how to use pooling models.

-Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
+```{important}
+Since some model architectures support both generative and pooling tasks,
+you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+```

-.. note::
-    To get the best results, you should use pooling models that are specifically trained as such.
-
-The following table lists those that are tested in vLLM.
+#### Text Embedding (`--task embed`)

+```{eval-rst}
 .. list-table::
  :widths: 25 25 50 5 5
  :header-rows: 1
@@ -388,17 +382,17 @@ The following table lists those that are tested in vLLM.
  * - Architecture
    - Models
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
  * - :code:`BertModel`
    - BERT-based
    - :code:`BAAI/bge-base-en-v1.5`, etc.
-    - 
-    - 
+    -
+    -
  * - :code:`Gemma2Model`
    - Gemma2-based
    - :code:`BAAI/bge-multilingual-gemma2`, etc.
-    - 
+    -
    - ✅︎
  * - :code:`GritLM`
    - GritLM
@@ -418,28 +412,35 @@ The following table lists those that are tested in vLLM.
  * - :code:`RobertaModel`, :code:`RobertaForMaskedLM`
    - RoBERTa-based
    - :code:`sentence-transformers/all-roberta-large-v1`, :code:`sentence-transformers/all-roberta-large-v1`, etc.
-    - 
-    - 
+    -
+    -
  * - :code:`XLMRobertaModel`
    - XLM-RoBERTa-based
    - :code:`intfloat/multilingual-e5-large`, etc.
-    - 
-    - 
+    -
+    -
+```
+
+```{note}
+{code}`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
+You should manually set mean pooling by passing {code}`--override-pooler-config '{"pooling_type": "MEAN"}'`.
+```

-.. note::
-  :code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
-  You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
+```{note}
+Unlike base Qwen2, {code}`Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention.
+You can set {code}`--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.

-.. note::
-  Unlike base Qwen2, :code:`Alibaba-NLP/gte-Qwen2-7B-instruct` uses bi-directional attention.
-  You can set :code:`--hf-overrides '{"is_causal": false}'` to change the attention mask accordingly.
+On the other hand, its 1.5B variant ({code}`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
+despite being described otherwise on its model card.
+```

-  On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
-  despite being described otherwise on its model card.
+If your model is not in the above list, we will try to automatically convert the model using
+:func:`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
+of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

-Reward Modeling (``--task reward``)
-----------------------------------
+#### Reward Modeling (`--task reward`)

+```{eval-rst}
 .. list-table::
  :widths: 25 25 50 5 5
  :header-rows: 1
@@ -447,8 +448,8 @@ Reward Modeling (``--task reward``)
  * - Architecture
    - Models
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
  * - :code:`LlamaForCausalLM`
    - Llama-based
    - :code:`peiyi9979/math-shepherd-mistral-7b-prm`, etc.
@@ -459,14 +460,19 @@ Reward Modeling (``--task reward``)
    - :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
    - ✅︎
    - ✅︎
+```

-.. important::
-  For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
-  e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
+If your model is not in the above list, we will try to automatically convert the model using
+:func:`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.

-Classification (``--task classify``)
------------------------------------
+```{important}
+For process-supervised reward models such as {code}`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
+e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
+```

+#### Classification (`--task classify`)
+
+```{eval-rst}
 .. list-table::
  :widths: 25 25 50 5 5
  :header-rows: 1
@@ -474,17 +480,26 @@ Classification (``--task classify``)
  * - Architecture
    - Models
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
+  * - :code:`JambaForSequenceClassification`
+    - Jamba
+    - :code:`ai21labs/Jamba-tiny-reward-dev`, etc.
+    - ✅︎
+    - ✅︎
  * - :code:`Qwen2ForSequenceClassification`
    - Qwen2-based
    - :code:`jason9693/Qwen2.5-1.5B-apeach`, etc.
    - ✅︎
    - ✅︎
+```

-Sentence Pair Scoring (``--task score``)
----------------------------------------
+If your model is not in the above list, we will try to automatically convert the model using
+:func:`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

+#### Sentence Pair Scoring (`--task score`)
+
+```{eval-rst}
 .. list-table::
  :widths: 25 25 50 5 5
  :header-rows: 1
@@ -492,54 +507,53 @@ Sentence Pair Scoring (``--task score``)
  * - Architecture
    - Models
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
  * - :code:`BertForSequenceClassification`
    - BERT-based
    - :code:`cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
-    - 
-    - 
+    -
+    -
  * - :code:`RobertaForSequenceClassification`
    - RoBERTa-based
    - :code:`cross-encoder/quora-roberta-base`, etc.
-    - 
-    - 
+    -
+    -
  * - :code:`XLMRobertaForSequenceClassification`
    - XLM-RoBERTa-based
    - :code:`BAAI/bge-reranker-v2-m3`, etc.
-    - 
-    - 
+    -
+    -
+```

-.. _supported_mm_models:
+(supported-mm-models)=

-List of Multimodal Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+## List of Multimodal Language Models

 The following modalities are supported depending on the model:

- **T**\ ext
- **I**\ mage
- **V**\ ideo
- **A**\ udio
+- **T**ext
+- **I**mage
+- **V**ideo
+- **A**udio

-Any combination of modalities joined by :code:`+` are supported.
+Any combination of modalities joined by {code}`+` are supported.

- e.g.: :code:`T + I` means that the model supports text-only, image-only, and text-with-image inputs.
+- e.g.: {code}`T + I` means that the model supports text-only, image-only, and text-with-image inputs.

-On the other hand, modalities separated by :code:`/` are mutually exclusive.
+On the other hand, modalities separated by {code}`/` are mutually exclusive.

- e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
+- e.g.: {code}`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.

-See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
+See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.

-Generative Models
-+++++++++++++++++
+### Generative Models

-See :ref:`this page <generative_models>` for more information on how to use generative models.
+See [this page](#generative-models) for more information on how to use generative models.

-Text Generation (``--task generate``)
-------------------------------------
+#### Text Generation (`--task generate`)

+```{eval-rst}
 .. list-table::
  :widths: 25 25 15 20 5 5 5
  :header-rows: 1
@@ -548,63 +562,63 @@ Text Generation (``--task generate``)
    - Models
    - Inputs
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
    - V1
  * - :code:`AriaForConditionalGeneration`
    - Aria
    - T + I
    - :code:`rhymes-ai/Aria`
-    - 
+    -
    - ✅︎
-    - 
+    -
  * - :code:`Blip2ForConditionalGeneration`
    - BLIP-2
    - T + I\ :sup:`E`
    - :code:`Salesforce/blip2-opt-2.7b`, :code:`Salesforce/blip2-opt-6.7b`, etc.
    -
    - ✅︎
-    - 
+    -
  * - :code:`ChameleonForConditionalGeneration`
    - Chameleon
    - T + I
    - :code:`facebook/chameleon-7b` etc.
-    - 
+    -
    - ✅︎
-    - 
+    -
  * - :code:`FuyuForCausalLM`
    - Fuyu
    - T + I
    - :code:`adept/fuyu-8b` etc.
-    - 
+    -
    - ✅︎
-    - 
+    -
  * - :code:`ChatGLMModel`
    - GLM-4V
    - T + I
    - :code:`THUDM/glm-4v-9b` etc.
    - ✅︎
    - ✅︎
-    - 
+    -
  * - :code:`H2OVLChatModel`
    - H2OVL
    - T + I\ :sup:`E+`
    - :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc.
-    - 
+    -
    - ✅︎
-    - 
+    -
  * - :code:`Idefics3ForConditionalGeneration`
    - Idefics3
    - T + I
    - :code:`HuggingFaceM4/Idefics3-8B-Llama3` etc.
    - ✅︎
    -
-    - 
+    -
  * - :code:`InternVLChatModel`
    - InternVL 2.5, Mono-InternVL, InternVL 2.0
    - T + I\ :sup:`E+`
    - :code:`OpenGVLab/InternVL2_5-4B`, :code:`OpenGVLab/Mono-InternVL-2B`, :code:`OpenGVLab/InternVL2-4B`, etc.
-    - 
+    -
    - ✅︎
    - ✅︎
  * - :code:`LlavaForConditionalGeneration`
@@ -620,28 +634,28 @@ Text Generation (``--task generate``)
    - :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
    -
    - ✅︎
-    - 
+    -
  * - :code:`LlavaNextVideoForConditionalGeneration`
    - LLaVA-NeXT-Video
    - T + V
    - :code:`llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
    -
    - ✅︎
-    - 
+    -
  * - :code:`LlavaOnevisionForConditionalGeneration`
    - LLaVA-Onevision
    - T + I\ :sup:`+` + V\ :sup:`+`
    - :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
    -
    - ✅︎
-    - 
+    -
  * - :code:`MiniCPMV`
    - MiniCPM-V
    - T + I\ :sup:`E+`
    - :code:`openbmb/MiniCPM-V-2` (see note), :code:`openbmb/MiniCPM-Llama3-V-2_5`, :code:`openbmb/MiniCPM-V-2_6`, etc.
    - ✅︎
    - ✅︎
-    - 
+    -
  * - :code:`MllamaForConditionalGeneration`
    - Llama 3.2
    - T + I\ :sup:`+`
@@ -660,16 +674,16 @@ Text Generation (``--task generate``)
    - NVLM-D 1.0
    - T + I\ :sup:`E+`
    - :code:`nvidia/NVLM-D-72B`, etc.
-    - 
+    -
    - ✅︎
    - ✅︎
  * - :code:`PaliGemmaForConditionalGeneration`
    - PaliGemma, PaliGemma 2
    - T + I\ :sup:`E`
    - :code:`google/paligemma-3b-pt-224`, :code:`google/paligemma-3b-mix-224`, :code:`google/paligemma2-3b-ft-docci-448`, etc.
-    - 
+    -
    - ✅︎
-    - 
+    -
  * - :code:`Phi3VForCausalLM`
    - Phi-3-Vision, Phi-3.5-Vision
    - T + I\ :sup:`E+`
@@ -697,70 +711,79 @@ Text Generation (``--task generate``)
    - :code:`Qwen/Qwen2-Audio-7B-Instruct`
    -
    - ✅︎
-    - 
+    -
  * - :code:`Qwen2VLForConditionalGeneration`
    - Qwen2-VL
    - T + I\ :sup:`E+` + V\ :sup:`E+`
-    - :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
+    - :code:`Qwen/QVQ-72B-Preview`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
    - ✅︎
    - ✅︎
-    - 
+    -
  * - :code:`UltravoxModel`
    - Ultravox
    - T + A\ :sup:`E+`
    - :code:`fixie-ai/ultravox-v0_3`
    -
    - ✅︎
-    - 
-
-| :sup:`E` Pre-computed embeddings can be inputted for this modality.
-| :sup:`+` Multiple items can be inputted per text prompt for this modality.
+    -
+```

-.. important::
-    To enable multiple multi-modal items per text prompt, you have to set :code:`limit_mm_per_prompt` (offline inference)
-    or :code:`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
+```{eval-rst}
+:sup:`E` Pre-computed embeddings can be inputted for this modality.

-    .. code-block:: python
+:sup:`+` Multiple items can be inputted per text prompt for this modality.
+```

-        llm = LLM(
-            model="Qwen/Qwen2-VL-7B-Instruct",
-            limit_mm_per_prompt={"image": 4},
-        )
+````{important}
+To enable multiple multi-modal items per text prompt, you have to set {code}`limit_mm_per_prompt` (offline inference)
+or {code}`--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:

-    .. code-block:: bash
+```python
+llm = LLM(
+    model="Qwen/Qwen2-VL-7B-Instruct",
+    limit_mm_per_prompt={"image": 4},
+)
+```

-        vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
+```bash
+vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
+```
+````

-.. note::
-  vLLM currently only supports adding LoRA to the language backbone of multimodal models.
+```{note}
+vLLM currently only supports adding LoRA to the language backbone of multimodal models.
+```

-.. note::
-  To use :code:`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo (:code:`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
-  and pass :code:`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
+```{note}
+To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo ({code}`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
+and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
+```

-.. note::
-  The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
-  For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
+```{note}
+The official {code}`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork ({code}`HwwwH/MiniCPM-V-2`) for now.
+For more details, please see: <gh-pr:4087#issuecomment-2250397630>
+```

-Pooling Models
-++++++++++++++
+### Pooling Models

-See :ref:`this page <pooling_models>` for more information on how to use pooling models.
+See [this page](pooling-models) for more information on how to use pooling models.

-.. important::
-    Since some model architectures support both generative and pooling tasks,
-    you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+```{important}
+Since some model architectures support both generative and pooling tasks,
+you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+```

-Text Embedding (``--task embed``)
---------------------------------
+#### Text Embedding (`--task embed`)

-Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
+Any text generation model can be converted into an embedding model by passing {code}`--task embed`.

-.. note::
-    To get the best results, you should use pooling models that are specifically trained as such.
+```{note}
+To get the best results, you should use pooling models that are specifically trained as such.
+```

 The following table lists those that are tested in vLLM.

+```{eval-rst}
 .. list-table::
  :widths: 25 25 15 25 5 5
  :header-rows: 1
@@ -769,13 +792,13 @@ The following table lists those that are tested in vLLM.
    - Models
    - Inputs
    - Example HF Models
-    - :ref:`LoRA <lora>`
-    - :ref:`PP <distributed_serving>`
+    - :ref:`LoRA <lora-adapter>`
+    - :ref:`PP <distributed-serving>`
  * - :code:`LlavaNextForConditionalGeneration`
    - LLaVA-NeXT-based
    - T / I
    - :code:`royokong/e5-v`
-    - 
+    -
    - ✅︎
  * - :code:`Phi3VForCausalLM`
    - Phi-3-Vision-based
@@ -787,27 +810,25 @@ The following table lists those that are tested in vLLM.
    - Qwen2-VL-based
    - T + I
    - :code:`MrLight/dse-qwen2-2b-mrl-v1`
-    - 
+    -
    - ✅︎
+```

----
+______________________________________________________________________

-Model Support Policy
-=====================
+# Model Support Policy

 At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:

 1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
-
 2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.

-.. tip::
-  When comparing the output of :code:`model.generate` from HuggingFace Transformers with the output of :code:`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., `generation_config.json <https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945>`__) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
+```{tip}
+When comparing the output of {code}`model.generate` from HuggingFace Transformers with the output of {code}`llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
+```

 3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
-
 4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
-
 5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.

 Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
@@ -816,7 +837,7 @@ Note that, as an inference engine, vLLM does not introduce new models. Therefore

 We have the following levels of testing for models:

-1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `models tests <https://github.com/vllm-project/vllm/blob/main/tests/models>`_ for the models that have passed this test.
+1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
-3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to `functionality tests <https://github.com/vllm-project/vllm/tree/main/tests>`_ and `examples <https://github.com/vllm-project/vllm/tree/main/examples>`_ for the models that have passed this test.
+3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:main/examples) for the models that have passed this test.
 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
\ No newline at end of file
--- a/docs/source/performance/benchmarks.md
+++ b/docs/source/performance/benchmarks.md
+(benchmarks)=
+
+# Benchmark Suites
+
+vLLM contains two sets of benchmarks:
+
+- [Performance benchmarks](#performance-benchmarks)
+- [Nightly benchmarks](#nightly-benchmarks)
+
+(performance-benchmarks)=
+
+## Performance Benchmarks
+
+The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
+
+The latest performance results are hosted on the public [vLLM Performance Dashboard](https://perf.vllm.ai).
+
+More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
+
+(nightly-benchmarks)=
+
+## Nightly Benchmarks
+
+These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lmdeploy`) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the `perf-benchmarks` and `nightly-benchmarks` labels.
+
+The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
+
+More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md).
--- a/docs/source/performance/benchmarks.rst
+++ b/docs/source/performance/benchmarks.rst
-.. _benchmarks:
-
-================
-Benchmark Suites
-================
-
-vLLM contains two sets of benchmarks:
-
-+ :ref:`Performance benchmarks <performance_benchmarks>`
-+ :ref:`Nightly benchmarks <nightly_benchmarks>`
-
-
-.. _performance_benchmarks:
-
-Performance Benchmarks
----------------------
-
-The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the ``perf-benchmarks`` and ``ready`` labels, and when a PR is merged into vLLM.
-
-The latest performance results are hosted on the public `vLLM Performance Dashboard <https://perf.vllm.ai>`_.
-
-More information on the performance benchmarks and their parameters can be found `here <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md>`__.
-
-.. _nightly_benchmarks:
-
-Nightly Benchmarks
------------------
-
-These compare vLLM's performance against alternatives (``tgi``, ``trt-llm``, and ``lmdeploy``) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the ``perf-benchmarks`` and ``nightly-benchmarks`` labels. 
-
-The latest nightly benchmark results are shared in major release blog posts such as `vLLM v0.6.0 <https://blog.vllm.ai/2024/09/05/perf-update.html>`_.
-
-More information on the nightly benchmarks and their parameters can be found `here <https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/nightly-descriptions.md>`__.
\ No newline at end of file
--- a/docs/source/quantization/auto_awq.md
+++ b/docs/source/quantization/auto_awq.md
+(auto-awq)=
+
+# AutoAWQ
+
+```{warning}
+Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
+accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
+inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
+```
+
+To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
+Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
+The main benefits are lower latency and memory usage.
+
+You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
+
+```console
+$ pip install autoawq
+```
+
+After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
+
+```python
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer
+
+model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
+quant_path = 'mistral-instruct-v0.2-awq'
+quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
+
+# Load model
+model = AutoAWQForCausalLM.from_pretrained(
+    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
+)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+# Quantize
+model.quantize(tokenizer, quant_config=quant_config)
+
+# Save quantized model
+model.save_quantized(quant_path)
+tokenizer.save_pretrained(quant_path)
+
+print(f'Model is quantized and saved at "{quant_path}"')
+```
+
+To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
+
+```console
+$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
+```
+
+AWQ models are also supported directly through the LLM entrypoint:
+
+```python
+from vllm import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
--- a/docs/source/quantization/auto_awq.rst
+++ b/docs/source/quantization/auto_awq.rst
-.. _auto_awq:
-
-AutoAWQ
-==================
-
-.. warning::
-
-   Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
-   accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
-   inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
-
-To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_. 
-Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
-The main benefits are lower latency and memory usage.
-
-You can quantize your own models by installing AutoAWQ or picking one of the `400+ models on Huggingface <https://huggingface.co/models?sort=trending&search=awq>`_. 
-
-.. code-block:: console
-
-    $ pip install autoawq
-
-After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
-
-.. code-block:: python
-
-    from awq import AutoAWQForCausalLM
-    from transformers import AutoTokenizer
-    
-    model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
-    quant_path = 'mistral-instruct-v0.2-awq'
-    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
-    
-    # Load model
-    model = AutoAWQForCausalLM.from_pretrained(
-        model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
-    )
-    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-    
-    # Quantize
-    model.quantize(tokenizer, quant_config=quant_config)
-    
-    # Save quantized model
-    model.save_quantized(quant_path)
-    tokenizer.save_pretrained(quant_path)
-    
-    print(f'Model is quantized and saved at "{quant_path}"')
-
-To run an AWQ model with vLLM, you can use `TheBloke/Llama-2-7b-Chat-AWQ <https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ>`_ with the following command:
-
-.. code-block:: console
-
-    $ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
-
-AWQ models are also supported directly through the LLM entrypoint:
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-
-    # Sample prompts.
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-    # Create an LLM.
-    llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
-    # Generate texts from the prompts. The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/docs/source/quantization/bnb.md
+++ b/docs/source/quantization/bnb.md
+(bits-and-bytes)=
+
+# BitsAndBytes
+
+vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
+BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
+Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
+
+Below are the steps to utilize BitsAndBytes with vLLM.
+
+```console
+$ pip install bitsandbytes>=0.45.0
+```
+
+vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
+
+You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
+And usually, these repositories have a config.json file that includes a quantization_config section.
+
+## Read quantized checkpoint.
+
+```python
+from vllm import LLM
+import torch
+# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
+model_id = "unsloth/tinyllama-bnb-4bit"
+llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+quantization="bitsandbytes", load_format="bitsandbytes")
+```
+
+## Inflight quantization: load as 4bit quantization
+
+```python
+from vllm import LLM
+import torch
+model_id = "huggyllama/llama-7b"
+llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
+quantization="bitsandbytes", load_format="bitsandbytes")
+```
--- a/docs/source/quantization/bnb.rst
+++ b/docs/source/quantization/bnb.rst
-.. _bits_and_bytes:
-
-BitsAndBytes
-==================
-
-vLLM now supports `BitsAndBytes <https://github.com/TimDettmers/bitsandbytes>`_ for more efficient model inference.
-BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
-Compared to other quantization methods,  BitsAndBytes eliminates the need for calibrating the quantized model with input data.
-
-Below are the steps to utilize BitsAndBytes with vLLM.
-
-.. code-block:: console
-
-    $ pip install bitsandbytes>=0.45.0
-
-vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
-
-You can find bitsandbytes quantized models on https://huggingface.co/models?other=bitsandbytes.
-And usually, these repositories have a config.json file that includes a quantization_config section.
-
-Read quantized checkpoint.
--------------------------
-
-.. code-block:: python
-
-    from vllm import LLM
-    import torch
-    # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
-    model_id = "unsloth/tinyllama-bnb-4bit"
-    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
-    quantization="bitsandbytes", load_format="bitsandbytes")
-
-Inflight quantization: load as 4bit quantization
------------------------------------------------
-
-.. code-block:: python
-
-    from vllm import LLM
-    import torch
-    model_id = "huggyllama/llama-7b"
-    llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
-    quantization="bitsandbytes", load_format="bitsandbytes")
-
--- a/docs/source/quantization/fp8.md
+++ b/docs/source/quantization/fp8.md
+(fp8)=
+
+# FP8 W8A8
+
+vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
+Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
+Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
+Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
+
+Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
+
+The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
+
+- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
+- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
+
+```{note}
+FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+```
+
+## Quick Start with Online Dynamic Quantization
+
+Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
+
+In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
+
+```python
+from vllm import LLM
+model = LLM("facebook/opt-125m", quantization="fp8")
+# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
+result = model.generate("Hello, my name is")
+```
+
+```{warning}
+Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
+```
+
+## Installation
+
+To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
+
+```console
+$ pip install llmcompressor
+```
+
+## Quantization Process
+
+The quantization process involves three main steps:
+
+1. Loading the model
+2. Applying quantization
+3. Evaluating accuracy in vLLM
+
+### 1. Loading the Model
+
+Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
+
+```python
+from llmcompressor.transformers import SparseAutoModelForCausalLM
+from transformers import AutoTokenizer
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+model = SparseAutoModelForCausalLM.from_pretrained(
+  MODEL_ID, device_map="auto", torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2. Applying Quantization
+
+For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
+
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
+
+Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
+
+```python
+from llmcompressor.transformers import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+# Configure the simple PTQ quantization
+recipe = QuantizationModifier(
+  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+
+# Apply the quantization algorithm.
+oneshot(model=model, recipe=recipe)
+
+# Save the model.
+SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+model.save_pretrained(SAVE_DIR)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+### 3. Evaluating Accuracy
+
+Install `vllm` and `lm-evaluation-harness`:
+
+```console
+$ pip install vllm lm-eval==0.4.4
+```
+
+Load and run the model in `vllm`:
+
+```python
+from vllm import LLM
+model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+model.generate("Hello my name is")
+```
+
+Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
+
+```{note}
+Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
+```
+
+```console
+$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
+$ lm_eval \
+  --model vllm \
+  --model_args pretrained=$MODEL,add_bos_token=True \
+  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
+```
+
+Here's an example of the resulting scores:
+
+```text
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
+|     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
+```
+
+## Troubleshooting and Support
+
+If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
+
+## Deprecated Flow
+
+```{note}
+The following information is preserved for reference and search purposes.
+The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
+```
+
+For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
+
+```bash
+git clone https://github.com/neuralmagic/AutoFP8.git
+pip install -e AutoFP8
+```
+
+This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
+
+## Offline Quantization with Static Activation Scaling Factors
+
+You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
+
+```python
+from datasets import load_dataset
+from transformers import AutoTokenizer
+from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
+
+pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
+quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
+
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
+tokenizer.pad_token = tokenizer.eos_token
+
+# Load and tokenize 512 dataset samples for calibration of activation scales
+ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
+examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
+examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
+
+# Define quantization config with static activation scales
+quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
+
+# Load the model, quantize, and save checkpoint
+model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
+model.quantize(examples)
+model.save_quantized(quantized_model_dir)
+```
+
+Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
+Finally, you can load the quantized model checkpoint directly in vLLM.
+
+```python
+from vllm import LLM
+model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
+# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
+result = model.generate("Hello, my name is")
+```
--- a/docs/source/quantization/fp8.rst
+++ b/docs/source/quantization/fp8.rst
-.. _fp8:
-
-FP8 W8A8
-==================
-
-vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. 
-Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. 
-Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
-Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
-
-Please visit the HF collection of `quantized FP8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127>`_.
-
-The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
-
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
-
-.. note::
-
-   FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
-   FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-
-Quick Start with Online Dynamic Quantization
--------------------------------------------
-
-Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
-
-In this mode, all Linear modules (except for the final ``lm_head``) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
-
-.. code-block:: python
-
-    from vllm import LLM
-    model = LLM("facebook/opt-125m", quantization="fp8")
-    # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
-    result = model.generate("Hello, my name is")
-
-.. warning::
-
-    Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-
-Installation
------------
-
-To produce performant FP8 quantized models with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
-
-.. code-block:: console
-
-   $ pip install llmcompressor
-
-Quantization Process
--------------------
-
-The quantization process involves three main steps:
-
-1. Loading the model
-2. Applying quantization
-3. Evaluating accuracy in vLLM
-
-1. Loading the Model
-^^^^^^^^^^^^^^^^^^^^
-
-Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
-
-.. code-block:: python
-
-   from llmcompressor.transformers import SparseAutoModelForCausalLM
-   from transformers import AutoTokenizer
-
-   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
-
-   model = SparseAutoModelForCausalLM.from_pretrained(
-     MODEL_ID, device_map="auto", torch_dtype="auto")
-   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-
-2. Applying Quantization
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
-
- Static, per-channel quantization on the weights
- Dynamic, per-token quantization on the activations
-
-Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
-
-.. code-block:: python
-
-   from llmcompressor.transformers import oneshot
-   from llmcompressor.modifiers.quantization import QuantizationModifier
-
-   # Configure the simple PTQ quantization
-   recipe = QuantizationModifier(
-     targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
-
-   # Apply the quantization algorithm.
-   oneshot(model=model, recipe=recipe)
-
-   # Save the model.
-   SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
-   model.save_pretrained(SAVE_DIR)
-   tokenizer.save_pretrained(SAVE_DIR)
-
-3. Evaluating Accuracy
-^^^^^^^^^^^^^^^^^^^^^^
-
-Install ``vllm`` and ``lm-evaluation-harness``:
-
-.. code-block:: console
-
-   $ pip install vllm lm-eval==0.4.4
-
-Load and run the model in ``vllm``:
-
-.. code-block:: python
-
-   from vllm import LLM
-   model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
-   model.generate("Hello my name is")
-
-Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
-
-.. note::
-
-   Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
-
-.. code-block:: console
-
-   $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic 
-   $ lm_eval \
-     --model vllm \
-     --model_args pretrained=$MODEL,add_bos_token=True \
-     --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 250
-
-Here's an example of the resulting scores:
-
-.. code-block:: text
-
-   |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
-   |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
-   |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.768|±  |0.0268|
-   |     |       |strict-match    |     5|exact_match|↑  |0.768|±  |0.0268|
-
-Troubleshooting and Support
---------------------------
-
-If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
-
-
-Deprecated Flow
------------------
-
-.. note::
-
-   The following information is preserved for reference and search purposes.
-   The quantization method described below is deprecated in favor of the ``llmcompressor`` method described above.
-
-For static per-tensor offline quantization to FP8, please install the `AutoFP8 library <https://github.com/neuralmagic/autofp8>`_.
-
-.. code-block:: bash
-
-    git clone https://github.com/neuralmagic/AutoFP8.git
-    pip install -e AutoFP8
-
-This package introduces the ``AutoFP8ForCausalLM`` and ``BaseQuantizeConfig`` objects for managing how your model will be compressed.
-
-Offline Quantization with Static Activation Scaling Factors
-----------------------------------------------------------
-
-You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the ``activation_scheme="static"`` argument.
-
-.. code-block:: python
-
-    from datasets import load_dataset
-    from transformers import AutoTokenizer
-    from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
-
-    pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
-    quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
-
-    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    # Load and tokenize 512 dataset samples for calibration of activation scales
-    ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
-    examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
-    examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
-
-    # Define quantization config with static activation scales
-    quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
-
-    # Load the model, quantize, and save checkpoint
-    model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
-    model.quantize(examples)
-    model.save_quantized(quantized_model_dir)
-
-Your model checkpoint with quantized weights and activations should be available at ``Meta-Llama-3-8B-Instruct-FP8/``.
-Finally, you can load the quantized model checkpoint directly in vLLM.
-
-.. code-block:: python
-
-    from vllm import LLM
-    model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
-    # INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
-    result = model.generate("Hello, my name is")
-
--- a/docs/source/quantization/fp8_e4m3_kvcache.rst
+++ b/docs/source/quantization/fp8_e4m3_kvcache.rst
-.. _fp8_e4m3_kvcache:
+(fp8-e4m3-kvcache)=

-FP8 E4M3 KV Cache
-==================
+# FP8 E4M3 KV Cache

-Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache, 
-improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2 
-(5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of 
-the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of 
-FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside 
-each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling 
+Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache,
+improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2
+(5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of
+the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of
+FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside
+each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling
 factors of a finer granularity (e.g. per-channel).

-These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If 
-this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an 
-unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO). 
+These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If
+this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an
+unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).

 To install AMMO (AlgorithMic Model Optimization):

-.. code-block:: console
+```console
+$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
+```

-        $ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
-
-Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon 
-offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc. 
+Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
+offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc.
 Thus, LLM inference is greatly accelerated with minimal accuracy loss.

-
 Here is an example of how to enable this feature:

-.. code-block:: python
-
-        # two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to 
-        # https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
-
-        from vllm import LLM, SamplingParams
-        sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
-        llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
-                  kv_cache_dtype="fp8",
-                  quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
-        prompt = "London is the capital of"
-        out = llm.generate(prompt, sampling_params)[0].outputs[0].text
-        print(out)
-
-        # output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
-        # output w/o scaling factors:  England, located in the southeastern part of the country. It is known 
-
+```python
+# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
+# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
+
+from vllm import LLM, SamplingParams
+sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
+llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
+          kv_cache_dtype="fp8",
+          quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
+prompt = "London is the capital of"
+out = llm.generate(prompt, sampling_params)[0].outputs[0].text
+print(out)
+
+# output w/ scaling factors:  England, the United Kingdom, and one of the world's leading financial,
+# output w/o scaling factors:  England, located in the southeastern part of the country. It is known
+```
--- a/docs/source/quantization/fp8_e5m2_kvcache.md
+++ b/docs/source/quantization/fp8_e5m2_kvcache.md
+(fp8-kv-cache)=
+
+# FP8 E5M2 KV Cache
+
+The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
+The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.
+
+Here is an example of how to enable this feature:
+
+```python
+from vllm import LLM, SamplingParams
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+# Create an LLM.
+llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
--- a/docs/source/quantization/fp8_e5m2_kvcache.rst
+++ b/docs/source/quantization/fp8_e5m2_kvcache.rst
-.. _fp8_kv_cache:
-
-FP8 E5M2 KV Cache
-==================
-
-The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
-The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.
-
-Here is an example of how to enable this feature:
-
-.. code-block:: python
-
-    from vllm import LLM, SamplingParams
-    # Sample prompts.
-    prompts = [
-        "Hello, my name is",
-        "The president of the United States is",
-        "The capital of France is",
-        "The future of AI is",
-    ]
-    # Create a sampling params object.
-    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-    # Create an LLM.
-    llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
-    # Generate texts from the prompts. The output is a list of RequestOutput objects
-    # that contain the prompt, generated text, and other information.
-    outputs = llm.generate(prompts, sampling_params)
-    # Print the outputs.
-    for output in outputs:
-        prompt = output.prompt
-        generated_text = output.outputs[0].text
-        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
-
--- a/docs/source/quantization/gguf.md
+++ b/docs/source/quantization/gguf.md
+(gguf)=
+
+# GGUF
+
+```{warning}
+Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
+```
+
+```{warning}
+Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
+```
+
+To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
+
+```console
+$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
+$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+```
+
+You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
+
+```console
+$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
+```
+
+```{warning}
+We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
+```
+
+You can also use the GGUF model directly through the LLM entrypoint:
+
+```python
+from vllm import LLM, SamplingParams
+
+# In this script, we demonstrate how to pass input to the chat method:
+conversation = [
+   {
+      "role": "system",
+      "content": "You are a helpful assistant"
+   },
+   {
+      "role": "user",
+      "content": "Hello"
+   },
+   {
+      "role": "assistant",
+      "content": "Hello! How can I assist you today?"
+   },
+   {
+      "role": "user",
+      "content": "Write an essay about the importance of higher education.",
+   },
+]
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
+         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.chat(conversation, sampling_params)
+
+# Print the outputs.
+for output in outputs:
+   prompt = output.prompt
+   generated_text = output.outputs[0].text
+   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
--- a/docs/source/quantization/gguf.rst
+++ b/docs/source/quantization/gguf.rst
-.. _gguf:
-
-GGUF
-==================
-
-.. warning::
-
-   Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-
-.. warning::
-
-   Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use `gguf-split <https://github.com/ggerganov/llama.cpp/pull/6135>`_ tool to merge them to a single-file model.
-
-To run a GGUF model with vLLM, you can download and use the local GGUF model from `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF <https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF>`_ with the following command:
-
-.. code-block:: console
-
-   $ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
-   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
-
-You can also add ``--tensor-parallel-size 2`` to enable tensor parallelism inference with 2 GPUs:
-
-.. code-block:: console
-
-   $ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-   $ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
-
-.. warning::
-
-   We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-
-You can also use the GGUF model directly through the LLM entrypoint:
-
-.. code-block:: python
-
-   from vllm import LLM, SamplingParams
-
-   # In this script, we demonstrate how to pass input to the chat method:
-   conversation = [
-      {
-         "role": "system",
-         "content": "You are a helpful assistant"
-      },
-      {
-         "role": "user",
-         "content": "Hello"
-      },
-      {
-         "role": "assistant",
-         "content": "Hello! How can I assist you today?"
-      },
-      {
-         "role": "user",
-         "content": "Write an essay about the importance of higher education.",
-      },
-   ]
-
-   # Create a sampling params object.
-   sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-
-   # Create an LLM.
-   llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
-            tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
-   # Generate texts from the prompts. The output is a list of RequestOutput objects
-   # that contain the prompt, generated text, and other information.
-   outputs = llm.chat(conversation, sampling_params)
-
-   # Print the outputs.
-   for output in outputs:
-      prompt = output.prompt
-      generated_text = output.outputs[0].text
-      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/docs/source/quantization/int8.md
+++ b/docs/source/quantization/int8.md
+(int8)=
+
+# INT8 W8A8
+
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+
+Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
+
+```{note}
+INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+```
+
+## Prerequisites
+
+To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
+
+```console
+$ pip install llmcompressor
+```
+
+## Quantization Process
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+### 1. Loading the Model
+
+Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
+
+```python
+from llmcompressor.transformers import SparseAutoModelForCausalLM
+from transformers import AutoTokenizer
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = SparseAutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map="auto", torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2. Preparing Calibration Data
+
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data.
+For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
+
+```python
+from datasets import load_dataset
+
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)
+
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```
+
+### 3. Applying Quantization
+
+Now, apply the quantization algorithms:
+
+```python
+from llmcompressor.transformers import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+
+# Configure the quantization algorithms
+recipe = [
+    SmoothQuantModifier(smoothing_strength=0.8),
+    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+]
+
+# Apply quantization
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Save the compressed model
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+
+### 4. Evaluating Accuracy
+
+After quantization, you can load and run the model in vLLM:
+
+```python
+from vllm import LLM
+model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+```
+
+To evaluate accuracy, you can use `lm_eval`:
+
+```console
+$ lm_eval --model vllm \
+  --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --limit 250 \
+  --batch_size 'auto'
+```
+
+```{note}
+Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+```
+
+## Best Practices
+
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+
+## Troubleshooting and Support
+
+If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
--- a/docs/source/quantization/int8.rst
+++ b/docs/source/quantization/int8.rst
-.. _int8:
-
-INT8 W8A8
-==================
-
-vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
-This quantization method is particularly useful for reducing model size while maintaining good performance.
-
-Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM <https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415>`_.
-
-.. note::
-
-   INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
-
-Prerequisites
-------------
-
-To use INT8 quantization with vLLM, you'll need to install the `llm-compressor <https://github.com/vllm-project/llm-compressor/>`_ library:
-
-.. code-block:: console
-
-   $ pip install llmcompressor
-
-Quantization Process
--------------------
-
-The quantization process involves four main steps:
-
-1. Loading the model
-2. Preparing calibration data
-3. Applying quantization
-4. Evaluating accuracy in vLLM
-
-1. Loading the Model
-^^^^^^^^^^^^^^^^^^^^
-
-Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
-
-.. code-block:: python
-
-   from llmcompressor.transformers import SparseAutoModelForCausalLM
-   from transformers import AutoTokenizer
-
-   MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
-   model = SparseAutoModelForCausalLM.from_pretrained(
-       MODEL_ID, device_map="auto", torch_dtype="auto",
-   )
-   tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-
-2. Preparing Calibration Data
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-When quantizing activations to INT8, you need sample data to estimate the activation scales.
-It's best to use calibration data that closely matches your deployment data. 
-For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
-
-.. code-block:: python
-
-   from datasets import load_dataset
-
-   NUM_CALIBRATION_SAMPLES = 512
-   MAX_SEQUENCE_LENGTH = 2048
-
-   # Load and preprocess the dataset
-   ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
-   ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
-
-   def preprocess(example):
-       return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
-   ds = ds.map(preprocess)
-
-   def tokenize(sample):
-       return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
-   ds = ds.map(tokenize, remove_columns=ds.column_names)
-
-3. Applying Quantization
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-Now, apply the quantization algorithms:
-
-.. code-block:: python
-
-   from llmcompressor.transformers import oneshot
-   from llmcompressor.modifiers.quantization import GPTQModifier
-   from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
-
-   # Configure the quantization algorithms
-   recipe = [
-       SmoothQuantModifier(smoothing_strength=0.8),
-       GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
-   ]
-
-   # Apply quantization
-   oneshot(
-       model=model,
-       dataset=ds,
-       recipe=recipe,
-       max_seq_length=MAX_SEQUENCE_LENGTH,
-       num_calibration_samples=NUM_CALIBRATION_SAMPLES,
-   )
-
-   # Save the compressed model
-   SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
-   model.save_pretrained(SAVE_DIR, save_compressed=True)
-   tokenizer.save_pretrained(SAVE_DIR)
-
-This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
-
-4. Evaluating Accuracy
-^^^^^^^^^^^^^^^^^^^^^^
-
-After quantization, you can load and run the model in vLLM:
-
-.. code-block:: python
-
-   from vllm import LLM
-   model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
-
-To evaluate accuracy, you can use ``lm_eval``:
-
-.. code-block:: console
-
-   $ lm_eval --model vllm \
-     --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
-     --tasks gsm8k \
-     --num_fewshot 5 \
-     --limit 250 \
-     --batch_size 'auto'
-
-.. note::
-
-   Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
-
-Best Practices
--------------
-
- Start with 512 samples for calibration data (increase if accuracy drops)
- Use a sequence length of 2048 as a starting point
- Employ the chat template or instruction template that the model was trained with
- If you've fine-tuned a model, consider using a sample of your training data for calibration
-
-Troubleshooting and Support
---------------------------
-
-If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.