Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
a1fe24d9 · Harry Mellor · GitHub · d0bc2f81 · a1fe24d9 · a1fe24d9
Unverified Commit a1fe24d9 authored May 23, 2025 by Harry Mellor Committed by GitHub May 23, 2025
20 changed files
--- a/docs/source/serving/integrations/langchain.md
+++ b/docs/source/serving/integrations/langchain.md
-(serving-langchain)=
-
-# LangChain
+---
+title: LangChain
+---
+[](){ #serving-langchain }

 vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .


--- a/docs/source/serving/integrations/llamaindex.md
+++ b/docs/source/serving/integrations/llamaindex.md
-(serving-llamaindex)=
-
-# LlamaIndex
+---
+title: LlamaIndex
+---
+[](){ #serving-llamaindex }

 vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .


--- a/docs/source/serving/metrics.md
+++ b/docs/source/serving/metrics.md
@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
 system. These metrics are exposed via the `/metrics` endpoint on the vLLM
 OpenAI compatible API server.

-You can start the server using Python, or using [Docker](#deployment-docker):
+You can start the server using Python, or using [Docker][deployment-docker]:

 ```console
 vllm serve unsloth/Llama-3.2-1B-Instruct
@@ -31,11 +31,9 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I

 The following metrics are exposed:

-:::{literalinclude} ../../../vllm/engine/metrics.py
-:end-before: end-metrics-definitions
-:language: python
-:start-after: begin-metrics-definitions
-:::
+```python
+--8<-- "vllm/engine/metrics.py:metrics-definitions"
+```

 The following metrics are deprecated and due to be removed in a future version:


--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
-(offline-inference)=
-
-# Offline Inference
+---
+title: Offline Inference
+---
+[](){ #offline-inference }

 You can run vLLM in your own code on a list of prompts.

-The offline API is based on the {class}`~vllm.LLM` class.
+The offline API is based on the [LLM][vllm.LLM] class.
 To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.

 For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
@@ -19,23 +20,22 @@ llm = LLM(model="facebook/opt-125m")
 After initializing the `LLM` instance, you can perform model inference using various APIs.
 The available APIs depend on the type of model that is being run:

- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
+- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
+- [Pooling models][pooling-models] output their hidden states directly.

 Please refer to the above pages for more details about each API.

-:::{seealso}
-[API Reference](#offline-inference-api)
-:::
+!!! info
+    [API Reference][offline-inference-api]

-(configuration-options)=
+[](){ #configuration-options }

 ## Configuration Options

 This section lists the most common options for running the vLLM engine.
-For a full list, refer to the <project:#configuration> page.
+For a full list, refer to the [configuration][configuration] page.

-(model-resolution)=
+[](){ #model-resolution }

 ### Model resolution

@@ -59,9 +59,9 @@ model = LLM(
 )
 ```

-Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.

-(reducing-memory-usage)=
+[](){ #reducing-memory-usage }

 ### Reducing memory usage

@@ -80,18 +80,16 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
 ```

-:::{important}
-To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
-before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+!!! warning
+    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
+    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

-To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-:::
+    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.

-:::{note}
-With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+!!! note
+    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).

-You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-:::
+    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.

 #### Quantization

@@ -100,7 +98,7 @@ Quantized models take less memory at the cost of lower precision.
 Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
 and used directly without extra configuration.

-Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.

 #### Context length and batch size

@@ -119,9 +117,8 @@ llm = LLM(model="adept/fuyu-8b",

 By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.

-:::{important}
-CUDA graph capture takes up more memory in V1 than in V0.
-:::
+!!! warning
+    CUDA graph capture takes up more memory in V1 than in V0.

 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

@@ -214,4 +211,4 @@ llm = LLM(model="OpenGVLab/InternVL2-2B",
 ### Performance optimization and tuning

 You can potentially improve the performance of vLLM by finetuning various options.
-Please refer to [this guide](#optimization-and-tuning) for more details.
+Please refer to [this guide][optimization-and-tuning] for more details.
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
-(openai-compatible-server)=
-
-# OpenAI-Compatible Server
+---
+title: OpenAI-Compatible Server
+---
+[](){ #openai-compatible-server }

 vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.

-In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#serve-args) command. (You can also use our [Docker](#deployment-docker) image.)
+In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)

 ```bash
 vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
@@ -20,58 +21,56 @@ client = OpenAI(
 )

 completion = client.chat.completions.create(
-  model="NousResearch/Meta-Llama-3-8B-Instruct",
-  messages=[
-    {"role": "user", "content": "Hello!"}
-  ]
+    model="NousResearch/Meta-Llama-3-8B-Instruct",
+    messages=[
+        {"role": "user", "content": "Hello!"}
+    ]
 )

 print(completion.choices[0].message)
 ```

-:::{tip}
-vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
-You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
-:::
+!!! tip
+    vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
+    You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.

-:::{important}
-By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
+!!! warning
+    By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.

-To disable this behavior, please pass `--generation-config vllm` when launching the server.
-:::
+    To disable this behavior, please pass `--generation-config vllm` when launching the server.

 ## Supported APIs

 We currently support the following OpenAI APIs:

- [Completions API](#completions-api) (`/v1/completions`)
-  - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
-  - *Note: `suffix` parameter is not supported.*
- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
-  - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template](#chat-template).
-  - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
-  - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
-  - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
+- [Completions API][completions-api] (`/v1/completions`)
+    - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
+    - *Note: `suffix` parameter is not supported.*
+- [Chat Completions API][chat-api] (`/v1/chat/completions`)
+    - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
+    - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
+- [Embeddings API][embeddings-api] (`/v1/embeddings`)
+    - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
+- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
+    - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).

 In addition, we have the following custom APIs:

- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
-  - Applicable to any model with a tokenizer.
- [Pooling API](#pooling-api) (`/pooling`)
-  - Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API](#classification-api) (`/classify`)
-  - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- [Score API](#score-api) (`/score`)
-  - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
-  - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
-  - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
-  - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
-  - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
-
-(chat-template)=
+- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
+    - Applicable to any model with a tokenizer.
+- [Pooling API][pooling-api] (`/pooling`)
+    - Applicable to all [pooling models](../models/pooling_models.md).
+- [Classification API][classification-api] (`/classify`)
+    - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
+- [Score API][score-api] (`/score`)
+    - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
+- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
+    - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
+    - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
+    - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
+    - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
+
+[](){ #chat-template }

 ## Chat Template

@@ -97,10 +96,10 @@ both a `type` and a `text` field. An example is provided below:

 ```python
 completion = client.chat.completions.create(
-  model="NousResearch/Meta-Llama-3-8B-Instruct",
-  messages=[
-    {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
-  ]
+    model="NousResearch/Meta-Llama-3-8B-Instruct",
+    messages=[
+        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
+    ]
 )
 ```

@@ -111,9 +110,9 @@ request. vLLM provides best-effort support to detect this automatically, which i
 the detected format, which can be one of:

 - `"string"`: A string.
-  - Example: `"Hello world"`
+    - Example: `"Hello world"`
 - `"openai"`: A list of dictionaries, similar to OpenAI schema.
-  - Example: `[{"type": "text", "text": "Hello world!"}]`
+    - Example: `[{"type": "text", "text": "Hello world!"}]`

 If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
 to override which format to use.
@@ -126,13 +125,13 @@ Or directly merge them into the JSON payload if you are using HTTP call directly

 ```python
 completion = client.chat.completions.create(
-  model="NousResearch/Meta-Llama-3-8B-Instruct",
-  messages=[
-    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
-  ],
-  extra_body={
-    "guided_choice": ["positive", "negative"]
-  }
+    model="NousResearch/Meta-Llama-3-8B-Instruct",
+    messages=[
+        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+    ],
+    extra_body={
+        "guided_choice": ["positive", "negative"]
+    }
 )
 ```

@@ -148,29 +147,29 @@ with `--enable-request-id-headers`.

 ```python
 completion = client.chat.completions.create(
-  model="NousResearch/Meta-Llama-3-8B-Instruct",
-  messages=[
-    {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
-  ],
-  extra_headers={
-    "x-request-id": "sentiment-classification-00001",
-  }
+    model="NousResearch/Meta-Llama-3-8B-Instruct",
+    messages=[
+        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+    ],
+    extra_headers={
+        "x-request-id": "sentiment-classification-00001",
+    }
 )
 print(completion._request_id)

 completion = client.completions.create(
-  model="NousResearch/Meta-Llama-3-8B-Instruct",
-  prompt="A robot may not injure a human being",
-  extra_headers={
-    "x-request-id": "completion-test",
-  }
+    model="NousResearch/Meta-Llama-3-8B-Instruct",
+    prompt="A robot may not injure a human being",
+    extra_headers={
+        "x-request-id": "completion-test",
+    }
 )
 print(completion._request_id)
 ```

 ## API Reference

-(completions-api)=
+[](){ #completions-api }

 ### Completions API

@@ -181,23 +180,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>

 #### Extra parameters

-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-completion-sampling-params
-:end-before: end-completion-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-completion-extra-params
-:end-before: end-completion-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
+```

-(chat-api)=
+[](){ #chat-api }

 ### Chat API

@@ -206,37 +201,33 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai

 We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
 [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
-see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
+see our [Multimodal Inputs][multimodal-inputs] guide for more information.
 - *Note: `image_url.detail` parameter is not supported.*

 Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>

 #### Extra parameters

-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-completion-sampling-params
-:end-before: end-chat-completion-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-completion-extra-params
-:end-before: end-chat-completion-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
+```

-(embeddings-api)=
+[](){ #embeddings-api }

 ### Embeddings API

 Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
 you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

-If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
+If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
 which will be treated as a single prompt to the model.

 Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
@@ -246,138 +237,117 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
 You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
 and passing a list of `messages` in the request. Refer to the examples below for illustration.

-:::::{tab-set}
-::::{tab-item} VLM2Vec
-
-To serve the model:
+=== "VLM2Vec"

-```bash
-vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
-  --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
-```
+    To serve the model:

-:::{important}
-Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
-to run this model in embedding mode instead of text generation mode.
+    ```bash
+    vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
+      --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
+    ```

-The custom chat template is completely different from the original one for this model,
-and can be found here: <gh-file:examples/template_vlm2vec.jinja>
-:::
+    !!! warning
+        Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
+        to run this model in embedding mode instead of text generation mode.

-Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
+        The custom chat template is completely different from the original one for this model,
+        and can be found here: <gh-file:examples/template_vlm2vec.jinja>

-```python
-import requests
-
-image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-
-response = requests.post(
-    "http://localhost:8000/v1/embeddings",
-    json={
-        "model": "TIGER-Lab/VLM2Vec-Full",
-        "messages": [{
-            "role": "user",
-            "content": [
-                {"type": "image_url", "image_url": {"url": image_url}},
-                {"type": "text", "text": "Represent the given image."},
-            ],
-        }],
-        "encoding_format": "float",
-    },
-)
-response.raise_for_status()
-response_json = response.json()
-print("Embedding output:", response_json["data"][0]["embedding"])
-```
+    Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:

-::::
+    ```python
+    import requests

-::::{tab-item} DSE-Qwen2-MRL
+    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

-To serve the model:
+    response = requests.post(
+        "http://localhost:8000/v1/embeddings",
+        json={
+            "model": "TIGER-Lab/VLM2Vec-Full",
+            "messages": [{
+                "role": "user",
+                "content": [
+                    {"type": "image_url", "image_url": {"url": image_url}},
+                    {"type": "text", "text": "Represent the given image."},
+                ],
+            }],
+            "encoding_format": "float",
+        },
+    )
+    response.raise_for_status()
+    response_json = response.json()
+    print("Embedding output:", response_json["data"][0]["embedding"])
+    ```

-```bash
-vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
-  --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
-```
+=== "DSE-Qwen2-MRL"

-:::{important}
-Like with VLM2Vec, we have to explicitly pass `--task embed`.
+    To serve the model:

-Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
-by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
-:::
+    ```bash
+    vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
+      --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
+    ```

-:::{important}
-`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
-example below for details.
-:::
+    !!! warning
+        Like with VLM2Vec, we have to explicitly pass `--task embed`.

-::::
+        Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
+        by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>

-:::::
+    !!! warning
+        `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
+        example below for details.

 Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>

 #### Extra parameters

-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-embedding-pooling-params
-:end-before: end-embedding-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
+```

 The following extra parameters are supported by default:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-embedding-extra-params
-:end-before: end-embedding-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
+```

 For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-embedding-extra-params
-:end-before: end-chat-embedding-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
+```

-(transcriptions-api)=
+[](){ #transcriptions-api }

 ### Transcriptions API

 Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
 you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.

-:::{note}
-To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
-:::
+!!! note
+    To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.

 Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
 <!-- TODO: api enforced limits + uploading audios -->

 #### Extra Parameters

-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-transcription-sampling-params
-:end-before: end-transcription-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-transcription-extra-params
-:end-before: end-transcription-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
+```

-(tokenizer-api)=
+[](){ #tokenizer-api }

 ### Tokenizer API

@@ -387,17 +357,17 @@ It consists of two endpoints:
 - `/tokenize` corresponds to calling `tokenizer.encode()`.
 - `/detokenize` corresponds to calling `tokenizer.decode()`.

-(pooling-api)=
+[](){ #pooling-api }

 ### Pooling API

 Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.

-The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.

 Code example: <gh-file:examples/online_serving/openai_pooling_client.py>

-(classification-api)=
+[](){ #classification-api }

 ### Classification API

@@ -505,23 +475,19 @@ Response:

 #### Extra parameters

-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-classification-pooling-params
-:end-before: end-classification-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-classification-extra-params
-:end-before: end-classification-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
+```

-(score-api)=
+[](){ #score-api }

 ### Score API

@@ -668,23 +634,19 @@ Response:

 #### Extra parameters

-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-score-pooling-params
-:end-before: end-score-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-score-extra-params
-:end-before: end-score-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
+```

-(rerank-api)=
+[](){ #rerank-api }

 ### Re-rank API

@@ -755,18 +717,14 @@ Response:

 #### Extra parameters

-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-rerank-pooling-params
-:end-before: end-rerank-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
+```

 The following extra parameters are supported:

-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-rerank-extra-params
-:end-before: end-rerank-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
+```
--- a/docs/source/serving/serve_args.md
+++ b/docs/source/serving/serve_args.md
-(serve-args)=
-
-# Server Arguments
+---
+title: Server Arguments
+---
+[](){ #serve-args }

 The `vllm serve` command is used to launch the OpenAI-compatible server.

 ## CLI Arguments

-The following are all arguments available from the `vllm serve` command:
-
-<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
-```{eval-rst}
-.. argparse::
-    :module: vllm.entrypoints.openai.cli_args
-    :func: create_parser_for_docs
-    :prog: vllm serve
-    :nodefaultconst:
-    :markdownhelp:
-```
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+To see the available CLI arguments, run `vllm serve --help`!

 ## Configuration file

 You can load CLI arguments via a [YAML](https://yaml.org/) config file.
-The argument names must be the long form of those outlined [above](#serve-args).
+The argument names must be the long form of those outlined [above][serve-args].

 For example:

@@ -40,8 +32,7 @@ To use the above config file:
 vllm serve --config config.yaml
 ```

-:::{note}
-In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
-The order of priorities is `command line > config file values > defaults`.
-e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
-:::
+!!! note
+    In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
+    The order of priorities is `command line > config file values > defaults`.
+    e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
--- a/docs/source/serving/usage_stats.md
+++ b/docs/source/serving/usage_stats.md
--- a/docs/source/_static/custom.css
+++ b/docs/source/_static/custom.css
-.vertical-table-header th.head:not(.stub) {
-    writing-mode: sideways-lr;
-    white-space: nowrap;
-    max-width: 0;
-    p {
-       margin: 0;
-    }
-}
--- a/docs/source/_templates/sections/header.html
+++ b/docs/source/_templates/sections/header.html
-<style>
-  .notification-bar {
-    width: 100vw;
-    display: flex;
-    justify-content: center;
-    align-items: center;
-    font-size: 16px;
-    padding: 0 6px 0 6px;
-  }
-  .notification-bar p {
-    margin: 0;
-  }
-  .notification-bar a {
-    font-weight: bold;
-    text-decoration: none;
-  }
-
-  /* Light mode styles (default) */
-  .notification-bar {
-    background-color: #fff3cd;
-    color: #856404;
-  }
-  .notification-bar a {
-    color: #d97706;
-  }
-
-  /* Dark mode styles */
-  html[data-theme=dark] .notification-bar {
-    background-color: #333;
-    color: #ddd;
-  }
-  html[data-theme=dark] .notification-bar a {
-    color: #ffa500; /* Brighter color for visibility */
-  }
-</style>
-
-<div class="notification-bar">
-  <p>You are viewing the latest developer preview docs. <a href="https://docs.vllm.ai/en/stable/">Click here</a> to view docs for the latest stable release.</p>
-</div>
--- a/docs/source/api/summary.md
+++ b/docs/source/api/summary.md
-# Summary
-
-(configuration)=
-
-## Configuration
-
-API documentation for vLLM's configuration classes.
-
-```{autodoc2-summary}
-    vllm.config.ModelConfig
-    vllm.config.CacheConfig
-    vllm.config.TokenizerPoolConfig
-    vllm.config.LoadConfig
-    vllm.config.ParallelConfig
-    vllm.config.SchedulerConfig
-    vllm.config.DeviceConfig
-    vllm.config.SpeculativeConfig
-    vllm.config.LoRAConfig
-    vllm.config.PromptAdapterConfig
-    vllm.config.MultiModalConfig
-    vllm.config.PoolerConfig
-    vllm.config.DecodingConfig
-    vllm.config.ObservabilityConfig
-    vllm.config.KVTransferConfig
-    vllm.config.CompilationConfig
-    vllm.config.VllmConfig
-```
-
-(offline-inference-api)=
-
-## Offline Inference
-
-LLM Class.
-
-```{autodoc2-summary}
-    vllm.LLM
-```
-
-LLM Inputs.
-
-```{autodoc2-summary}
-    vllm.inputs.PromptType
-    vllm.inputs.TextPrompt
-    vllm.inputs.TokensPrompt
-```
-
-## vLLM Engines
-
-Engine classes for offline and online inference.
-
-```{autodoc2-summary}
-    vllm.LLMEngine
-    vllm.AsyncLLMEngine
-```
-
-## Inference Parameters
-
-Inference parameters for vLLM APIs.
-
-(sampling-params)=
-(pooling-params)=
-
-```{autodoc2-summary}
-    vllm.SamplingParams
-    vllm.PoolingParams
-```
-
-(multi-modality)=
-
-## Multi-Modality
-
-vLLM provides experimental support for multi-modal models through the {mod}`vllm.multimodal` package.
-
-Multi-modal inputs can be passed alongside text and token prompts to [supported models](#supported-mm-models)
-via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
-
-Looking to add your own multi-modal model? Please follow the instructions listed [here](#supports-multimodal).
-
-```{autodoc2-summary}
-    vllm.multimodal.MULTIMODAL_REGISTRY
-```
-
-### Inputs
-
-User-facing inputs.
-
-```{autodoc2-summary}
-    vllm.multimodal.inputs.MultiModalDataDict
-```
-
-Internal data structures.
-
-```{autodoc2-summary}
-    vllm.multimodal.inputs.PlaceholderRange
-    vllm.multimodal.inputs.NestedTensors
-    vllm.multimodal.inputs.MultiModalFieldElem
-    vllm.multimodal.inputs.MultiModalFieldConfig
-    vllm.multimodal.inputs.MultiModalKwargsItem
-    vllm.multimodal.inputs.MultiModalKwargs
-    vllm.multimodal.inputs.MultiModalInputs
-```
-
-### Data Parsing
-
-```{autodoc2-summary}
-    vllm.multimodal.parse
-```
-
-### Data Processing
-
-```{autodoc2-summary}
-    vllm.multimodal.processing
-```
-
-### Memory Profiling
-
-```{autodoc2-summary}
-    vllm.multimodal.profiling
-```
-
-### Registry
-
-```{autodoc2-summary}
-    vllm.multimodal.registry
-```
-
-## Model Development
-
-```{autodoc2-summary}
-    vllm.model_executor.models.interfaces_base
-    vllm.model_executor.models.interfaces
-    vllm.model_executor.models.adapters
-```
--- a/docs/source/autodoc2_docstring_parser.py
+++ b/docs/source/autodoc2_docstring_parser.py
-# SPDX-License-Identifier: Apache-2.0
-from docutils import nodes
-from myst_parser.parsers.sphinx_ import MystParser
-from sphinx.ext.napoleon import docstring
-
-
-class NapoleonParser(MystParser):
-
-    def parse(self, input_string: str, document: nodes.document) -> None:
-        # Get the Sphinx configuration
-        config = document.settings.env.config
-
-        parsed_content = str(
-            docstring.GoogleDocstring(
-                str(docstring.NumpyDocstring(input_string, config)),
-                config,
-            ))
-        return super().parse(parsed_content, document)
-
-
-Parser = NapoleonParser
--- a/docs/source/community/blog.md
+++ b/docs/source/community/blog.md
-# vLLM Blog
-
-vLLM blog posts are published [here](https://blog.vllm.ai/).
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
-# SPDX-License-Identifier: Apache-2.0
-
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-
-import datetime
-import logging
-import os
-import re
-import sys
-from pathlib import Path
-
-import requests
-
-logger = logging.getLogger(__name__)
-REPO_ROOT = Path(__file__).resolve().parent.parent.parent
-sys.path.append(os.path.abspath(REPO_ROOT))
-
-# -- Project information -----------------------------------------------------
-
-project = 'vLLM'
-copyright = f'{datetime.datetime.now().year}, vLLM Team'
-author = 'the vLLM Team'
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    "sphinx.ext.napoleon",
-    "sphinx.ext.linkcode",
-    "sphinx.ext.intersphinx",
-    "sphinx_copybutton",
-    "autodoc2",
-    "myst_parser",
-    "sphinxarg.ext",
-    "sphinx_design",
-    "sphinx_togglebutton",
-]
-myst_enable_extensions = [
-    "colon_fence",
-    "fieldlist",
-]
-autodoc2_packages = [
-    {
-        "path": "../../vllm",
-        "exclude_dirs": ["__pycache__", "third_party"],
-    },
-]
-autodoc2_output_dir = "api"
-autodoc2_render_plugin = "myst"
-autodoc2_hidden_objects = ["dunder", "private", "inherited"]
-autodoc2_sort_names = True
-autodoc2_index_template = None
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns: list[str] = ["**/*.template.md", "**/*.inc.md"]
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r"\$ "
-copybutton_prompt_is_regexp = True
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_title = project
-html_theme = 'sphinx_book_theme'
-html_logo = 'assets/logos/vllm-logo-text-light.png'
-html_favicon = 'assets/logos/vllm-logo-only-light.ico'
-html_theme_options = {
-    'path_to_docs': 'docs/source',
-    'repository_url': 'https://github.com/vllm-project/vllm',
-    'use_repository_button': True,
-    'use_edit_page_button': True,
-    # Prevents the full API being added to the left sidebar of every page.
-    # Reduces build time by 2.5x and reduces build size from ~225MB to ~95MB.
-    'collapse_navbar': True,
-    # Makes API visible in the right sidebar on API reference pages.
-    'show_toc_level': 3,
-}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ["_static"]
-html_js_files = ["custom.js"]
-html_css_files = ["custom.css"]
-
-myst_heading_anchors = 2
-myst_url_schemes = {
-    'http': None,
-    'https': None,
-    'mailto': None,
-    'ftp': None,
-    "gh-issue": {
-        "url":
-        "https://github.com/vllm-project/vllm/issues/{{path}}#{{fragment}}",
-        "title": "Issue #{{path}}",
-        "classes": ["github"],
-    },
-    "gh-pr": {
-        "url":
-        "https://github.com/vllm-project/vllm/pull/{{path}}#{{fragment}}",
-        "title": "Pull Request #{{path}}",
-        "classes": ["github"],
-    },
-    "gh-project": {
-        "url": "https://github.com/orgs/vllm-project/projects/{{path}}",
-        "title": "Project #{{path}}",
-        "classes": ["github"],
-    },
-    "gh-dir": {
-        "url": "https://github.com/vllm-project/vllm/tree/main/{{path}}",
-        "title": "{{path}}",
-        "classes": ["github"],
-    },
-    "gh-file": {
-        "url": "https://github.com/vllm-project/vllm/blob/main/{{path}}",
-        "title": "{{path}}",
-        "classes": ["github"],
-    },
-}
-
-# see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
-READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
-if READTHEDOCS_VERSION_TYPE == "tag":
-    # remove the warning banner if the version is a tagged release
-    header_file = os.path.join(os.path.dirname(__file__),
-                               "_templates/sections/header.html")
-    # The file might be removed already if the build is triggered multiple times
-    # (readthedocs build both HTML and PDF versions separately)
-    if os.path.exists(header_file):
-        os.remove(header_file)
-
-
-# Generate additional rst documentation here.
-def setup(app):
-    from docs.source.generate_examples import generate_examples
-    generate_examples()
-
-
-_cached_base: str = ""
-_cached_branch: str = ""
-
-
-def get_repo_base_and_branch(pr_number):
-    global _cached_base, _cached_branch
-    if _cached_base and _cached_branch:
-        return _cached_base, _cached_branch
-
-    url = f"https://api.github.com/repos/vllm-project/vllm/pulls/{pr_number}"
-    response = requests.get(url)
-    if response.status_code == 200:
-        data = response.json()
-        _cached_base = data['head']['repo']['full_name']
-        _cached_branch = data['head']['ref']
-        return _cached_base, _cached_branch
-    else:
-        logger.error("Failed to fetch PR details: %s", response)
-        return None, None
-
-
-def linkcode_resolve(domain, info):
-    if domain != 'py':
-        return None
-    if not info['module']:
-        return None
-
-    # Get path from module name
-    file = Path(f"{info['module'].replace('.', '/')}.py")
-    path = REPO_ROOT / file
-    if not path.exists():
-        path = REPO_ROOT / file.with_suffix("") / "__init__.py"
-    if not path.exists():
-        return None
-
-    # Get the line number of the object
-    with open(path) as f:
-        lines = f.readlines()
-    name = info['fullname'].split(".")[-1]
-    pattern = fr"^( {{4}})*((def|class) )?{name}\b.*"
-    for lineno, line in enumerate(lines, 1):
-        if not line or line.startswith("#"):
-            continue
-        if re.match(pattern, line):
-            break
-
-    # If the line number is not found, return None
-    if lineno == len(lines):
-        return None
-
-    # If the line number is found, create the URL
-    filename = path.relative_to(REPO_ROOT)
-    if "checkouts" in path.parts:
-        # a PR build on readthedocs
-        pr_number = REPO_ROOT.name
-        base, branch = get_repo_base_and_branch(pr_number)
-        if base and branch:
-            return f"https://github.com/{base}/blob/{branch}/{filename}#L{lineno}"
-    # Otherwise, link to the source file on the main branch
-    return f"https://github.com/vllm-project/vllm/blob/main/{filename}#L{lineno}"
-
-
-# Mock out external dependencies here, otherwise sphinx-argparse won't work.
-autodoc_mock_imports = [
-    "huggingface_hub",
-    "pydantic",
-    "zmq",
-    "cloudpickle",
-    "aiohttp",
-    "starlette",
-    "blake3",
-    "cpuinfo",
-    "transformers",
-    "psutil",
-    "vllm._C",
-    "PIL",
-    "numpy",
-    "tqdm",
-    # The mocks below are required by
-    # docs/source/serving/openai_compatible_server.md's
-    # vllm.entrypoints.openai.cli_args
-    "openai",
-    "fastapi",
-    "partial_json_parser",
-]
-
-for mock_target in autodoc_mock_imports:
-    if mock_target in sys.modules:
-        logger.info(
-            "Potentially problematic mock target (%s) found; "
-            "autodoc_mock_imports cannot mock modules that have already "
-            "been loaded into sys.modules when the sphinx build starts.",
-            mock_target)
-
-intersphinx_mapping = {
-    "python": ("https://docs.python.org/3", None),
-    "typing_extensions":
-    ("https://typing-extensions.readthedocs.io/en/latest", None),
-    "aiohttp": ("https://docs.aiohttp.org/en/stable", None),
-    "pillow": ("https://pillow.readthedocs.io/en/stable", None),
-    "numpy": ("https://numpy.org/doc/stable", None),
-    "torch": ("https://pytorch.org/docs/stable", None),
-    "psutil": ("https://psutil.readthedocs.io/en/stable", None),
-}
-
-navigation_with_keys = False
--- a/docs/source/contributing/model/index.md
+++ b/docs/source/contributing/model/index.md
-(new-model)=
-
-# Adding a New Model
-
-This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
-
-:::{toctree}
-:caption: Contents
-:maxdepth: 1
-
-basic
-registration
-tests
-multimodal
-:::
-
-:::{note}
-The complexity of adding a new model depends heavily on the model's architecture.
-The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
-However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
-:::
-
-:::{tip}
-If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
-or ask on our [developer slack](https://slack.vllm.ai).
-We will be happy to help you out!
-:::
--- a/docs/source/contributing/model/multimodal.md
+++ b/docs/source/contributing/model/multimodal.md
-(supports-multimodal)=
-
-# Multi-Modal Support
-
-This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](#multimodal-inputs).
-
-## 1. Update the base vLLM model
-
-It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).
-Further update the model as follows:
-
- Reserve a keyword parameter in {meth}`~torch.nn.Module.forward` for each input tensor that corresponds to a multi-modal input, as shown in the following example:
-
-  ```diff
-    def forward(
-        self,
-        input_ids: torch.Tensor,
-        positions: torch.Tensor,
-  +     pixel_values: torch.Tensor,
-    ) -> SamplerOutput:
-  ```
-  
-  More conveniently, you can simply pass `**kwargs` to the {meth}`~torch.nn.Module.forward` method and retrieve the keyword parameters for multimodal inputs from it.
-
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings` that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
-
-    ```python
-    class YourModelForImage2Seq(nn.Module):
-        ...
-
-        def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
-
-            assert self.vision_encoder is not None
-            image_features = self.vision_encoder(image_input)
-            return self.multi_modal_projector(image_features)
-
-        def get_multimodal_embeddings(
-                self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
-
-            # Validate the multimodal input keyword arguments
-            image_input = self._parse_and_validate_image_input(**kwargs)
-            if image_input is None:
-                return None
-
-            # Run multimodal inputs through encoder and projector
-            vision_embeddings = self._process_image_input(image_input)
-            return vision_embeddings
-    ```
-
-    :::{important}
-    The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
-    :::
-
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
-
-    ```python
-    from .utils import merge_multimodal_embeddings
-
-    class YourModelForImage2Seq(nn.Module):
-        ...
-
-        def get_input_embeddings(
-            self,
-            input_ids: torch.Tensor,
-            multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
-        ) -> torch.Tensor:
-
-            # `get_input_embeddings` should already be implemented for the language 
-            # model as one of the requirements of basic vLLM model implementation.
-            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
-
-            if multimodal_embeddings is not None:
-                inputs_embeds = merge_multimodal_embeddings(
-                    input_ids=input_ids, 
-                    inputs_embeds=inputs_embeds, 
-                    multimodal_embeddings=multimodal_embeddings,
-                    placeholder_token_id=self.config.image_token_index)
-
-            return inputs_embeds
-    ```
-
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model` getter to provide stable access to the underlying language model.
-
-    ```python
-    class YourModelForImage2Seq(nn.Module):
-        ...
-
-        def get_language_model(self) -> torch.nn.Module:
-            # Change `language_model` according to your implementation.
-            return self.language_model
-    ```
-
- Once the above steps are done, update the model class with the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
-
-  ```diff
-  + from vllm.model_executor.models.interfaces import SupportsMultiModal
-
-  - class YourModelForImage2Seq(nn.Module):
-  + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
-  ```
-
-  :::{note}
-  The model class does not have to be named {code}`*ForCausalLM`.
-  Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
-  :::
-
-## 2. Specify processing information
-
-Next, create a subclass of {class}`~vllm.multimodal.processing.BaseProcessingInfo`
-to provide basic information related to HF processing.
-
-### Maximum number of input items
-
-You need to override the abstract method {meth}`~vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits`
-to return the maximum number of input items for each modality supported by the model.
-
-For example, if the model supports any number of images but only one video per prompt:
-
-```python
-def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
-    return {"image": None, "video": 1}
-```
-
-## 3. Specify dummy inputs
-
-Then, inherit {class}`~vllm.multimodal.profiling.BaseDummyInputsBuilder` to construct dummy inputs for
-HF processing as well as memory profiling.
-
-### For memory profiling
-
-Override the abstract methods {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text` and {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data` to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
-
-Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
-
-::::{tab-set}
-:::{tab-item} Basic example: LLaVA
-:sync: llava
-
-Looking at the code of HF's `LlavaForConditionalGeneration`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
-n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
-n_image_features = image_features.shape[0] * image_features.shape[1]
-
-if n_image_tokens != n_image_features:
-    raise ValueError(
-        f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
-    )
-special_image_mask = (
-    (input_ids == self.config.image_token_index)
-    .unsqueeze(-1)
-    .expand_as(inputs_embeds)
-    .to(inputs_embeds.device)
-)
-image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
-inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
-```
-
-The number of placeholder feature tokens per image is `image_features.shape[1]`.
-`image_features` is calculated inside the `get_image_features` method:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
-image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
-
-selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
-if vision_feature_select_strategy == "default":
-    selected_image_feature = selected_image_feature[:, 1:]
-elif vision_feature_select_strategy == "full":
-    selected_image_feature = selected_image_feature
-else:
-    raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
-image_features = self.multi_modal_projector(selected_image_feature)
-return image_features
-```
-
-We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
-(`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
-Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
-The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
-mechanism doesn't change the sequence length of the output hidden states.
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
-hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
-hidden_states = self.pre_layrnorm(hidden_states)
-
-encoder_outputs = self.encoder(
-    inputs_embeds=hidden_states,
-    output_attentions=output_attentions,
-    output_hidden_states=output_hidden_states,
-    return_dict=return_dict,
-)
-```
-
-To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
-target_dtype = self.patch_embedding.weight.dtype
-patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
-patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
-
-class_embeds = self.class_embedding.expand(batch_size, 1, -1)
-embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
-if interpolate_pos_encoding:
-    embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
-else:
-    embeddings = embeddings + self.position_embedding(self.position_ids)
-return embeddings
-```
-
-We can infer that `embeddings.shape[1] == self.num_positions`, where
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
-self.num_patches = (self.image_size // self.patch_size) ** 2
-self.num_positions = self.num_patches + 1
-```
-
-Overall, the number of placeholder feature tokens for an image can be calculated as:
-
-```python
-def get_num_image_tokens(
-    self,
-    *,
-    image_width: int,
-    image_height: int,
-) -> int:
-    hf_config = self.get_hf_config()
-    hf_processor = self.get_hf_processor()
-
-    image_size = hf_config.vision_config.image_size
-    patch_size = hf_config.vision_config.patch_size
-
-    num_image_tokens = (image_size // patch_size) ** 2 + 1
-    if hf_processor.vision_feature_select_strategy == "default":
-        num_image_tokens -= 1
-
-    return num_image_tokens
-```
-
-Notice that the number of image tokens doesn't depend on the image width and height.
-We can simply use a dummy `image_size` to calculate the multimodal profiling data:
-
-```python
-# NOTE: In actuality, this is usually implemented as part of the
-# model's subclass of `BaseProcessingInfo`, but we show it as is
-# here for simplicity.
-def get_image_size_with_most_features(self) -> ImageSize:
-    hf_config = self.get_hf_config()
-    width = height = hf_config.image_size
-    return ImageSize(width=width, height=height)
-
-def get_dummy_mm_data(
-    self,
-    seq_len: int,
-    mm_counts: Mapping[str, int],
-) -> MultiModalDataDict:
-    num_images = mm_counts.get("image", 0)
-
-    target_width, target_height = \
-        self.info.get_image_size_with_most_features()
-
-    return {
-        "image":
-        self._get_dummy_images(width=target_width,
-                               height=target_height,
-                               num_images=num_images)
-    }
-```
-
-For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
-
-```python
-def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
-    num_images = mm_counts.get("image", 0)
-
-    processor = self.info.get_hf_processor()
-    image_token = processor.image_token
-
-    return image_token * num_images
-```
-
-:::
-
-:::{tab-item} No input placeholders: Fuyu
-:sync: fuyu
-
-Looking at the code of HF's `FuyuForCausalLM`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
-if image_patches is not None and past_key_values is None:
-    patch_embeddings = [
-        self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
-        .squeeze(0)
-        .to(inputs_embeds.device)
-        for patch in image_patches
-    ]
-    inputs_embeds = self.gather_continuous_embeddings(
-        word_embeddings=inputs_embeds,
-        continuous_embeddings=patch_embeddings,
-        image_patch_input_indices=image_patches_indices,
-    )
-```
-
-The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
-which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
-
-Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
-Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
-
-The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
-`FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
-
-In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
-returning the dimensions after resizing (but before padding) as metadata.
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
-image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
-batch_images = image_encoding["images"]
-image_unpadded_heights = image_encoding["image_unpadded_heights"]
-image_unpadded_widths = image_encoding["image_unpadded_widths"]
-
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
-if do_resize:
-    batch_images = [
-        [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
-        for images in batch_images
-    ]
-
-image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
-image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
-image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
-
-if do_pad:
-    batch_images = [
-        [
-            self.pad_image(
-                image,
-                size=size,
-                mode=padding_mode,
-                constant_values=padding_value,
-                input_data_format=input_data_format,
-            )
-            for image in images
-        ]
-        for images in batch_images
-    ]
-```
-
-In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
-model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-    image_input=tensor_batch_images,
-    image_present=image_present,
-    image_unpadded_h=image_unpadded_heights,
-    image_unpadded_w=image_unpadded_widths,
-    image_placeholder_id=image_placeholder_id,
-    image_newline_id=image_newline_id,
-    variable_sized=True,
-)
-
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
-image_height, image_width = image.shape[1], image.shape[2]
-if variable_sized:  # variable_sized=True
-    new_h = min(
-        image_height,
-        math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
-    )
-    new_w = min(
-        image_width,
-        math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
-    )
-    image = image[:, :new_h, :new_w]
-    image_height, image_width = new_h, new_w
-
-num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
-tensor_of_image_ids = torch.full(
-    [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
-)
-patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
-assert num_patches == patches.shape[0]
-```
-
-The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
-patch_size = patch_size if patch_size is not None else self.patch_size
-patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
-
-if image_height % patch_height != 0:
-    raise ValueError(f"{image_height=} must be divisible by {patch_height}")
-if image_width % patch_width != 0:
-    raise ValueError(f"{image_width=} must be divisible by {patch_width}")
-
-num_patches_per_dim_h = image_height // patch_height
-num_patches_per_dim_w = image_width // patch_width
-num_patches = num_patches_per_dim_h * num_patches_per_dim_w
-```
-
-These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
-to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
-
-```python
-def get_image_size_with_most_features(self) -> ImageSize:
-    image_processor = self.get_image_processor()
-    return ImageSize(width=image_processor.size["width"],
-                        height=image_processor.size["height"])
-```
-
-Fuyu does not expect image placeholders in the inputs to HF processor, so
-the dummy prompt text is empty regardless of the number of images.
-
-```python
-def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
-    return ""
-```
-
-For the multimodal image profiling data, the logic is very similar to LLaVA:
-
-```python
-def get_dummy_mm_data(
-    self,
-    seq_len: int,
-    mm_counts: Mapping[str, int],
-) -> MultiModalDataDict:
-    target_width, target_height = \
-        self.info.get_image_size_with_most_features()
-    num_images = mm_counts.get("image", 0)
-
-    return {
-        "image":
-        self._get_dummy_images(width=target_width,
-                               height=target_height,
-                               num_images=num_images)
-    }
-```
-
-:::
-
-::::
-
-## 4. Specify processing details
-
-Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
-to fill in the missing details about HF processing.
-
-:::{seealso}
-[Multi-Modal Data Processing](#mm-processing)
-:::
-
-### Multi-modal fields
-
-Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
-return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
-
-:::::{tab-set}
-::::{tab-item} Basic example: LLaVA
-:sync: llava
-
-The output of `CLIPImageProcessor` is a simple tensor with shape
-`(num_images, num_channels, image_height, image_width)`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
-images = [
-    to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
-    for image in all_images
-]
-
-data = {"pixel_values": images}
-return BatchFeature(data=data, tensor_type=return_tensors)
-```
-
-So, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:
-
-```python
-def _get_mm_fields_config(
-    self,
-    hf_inputs: BatchFeature,
-    hf_processor_mm_kwargs: Mapping[str, object],
-) -> Mapping[str, MultiModalFieldConfig]:
-    return dict(
-        pixel_values=MultiModalFieldConfig.batched("image"),
-    )
-```
-
-:::{note}
-Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
-pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
-:::
-
-::::
-
-::::{tab-item} With postprocessing: Fuyu
-:sync: fuyu
-
-The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
-the patches from each image belonging to an item in the batch:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
-        image_input_ids.append(tensor_of_image_ids)
-        image_patches.append(patches)
-    else:
-        image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
-
-batch_image_input_ids.append(image_input_ids)
-batch_image_patches.append(image_patches)
-```
-
-The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
-`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
-
-In order to support the use of {func}`MultiModalFieldConfig.batched` like in LLaVA,
-we remove the extra batch dimension by overriding {meth}`BaseMultiModalProcessor._call_hf_processor`:
-
-```python
-def _call_hf_processor(
-    self,
-    prompt: str,
-    mm_data: Mapping[str, object],
-    mm_kwargs: Mapping[str, object],
-) -> BatchFeature:
-    processed_outputs = super()._call_hf_processor(
-        prompt=prompt,
-        mm_data=mm_data,
-        mm_kwargs=mm_kwargs,
-    )
-
-    image_patches = processed_outputs.get("image_patches")
-    if image_patches is not None:
-        images = mm_data["images"]
-        assert isinstance(images, list)
-
-        # Original output: (1, num_images, Pn, Px * Py * C)
-        # New output: (num_images, Pn, Px * Py * C)
-        assert (isinstance(image_patches, list)
-                and len(image_patches) == 1)
-        assert (isinstance(image_patches[0], torch.Tensor)
-                and len(image_patches[0]) == len(images))
-
-        processed_outputs["image_patches"] = image_patches[0]
-
-    return processed_outputs
-```
-
-:::{note}
-Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
-for text-only inputs to prevent unnecessary warnings from HF processor.
-:::
-
-This lets us override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:
-
-```python
-def _get_mm_fields_config(
-    self,
-    hf_inputs: BatchFeature,
-    hf_processor_mm_kwargs: Mapping[str, object],
-) -> Mapping[str, MultiModalFieldConfig]:
-    return dict(image_patches=MultiModalFieldConfig.batched("image"))
-```
-
-::::
-
-:::::
-
-### Prompt updates
-
-Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` to
-return a list of {class}`~vllm.multimodal.processing.PromptUpdate` instances.
-
-Each {class}`~vllm.multimodal.processing.PromptUpdate` instance specifies an update operation
-(e.g.: insertion, replacement) performed by the HF processor.
-
-::::{tab-set}
-:::{tab-item} Basic example: LLaVA
-:sync: llava
-
-Looking at HF's `LlavaProcessor`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
-prompt_strings = []
-for sample in text:
-    sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
-    prompt_strings.append(sample)
-```
-
-It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
-Based on this, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` as follows:
-
-```python
-def _get_prompt_updates(
-    self,
-    mm_items: MultiModalDataItems,
-    hf_processor_mm_kwargs: Mapping[str, object],
-    out_mm_kwargs: MultiModalKwargs,
-) -> Sequence[PromptUpdate]:
-    hf_config = self.info.get_hf_config()
-    image_token_id = hf_config.image_token_index
-
-    def get_replacement(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-
-        image_size = images.get_image_size(item_idx)
-        num_image_tokens = self.info.get_num_image_tokens(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
-
-        return [image_token_id] * num_image_tokens
-
-    return [
-        PromptReplacement(
-            modality="image",
-            target=[image_token_id],
-            replacement=get_replacement,
-        ),
-    ]
-```
-
-:::
-
-:::{tab-item} Handling additional tokens: Fuyu
-:sync: fuyu
-
-Recall the layout of feature tokens from Step 2:
-
-```
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-...
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-```
-
-We define a helper function to return `ncols` and `nrows` directly:
-
-```python
-def get_image_feature_grid_size(
-    self,
-    *,
-    image_width: int,
-    image_height: int,
-) -> tuple[int, int]:
-    image_processor = self.get_image_processor()
-    target_width = image_processor.size["width"]
-    target_height = image_processor.size["height"]
-    patch_width = image_processor.patch_size["width"]
-    patch_height = image_processor.patch_size["height"]
-
-    if not (image_width <= target_width and image_height <= target_height):
-        height_scale_factor = target_height / image_height
-        width_scale_factor = target_width / image_width
-        optimal_scale_factor = min(height_scale_factor, width_scale_factor)
-
-        image_height = int(image_height * optimal_scale_factor)
-        image_width = int(image_width * optimal_scale_factor)
-
-    ncols = math.ceil(image_width / patch_width)
-    nrows = math.ceil(image_height / patch_height)
-    return ncols, nrows
-```
-
-Based on this, we can initially define our replacement tokens as:
-
-```python
-def get_replacement(item_idx: int):
-    images = mm_items.get_items("image", ImageProcessorItems)
-    image_size = images.get_image_size(item_idx)
-
-    ncols, nrows = self.info.get_image_feature_grid_size(
-        image_width=image_size.width,
-        image_height=image_size.height,
-    )
-
-    # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
-    # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
-    return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
-```
-
-However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
-a BOS token (`<s>`) is also added to the promopt:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
-model_image_input = self.image_processor.preprocess_with_tokenizer_info(
-    image_input=tensor_batch_images,
-    image_present=image_present,
-    image_unpadded_h=image_unpadded_heights,
-    image_unpadded_w=image_unpadded_widths,
-    image_placeholder_id=image_placeholder_id,
-    image_newline_id=image_newline_id,
-    variable_sized=True,
-)
-prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
-    tokenizer=self.tokenizer,
-    prompts=prompts,
-    scale_factors=scale_factors,
-    max_tokens_to_generate=self.max_tokens_to_generate,
-    max_position_embeddings=self.max_position_embeddings,
-    add_BOS=True,
-    add_beginning_of_answer_token=True,
-)
-```
-
-To assign the vision embeddings to only the image tokens, instead of a string
-you can return an instance of {class}`~vllm.multimodal.processing.PromptUpdateDetails`:
-
-```python
-hf_config = self.info.get_hf_config()
-bos_token_id = hf_config.bos_token_id  # `<s>`
-assert isinstance(bos_token_id, int)
-
-def get_replacement_fuyu(item_idx: int):
-    images = mm_items.get_items("image", ImageProcessorItems)
-    image_size = images.get_image_size(item_idx)
-
-    ncols, nrows = self.info.get_image_feature_grid_size(
-        image_width=image_size.width,
-        image_height=image_size.height,
-    )
-    image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
-                    [_NEWLINE_TOKEN_ID]) * nrows
-
-    return PromptUpdateDetails.select_token_id(
-        image_tokens + [bos_token_id],
-        embed_token_id=_IMAGE_TOKEN_ID,
-    )
-```
-
-Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
-we can search for it to conduct the replacement at the start of the string:
-
-```python
-def _get_prompt_updates(
-    self,
-    mm_items: MultiModalDataItems,
-    hf_processor_mm_kwargs: Mapping[str, object],
-    out_mm_kwargs: MultiModalKwargs,
-) -> Sequence[PromptUpdate]:
-    hf_config = self.info.get_hf_config()
-    bos_token_id = hf_config.bos_token_id
-    assert isinstance(bos_token_id, int)
-
-    tokenizer = self.info.get_tokenizer()
-    eot_token_id = tokenizer.bos_token_id
-    assert isinstance(eot_token_id, int)
-
-    def get_replacement_fuyu(item_idx: int):
-        images = mm_items.get_items("image", ImageProcessorItems)
-        image_size = images.get_image_size(item_idx)
-
-        ncols, nrows = self.info.get_image_feature_grid_size(
-            image_width=image_size.width,
-            image_height=image_size.height,
-        )
-        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
-                        [_NEWLINE_TOKEN_ID]) * nrows
-
-        return PromptUpdateDetails.select_token_id(
-            image_tokens + [bos_token_id],
-            embed_token_id=_IMAGE_TOKEN_ID,
-        )
-
-    return [
-        PromptReplacement(
-            modality="image",
-            target=[eot_token_id],
-            replacement=get_replacement_fuyu,
-        )
-    ]
-```
-
-:::
-
-::::
-
-## 5. Register processor-related classes
-
-After you have defined {class}`~vllm.multimodal.processing.BaseProcessingInfo` (Step 2),
-{class}`~vllm.multimodal.profiling.BaseDummyInputsBuilder` (Step 3),
-and {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` (Step 4),
-decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor <vllm.multimodal.registry.MultiModalRegistry.register_processor>`
-to register them to the multi-modal registry:
-
-```diff
-  from vllm.model_executor.models.interfaces import SupportsMultiModal
-+ from vllm.multimodal import MULTIMODAL_REGISTRY
-
-+ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
-+                                         info=YourProcessingInfo,
-+                                         dummy_inputs=YourDummyInputsBuilder)
-  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
-```
-
-## Notes
-
-### Inserting feature tokens without replacement
-
-Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use {class}`~vllm.multimodal.processing.PromptInsertion` instead of {class}`~vllm.multimodal.processing.PromptReplacement` inside {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
-
-Examples:
-
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
-
-### Handling prompt updates unrelated to multi-modal data
-
-{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only` so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](#mm-processing).
-
-Examples:
-
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py>
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py>
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py>
-
-### Custom HF processor
-
-Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor`.
-
-Examples:
-
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py>
- InternVL: <gh-file:vllm/model_executor/models/internvl.py>
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py>
--- a/docs/source/deployment/frameworks/helm.md
+++ b/docs/source/deployment/frameworks/helm.md
-(deployment-helm)=
-
-# Helm
-
-A Helm chart to deploy vLLM for Kubernetes
-
-Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
-
-This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
-
-## Prerequisites
-
-Before you begin, ensure that you have the following:
-
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
- Available GPU resources in your cluster
- S3 with the model which will be deployed
-
-## Installing the chart
-
-To install the chart with the release name `test-vllm`:
-
-```console
-helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
-```
-
-## Uninstalling the Chart
-
-To uninstall the `test-vllm` deployment:
-
-```console
-helm uninstall test-vllm --namespace=ns-vllm
-```
-
-The command removes all the Kubernetes components associated with the
-chart **including persistent volumes** and deletes the release.
-
-## Architecture
-
-:::{image} /assets/deployment/architecture_helm_deployment.png
-:::
-
-## Values
-
-:::{list-table}
-:widths: 25 25 25 25
-:header-rows: 1
-
- * Key
-  * Type
-  * Default
-  * Description
- * autoscaling
-  * object
-  * {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
-  * Autoscaling configuration
- * autoscaling.enabled
-  * bool
-  * false
-  * Enable autoscaling
- * autoscaling.maxReplicas
-  * int
-  * 100
-  * Maximum replicas
- * autoscaling.minReplicas
-  * int
-  * 1
-  * Minimum replicas
- * autoscaling.targetCPUUtilizationPercentage
-  * int
-  * 80
-  * Target CPU utilization for autoscaling
- * configs
-  * object
-  * {}
-  * Configmap
- * containerPort
-  * int
-  * 8000
-  * Container port
- * customObjects
-  * list
-  * []
-  * Custom Objects configuration
- * deploymentStrategy
-  * object
-  * {}
-  * Deployment strategy configuration
- * externalConfigs
-  * list
-  * []
-  * External configuration
- * extraContainers
-  * list
-  * []
-  * Additional containers configuration
- * extraInit
-  * object
-  * {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
-  * Additional configuration for the init container
- * extraInit.pvcStorage
-  * string
-  * "50Gi"
-  * Storage size of the s3
- * extraInit.s3modelpath
-  * string
-  * "relative_s3_model_path/opt-125m"
-  * Path of the model on the s3 which hosts model weights and config files
- * extraInit.awsEc2MetadataDisabled
-  * boolean
-  * true
-  * Disables the use of the Amazon EC2 instance metadata service
- * extraPorts
-  * list
-  * []
-  * Additional ports configuration
- * gpuModels
-  * list
-  * ["TYPE_GPU_USED"]
-  * Type of gpu used
- * image
-  * object
-  * {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
-  * Image configuration
- * image.command
-  * list
-  * ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
-  * Container launch command
- * image.repository
-  * string
-  * "vllm/vllm-openai"
-  * Image repository
- * image.tag
-  * string
-  * "latest"
-  * Image tag
- * livenessProbe
-  * object
-  * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
-  * Liveness probe configuration
- * livenessProbe.failureThreshold
-  * int
-  * 3
-  * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
- * livenessProbe.httpGet
-  * object
-  * {"path":"/health","port":8000}
-  * Configuration of the Kubelet http request on the server
- * livenessProbe.httpGet.path
-  * string
-  * "/health"
-  * Path to access on the HTTP server
- * livenessProbe.httpGet.port
-  * int
-  * 8000
-  * Name or number of the port to access on the container, on which the server is listening
- * livenessProbe.initialDelaySeconds
-  * int
-  * 15
-  * Number of seconds after the container has started before liveness probe is initiated
- * livenessProbe.periodSeconds
-  * int
-  * 10
-  * How often (in seconds) to perform the liveness probe
- * maxUnavailablePodDisruptionBudget
-  * string
-  * ""
-  * Disruption Budget Configuration
- * readinessProbe
-  * object
-  * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
-  * Readiness probe configuration
- * readinessProbe.failureThreshold
-  * int
-  * 3
-  * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
- * readinessProbe.httpGet
-  * object
-  * {"path":"/health","port":8000}
-  * Configuration of the Kubelet http request on the server
- * readinessProbe.httpGet.path
-  * string
-  * "/health"
-  * Path to access on the HTTP server
- * readinessProbe.httpGet.port
-  * int
-  * 8000
-  * Name or number of the port to access on the container, on which the server is listening
- * readinessProbe.initialDelaySeconds
-  * int
-  * 5
-  * Number of seconds after the container has started before readiness probe is initiated
- * readinessProbe.periodSeconds
-  * int
-  * 5
-  * How often (in seconds) to perform the readiness probe
- * replicaCount
-  * int
-  * 1
-  * Number of replicas
- * resources
-  * object
-  * {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
-  * Resource configuration
- * resources.limits."nvidia.com/gpu"
-  * int
-  * 1
-  * Number of gpus used
- * resources.limits.cpu
-  * int
-  * 4
-  * Number of CPUs
- * resources.limits.memory
-  * string
-  * "16Gi"
-  * CPU memory configuration
- * resources.requests."nvidia.com/gpu"
-  * int
-  * 1
-  * Number of gpus used
- * resources.requests.cpu
-  * int
-  * 4
-  * Number of CPUs
- * resources.requests.memory
-  * string
-  * "16Gi"
-  * CPU memory configuration
- * secrets
-  * object
-  * {}
-  * Secrets configuration
- * serviceName
-  * string
-  *
-  * Service name
- * servicePort
-  * int
-  * 80
-  * Service port
- * labels.environment
-  * string
-  * test
-  * Environment name
- * labels.release
-  * string
-  * test
-  * Release name
-:::
--- a/docs/source/deployment/frameworks/index.md
+++ b/docs/source/deployment/frameworks/index.md
-# Using other frameworks
-
-:::{toctree}
-:maxdepth: 1
-
-anything-llm
-bentoml
-cerebrium
-chatbox
-dify
-dstack
-helm
-litellm
-lobe-chat
-lws
-modal
-open-webui
-retrieval_augmented_generation
-skypilot
-streamlit
-triton
-:::
--- a/docs/source/deployment/integrations/index.md
+++ b/docs/source/deployment/integrations/index.md
-# External Integrations
-
-:::{toctree}
-:maxdepth: 1
-
-kserve
-kubeai
-llamastack
-llmaz
-production-stack
-:::
--- a/docs/source/features/compatibility_matrix.md
+++ b/docs/source/features/compatibility_matrix.md
-(compatibility-matrix)=
-
-# Compatibility Matrix
-
-The tables below show mutually exclusive features and the support on some hardware.
-
-The symbols used have the following meanings:
-
- ✅ = Full compatibility
- 🟠 = Partial compatibility
- ❌ = No compatibility
-
-:::{note}
-Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
-:::
-
-## Feature x Feature
-
-:::{raw} html
-<style>
-  /* Make smaller to try to improve readability  */
-  td {
-    font-size: 0.8rem;
-    text-align: center;
-  }
-
-  th {
-    text-align: center;
-    font-size: 0.8rem;
-  }
-</style>
-:::
-
-:::{list-table}
-:header-rows: 1
-:stub-columns: 1
-:widths: auto
-:class: vertical-table-header
-
- * Feature
-  * [CP](#chunked-prefill)
-  * [APC](#automatic-prefix-caching)
-  * [LoRA](#lora-adapter)
-  * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-  * [SD](#spec-decode)
-  * CUDA graph
-  * <abbr title="Pooling Models">pooling</abbr>
-  * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-  * <abbr title="Logprobs">logP</abbr>
-  * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-  * <abbr title="Async Output Processing">async output</abbr>
-  * multi-step
-  * <abbr title="Multimodal Inputs">mm</abbr>
-  * best-of
-  * beam-search
-  * <abbr title="Guided Decoding">guided dec</abbr>
- * [CP](#chunked-prefill)
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * [APC](#automatic-prefix-caching)
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * [LoRA](#lora-adapter)
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * [SD](#spec-decode)
-  * ✅
-  * ✅
-  * ❌
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * CUDA graph
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Pooling Models">pooling</abbr>
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ❌
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-  * ❌
-  * [❌](gh-issue:7366)
-  * ❌
-  * ❌
-  * [❌](gh-issue:7366)
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Logprobs">logP</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
-  *
- * <abbr title="Async Output Processing">async output</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
-  * ✅
-  * ❌
-  * ❌
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
-  *
- * multi-step
-  * ❌
-  * ✅
-  * ❌
-  * ✅
-  * ❌
-  * ✅
-  * ❌
-  * ❌
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  *
-  *
-  *
-  *
- * <abbr title="Multimodal Inputs">mm</abbr>
-  * ✅
-  * [🟠](gh-pr:8348)
-  * [🟠](gh-pr:4194)
-  * ❔
-  * ❔
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❔
-  * ✅
-  *
-  *
-  *
- * best-of
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * [❌](gh-issue:6137)
-  * ✅
-  * ❌
-  * ✅
-  * ✅
-  * ✅
-  * ❔
-  * [❌](gh-issue:7968)
-  * ✅
-  * ✅
-  *
-  *
- * beam-search
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * [❌](gh-issue:6137)
-  * ✅
-  * ❌
-  * ✅
-  * ✅
-  * ✅
-  * ❔
-  * [❌](gh-issue:7968)
-  * ❔
-  * ✅
-  * ✅
-  *
- * <abbr title="Guided Decoding">guided dec</abbr>
-  * ✅
-  * ✅
-  * ❔
-  * ❔
-  * [❌](gh-issue:11484)
-  * ✅
-  * ❌
-  * ❔
-  * ✅
-  * ✅
-  * ✅
-  * [❌](gh-issue:9893)
-  * ❔
-  * ✅
-  * ✅
-  * ✅
-:::
-
-(feature-x-hardware)=
-
-## Feature x Hardware
-
-:::{list-table}
-:header-rows: 1
-:stub-columns: 1
-:widths: auto
-
- * Feature
-  * Volta
-  * Turing
-  * Ampere
-  * Ada
-  * Hopper
-  * CPU
-  * AMD
- * [CP](#chunked-prefill)
-  * [❌](gh-issue:2729)
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * [APC](#automatic-prefix-caching)
-  * [❌](gh-issue:3687)
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * [LoRA](#lora-adapter)
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * [❌](gh-issue:8475)
-  * ✅
- * [SD](#spec-decode)
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * CUDA graph
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
-  * ✅
- * <abbr title="Pooling Models">pooling</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❔
- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
- * <abbr title="Multimodal Inputs">mm</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * <abbr title="Logprobs">logP</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * <abbr title="Async Output Processing">async output</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ❌
-  * ❌
- * multi-step
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * [❌](gh-issue:8477)
-  * ✅
- * best-of
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * beam-search
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
- * <abbr title="Guided Decoding">guided dec</abbr>
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-  * ✅
-:::
--- a/docs/source/features/quantization/index.md
+++ b/docs/source/features/quantization/index.md
-(quantization-index)=
-
-# Quantization
-
-Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
-
-:::{toctree}
-:caption: Contents
-:maxdepth: 1
-
-supported_hardware
-auto_awq
-bnb
-bitblas
-gguf
-gptqmodel
-int4
-int8
-fp8
-modelopt
-quark
-quantized_kvcache
-torchao
-:::