Merge tag 'v0.9.1' into v0.9.1-dev

4c676e3d · zhuwenwen · b4c4464d · b6553be1 · 4c676e3d · 4c676e3d
Commit 4c676e3d authored Jun 20, 2025 by zhuwenwen
20 changed files
--- a/docs/source/assets/logos/vllm-logo-text-dark.png
+++ b/docs/source/assets/logos/vllm-logo-text-dark.png
--- a/docs/source/assets/logos/vllm-logo-text-light.png
+++ b/docs/source/assets/logos/vllm-logo-text-light.png
--- a/docs/cli/README.md
+++ b/docs/cli/README.md
+# vLLM CLI Guide
+
+The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
+
+```
+vllm --help
+```
+
+Available Commands:
+
+```
+vllm {chat,complete,serve,bench,collect-env,run-batch}
+```
+
+## serve
+
+Start the vLLM OpenAI Compatible API server.
+
+Examples:
+
+```bash
+# Start with a model
+vllm serve meta-llama/Llama-2-7b-hf
+
+# Specify the port
+vllm serve meta-llama/Llama-2-7b-hf --port 8100
+
+# Check with --help for more options
+# To list all groups
+vllm serve --help=listgroup
+
+# To view a argument group
+vllm serve --help=ModelConfig
+
+# To view a single argument
+vllm serve --help=max-num-seqs
+
+# To search by keyword
+vllm serve --help=max
+```
+
+## chat
+
+Generate chat completions via the running API server.
+
+Examples:
+
+```bash
+# Directly connect to localhost API without arguments
+vllm chat
+
+# Specify API url
+vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
+
+# Quick chat with a single prompt
+vllm chat --quick "hi"
+```
+
+## complete
+
+Generate text completions based on the given prompt via the running API server.
+
+Examples:
+
+```bash
+# Directly connect to localhost API without arguments
+vllm complete
+
+# Specify API url
+vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
+
+# Quick complete with a single prompt
+vllm complete --quick "The future of AI is"
+```
+
+## bench
+
+Run benchmark tests for latency online serving throughput and offline inference throughput.
+
+To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
+
+Available Commands:
+
+```bash
+vllm bench {latency, serve, throughput}
+```
+
+### latency
+
+Benchmark the latency of a single batch of requests.
+
+Example:
+
+```bash
+vllm bench latency \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --input-len 32 \
+    --output-len 1 \
+    --enforce-eager \
+    --load-format dummy
+```
+
+### serve
+
+Benchmark the online serving throughput.
+
+Example:
+
+```bash
+vllm bench serve \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --host server-host \
+    --port server-port \
+    --random-input-len 32 \
+    --random-output-len 4  \
+    --num-prompts  5
+```
+
+### throughput
+
+Benchmark offline inference throughput.
+
+Example:
+
+```bash
+vllm bench throughput \
+    --model meta-llama/Llama-3.2-1B-Instruct \
+    --input-len 32 \
+    --output-len 1 \
+    --enforce-eager \
+    --load-format dummy
+```
+
+## collect-env
+
+Start collecting environment information.
+
+```bash
+vllm collect-env
+```
+
+## run-batch
+
+Run batch prompts and write results to file.
+
+Examples:
+
+```bash
+# Running with a local file
+vllm run-batch \
+    -i offline_inference/openai_batch/openai_example_batch.jsonl \
+    -o results.jsonl \
+    --model meta-llama/Meta-Llama-3-8B-Instruct
+
+# Using remote file
+vllm run-batch \
+    -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
+    -o results.jsonl \
+    --model meta-llama/Meta-Llama-3-8B-Instruct
+```
+
+## More Help
+
+For detailed options of any subcommand, use:
+
+```bash
+vllm <subcommand> --help
+```
--- a/docs/source/community/meetups.md
+++ b/docs/source/community/meetups.md
-(meetups)=
-
-# vLLM Meetups
+---
+title: Meetups
+---
+[](){ #meetups }

 We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

+- [NYC vLLM Meetup](https://lu.ma/c1rqyf1f), May 7th, 2025. [[Slides]](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing)
 - [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day), April 3rd 2025. [[Slides]](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
 - [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama), March 27th 2025. [[Slides]](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
 - [The first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg), March 16th 2025. [[Slides]](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).

--- a/docs/source/community/sponsors.md
+++ b/docs/source/community/sponsors.md
--- a/docs/configuration/README.md
+++ b/docs/configuration/README.md
+# Configuration Options
+
+This section lists the most common options for running vLLM.
+
+There are three main levels of configuration, from highest priority to lowest priority:
+
+- [Request parameters][completions-api] and [input arguments][sampling-params]
+- [Engine arguments](./engine_args.md)
+- [Environment variables](./env_vars.md)
--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
-(offline-inference)=
-
-# Offline Inference
-
-You can run vLLM in your own code on a list of prompts.
-
-The offline API is based on the {class}`~vllm.LLM` class.
-To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
-
-For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
-and runs it in vLLM using the default configuration.
-
-```python
-from vllm import LLM
-
-llm = LLM(model="facebook/opt-125m")
-```
-
-After initializing the `LLM` instance, you can perform model inference using various APIs.
-The available APIs depend on the type of model that is being run:
-
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
- [Pooling models](#pooling-models) output their hidden states directly.
-
-Please refer to the above pages for more details about each API.
-
-:::{seealso}
-[API Reference](/api/offline_inference/index)
-:::
-
-(configuration-options)=
-
-## Configuration Options
-
-This section lists the most common options for running the vLLM engine.
-For a full list, refer to the [Engine Arguments](#engine-args) page.
-
-(model-resolution)=
-
-### Model resolution
-
-vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
-and finding the corresponding implementation that is registered to vLLM.
-Nevertheless, our model resolution may fail for the following reasons:
-
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
-
-To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
-For example:
-
-```python
-from vllm import LLM
-
-model = LLM(
-    model="cerebras/Cerebras-GPT-1.3B",
-    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
-)
-```
-
-Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
-
-(reducing-memory-usage)=
-
-### Reducing memory usage
+# Conserving Memory

 Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.

-#### Tensor Parallelism (TP)
+## Tensor Parallelism (TP)

 Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.

 The following code splits the model across 2 GPUs.

 ```python
+from vllm import LLM
+
 llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
          tensor_parallel_size=2)
 ```

-:::{important}
-To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
-before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+!!! warning
+    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
+    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.

-To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-:::
+    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.

-:::{note}
-With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+!!! note
+    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).

-You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-:::
+    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.

-#### Quantization
+## Quantization

 Quantized models take less memory at the cost of lower precision.

-Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
+Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
 and used directly without extra configuration.

-Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.

-#### Context length and batch size
+## Context length and batch size

 You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
 and the maximum batch size (`max_num_seqs` option).
@@ -113,13 +48,12 @@ llm = LLM(model="adept/fuyu-8b",
          max_num_seqs=2)
 ```

-#### Reduce CUDA Graphs
+## Reduce CUDA Graphs

 By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.

-:::{important}
-CUDA graph capture takes up more memory in V1 than in V0.
-:::
+!!! warning
+    CUDA graph capture takes up more memory in V1 than in V0.

 You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:

@@ -146,14 +80,14 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          enforce_eager=True)
 ```

-#### Adjust cache size
+## Adjust cache size

 If you run out of CPU RAM, try the following options:

 - (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
 - (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).

-#### Multi-modal input limits
+## Multi-modal input limits

 You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:

@@ -186,7 +120,7 @@ llm = LLM(model="google/gemma-3-27b-it",
          limit_mm_per_prompt={"image": 0})
 ```

-#### Multi-modal processor arguments
+## Multi-modal processor arguments

 For certain models, you can adjust the multi-modal processor arguments to
 reduce the size of the processed multi-modal inputs, which in turn saves memory.
@@ -208,8 +142,3 @@ llm = LLM(model="OpenGVLab/InternVL2-2B",
              "max_dynamic_patch": 4,  # Default is 12
          })
 ```
-
-### Performance optimization and tuning
-
-You can potentially improve the performance of vLLM by finetuning various options.
-Please refer to [this guide](#optimization-and-tuning) for more details.
--- a/docs/configuration/engine_args.md
+++ b/docs/configuration/engine_args.md
+---
+title: Engine Arguments
+---
+[](){ #engine-args }
+
+Engine arguments control the behavior of the vLLM engine.
+
+- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
+- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
+
+You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
+
+However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
+
+For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
+
+!!! note
+    Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
--- a/docs/configuration/env_vars.md
+++ b/docs/configuration/env_vars.md
+# Environment Variables
+
+vLLM uses the following environment variables to configure the system:
+
+!!! warning
+    Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
+
+    All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
+
+```python
+--8<-- "vllm/envs.py:env-vars-definition"
+```
--- a/docs/configuration/model_resolution.md
+++ b/docs/configuration/model_resolution.md
+# Model Resolution
+
+vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
+and finding the corresponding implementation that is registered to vLLM.
+Nevertheless, our model resolution may fail for the following reasons:
+
+- The `config.json` of the model repository lacks the `architectures` field.
+- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
+- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
+
+To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
+For example:
+
+```python
+from vllm import LLM
+
+model = LLM(
+    model="cerebras/Cerebras-GPT-1.3B",
+    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
+)
+```
+
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
+# Optimization and Tuning
+
+This guide covers optimization strategies and performance tuning for vLLM V1.
+
+## Preemption
+
+Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
+In such cases, vLLM can preempt requests to free up KV cache space for other requests. Preempted requests are recomputed when sufficient KV cache space becomes
+available again. When this occurs, you may see the following warning:
+
+```text
+WARNING 05-09 00:49:33 scheduler.py:1057 Sequence group 0 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_cumulative_preemption_cnt=1
+```
+
+While this mechanism ensures system robustness, preemption and recomputation can adversely affect end-to-end latency.
+If you frequently encounter preemptions, consider the following actions:
+
+- Increase `gpu_memory_utilization`. vLLM pre-allocates GPU cache using this percentage of memory. By increasing utilization, you can provide more KV cache space.
+- Decrease `max_num_seqs` or `max_num_batched_tokens`. This reduces the number of concurrent requests in a batch, thereby requiring less KV cache space.
+- Increase `tensor_parallel_size`. This shards model weights across GPUs, allowing each GPU to have more memory available for KV cache. However, increasing this value may cause excessive synchronization overhead.
+- Increase `pipeline_parallel_size`. This distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, indirectly leaving more memory available for KV cache. However, increasing this value may cause latency penalties.
+
+You can monitor the number of preemption requests through Prometheus metrics exposed by vLLM. Additionally, you can log the cumulative number of preemption requests by setting `disable_log_stats=False`.
+
+In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
+
+[](){ #chunked-prefill }
+
+## Chunked Prefill
+
+Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
+
+In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
+
+With chunked prefill enabled, the scheduling policy prioritizes decode requests. It batches all pending decode requests before scheduling any prefill operations. When there are available tokens in the `max_num_batched_tokens` budget, it schedules pending prefills. If a pending prefill request cannot fit into `max_num_batched_tokens`, it automatically chunks it.
+
+This policy has two benefits:
+
+- It improves ITL and generation decode because decode requests are prioritized.
+- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
+
+### Performance Tuning with Chunked Prefill
+
+You can tune the performance by adjusting `max_num_batched_tokens`:
+
+- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
+- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
+- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs.
+- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
+
+```python
+from vllm import LLM
+
+# Set max_num_batched_tokens to tune performance
+llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", max_num_batched_tokens=16384)
+```
+
+See related papers for more details (<https://arxiv.org/pdf/2401.08671> or <https://arxiv.org/pdf/2308.16369>).
+
+## Parallelism Strategies
+
+vLLM supports multiple parallelism strategies that can be combined to optimize performance across different hardware configurations.
+
+### Tensor Parallelism (TP)
+
+Tensor parallelism shards model parameters across multiple GPUs within each model layer. This is the most common strategy for large model inference within a single node.
+
+**When to use:**
+
+- When the model is too large to fit on a single GPU
+- When you need to reduce memory pressure per GPU to allow more KV cache space for higher throughput
+
+```python
+from vllm import LLM
+
+# Split model across 4 GPUs
+llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=4)
+```
+
+For models that are too large to fit on a single GPU (like 70B parameter models), tensor parallelism is essential.
+
+### Pipeline Parallelism (PP)
+
+Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence.
+
+**When to use:**
+
+- When you've already maxed out efficient tensor parallelism but need to distribute the model further, or across nodes
+- For very deep and narrow models where layer distribution is more efficient than tensor sharding
+
+Pipeline parallelism can be combined with tensor parallelism for very large models:
+
+```python
+from vllm import LLM
+
+# Combine pipeline and tensor parallelism
+llm = LLM(
+    model="meta-llama/Llama-3.3-70B-Instruct,
+    tensor_parallel_size=4,
+    pipeline_parallel_size=2
+)
+```
+
+### Expert Parallelism (EP)
+
+Expert parallelism is a specialized form of parallelism for Mixture of Experts (MoE) models, where different expert networks are distributed across GPUs.
+
+**When to use:**
+
+- Specifically for MoE models (like DeepSeekV3, Qwen3MoE, Llama-4)
+- When you want to balance the expert computation load across GPUs
+
+Expert parallelism is enabled by setting `enable_expert_parallel=True`, which will use expert parallelism instead of tensor parallelism for MoE layers.
+It will use the same degree of parallelism as what you have set for tensor parallelism.
+
+### Data Parallelism (DP)
+
+Data parallelism replicates the entire model across multiple GPU sets and processes different batches of requests in parallel.
+
+**When to use:**
+
+- When you have enough GPUs to replicate the entire model
+- When you need to scale throughput rather than model size
+- In multi-user environments where isolation between request batches is beneficial
+
+Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
+Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
+
+## Reducing Memory Usage
+
+If you encounter out-of-memory issues, consider these strategies:
+
+### Context Length and Batch Size
+
+You can reduce memory usage by limiting the context length and batch size:
+
+```python
+from vllm import LLM
+
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    max_model_len=2048,  # Limit context window
+    max_num_seqs=4       # Limit batch size
+)
+```
+
+### Adjust CUDA Graph Compilation
+
+CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level:
+
+```python
+from vllm import LLM
+from vllm.config import CompilationConfig, CompilationLevel
+
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    compilation_config=CompilationConfig(
+        level=CompilationLevel.PIECEWISE,
+        cudagraph_capture_sizes=[1, 2, 4, 8]  # Capture fewer batch sizes
+    )
+)
+```
+
+Or, if you are not concerned about latency or overall performance, disable CUDA graph compilation entirely with `enforce_eager=True`:
+
+```python
+from vllm import LLM
+
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    enforce_eager=True  # Disable CUDA graph compilation
+)
+```
+
+### Multimodal Models
+
+For multi-modal models, you can reduce memory usage by limiting the number of images/videos per request:
+
+```python
+from vllm import LLM
+
+# Accept up to 2 images per prompt
+llm = LLM(
+    model="Qwen/Qwen2.5-VL-3B-Instruct",
+    limit_mm_per_prompt={"image": 2}
+)
+```
--- a/docs/configuration/serve_args.md
+++ b/docs/configuration/serve_args.md
+---
+title: Server Arguments
+---
+[](){ #serve-args }
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+
+## CLI Arguments
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+To see the available CLI arguments, run `vllm serve --help`!
+
+## Configuration file
+
+You can load CLI arguments via a [YAML](https://yaml.org/) config file.
+The argument names must be the long form of those outlined [above][serve-args].
+
+For example:
+
+```yaml
+# config.yaml
+
+model: meta-llama/Llama-3.1-8B-Instruct
+host: "127.0.0.1"
+port: 6379
+uvicorn-log-level: "info"
+```
+
+To use the above config file:
+
+```bash
+vllm serve --config config.yaml
+```
+
+!!! note
+    In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
+    The order of priorities is `command line > config file values > defaults`.
+    e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
--- a/docs/source/contributing/overview.md
+++ b/docs/source/contributing/overview.md
@@ -16,9 +16,9 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
 Unsure on where to start? Check out the following links for tasks to work on:

 - [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
-  - [Selected onboarding tasks](gh-project:6)
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new%20model%22)
-  - [Models with multi-modal capabilities](gh-project:10)
+    - [Selected onboarding tasks](gh-project:6)
+- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
+    - [Models with multi-modal capabilities](gh-project:10)

 ## License

@@ -27,7 +27,69 @@ See <gh-file:LICENSE>.
 ## Developing

 Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
-Check out the [building from source](#build-from-source) documentation for details.
+Check out the [building from source][build-from-source] documentation for details.
+
+### Building the docs with MkDocs
+
+#### Introduction to MkDocs
+
+[MkDocs](https://github.com/mkdocs/mkdocs) is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file.
+
+#### Install MkDocs and Plugins
+
+Install MkDocs along with the [plugins](https://github.com/vllm-project/vllm/blob/main/mkdocs.yaml) used in the vLLM documentation, as well as required dependencies:
+
+```bash
+pip install -r requirements/docs.txt
+```
+
+!!! note
+    Ensure that your Python version is compatible with the plugins (e.g., `mkdocs-awesome-nav` requires Python 3.10+)
+
+#### Verify Installation
+
+Confirm that MkDocs is correctly installed:
+
+```bash
+mkdocs --version
+```
+
+Example output:
+
+```console
+mkdocs, version 1.6.1 from /opt/miniconda3/envs/mkdoc/lib/python3.10/site-packages/mkdocs (Python 3.10)
+```
+
+#### Clone the `vLLM` repository
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+```
+
+#### Start the Development Server
+
+MkDocs comes with a built-in dev-server that lets you preview your documentation as you work on it. Make sure you're in the same directory as the `mkdocs.yml` configuration file, and then start the server by running the `mkdocs serve` command:
+
+```bash
+mkdocs serve
+```
+
+Example output:
+
+```console
+INFO    -  Documentation built in 106.83 seconds
+INFO    -  [22:02:02] Watching paths for changes: 'docs', 'mkdocs.yaml'
+INFO    -  [22:02:02] Serving on http://127.0.0.1:8000/
+```
+
+#### View in Your Browser
+
+Open up [http://127.0.0.1:8000/](http://127.0.0.1:8000/) in your browser to see a live preview:.
+
+#### Learn More
+
+For additional features and advanced configurations, refer to the official [MkDocs Documentation](https://www.mkdocs.org/).

 ## Testing

@@ -40,27 +102,36 @@ pre-commit install --hook-type pre-commit --hook-type commit-msg
 # You can manually run pre-commit with
 pre-commit run --all-files

+# To manually run something from CI that does not run
+# locally by default, you can run:
+pre-commit run mypy-3.9 --hook-stage manual --all-files
+
 # Unit tests
 pytest tests/
+
+# Run tests for a single test file with detailed output
+pytest -s -v tests/test_logger.py
 ```

-:::{tip}
-Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
+!!! tip
+    Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
+
+    Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.

-Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::
+!!! note
+    Currently, the repository is not fully checked by `mypy`.

-:::{note}
-Currently, the repository is not fully checked by `mypy`.
-:::
+!!! note
+    Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
+    platform to run unit tests locally, rely on the continuous integration system to run the tests for
+    now.

 ## Issues

 If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

-:::{important}
-If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
-:::
+!!! warning
+    If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).

 ## Pull Requests & Code Reviews

@@ -96,9 +167,8 @@ appropriately to indicate the type of change. Please use one of the following:
 - `[Misc]` for PRs that do not fit the above categories. Please use this
  sparingly.

-:::{note}
-If the PR spans more than one category, please include all relevant prefixes.
-:::
+!!! note
+    If the PR spans more than one category, please include all relevant prefixes.

 ### Code Quality

@@ -111,9 +181,8 @@ The PR needs to meet the following code quality standards:
  understand the code.
 - Include sufficient tests to ensure the project stays correct and robust. This
  includes both unit tests and integration tests.
- Please add documentation to `docs/source/` if the PR modifies the
-  user-facing behaviors of vLLM. It helps vLLM users understand and utilize the
-  new features or changes.
+- Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM.
+  It helps vLLM users understand and utilize the new features or changes.

 ### Adding or Changing Kernels


--- a/docs/source/performance/benchmarks.md
+++ b/docs/source/performance/benchmarks.md
-(benchmarks)=
-
-# Benchmark Suites
+---
+title: Benchmark Suites
+---
+[](){ #benchmarks }

 vLLM contains two sets of benchmarks:

- [Performance benchmarks](#performance-benchmarks)
- [Nightly benchmarks](#nightly-benchmarks)
+- [Performance benchmarks][performance-benchmarks]
+- [Nightly benchmarks][nightly-benchmarks]

-(performance-benchmarks)=
+[](){ #performance-benchmarks }

 ## Performance Benchmarks

@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo

 More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).

-(nightly-benchmarks)=
+[](){ #nightly-benchmarks }

 ## Nightly Benchmarks


--- a/docs/contributing/ci-failures.md
+++ b/docs/contributing/ci-failures.md
+# CI Failures
+
+What should I do when a CI job fails on my PR, but I don't think my PR caused
+the failure?
+
+- Check the dashboard of current CI test failures:  
+  👉 [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20)
+
+- If your failure **is already listed**, it's likely unrelated to your PR.  
+  Help fixing it is always welcome!  
+    - Leave comments with links to additional instances of the failure.  
+    - React with a 👍 to signal how many are affected.
+
+- If your failure **is not listed**, you should **file an issue**.
+
+## Filing a CI Test Failure Issue
+
+- **File a bug report:**  
+    👉 [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml)
+
+- **Use this title format:**
+  
+    ```
+    [CI Failure]: failing-test-job - regex/matching/failing:test
+    ```
+
+- **For the environment field:**
+  
+    ```
+ Still failing on main as of commit abcdef123
+    ```
+
+- **In the description, include failing tests:**
+  
+    ```
+    FAILED failing/test.py:failing_test1 - Failure description  
+     FAILED failing/test.py:failing_test2 - Failure description  
+    https://github.com/orgs/vllm-project/projects/20  
+    https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml  
+    FAILED failing/test.py:failing_test3 - Failure description  
+    ```
+
+- **Attach logs** (collapsible section example):
+    <details>
+    <summary>Logs:</summary>
+
+    ```text
+    ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data  
+    --- Logging error ---  
+    Traceback (most recent call last):  
+      File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model  
+        return self.model_executor.execute_model(scheduler_output)  
+    ...
+    FAILED failing/test.py:failing_test1 - Failure description  
+    FAILED failing/test.py:failing_test2 - Failure description  
+    FAILED failing/test.py:failing_test3 - Failure description  
+    ```
+  
+    </details>
+
+## Logs Wrangling
+
+Download the full log file from Buildkite locally.
+
+Strip timestamps and colorization:
+
+<gh-file:.buildkite/scripts/ci-clean-log.sh>
+
+```bash
+./ci-clean-log.sh ci.log
+```
+
+Use a tool [wl-clipboard](https://github.com/bugaevc/wl-clipboard) for quick copy-pasting:
+
+```bash
+tail -525 ci_build.log | wl-copy
+```
+
+## Investigating a CI Test Failure
+
+1. Go to 👉 [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main)  
+2. Bisect to find the first build that shows the issue.  
+3. Add your findings to the GitHub issue.  
+4. If you find a strong candidate PR, mention it in the issue and ping contributors.
+
+## Reproducing a Failure
+
+CI test failures may be flaky. Use a bash loop to run repeatedly:
+
+<gh-file:.buildkite/scripts/rerun-test.sh>
+
+```bash
+./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]
+```
+
+## Submitting a PR
+
+If you submit a PR to fix a CI failure:
+
+- Link the PR to the issue:  
+  Add `Closes #12345` to the PR description.
+- Add the `ci-failure` label:  
+  This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20).
+
+## Other Resources
+
+- 🔍 [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability)
+- 🧪 [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main)
+
+## Daily Triage
+
+Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to:
+
+- Identify recent test failures **on `main`**.
+- Exclude legitimate test failures on PRs.
+- (Optional) Ignore tests with 0% reliability.
+
+Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20).
--- a/docs/contributing/deprecation_policy.md
+++ b/docs/contributing/deprecation_policy.md
+# Deprecation Policy
+
+This document outlines the official policy and process for deprecating features
+in the vLLM project.
+
+## Overview
+
+vLLM uses a structured "deprecation pipeline" to guide the lifecycle of
+deprecated features. This policy ensures that users are given clear and
+sufficient notice when a feature is deprecated and that deprecations proceed in
+a consistent and predictable manner.
+
+We aim to strike a balance between continued innovation and respecting users’
+reliance on existing functionality. Deprecations are tied to our **minor (Y)
+releases** following semantic versioning (X.Y.Z), where:
+
+- **X** is a major version (rare)
+- **Y** is a minor version (used for significant changes, including deprecations/removals)
+- **Z** is a patch version (used for fixes and safer enhancements)
+
+Features that fall under this policy include (at a minimum) the following:
+
+- CLI flags
+- Environment variables
+- Configuration files
+- APIs in the OpenAI-compatible API server
+- Public Python APIs for the `vllm` library
+
+## Deprecation Pipeline
+
+The deprecation process consists of several clearly defined stages that span
+multiple Y releases:
+
+**1. Deprecated (Still On By Default)**
+
+- **Action**: Feature is marked as deprecated.
+- **Timeline**: A removal version is explicitly stated in the deprecation
+warning (e.g., "This will be removed in v0.10.0").
+- **Communication**: Deprecation is noted in the following, as applicable:
+  - Help strings
+  - Log output
+  - API responses
+  - `/metrics` output (for metrics features)
+  - User-facing documentation
+  - Release notes
+  - GitHub Issue (RFC) for feedback
+  - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
+
+**2.Deprecated (Off By Default)**
+
+- **Action**: Feature is disabled by default, but can still be re-enabled via a
+CLI flag or environment variable. Feature throws an error when used without
+re-enabling.
+- **Purpose**: Allows users who missed earlier warnings a temporary escape hatch
+while signaling imminent removal. Ensures any remaining usage is clearly
+surfaced and blocks silent breakage before full removal.
+
+**3. Removed**
+
+- **Action**: Feature is completely removed from the codebase.
+- **Note**: Only features that have passed through the previous deprecation
+stages will be removed.
+
+## Example Timeline
+
+Assume a feature is deprecated in `v0.9.0`.
+
+| Release       | Status                                                                                          |
+|---------------|-------------------------------------------------------------------------------------------------|
+| `v0.9.0`      | Feature is deprecated with clear removal version listed.                                        |
+| `v0.10.0`     | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. |
+| `v0.11.0`     | Feature is removed.                                                                             |
+
+## Important Guidelines
+
+- **No Removals in Patch Releases**: Removing deprecated features in patch
+(`.Z`) releases is disallowed to avoid surprising users.
+- **Grace Period for Existing Deprecations**: Any feature deprecated **before
+this policy** will have its grace period start **now**, not retroactively.
+- **Documentation is Critical**: Ensure every stage of the pipeline is
+documented clearly for users.
+
+## Final Notes
+
+This policy is a living document and may evolve as the needs of the project and
+its users change. Community feedback is welcome and encouraged as we refine the
+process.
--- a/docs/source/contributing/dockerfile/dockerfile.md
+++ b/docs/source/contributing/dockerfile/dockerfile.md
 # Dockerfile

 We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](#deployment-docker).
+More information about deploying with Docker can be found [here][deployment-docker].

 Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

@@ -17,18 +17,21 @@ The edges of the build graph represent:

 - `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)

-  > :::{figure} /assets/contributing/dockerfile-stages-dependency.png
-  > :align: center
-  > :alt: query
-  > :width: 100%
-  > :::
+  > <figure markdown="span">
+  >   ![](../../assets/contributing/dockerfile-stages-dependency.png){ align="center" alt="query" width="100%" }
+  > </figure>
  >
  > Made using: <https://github.com/patrickhoefler/dockerfilegraph>
  >
  > Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
  >
  > ```bash
-  > dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
+  > dockerfilegraph \
+  >   -o png \
+  >   --legend \
+  >   --dpi 200 \
+  >   --max-label-length 50 \
+  >   --filename docker/Dockerfile
  > ```
  >
  > or in case you want to run it directly with the docker image:

--- a/docs/contributing/model/README.md
+++ b/docs/contributing/model/README.md
+---
+title: Adding a New Model
+---
+[](){ #new-model }
+
+This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
+
+Contents:
+
+- [Basic](basic.md)
+- [Registration](registration.md)
+- [Tests](tests.md)
+- [Multimodal](multimodal.md)
+
+!!! note
+    The complexity of adding a new model depends heavily on the model's architecture.
+    The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
+    However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
+
+!!! tip
+    If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
+    or ask on our [developer slack](https://slack.vllm.ai).
+    We will be happy to help you out!
--- a/docs/source/contributing/model/basic.md
+++ b/docs/source/contributing/model/basic.md
-(new-model-basic)=
-
-# Implementing a Basic Model
+---
+title: Implementing a Basic Model
+---
+[](){ #new-model-basic }

 This guide walks you through the steps to implement a basic vLLM model.

@@ -10,9 +11,8 @@ First, clone the PyTorch model code from the source repository.
 For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
 HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.

-:::{warning}
-Make sure to review and adhere to the original code's copyright and licensing terms!
-:::
+!!! warning
+    Make sure to review and adhere to the original code's copyright and licensing terms!

 ## 2. Make your code compatible with vLLM

@@ -67,7 +67,7 @@ class MyModel(nn.Module):
        ... 
 ```

- Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
+- Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.

 ```python
 def forward(
@@ -78,10 +78,9 @@ def forward(
    ...
 ```

-:::{note}
-Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
-If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
-:::
+!!! note
+    Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
+    If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.

 For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.

@@ -89,7 +88,7 @@ For reference, check out our [Llama implementation](gh-file:vllm/model_executor/

 If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
 To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
-For the embedding layer, you can simply replace {class}`torch.nn.Embedding` with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
+For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
 When it comes to the linear layers, we provide the following options to parallelize them:

 - `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
@@ -107,7 +106,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a

 ## 5. Register your model

-See [this page](#new-model-registration) for instructions on how to register your new model to be used by vLLM.
+See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.

 ## Frequently Asked Questions

@@ -117,7 +116,7 @@ For models with interleaving sliding windows (e.g. `google/gemma-2-2b-it` and `m

 To support a model with interleaving sliding windows, we need to take care of the following details:

- Make sure [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/config.py#L308) evaluates `has_interleaved_attention` to `True` for this model, and set `self.hf_text_config.interleaved_sliding_window` to the format of interleaving sliding windows the model can understand. Then, `self.hf_text_config.sliding_window` will be deleted, and the model will be treated as a full-attention model.
+- Make sure the model's `config.json` contains `sliding_window_pattern`. vLLM then sets `self.hf_text_config.interleaved_sliding_window` to the value of `self.hf_text_config.sliding_window` and deletes `sliding_window` from `self.hf_text_config`. The model will then be treated as a full-attention model.
 - In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171).

 With these two steps, interleave sliding windows should work with the model.
--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
+---
+title: Multi-Modal Support
+---
+[](){ #supports-multimodal }
+
+This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
+
+## 1. Update the base vLLM model
+
+It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
+Further update the model as follows:
+
+- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:
+
+  ```diff
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+  +     pixel_values: torch.Tensor,
+    ) -> SamplerOutput:
+  ```
+  
+  More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it.
+
+- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
+
+    ```python
+    class YourModelForImage2Seq(nn.Module):
+        ...
+
+        def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
+
+            assert self.vision_encoder is not None
+            image_features = self.vision_encoder(image_input)
+            return self.multi_modal_projector(image_features)
+
+        def get_multimodal_embeddings(
+                self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+
+            # Validate the multimodal input keyword arguments
+            image_input = self._parse_and_validate_image_input(**kwargs)
+            if image_input is None:
+                return None
+
+            # Run multimodal inputs through encoder and projector
+            vision_embeddings = self._process_image_input(image_input)
+            return vision_embeddings
+    ```
+
+!!! warning
+        The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
+
+- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
+
+    ```python
+    from .utils import merge_multimodal_embeddings
+
+    class YourModelForImage2Seq(nn.Module):
+        ...
+
+        def get_input_embeddings(
+            self,
+            input_ids: torch.Tensor,
+            multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+        ) -> torch.Tensor:
+
+            # `get_input_embeddings` should already be implemented for the language 
+            # model as one of the requirements of basic vLLM model implementation.
+            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+
+            if multimodal_embeddings is not None:
+                inputs_embeds = merge_multimodal_embeddings(
+                    input_ids=input_ids, 
+                    inputs_embeds=inputs_embeds, 
+                    multimodal_embeddings=multimodal_embeddings,
+                    placeholder_token_id=self.config.image_token_index)
+
+            return inputs_embeds
+    ```
+
+- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.
+
+    ```python
+    class YourModelForImage2Seq(nn.Module):
+        ...
+
+        def get_language_model(self) -> torch.nn.Module:
+            # Change `language_model` according to your implementation.
+            return self.language_model
+    ```
+
+- Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
+
+  ```diff
+  + from vllm.model_executor.models.interfaces import SupportsMultiModal
+
+  - class YourModelForImage2Seq(nn.Module):
+  + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+  ```
+
+!!! note
+      The model class does not have to be named `*ForCausalLM`.
+      Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
+
+## 2. Specify processing information
+
+Next, create a subclass of [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo]
+to provide basic information related to HF processing.
+
+### Maximum number of input items
+
+You need to override the abstract method [get_supported_mm_limits][vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits]
+to return the maximum number of input items for each modality supported by the model.
+
+For example, if the model supports any number of images but only one video per prompt:
+
+```python
+def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
+    return {"image": None, "video": 1}
+```
+
+## 3. Specify dummy inputs
+
+Then, inherit [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] to construct dummy inputs for
+HF processing as well as memory profiling.
+
+### For memory profiling
+
+Override the abstract methods [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text] and [get_dummy_mm_data][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data] to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
+
+Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
+
+=== "Basic example: LLaVA"
+
+    Looking at the code of HF's `LlavaForConditionalGeneration`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
+    n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
+    n_image_features = image_features.shape[0] * image_features.shape[1]
+
+    if n_image_tokens != n_image_features:
+        raise ValueError(
+            f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+        )
+    special_image_mask = (
+        (input_ids == self.config.image_token_index)
+        .unsqueeze(-1)
+        .expand_as(inputs_embeds)
+        .to(inputs_embeds.device)
+    )
+    image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+    inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+    ```
+
+    The number of placeholder feature tokens per image is `image_features.shape[1]`.
+    `image_features` is calculated inside the `get_image_features` method:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
+    image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+
+    selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+    if vision_feature_select_strategy == "default":
+        selected_image_feature = selected_image_feature[:, 1:]
+    elif vision_feature_select_strategy == "full":
+        selected_image_feature = selected_image_feature
+    else:
+        raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+    image_features = self.multi_modal_projector(selected_image_feature)
+    return image_features
+    ```
+
+    We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
+    (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
+    Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
+    The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
+    mechanism doesn't change the sequence length of the output hidden states.
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
+    hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
+    hidden_states = self.pre_layrnorm(hidden_states)
+
+    encoder_outputs = self.encoder(
+        inputs_embeds=hidden_states,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        return_dict=return_dict,
+    )
+    ```
+
+    To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
+    target_dtype = self.patch_embedding.weight.dtype
+    patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
+    patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+
+    class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+    embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+    if interpolate_pos_encoding:
+        embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+    else:
+        embeddings = embeddings + self.position_embedding(self.position_ids)
+    return embeddings
+    ```
+
+    We can infer that `embeddings.shape[1] == self.num_positions`, where
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
+    self.num_patches = (self.image_size // self.patch_size) ** 2
+    self.num_positions = self.num_patches + 1
+    ```
+
+    Overall, the number of placeholder feature tokens for an image can be calculated as:
+
+    ```python
+    def get_num_image_tokens(
+        self,
+        *,
+        image_width: int,
+        image_height: int,
+    ) -> int:
+        hf_config = self.get_hf_config()
+        hf_processor = self.get_hf_processor()
+
+        image_size = hf_config.vision_config.image_size
+        patch_size = hf_config.vision_config.patch_size
+
+        num_image_tokens = (image_size // patch_size) ** 2 + 1
+        if hf_processor.vision_feature_select_strategy == "default":
+            num_image_tokens -= 1
+
+        return num_image_tokens
+    ```
+
+    Notice that the number of image tokens doesn't depend on the image width and height.
+    We can simply use a dummy `image_size` to calculate the multimodal profiling data:
+
+    ```python
+    # NOTE: In actuality, this is usually implemented as part of the
+    # model's subclass of `BaseProcessingInfo`, but we show it as is
+    # here for simplicity.
+    def get_image_size_with_most_features(self) -> ImageSize:
+        hf_config = self.get_hf_config()
+        width = height = hf_config.image_size
+        return ImageSize(width=width, height=height)
+
+    def get_dummy_mm_data(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+    ) -> MultiModalDataDict:
+        num_images = mm_counts.get("image", 0)
+
+        target_width, target_height = \
+            self.info.get_image_size_with_most_features()
+
+        return {
+            "image":
+            self._get_dummy_images(width=target_width,
+                                   height=target_height,
+                                   num_images=num_images)
+        }
+    ```
+
+    For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
+
+    ```python
+    def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+        num_images = mm_counts.get("image", 0)
+
+        processor = self.info.get_hf_processor()
+        image_token = processor.image_token
+
+        return image_token * num_images
+    ```
+
+=== "No input placeholders: Fuyu"
+
+    Looking at the code of HF's `FuyuForCausalLM`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+    if image_patches is not None and past_key_values is None:
+        patch_embeddings = [
+            self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+            .squeeze(0)
+            .to(inputs_embeds.device)
+            for patch in image_patches
+        ]
+        inputs_embeds = self.gather_continuous_embeddings(
+            word_embeddings=inputs_embeds,
+            continuous_embeddings=patch_embeddings,
+            image_patch_input_indices=image_patches_indices,
+        )
+    ```
+
+    The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
+    which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
+
+    Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
+    Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
+
+    The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
+    `FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
+
+    In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
+    returning the dimensions after resizing (but before padding) as metadata.
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+    image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+    batch_images = image_encoding["images"]
+    image_unpadded_heights = image_encoding["image_unpadded_heights"]
+    image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+    if do_resize:
+        batch_images = [
+            [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+            for images in batch_images
+        ]
+
+    image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+    image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+    image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+    if do_pad:
+        batch_images = [
+            [
+                self.pad_image(
+                    image,
+                    size=size,
+                    mode=padding_mode,
+                    constant_values=padding_value,
+                    input_data_format=input_data_format,
+                )
+                for image in images
+            ]
+            for images in batch_images
+        ]
+    ```
+
+    In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+        image_input=tensor_batch_images,
+        image_present=image_present,
+        image_unpadded_h=image_unpadded_heights,
+        image_unpadded_w=image_unpadded_widths,
+        image_placeholder_id=image_placeholder_id,
+        image_newline_id=image_newline_id,
+        variable_sized=True,
+    )
+
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+    image_height, image_width = image.shape[1], image.shape[2]
+    if variable_sized:  # variable_sized=True
+        new_h = min(
+            image_height,
+            math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+        )
+        new_w = min(
+            image_width,
+            math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+        )
+        image = image[:, :new_h, :new_w]
+        image_height, image_width = new_h, new_w
+
+    num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+    tensor_of_image_ids = torch.full(
+        [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+    )
+    patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+    assert num_patches == patches.shape[0]
+    ```
+
+    The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+    patch_size = patch_size if patch_size is not None else self.patch_size
+    patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+    if image_height % patch_height != 0:
+        raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+    if image_width % patch_width != 0:
+        raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+    num_patches_per_dim_h = image_height // patch_height
+    num_patches_per_dim_w = image_width // patch_width
+    num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+    ```
+
+    These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
+    to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
+
+    ```python
+    def get_image_size_with_most_features(self) -> ImageSize:
+        image_processor = self.get_image_processor()
+        return ImageSize(width=image_processor.size["width"],
+                            height=image_processor.size["height"])
+    ```
+
+    Fuyu does not expect image placeholders in the inputs to HF processor, so
+    the dummy prompt text is empty regardless of the number of images.
+
+    ```python
+    def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+        return ""
+    ```
+
+    For the multimodal image profiling data, the logic is very similar to LLaVA:
+
+    ```python
+    def get_dummy_mm_data(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+    ) -> MultiModalDataDict:
+        target_width, target_height = \
+            self.info.get_image_size_with_most_features()
+        num_images = mm_counts.get("image", 0)
+
+        return {
+            "image":
+            self._get_dummy_images(width=target_width,
+                                   height=target_height,
+                                   num_images=num_images)
+        }
+    ```
+
+## 4. Specify processing details
+
+Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]
+to fill in the missing details about HF processing.
+
+!!! info
+    [Multi-Modal Data Processing][mm-processing]
+
+### Multi-modal fields
+
+Override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] to
+return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
+
+=== "Basic example: LLaVA"
+
+    The output of `CLIPImageProcessor` is a simple tensor with shape
+    `(num_images, num_channels, image_height, image_width)`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
+    images = [
+        to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+        for image in all_images
+    ]
+
+    data = {"pixel_values": images}
+    return BatchFeature(data=data, tensor_type=return_tensors)
+    ```
+
+    So, we override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+    ```python
+    def _get_mm_fields_config(
+        self,
+        hf_inputs: BatchFeature,
+        hf_processor_mm_kwargs: Mapping[str, object],
+    ) -> Mapping[str, MultiModalFieldConfig]:
+        return dict(
+            pixel_values=MultiModalFieldConfig.batched("image"),
+        )
+    ```
+
+    !!! note
+        Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
+        pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
+
+=== "With postprocessing: Fuyu"
+
+    The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
+    the patches from each image belonging to an item in the batch:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
+            image_input_ids.append(tensor_of_image_ids)
+            image_patches.append(patches)
+        else:
+            image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
+
+    batch_image_input_ids.append(image_input_ids)
+    batch_image_patches.append(image_patches)
+    ```
+
+    The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
+    `(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
+
+    In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
+    we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
+
+    ```python
+    def _call_hf_processor(
+        self,
+        prompt: str,
+        mm_data: Mapping[str, object],
+        mm_kwargs: Mapping[str, object],
+    ) -> BatchFeature:
+        processed_outputs = super()._call_hf_processor(
+            prompt=prompt,
+            mm_data=mm_data,
+            mm_kwargs=mm_kwargs,
+        )
+
+        image_patches = processed_outputs.get("image_patches")
+        if image_patches is not None:
+            images = mm_data["images"]
+            assert isinstance(images, list)
+
+            # Original output: (1, num_images, Pn, Px * Py * C)
+            # New output: (num_images, Pn, Px * Py * C)
+            assert (isinstance(image_patches, list)
+                    and len(image_patches) == 1)
+            assert (isinstance(image_patches[0], torch.Tensor)
+                    and len(image_patches[0]) == len(images))
+
+            processed_outputs["image_patches"] = image_patches[0]
+
+        return processed_outputs
+    ```
+
+    !!! note
+        Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
+        for text-only inputs to prevent unnecessary warnings from HF processor.
+
+    This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+    ```python
+    def _get_mm_fields_config(
+        self,
+        hf_inputs: BatchFeature,
+        hf_processor_mm_kwargs: Mapping[str, object],
+    ) -> Mapping[str, MultiModalFieldConfig]:
+        return dict(image_patches=MultiModalFieldConfig.batched("image"))
+    ```
+
+### Prompt updates
+
+Override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] to
+return a list of [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instances.
+
+Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies an update operation
+(e.g.: insertion, replacement) performed by the HF processor.
+
+=== "Basic example: LLaVA"
+
+    Looking at HF's `LlavaProcessor`:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
+    prompt_strings = []
+    for sample in text:
+        sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
+        prompt_strings.append(sample)
+    ```
+
+    It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
+    Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
+
+    ```python
+    def _get_prompt_updates(
+        self,
+        mm_items: MultiModalDataItems,
+        hf_processor_mm_kwargs: Mapping[str, object],
+        out_mm_kwargs: MultiModalKwargs,
+    ) -> Sequence[PromptUpdate]:
+        hf_config = self.info.get_hf_config()
+        image_token_id = hf_config.image_token_index
+
+        def get_replacement(item_idx: int):
+            images = mm_items.get_items("image", ImageProcessorItems)
+
+            image_size = images.get_image_size(item_idx)
+            num_image_tokens = self.info.get_num_image_tokens(
+                image_width=image_size.width,
+                image_height=image_size.height,
+            )
+
+            return [image_token_id] * num_image_tokens
+
+        return [
+            PromptReplacement(
+                modality="image",
+                target=[image_token_id],
+                replacement=get_replacement,
+            ),
+        ]
+    ```
+
+=== "Handling additional tokens: Fuyu"
+
+    Recall the layout of feature tokens from Step 2:
+
+    ```
+    |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+    |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+    ...
+    |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+    ```
+
+    We define a helper function to return `ncols` and `nrows` directly:
+
+    ```python
+    def get_image_feature_grid_size(
+        self,
+        *,
+        image_width: int,
+        image_height: int,
+    ) -> tuple[int, int]:
+        image_processor = self.get_image_processor()
+        target_width = image_processor.size["width"]
+        target_height = image_processor.size["height"]
+        patch_width = image_processor.patch_size["width"]
+        patch_height = image_processor.patch_size["height"]
+
+        if not (image_width <= target_width and image_height <= target_height):
+            height_scale_factor = target_height / image_height
+            width_scale_factor = target_width / image_width
+            optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+            image_height = int(image_height * optimal_scale_factor)
+            image_width = int(image_width * optimal_scale_factor)
+
+        ncols = math.ceil(image_width / patch_width)
+        nrows = math.ceil(image_height / patch_height)
+        return ncols, nrows
+    ```
+
+    Based on this, we can initially define our replacement tokens as:
+
+    ```python
+    def get_replacement(item_idx: int):
+        images = mm_items.get_items("image", ImageProcessorItems)
+        image_size = images.get_image_size(item_idx)
+
+        ncols, nrows = self.info.get_image_feature_grid_size(
+            image_width=image_size.width,
+            image_height=image_size.height,
+        )
+
+        # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+        # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+        return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+    ```
+
+    However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
+    a BOS token (`<s>`) is also added to the promopt:
+
+    ```python
+    # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+    model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+        image_input=tensor_batch_images,
+        image_present=image_present,
+        image_unpadded_h=image_unpadded_heights,
+        image_unpadded_w=image_unpadded_widths,
+        image_placeholder_id=image_placeholder_id,
+        image_newline_id=image_newline_id,
+        variable_sized=True,
+    )
+    prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+        tokenizer=self.tokenizer,
+        prompts=prompts,
+        scale_factors=scale_factors,
+        max_tokens_to_generate=self.max_tokens_to_generate,
+        max_position_embeddings=self.max_position_embeddings,
+        add_BOS=True,
+        add_beginning_of_answer_token=True,
+    )
+    ```
+
+    To assign the vision embeddings to only the image tokens, instead of a string
+    you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
+
+    ```python
+    hf_config = self.info.get_hf_config()
+    bos_token_id = hf_config.bos_token_id  # `<s>`
+    assert isinstance(bos_token_id, int)
+
+    def get_replacement_fuyu(item_idx: int):
+        images = mm_items.get_items("image", ImageProcessorItems)
+        image_size = images.get_image_size(item_idx)
+
+        ncols, nrows = self.info.get_image_feature_grid_size(
+            image_width=image_size.width,
+            image_height=image_size.height,
+        )
+        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                        [_NEWLINE_TOKEN_ID]) * nrows
+
+        return PromptUpdateDetails.select_token_id(
+            image_tokens + [bos_token_id],
+            embed_token_id=_IMAGE_TOKEN_ID,
+        )
+    ```
+
+    Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+    we can search for it to conduct the replacement at the start of the string:
+
+    ```python
+    def _get_prompt_updates(
+        self,
+        mm_items: MultiModalDataItems,
+        hf_processor_mm_kwargs: Mapping[str, object],
+        out_mm_kwargs: MultiModalKwargs,
+    ) -> Sequence[PromptUpdate]:
+        hf_config = self.info.get_hf_config()
+        bos_token_id = hf_config.bos_token_id
+        assert isinstance(bos_token_id, int)
+
+        tokenizer = self.info.get_tokenizer()
+        eot_token_id = tokenizer.bos_token_id
+        assert isinstance(eot_token_id, int)
+
+        def get_replacement_fuyu(item_idx: int):
+            images = mm_items.get_items("image", ImageProcessorItems)
+            image_size = images.get_image_size(item_idx)
+
+            ncols, nrows = self.info.get_image_feature_grid_size(
+                image_width=image_size.width,
+                image_height=image_size.height,
+            )
+            image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+                            [_NEWLINE_TOKEN_ID]) * nrows
+
+            return PromptUpdateDetails.select_token_id(
+                image_tokens + [bos_token_id],
+                embed_token_id=_IMAGE_TOKEN_ID,
+            )
+
+        return [
+            PromptReplacement(
+                modality="image",
+                target=[eot_token_id],
+                replacement=get_replacement_fuyu,
+            )
+        ]
+    ```
+
+## 5. Register processor-related classes
+
+After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
+[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
+and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
+decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor <vllm.multimodal.registry.MultiModalRegistry.register_processor>`
+to register them to the multi-modal registry:
+
+```diff
+  from vllm.model_executor.models.interfaces import SupportsMultiModal
+ from vllm.multimodal import MULTIMODAL_REGISTRY
+
+ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
+                                         info=YourProcessingInfo,
+                                         dummy_inputs=YourDummyInputsBuilder)
+  class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+```
+
+## Notes
+
+### Inserting feature tokens without replacement
+
+Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use [PromptInsertion][vllm.multimodal.processing.PromptInsertion] instead of [PromptReplacement][vllm.multimodal.processing.PromptReplacement] inside [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
+
+Examples:
+
+- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
+- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
+- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
+
+### Handling prompt updates unrelated to multi-modal data
+
+[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
+
+Examples:
+
+- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py>
+- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py>
+- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py>
+
+### Custom HF processor
+
+Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to [_call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor].
+
+Examples:
+
+- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py>
+- InternVL: <gh-file:vllm/model_executor/models/internvl.py>
+- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py>