Merge tag 'v0.7.1' into v0.7.1-dev

afd0da21 · zhuwenwen · 1a11f127 · 4f4d427a · afd0da21 · afd0da21
Commit afd0da21 authored Feb 03, 2025 by zhuwenwen
20 changed files
--- a/docs/source/deployment/frameworks/index.md
+++ b/docs/source/deployment/frameworks/index.md
+# Using other frameworks
+
+:::{toctree}
+:maxdepth: 1
+
+bentoml
+cerebrium
+dstack
+helm
+lws
+modal
+skypilot
+triton
+:::
--- a/docs/source/serving/deploying_with_lws.md
+++ b/docs/source/serving/deploying_with_lws.md
-(deploying-with-lws)=
+(deployment-lws)=

-# Deploying with LWS
+# LWS

 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.

--- a/docs/source/deployment/frameworks/modal.md
+++ b/docs/source/deployment/frameworks/modal.md
+(deployment-modal)=
+
+# Modal
+
+vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.
+
+For details on how to deploy vLLM on Modal, see [this tutorial in the Modal documentation](https://modal.com/docs/examples/vllm_inference).
--- a/docs/source/serving/run_on_sky.md
+++ b/docs/source/serving/run_on_sky.md
-(on-cloud)=
+(deployment-skypilot)=

-# Deploying and scaling up with SkyPilot
+# SkyPilot

-```{raw} html
+:::{raw} html
 <p align="center">
  <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
 </p>
-```
+:::

 vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).

 ## Prerequisites

- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
+- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
 - Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
+- Check that `sky check` shows clouds or Kubernetes are enabled.

 ```console
 pip install skypilot-nightly
@@ -61,7 +61,7 @@ run: |

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/gradio_openai_chatbot_webserver.py \
+  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
@@ -104,10 +104,10 @@ service:
  max_completion_tokens: 1
 ```

-```{raw} html
+:::{raw} html
 <details>
 <summary>Click to see the full recipe YAML</summary>
-```
+:::

 ```yaml
 service:
@@ -153,9 +153,9 @@ run: |
    2>&1 | tee api_server.log
 ```

-```{raw} html
+:::{raw} html
 </details>
-```
+:::

 Start the serving the Llama-3 8B model on multiple replicas:

@@ -169,10 +169,10 @@ Wait until the service is ready:
 watch -n10 sky serve status vllm
 ```

-```{raw} html
+:::{raw} html
 <details>
 <summary>Example outputs:</summary>
-```
+:::

 ```console
 Services
@@ -185,9 +185,9 @@ vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  R
 vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 ```

-```{raw} html
+:::{raw} html
 </details>
-```
+:::

 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

@@ -223,10 +223,10 @@ service:

 This will scale the service up to when the QPS exceeds 2 for each replica.

-```{raw} html
+:::{raw} html
 <details>
 <summary>Click to see the full recipe YAML</summary>
-```
+:::

 ```yaml
 service:
@@ -275,9 +275,9 @@ run: |
    2>&1 | tee api_server.log
 ```

-```{raw} html
+:::{raw} html
 </details>
-```
+:::

 To update the service with the new config:

@@ -295,10 +295,10 @@ sky serve down vllm

 It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

-```{raw} html
+:::{raw} html
 <details>
 <summary>Click to see the full GUI YAML</summary>
-```
+:::

 ```yaml
 envs:
@@ -321,25 +321,25 @@ run: |

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
-  python vllm/examples/gradio_openai_chatbot_webserver.py \
+  python vllm/examples/online_serving/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://$ENDPOINT/v1 \
    --stop-token-ids 128009,128001 | tee ~/gradio.log
 ```

-```{raw} html
+:::{raw} html
 </details>
-```
+:::

 1. Start the chat web UI:

-```console
-sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
-```
+    ```console
+    sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
+    ```

 2. Then, we can access the GUI at the returned gradio link:

-```console
-| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
-```
+    ```console
+    | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+    ```
--- a/docs/source/serving/deploying_with_triton.md
+++ b/docs/source/serving/deploying_with_triton.md
-(deploying-with-triton)=
+(deployment-triton)=

-# Deploying with NVIDIA Triton
+# NVIDIA Triton

 The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
--- a/docs/source/deployment/integrations/index.md
+++ b/docs/source/deployment/integrations/index.md
+# External Integrations
+
+:::{toctree}
+:maxdepth: 1
+
+kserve
+kubeai
+llamastack
+:::
--- a/docs/source/serving/deploying_with_kserve.md
+++ b/docs/source/serving/deploying_with_kserve.md
-(deploying-with-kserve)=
+(deployment-kserve)=

-# Deploying with KServe
+# KServe

 vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.


--- a/docs/source/serving/deploying_with_kubeai.md
+++ b/docs/source/serving/deploying_with_kubeai.md
-(deploying-with-kubeai)=
+(deployment-kubeai)=

-# Deploying with KubeAI
+# KubeAI

 [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.


--- a/docs/source/serving/serving_with_llamastack.md
+++ b/docs/source/serving/serving_with_llamastack.md
-(run-on-llamastack)=
+(deployment-llamastack)=

-# Serving with Llama Stack
+# Llama Stack

 vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

 To install Llama Stack, run

 ```console
-$ pip install llama-stack -q
+pip install llama-stack -q
 ```

 ## Inference using OpenAI Compatible API

--- a/docs/source/deployment/k8s.md
+++ b/docs/source/deployment/k8s.md
+(deployment-k8s)=
+
+# Using Kubernetes
+
+Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
+
+## Prerequisites
+
+Before you begin, ensure that you have the following:
+
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
+- Available GPU resources in your cluster
+
+## Deployment Steps
+
+1. Create a PVC, Secret and Deployment for vLLM
+
+      PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
+
+      ```yaml
+      apiVersion: v1
+      kind: PersistentVolumeClaim
+      metadata:
+        name: mistral-7b
+        namespace: default
+      spec:
+        accessModes:
+        - ReadWriteOnce
+        resources:
+          requests:
+            storage: 50Gi
+        storageClassName: default
+        volumeMode: Filesystem
+      ```
+
+      Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
+
+      ```yaml
+      apiVersion: v1
+      kind: Secret
+      metadata:
+        name: hf-token-secret
+        namespace: default
+      type: Opaque
+      stringData:
+        token: "REPLACE_WITH_TOKEN"
+      ```
+
+      Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
+
+      Here are two examples for using NVIDIA GPU and AMD GPU.
+
+      NVIDIA GPU:
+
+      ```yaml
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: mistral-7b
+        namespace: default
+        labels:
+          app: mistral-7b
+      spec:
+        replicas: 1
+        selector:
+          matchLabels:
+            app: mistral-7b
+        template:
+          metadata:
+            labels:
+              app: mistral-7b
+          spec:
+            volumes:
+            - name: cache-volume
+              persistentVolumeClaim:
+                claimName: mistral-7b
+            # vLLM needs to access the host's shared memory for tensor parallel inference.
+            - name: shm
+              emptyDir:
+                medium: Memory
+                sizeLimit: "2Gi"
+            containers:
+            - name: mistral-7b
+              image: vllm/vllm-openai:latest
+              command: ["/bin/sh", "-c"]
+              args: [
+                "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+              ]
+              env:
+              - name: HUGGING_FACE_HUB_TOKEN
+                valueFrom:
+                  secretKeyRef:
+                    name: hf-token-secret
+                    key: token
+              ports:
+              - containerPort: 8000
+              resources:
+                limits:
+                  cpu: "10"
+                  memory: 20G
+                  nvidia.com/gpu: "1"
+                requests:
+                  cpu: "2"
+                  memory: 6G
+                  nvidia.com/gpu: "1"
+              volumeMounts:
+              - mountPath: /root/.cache/huggingface
+                name: cache-volume
+              - name: shm
+                mountPath: /dev/shm
+              livenessProbe:
+                httpGet:
+                  path: /health
+                  port: 8000
+                initialDelaySeconds: 60
+                periodSeconds: 10
+              readinessProbe:
+                httpGet:
+                  path: /health
+                  port: 8000
+                initialDelaySeconds: 60
+                periodSeconds: 5
+      ```
+
+      AMD GPU:
+
+      You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.
+
+      ```yaml
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: mistral-7b
+        namespace: default
+        labels:
+          app: mistral-7b
+      spec:
+        replicas: 1
+        selector:
+          matchLabels:
+            app: mistral-7b
+        template:
+          metadata:
+            labels:
+              app: mistral-7b
+          spec:
+            volumes:
+            # PVC
+            - name: cache-volume
+              persistentVolumeClaim:
+                claimName: mistral-7b
+            # vLLM needs to access the host's shared memory for tensor parallel inference.
+            - name: shm
+              emptyDir:
+                medium: Memory
+                sizeLimit: "8Gi"
+            hostNetwork: true
+            hostIPC: true
+            containers:
+            - name: mistral-7b
+              image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
+              securityContext:
+                seccompProfile:
+                  type: Unconfined
+                runAsGroup: 44
+                capabilities:
+                  add:
+                  - SYS_PTRACE
+              command: ["/bin/sh", "-c"]
+              args: [
+                "vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+              ]
+              env:
+              - name: HUGGING_FACE_HUB_TOKEN
+                valueFrom:
+                  secretKeyRef:
+                    name: hf-token-secret
+                    key: token
+              ports:
+              - containerPort: 8000
+              resources:
+                limits:
+                  cpu: "10"
+                  memory: 20G
+                  amd.com/gpu: "1"
+                requests:
+                  cpu: "6"
+                  memory: 6G
+                  amd.com/gpu: "1"
+              volumeMounts:
+              - name: cache-volume
+                mountPath: /root/.cache/huggingface
+              - name: shm
+                mountPath: /dev/shm
+      ```
+
+      You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
+
+2. Create a Kubernetes Service for vLLM
+
+      Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+
+      ```yaml
+      apiVersion: v1
+      kind: Service
+      metadata:
+        name: mistral-7b
+        namespace: default
+      spec:
+        ports:
+        - name: http-mistral-7b
+          port: 80
+          protocol: TCP
+          targetPort: 8000
+        # The label selector should match the deployment labels & it is useful for prefix caching feature
+        selector:
+          app: mistral-7b
+        sessionAffinity: None
+        type: ClusterIP
+      ```
+
+3. Deploy and Test
+
+      Apply the deployment and service configurations using `kubectl apply -f <filename>`:
+
+      ```console
+      kubectl apply -f deployment.yaml
+      kubectl apply -f service.yaml
+      ```
+
+      To test the deployment, run the following `curl` command:
+
+      ```console
+      curl http://mistral-7b.default.svc.cluster.local/v1/completions \
+        -H "Content-Type: application/json" \
+        -d '{
+              "model": "mistralai/Mistral-7B-Instruct-v0.3",
+              "prompt": "San Francisco is a",
+              "max_tokens": 7,
+              "temperature": 0
+            }'
+      ```
+
+      If the service is correctly deployed, you should receive a response from the vLLM model.
+
+## Conclusion
+
+Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
--- a/docs/source/serving/deploying_with_nginx.md
+++ b/docs/source/serving/deploying_with_nginx.md
 (nginxloadbalancer)=

-# Deploying with Nginx Loadbalancer
+# Using Nginx

 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

@@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si
 docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
 ```

-```{note}
+:::{note}
 If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-```
+:::

 (nginxloadbalancer-nginx-launch-nginx)=


--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -4,19 +4,19 @@

 This document provides an overview of the vLLM architecture.

-```{contents} Table of Contents
+:::{contents} Table of Contents
 :depth: 2
 :local: true
-```
+:::

 ## Entrypoints

 vLLM provides a number of entrypoints for interacting with the system. The
 following diagram shows the relationship between them.

-```{image} /assets/design/arch_overview/entrypoints.excalidraw.png
+:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
 :alt: Entrypoints Diagram
-```
+:::

 ### LLM Class

@@ -53,11 +53,11 @@ for output in outputs:
 ```

 More API details can be found in the {doc}`Offline Inference
-</dev/offline_inference/offline_index>` section of the API docs.
+</api/offline_inference/index>` section of the API docs.

 The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.

-### OpenAI-compatible API server
+### OpenAI-Compatible API Server

 The second primary interface to vLLM is via its OpenAI-compatible API server.
 This server can be started using the `vllm serve` command.
@@ -77,17 +77,16 @@ python -m vllm.entrypoints.openai.api_server --model <model>

 That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.

-More details on the API server can be found in the {doc}`OpenAI Compatible
-Server </serving/openai_compatible_server>` document.
+More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.

 ## LLM Engine

 The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
 the vLLM system, handling model inference and asynchronous request processing.

-```{image} /assets/design/arch_overview/llm_engine.excalidraw.png
+:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
 :alt: LLMEngine Diagram
-```
+:::

 ### LLMEngine

@@ -145,11 +144,11 @@ configurations affect the class we ultimately get.

 The following figure shows the class hierarchy of vLLM:

-> ```{figure} /assets/design/hierarchy.png
+> :::{figure} /assets/design/hierarchy.png
 > :align: center
 > :alt: query
 > :width: 100%
-> ```
+> :::

 There are several important design choices behind this class hierarchy:

@@ -179,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
 can easily create a vision model and a language model and compose them into a
 vision-language model.

-````{note}
+:::{note}
 To support this change, all vLLM models' signatures have been updated to:

 ```python
@@ -216,7 +215,7 @@ else:
 ```

 This way, the model can work with both old and new versions of vLLM.
-````
+:::

 3\. **Sharding and Quantization at Initialization**: Certain features require
 changing the model weights. For example, tensor parallelism needs to shard the

--- a/docs/source/automatic_prefix_caching/details.md
+++ b/docs/source/automatic_prefix_caching/details.md
-# Implementation
+(design-automatic-prefix-caching)=

-The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
+# Automatic Prefix Caching
+
+The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

 To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

-```
+```text
                    Block 1                  Block 2                  Block 3
         [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
 Block 1: |<--- block tokens ---->|
@@ -12,19 +14,16 @@ Block 2: |<------- prefix ------>| |<--- block tokens --->|
 Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
 ```

-
 In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:

-```
+```text
 hash(prefix tokens + block tokens) <--> KV Block
 ```

 With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.

-
 This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.

-
 ## Generalized Caching Policy

 Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
@@ -39,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra

 However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:

- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
+* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
+* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
--- a/docs/source/design/input_processing/input_processing_pipeline.md
+++ b/docs/source/design/input_processing/input_processing_pipeline.md
-(input-processing-pipeline)=
-
-# Input Processing Pipeline
-
-1. Input data is passed to {class}`~vllm.LLMEngine` (or {class}`~vllm.AsyncLLMEngine`).
-
-2. Tokenize the data if necessary.
-
-3. Process the inputs using {meth}`INPUT_REGISTRY.process_input <vllm.inputs.registry.InputRegistry.process_input>`.
-
-   - For example, add placeholder tokens to reserve KV cache for multi-modal embeddings.
-
-4. Send the processed inputs to {class}`~vllm.executor.executor_base.ExecutorBase`.
-
-5. Distribute the inputs via {class}`~vllm.worker.worker_base.WorkerBase` to {class}`~vllm.worker.model_runner_base.ModelRunnerBase`.
-
-6. If the data contains multi-modal data, convert it into keyword arguments using {meth}`MULTIMODAL_REGISTRY.map_input <vllm.multimodal.MultiModalRegistry.map_input>`.
-
-   - For example, convert a {class}`PIL.Image.Image` input to its pixel values for a vision model.
--- a/docs/source/design/input_processing/model_inputs_index.md
+++ b/docs/source/design/input_processing/model_inputs_index.md
-(input-processing)=
-
-# Input Processing
-
-```{eval-rst}
-.. currentmodule:: vllm.inputs
-```
-
-Each model can override parts of vLLM's [input processing pipeline](#input-processing-pipeline) via
-{data}`~vllm.inputs.INPUT_REGISTRY` and {data}`~vllm.multimodal.MULTIMODAL_REGISTRY`.
-
-Currently, this mechanism is only utilized in [multi-modal](#multi-modality) models for preprocessing multi-modal input
-data in addition to input prompt, but it can be extended to text-only language models when needed.
-
-## Guides
-
-```{toctree}
-:maxdepth: 1
-
-input_processing_pipeline
-```
-
-## Module Contents
-
-### LLM Engine Inputs
-
-```{eval-rst}
-.. autoclass:: vllm.inputs.DecoderOnlyInputs
-    :members:
-    :show-inheritance:
-```
-
-### Registry
-
-```{eval-rst}
-.. autodata:: vllm.inputs.INPUT_REGISTRY
-```
-
-```{eval-rst}
-.. automodule:: vllm.inputs.registry
-    :members:
-    :show-inheritance:
-```
--- a/docs/source/design/kernel/paged_attention.md
+++ b/docs/source/design/kernel/paged_attention.md
+(design-paged-attention)=
+
 # vLLM Paged Attention

 - Currently, vLLM utilizes its own implementation of a multi-head query
@@ -137,26 +139,26 @@
  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
  ```

-  ```{figure} ../../assets/kernel/query.png
+  :::{figure} ../../assets/kernel/query.png
  :align: center
  :alt: query
  :width: 70%

  Query data of one token at one head
-  ```
+  :::

 - Each thread defines its own `q_ptr` which points to the assigned
  query token data on global memory. For example, if `VEC_SIZE` is 4
  and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
  total of 128 elements divided into 128 / 4 = 32 vecs.

-  ```{figure} ../../assets/kernel/q_vecs.png
+  :::{figure} ../../assets/kernel/q_vecs.png
  :align: center
  :alt: q_vecs
  :width: 70%

  `q_vecs` for one thread group
-  ```
+  :::

  ```cpp
  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -193,13 +195,13 @@
  points to key token data based on `k_cache` at assigned block,
  assigned head and assigned token.

-  ```{figure} ../../assets/kernel/key.png
+  :::{figure} ../../assets/kernel/key.png
  :align: center
  :alt: key
  :width: 70%

  Key data of all context tokens at one head
-  ```
+  :::

 - The diagram above illustrates the memory layout for key data. It
  assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -212,13 +214,13 @@
  elements for one token) that will be processed by 2 threads (one
  thread group) separately.

-  ```{figure} ../../assets/kernel/k_vecs.png
+  :::{figure} ../../assets/kernel/k_vecs.png
  :align: center
  :alt: k_vecs
  :width: 70%

  `k_vecs` for one thread
-  ```
+  :::

  ```cpp
  K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -287,14 +289,14 @@
  should be performed across the entire thread block, encompassing
  results between the query token and all context key tokens.

-  ```{math}
+  :::{math}
  :nowrap: true

  \begin{gather*}
  m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
  \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
  \end{gather*}
-  ```
+  :::

 ### `qk_max` and `logits`

@@ -377,29 +379,29 @@

 ## Value

-```{figure} ../../assets/kernel/value.png
+:::{figure} ../../assets/kernel/value.png
 :align: center
 :alt: value
 :width: 70%

 Value data of all context tokens at one head
-```
+:::

-```{figure} ../../assets/kernel/logits_vec.png
+:::{figure} ../../assets/kernel/logits_vec.png
 :align: center
 :alt: logits_vec
 :width: 50%

 `logits_vec` for one thread
-```
+:::

-```{figure} ../../assets/kernel/v_vec.png
+:::{figure} ../../assets/kernel/v_vec.png
 :align: center
 :alt: v_vec
 :width: 70%

 List of `v_vec` for one thread
-```
+:::

 - Now we need to retrieve the value data and perform dot multiplication
  with `logits`. Unlike query and key, there is no thread group

--- a/docs/source/design/mm_processing.md
+++ b/docs/source/design/mm_processing.md
+(mm-processing)=
+
+# Multi-Modal Data Processing
+
+To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefill) and [prefix caching](#automatic-prefix-caching), we use {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
+
+Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
+
+## Prompt Replacement Detection
+
+One of the main responsibilies of HF processor is to replace input placeholder tokens (e.g. `<image>` for a single image) with feature placeholder tokens (e.g. `<image><image>...<image>`, the number of which equals to the feature size). The information about which tokens have been replaced is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
+
+In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptReplacement` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements`. Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.
+
+## Tokenized Prompt Inputs
+
+To enable tokenization in a separate process, we support passing input token IDs alongside multi-modal data.
+
+### The problem
+
+Consider that HF processors follow these main steps:
+
+1. Tokenize the text
+2. Process multi-modal inputs
+3. Perform prompt replacement
+
+And we require that:
+
+- For text + multi-modal inputs, apply all steps 1--3.
+- For tokenized + multi-modal inputs, apply only steps 2--3.
+
+How can we achieve this without rewriting HF processors? We can try to call the HF processor several times on different inputs:
+
+- For text + multi-modal inputs, simply call the HF processor directly.
+- For tokenized + multi-modal inputs, call the processor only on the multi-modal inputs.
+
+While HF processors support text + multi-modal inputs natively, this is not so for tokenized + multi-modal inputs: an error is thrown if the number of input placeholder tokens do not correspond to the number of multi-modal inputs.
+
+Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.
+
+(mm-dummy-text)=
+
+### Dummy text
+
+We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_processor_inputs`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
+
+(mm-automatic-prompt-replacement)=
+
+### Automatic prompt replacement
+
+We address the second issue by implementing model-agnostic code in
+{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_replacements` to automatically replace input placeholder tokens with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements`.
+
+### Summary
+
+With the help of dummy text and automatic prompt replacement, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
+
+## Processor Output Caching
+
+Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.
+
+When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
+
+Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt replacement code, we apply [automatic prompt replacement](#mm-automatic-prompt-replacement) afterwards to keep the output tokens and multi-modal data consistent with each other.
--- a/docs/source/design/multimodal/adding_multimodal_plugin.md
+++ b/docs/source/design/multimodal/adding_multimodal_plugin.md
-(adding-multimodal-plugin)=
-
-# Adding a Multimodal Plugin
-
-This document teaches you how to add a new modality to vLLM.
-
-Each modality in vLLM is represented by a {class}`~vllm.multimodal.MultiModalPlugin` and registered to {data}`~vllm.multimodal.MULTIMODAL_REGISTRY`.
-For vLLM to recognize a new modality type, you have to create a new plugin and then pass it to {meth}`~vllm.multimodal.MultiModalRegistry.register_plugin`.
-
-The remainder of this document details how to define custom {class}`~vllm.multimodal.MultiModalPlugin` s.
-
-```{note}
-This article is a work in progress.
-```
-
-% TODO: Add more instructions on how to add new plugins once embeddings is in.
--- a/docs/source/design/multimodal/multimodal_index.md
+++ b/docs/source/design/multimodal/multimodal_index.md
-(multi-modality)=
-
-# Multi-Modality
-
-```{eval-rst}
-.. currentmodule:: vllm.multimodal
-```
-
-vLLM provides experimental support for multi-modal models through the {mod}`vllm.multimodal` package.
-
-Multi-modal inputs can be passed alongside text and token prompts to [supported models](#supported-mm-models)
-via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
-
-Currently, vLLM only has built-in support for image data. You can extend vLLM to process additional modalities
-by following [this guide](#adding-multimodal-plugin).
-
-Looking to add your own multi-modal model? Please follow the instructions listed [here](#enabling-multimodal-inputs).
-
-## Guides
-
-```{toctree}
-:maxdepth: 1
-
-adding_multimodal_plugin
-```
-
-## Module Contents
-
-```{eval-rst}
-.. automodule:: vllm.multimodal
-```
-
-### Registry
-
-```{eval-rst}
-.. autodata:: vllm.multimodal.MULTIMODAL_REGISTRY
-```
-
-```{eval-rst}
-.. autoclass:: vllm.multimodal.MultiModalRegistry
-    :members:
-    :show-inheritance:
-```
-
-### Base Classes
-
-```{eval-rst}
-.. autodata:: vllm.multimodal.NestedTensors
-```
-
-```{eval-rst}
-.. autodata:: vllm.multimodal.BatchedTensorInputs
-```
-
-```{eval-rst}
-.. autoclass:: vllm.multimodal.MultiModalDataBuiltins
-    :members:
-    :show-inheritance:
-```
-
-```{eval-rst}
-.. autodata:: vllm.multimodal.MultiModalDataDict
-```
-
-```{eval-rst}
-.. autoclass:: vllm.multimodal.MultiModalKwargs
-    :members:
-    :show-inheritance:
-```
-
-```{eval-rst}
-.. autoclass:: vllm.multimodal.MultiModalPlugin
-    :members:
-    :show-inheritance:
-```
-
-### Image Classes
-
-```{eval-rst}
-.. automodule:: vllm.multimodal.image
-    :members:
-    :show-inheritance:
-```
--- a/docs/source/design/multiprocessing.md
+++ b/docs/source/design/multiprocessing.md
@@ -2,14 +2,14 @@

 ## Debugging

-Please see the [Debugging Tips](#debugging-python-multiprocessing)
+Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
 page for information on known issues and how to solve them.

 ## Introduction

-```{important}
+:::{important}
 The source code references are to the state of the code at the time of writing in December, 2024.
-```
+:::

 The use of Python multiprocessing in vLLM is complicated by:

@@ -21,7 +21,7 @@ This document describes how vLLM deals with these challenges.

 ## Multiprocessing Methods

-[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html.md#contexts-and-start-methods) include:
+[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include:

 - `spawn` - spawn a new Python process. This will be the default as of Python
  3.14.