Commit 4eabe123 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

parents 45840cd2 58738772
(deployment-lws)= ---
title: LWS
# LWS ---
[](){ #deployment-lws }
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference. A major use case is for multi-host/multi-node distributed inference.
......
(deployment-modal)= ---
title: Modal
# Modal ---
[](){ #deployment-modal }
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling. vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.
......
(deployment-open-webui)= ---
title: Open WebUI
# Open WebUI ---
[](){ #deployment-open-webui }
1. Install the [Docker](https://docs.docker.com/engine/install/) 1. Install the [Docker](https://docs.docker.com/engine/install/)
...@@ -25,5 +26,4 @@ ghcr.io/open-webui/open-webui:main ...@@ -25,5 +26,4 @@ ghcr.io/open-webui/open-webui:main
On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`. On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`.
:::{image} /assets/deployment/open_webui.png ![](../../assets/deployment/open_webui.png)
:::
(deployment-retrieval-augmented-generation)= ---
title: Retrieval-Augmented Generation
# Retrieval-Augmented Generation ---
[](){ #deployment-retrieval-augmented-generation }
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
......
(deployment-skypilot)= ---
title: SkyPilot
---
[](){ #deployment-skypilot }
# SkyPilot
:::{raw} html
<p align="center"> <p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/> <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
</p> </p>
:::
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html). vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
...@@ -83,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the ...@@ -83,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: **Optional**: Serve the 70B model instead of the default 8B and use more GPU:
```console ```console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct HF_TOKEN="your-huggingface-token" \
sky launch serving.yaml \
--gpus A100:8 \
--env HF_TOKEN \
--env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
``` ```
## Scale up to multiple replicas ## Scale up to multiple replicas
...@@ -104,10 +107,8 @@ service: ...@@ -104,10 +107,8 @@ service:
max_completion_tokens: 1 max_completion_tokens: 1
``` ```
:::{raw} html
<details> <details>
<summary>Click to see the full recipe YAML</summary> <summary>Click to see the full recipe YAML</summary>
:::
```yaml ```yaml
service: service:
...@@ -153,14 +154,14 @@ run: | ...@@ -153,14 +154,14 @@ run: |
2>&1 | tee api_server.log 2>&1 | tee api_server.log
``` ```
:::{raw} html
</details> </details>
:::
Start the serving the Llama-3 8B model on multiple replicas: Start the serving the Llama-3 8B model on multiple replicas:
```console ```console
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN HF_TOKEN="your-huggingface-token" \
sky serve up -n vllm serving.yaml \
--env HF_TOKEN
``` ```
Wait until the service is ready: Wait until the service is ready:
...@@ -169,10 +170,8 @@ Wait until the service is ready: ...@@ -169,10 +170,8 @@ Wait until the service is ready:
watch -n10 sky serve status vllm watch -n10 sky serve status vllm
``` ```
:::{raw} html
<details> <details>
<summary>Example outputs:</summary> <summary>Example outputs:</summary>
:::
```console ```console
Services Services
...@@ -185,9 +184,7 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R ...@@ -185,9 +184,7 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4 vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
``` ```
:::{raw} html
</details> </details>
:::
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
...@@ -223,10 +220,8 @@ service: ...@@ -223,10 +220,8 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica. This will scale the service up to when the QPS exceeds 2 for each replica.
:::{raw} html
<details> <details>
<summary>Click to see the full recipe YAML</summary> <summary>Click to see the full recipe YAML</summary>
:::
```yaml ```yaml
service: service:
...@@ -275,9 +270,7 @@ run: | ...@@ -275,9 +270,7 @@ run: |
2>&1 | tee api_server.log 2>&1 | tee api_server.log
``` ```
:::{raw} html
</details> </details>
:::
To update the service with the new config: To update the service with the new config:
...@@ -295,10 +288,8 @@ sky serve down vllm ...@@ -295,10 +288,8 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
:::{raw} html
<details> <details>
<summary>Click to see the full GUI YAML</summary> <summary>Click to see the full GUI YAML</summary>
:::
```yaml ```yaml
envs: envs:
...@@ -328,14 +319,14 @@ run: | ...@@ -328,14 +319,14 @@ run: |
--stop-token-ids 128009,128001 | tee ~/gradio.log --stop-token-ids 128009,128001 | tee ~/gradio.log
``` ```
:::{raw} html
</details> </details>
:::
1. Start the chat web UI: 1. Start the chat web UI:
```console ```console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) sky launch \
-c gui ./gui.yaml \
--env ENDPOINT=$(sky serve status --endpoint vllm)
``` ```
2. Then, we can access the GUI at the returned gradio link: 2. Then, we can access the GUI at the returned gradio link:
......
(deployment-streamlit)= ---
title: Streamlit
# Streamlit ---
[](){ #deployment-streamlit }
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.
...@@ -32,11 +33,11 @@ pip install streamlit openai ...@@ -32,11 +33,11 @@ pip install streamlit openai
streamlit run streamlit_openai_chatbot_webserver.py streamlit run streamlit_openai_chatbot_webserver.py
# or specify the VLLM_API_BASE or VLLM_API_KEY # or specify the VLLM_API_BASE or VLLM_API_KEY
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \
streamlit run streamlit_openai_chatbot_webserver.py
# start with debug mode to view more details # start with debug mode to view more details
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
``` ```
:::{image} /assets/deployment/streamlit-chat.png ![](../../assets/deployment/streamlit-chat.png)
:::
(deployment-triton)= ---
title: NVIDIA Triton
# NVIDIA Triton ---
[](){ #deployment-triton }
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
(deployment-kserve)= ---
title: KServe
# KServe ---
[](){ #deployment-kserve }
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
......
(deployment-kubeai)= ---
title: KubeAI
# KubeAI ---
[](){ #deployment-kubeai }
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
......
(deployment-llamastack)= ---
title: Llama Stack
# Llama Stack ---
[](){ #deployment-llamastack }
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) . vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
......
(deployment-llmaz)= ---
title: llmaz
# llmaz ---
[](){ #deployment-llmaz }
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend. [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
......
(deployment-production-stack)= ---
title: Production stack
# Production stack ---
[](){ #deployment-production-stack }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
...@@ -114,7 +115,7 @@ To remove the deployment, run: ...@@ -114,7 +115,7 @@ To remove the deployment, run:
sudo helm uninstall vllm sudo helm uninstall vllm
``` ```
------ ---
### (Advanced) Configuring vLLM production stack ### (Advanced) Configuring vLLM production stack
......
(deployment-k8s)= ---
title: Using Kubernetes
# Using Kubernetes ---
[](){ #deployment-k8s }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
...@@ -8,6 +9,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le ...@@ -8,6 +9,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le
* [Deployment with GPUs](#deployment-with-gpus) * [Deployment with GPUs](#deployment-with-gpus)
Alternatively, you can deploy vLLM to Kubernetes using any of the following: Alternatively, you can deploy vLLM to Kubernetes using any of the following:
* [Helm](frameworks/helm.md) * [Helm](frameworks/helm.md)
* [InftyAI/llmaz](integrations/llmaz.md) * [InftyAI/llmaz](integrations/llmaz.md)
* [KServe](integrations/kserve.md) * [KServe](integrations/kserve.md)
...@@ -19,9 +21,8 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -19,9 +21,8 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
## Deployment with CPUs ## Deployment with CPUs
:::{note} !!! note
The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs. The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
:::
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
......
(nginxloadbalancer)= ---
title: Using Nginx
# Using Nginx ---
[](){ #nginxloadbalancer }
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents: Table of contents:
1. [Build Nginx Container](#nginxloadbalancer-nginx-build) 1. [Build Nginx Container][nginxloadbalancer-nginx-build]
2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf) 2. [Create Simple Nginx Config file][nginxloadbalancer-nginx-conf]
3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container) 3. [Build vLLM Container][nginxloadbalancer-nginx-vllm-container]
4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network) 4. [Create Docker Network][nginxloadbalancer-nginx-docker-network]
5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container) 5. [Launch vLLM Containers][nginxloadbalancer-nginx-launch-container]
6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx) 6. [Launch Nginx][nginxloadbalancer-nginx-launch-nginx]
7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx) 7. [Verify That vLLM Servers Are Ready][nginxloadbalancer-nginx-verify-nginx]
(nginxloadbalancer-nginx-build)= [](){ #nginxloadbalancer-nginx-build }
## Build Nginx Container ## Build Nginx Container
...@@ -39,7 +40,7 @@ Build the container: ...@@ -39,7 +40,7 @@ Build the container:
docker build . -f Dockerfile.nginx --tag nginx-lb docker build . -f Dockerfile.nginx --tag nginx-lb
``` ```
(nginxloadbalancer-nginx-conf)= [](){ #nginxloadbalancer-nginx-conf }
## Create Simple Nginx Config file ## Create Simple Nginx Config file
...@@ -63,7 +64,7 @@ server { ...@@ -63,7 +64,7 @@ server {
} }
``` ```
(nginxloadbalancer-nginx-vllm-container)= [](){ #nginxloadbalancer-nginx-vllm-container }
## Build vLLM Container ## Build vLLM Container
...@@ -76,10 +77,14 @@ If you are behind proxy, you can pass the proxy settings to the docker build com ...@@ -76,10 +77,14 @@ If you are behind proxy, you can pass the proxy settings to the docker build com
```console ```console
cd $vllm_root cd $vllm_root
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy docker build \
-f docker/Dockerfile . \
--tag vllm \
--build-arg http_proxy=$http_proxy \
--build-arg https_proxy=$https_proxy
``` ```
(nginxloadbalancer-nginx-docker-network)= [](){ #nginxloadbalancer-nginx-docker-network }
## Create Docker Network ## Create Docker Network
...@@ -87,7 +92,7 @@ docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_prox ...@@ -87,7 +92,7 @@ docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_prox
docker network create vllm_nginx docker network create vllm_nginx
``` ```
(nginxloadbalancer-nginx-launch-container)= [](){ #nginxloadbalancer-nginx-launch-container }
## Launch vLLM Containers ## Launch vLLM Containers
...@@ -101,23 +106,45 @@ Notes: ...@@ -101,23 +106,45 @@ Notes:
```console ```console
mkdir -p ~/.cache/huggingface/hub/ mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/ hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf docker run \
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf -itd \
--ipc host \
--network vllm_nginx \
--gpus device=0 \
--shm-size=10.24gb \
-v $hf_cache_dir:/root/.cache/huggingface/ \
-p 8081:8000 \
--name vllm0 vllm \
--model meta-llama/Llama-2-7b-chat-hf
docker run \
-itd \
--ipc host \
--network vllm_nginx \
--gpus device=1 \
--shm-size=10.24gb \
-v $hf_cache_dir:/root/.cache/huggingface/ \
-p 8082:8000 \
--name vllm1 vllm \
--model meta-llama/Llama-2-7b-chat-hf
``` ```
:::{note} !!! note
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`. If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
:::
(nginxloadbalancer-nginx-launch-nginx)= [](){ #nginxloadbalancer-nginx-launch-nginx }
## Launch Nginx ## Launch Nginx
```console ```console
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest docker run \
-itd \
-p 8000:80 \
--network vllm_nginx \
-v ./nginx_conf/:/etc/nginx/conf.d/ \
--name nginx-lb nginx-lb:latest
``` ```
(nginxloadbalancer-nginx-verify-nginx)= [](){ #nginxloadbalancer-nginx-verify-nginx }
## Verify That vLLM Servers Are Ready ## Verify That vLLM Servers Are Ready
......
(arch-overview)= ---
title: Architecture Overview
# Architecture Overview ---
[](){ #arch-overview }
This document provides an overview of the vLLM architecture. This document provides an overview of the vLLM architecture.
:::{contents} Table of Contents [TOC]
:depth: 2
:local: true
:::
## Entrypoints ## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them. following diagram shows the relationship between them.
:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png ![Entrypoints Diagram](../assets/design/arch_overview/entrypoints.excalidraw.png)
:alt: Entrypoints Diagram
:::
### LLM Class ### LLM Class
...@@ -77,16 +73,14 @@ python -m vllm.entrypoints.openai.api_server --model <model> ...@@ -77,16 +73,14 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document. More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document.
## LLM Engine ## LLM Engine
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing. the vLLM system, handling model inference and asynchronous request processing.
:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png ![LLMEngine Diagram](../assets/design/arch_overview/llm_engine.excalidraw.png)
:alt: LLMEngine Diagram
:::
### LLMEngine ### LLMEngine
...@@ -137,18 +131,16 @@ input tensors and capturing cudagraphs. ...@@ -137,18 +131,16 @@ input tensors and capturing cudagraphs.
## Model ## Model
Every model runner object has one model object, which is the actual Every model runner object has one model object, which is the actual
`torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various `torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
configurations affect the class we ultimately get. configurations affect the class we ultimately get.
## Class Hierarchy ## Class Hierarchy
The following figure shows the class hierarchy of vLLM: The following figure shows the class hierarchy of vLLM:
> :::{figure} /assets/design/hierarchy.png > <figure markdown="span">
> :align: center > ![](../assets/design/hierarchy.png){ align="center" alt="query" width="100%" }
> :alt: query > </figure>
> :width: 100%
> :::
There are several important design choices behind this class hierarchy: There are several important design choices behind this class hierarchy:
...@@ -178,17 +170,17 @@ of a vision model and a language model. By making the constructor uniform, we ...@@ -178,17 +170,17 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a can easily create a vision model and a language model and compose them into a
vision-language model. vision-language model.
:::{note} !!! note
To support this change, all vLLM models' signatures have been updated to: To support this change, all vLLM models' signatures have been updated to:
```python ```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
``` ```
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
```python ```python
class MyOldModel(nn.Module): class MyOldModel(nn.Module):
def __init__( def __init__(
self, self,
config, config,
...@@ -199,8 +191,8 @@ class MyOldModel(nn.Module): ...@@ -199,8 +191,8 @@ class MyOldModel(nn.Module):
) -> None: ) -> None:
... ...
from vllm.config import VllmConfig from vllm.config import VllmConfig
class MyNewModel(MyOldModel): class MyNewModel(MyOldModel):
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
config = vllm_config.model_config.hf_config config = vllm_config.model_config.hf_config
cache_config = vllm_config.cache_config cache_config = vllm_config.cache_config
...@@ -208,14 +200,13 @@ class MyNewModel(MyOldModel): ...@@ -208,14 +200,13 @@ class MyNewModel(MyOldModel):
lora_config = vllm_config.lora_config lora_config = vllm_config.lora_config
super().__init__(config, cache_config, quant_config, lora_config, prefix) super().__init__(config, cache_config, quant_config, lora_config, prefix)
if __version__ >= "0.6.4": if __version__ >= "0.6.4":
MyModel = MyNewModel MyModel = MyNewModel
else: else:
MyModel = MyOldModel MyModel = MyOldModel
``` ```
This way, the model can work with both old and new versions of vLLM. This way, the model can work with both old and new versions of vLLM.
:::
3\. **Sharding and Quantization at Initialization**: Certain features require 3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the changing the model weights. For example, tensor parallelism needs to shard the
......
(design-automatic-prefix-caching)= ---
title: Automatic Prefix Caching
# Automatic Prefix Caching ---
[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
......
(huggingface-integration)= ---
title: Integration with HuggingFace
# Integration with HuggingFace ---
[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
Let's say we want to serve the popular QWen model by running `vllm serve Qwen/Qwen2-7B`. Let's say we want to serve the popular QWen model by running `vllm serve Qwen/Qwen2-7B`.
1. The `model` argument is `Qwen/Qwen2-7B`. vLLM determines whether this model exists by checking for the corresponding config file `config.json`. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L162-L182) for the implementation. Within this process: 1. The `model` argument is `Qwen/Qwen2-7B`. vLLM determines whether this model exists by checking for the corresponding config file `config.json`. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L162-L182) for the implementation. Within this process:
- If the `model` argument corresponds to an existing local path, vLLM will load the config file directly from this path. - If the `model` argument corresponds to an existing local path, vLLM will load the config file directly from this path.
- If the `model` argument is a HuggingFace model ID consisting of a username and model name, vLLM will first try to use the config file from the HuggingFace local cache, using the `model` argument as the model name and the `--revision` argument as the revision. See [their website](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information on how the HuggingFace cache works. - If the `model` argument is a HuggingFace model ID consisting of a username and model name, vLLM will first try to use the config file from the HuggingFace local cache, using the `model` argument as the model name and the `--revision` argument as the revision. See [their website](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information on how the HuggingFace cache works.
- If the `model` argument is a HuggingFace model ID but it is not found in the cache, vLLM will download the config file from the HuggingFace model hub. Refer to [this function](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L91) for the implementation. The input arguments include the `model` argument as the model name, the `--revision` argument as the revision, and the environment variable `HF_TOKEN` as the token to access the model hub. In our case, vLLM will download the [config.json](https://huggingface.co/Qwen/Qwen2-7B/blob/main/config.json) file. - If the `model` argument is a HuggingFace model ID but it is not found in the cache, vLLM will download the config file from the HuggingFace model hub. Refer to [this function](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L91) for the implementation. The input arguments include the `model` argument as the model name, the `--revision` argument as the revision, and the environment variable `HF_TOKEN` as the token to access the model hub. In our case, vLLM will download the [config.json](https://huggingface.co/Qwen/Qwen2-7B/blob/main/config.json) file.
...@@ -15,7 +15,6 @@ Let's say we want to serve the popular QWen model by running `vllm serve Qwen/Qw ...@@ -15,7 +15,6 @@ Let's say we want to serve the popular QWen model by running `vllm serve Qwen/Qw
2. After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L185-L186) for the implementation. 2. After confirming the existence of the model, vLLM loads its config file and converts it into a dictionary. See this [code snippet](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L185-L186) for the implementation.
3. Next, vLLM [inspects](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L189) the `model_type` field in the config dictionary to [generate](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L190-L216) the config object to use. There are some `model_type` values that vLLM directly supports; see [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L48) for the list. If the `model_type` is not in the list, vLLM will use [AutoConfig.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained) to load the config class, with `model`, `--revision`, and `--trust_remote_code` as the arguments. Please note that: 3. Next, vLLM [inspects](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L189) the `model_type` field in the config dictionary to [generate](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L190-L216) the config object to use. There are some `model_type` values that vLLM directly supports; see [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/transformers_utils/config.py#L48) for the list. If the `model_type` is not in the list, vLLM will use [AutoConfig.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained) to load the config class, with `model`, `--revision`, and `--trust_remote_code` as the arguments. Please note that:
- HuggingFace also has its own logic to determine the config class to use. It will again use the `model_type` field to search for the class name in the transformers library; see [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models) for the list of supported models. If the `model_type` is not found, HuggingFace will use the `auto_map` field from the config JSON file to determine the class name. Specifically, it is the `AutoConfig` field under `auto_map`. See [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/main/config.json) for an example. - HuggingFace also has its own logic to determine the config class to use. It will again use the `model_type` field to search for the class name in the transformers library; see [here](https://github.com/huggingface/transformers/tree/main/src/transformers/models) for the list of supported models. If the `model_type` is not found, HuggingFace will use the `auto_map` field from the config JSON file to determine the class name. Specifically, it is the `AutoConfig` field under `auto_map`. See [DeepSeek](https://huggingface.co/deepseek-ai/DeepSeek-V2.5/blob/main/config.json) for an example.
- The `AutoConfig` field under `auto_map` points to a module path in the model's repository. To create the config class, HuggingFace will import the module and use the `from_pretrained` method to load the config class. This can generally cause arbitrary code execution, so it is only executed when `--trust_remote_code` is enabled. - The `AutoConfig` field under `auto_map` points to a module path in the model's repository. To create the config class, HuggingFace will import the module and use the `from_pretrained` method to load the config class. This can generally cause arbitrary code execution, so it is only executed when `--trust_remote_code` is enabled.
...@@ -28,7 +27,6 @@ Beyond that, there are two more things vLLM depends on HuggingFace for. ...@@ -28,7 +27,6 @@ Beyond that, there are two more things vLLM depends on HuggingFace for.
1. **Tokenizer**: vLLM uses the tokenizer from HuggingFace to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check HuggingFace's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [get_cached_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L24). 1. **Tokenizer**: vLLM uses the tokenizer from HuggingFace to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check HuggingFace's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [get_cached_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L24).
2. **Model weight**: vLLM downloads the model weight from the HuggingFace model hub using the `model` argument as the model name and the `--revision` argument as the revision. vLLM provides the argument `--load-format` to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass `--load-format dummy` to skip downloading the weights. 2. **Model weight**: vLLM downloads the model weight from the HuggingFace model hub using the `model` argument as the model name and the `--revision` argument as the revision. vLLM provides the argument `--load-format` to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass `--load-format dummy` to skip downloading the weights.
- It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that: - It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that:
This completes the integration between vLLM and HuggingFace. This completes the integration between vLLM and HuggingFace.
......
---
title: vLLM Paged Attention
---
[](){ #design-paged-attention }
Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`).
This kernel is designed to be compatible with
vLLM's paged KV caches, where the key and value cache are stored in
separate blocks (note that this block concept differs from the GPU
thread block. So in a later document, I will refer to vLLM paged
attention block as "block", while refer to GPU thread block as
"thread block").
To achieve high performance, this kernel relies on a specially
designed memory layout and access method, specifically when threads
read data from global memory to shared memory. The purpose of this
document is to provide a high-level explanation of the kernel
implementation step by step, aiding those who wish to learn about the
vLLM multi-head query attention kernel. After going through this
document, users will likely have a better understanding and feel easier
to follow the actual implementation.
Please note that this document may not cover all details, such as how
to calculate the correct index for the corresponding data or the dot
multiplication implementation. However, after reading this document
and becoming familiar with the high-level logic flow, it should be
easier for you to read the actual code and understand the details.
## Inputs
The kernel function takes a list of arguments for the current thread
to perform its assigned work. The three most important arguments are
the input pointers `q`, `k_cache`, and `v_cache`, which point
to query, key, and value data on global memory that need to be read
and processed. The output pointer `out` points to global memory
where the result should be written. These four pointers actually
refer to multi-dimensional arrays, but each thread only accesses the
portion of data assigned to it. I have omitted all other runtime
parameters here for simplicity.
```cpp
template<typename scalar_t, int HEAD_SIZE, int BLOCK_SIZE, int NUM_THREADS, int PARTITION_SIZE = 0>
__device__ void paged_attention_kernel(
... // Other side args.
const scalar_t* __restrict__ out, // [num_seqs, num_heads, max_num_partitions, head_size]
const scalar_t* __restrict__ q, // [num_seqs, num_heads, head_size]
const scalar_t* __restrict__ k_cache, // [num_blocks, num_kv_heads, head_size/x, block_size, x]
const scalar_t* __restrict__ v_cache, // [num_blocks, num_kv_heads, head_size, block_size]
... // Other side args.
)
```
There are also a list of template arguments above the function
signature that are determined during compilation time. `scalar_t`
represents the data type of the query, key, and value data elements,
such as FP16. `HEAD_SIZE` indicates the number of elements in each
head. `BLOCK_SIZE` refers to the number of tokens in each block.
`NUM_THREADS` denotes the number of threads in each thread block.
`PARTITION_SIZE` represents the number of tensor parallel GPUs (For
simplicity, we assume this is 0 and tensor parallel is disabled).
With these arguments, we need to perform a sequence of preparations.
This includes calculating the current head index, block index, and
other necessary variables. However, for now, we can ignore these
preparations and proceed directly to the actual calculations. It will
be easier to understand them once we grasp the entire flow.
## Concepts
Just before we dive into the calculation flow, I want to describe a
few concepts that are needed for later sections. However, you may
skip this section and return later if you encounter any confusing
terminologies.
- **Sequence**: A sequence represents a client request. For example,
the data pointed to by `q` has a shape of
`[num_seqs, num_heads, head_size]`. That represents there are total
`num_seqs` of query sequence data are pointed by `q`. Since this
kernel is a single query attention kernel, each sequence only has one
query token. Hence, the `num_seqs` equals the total number of tokens
that are processed in the batch.
- **Context**: The context consists of the generated tokens from the
sequence. For instance, `["What", "is", "your"]` are the context
tokens, and the input query token is `"name"`. The model might
generate the token `"?"`.
- **Vec**: The vec is a list of elements that are fetched and
calculated together. For query and key data, the vec size
(`VEC_SIZE`) is determined so that each thread group can fetch and
calculate 16 bytes of data at a time. For value data, the vec size
(`V_VEC_SIZE`) is determined so that each thread can fetch and
calculate 16 bytes of data at a time. For example, if the
`scalar_t` is FP16 (2 bytes) and `THREAD_GROUP_SIZE` is 2, the
`VEC_SIZE` will be 4, while the `V_VEC_SIZE` will be 8.
- **Thread group**: The thread group is a small group of
threads(`THREAD_GROUP_SIZE`) that fetches and calculates one
query token and one key token at a time. Each thread handles only a
portion of the token data. The total number of elements processed by
one thread group is referred as `x`. For example, if the thread
group contains 2 threads and the head size is 8, then thread 0
handles the query and key elements at index 0, 2, 4, 6, while thread
1 handles the elements at index 1, 3, 5, 7.
- **Block**: The key and value cache data in vLLM are split into
blocks. Each block stores data for a fixed number(`BLOCK_SIZE`)
of tokens at one head. Each block may contain only a portion of the
whole context tokens. For example, if the block size is 16 and the
head size is 128, then for one head, one block can store 16 * 128 =
2048 elements.
- **Warp**: A warp is a group of 32 threads(`WARP_SIZE`) that
execute simultaneously on a stream multiprocessor (SM). In this
kernel, each warp processes the calculation between one query token
and key tokens of one entire block at a time (it may process multiple
blocks in multiple iterations). For example, if there are 4 warps and
6 blocks for one context, the assignment would be like warp 0 handles
the 0th, 4th blocks, warp 1 handles the 1st, 5th blocks, warp 2
handles the 2nd block and warp 3 handles the 3rd block.
- **Thread block**: A thread block is a group of
threads(`NUM_THREADS`) that can access the same shared memory.
Each thread block contains multiple warps(`NUM_WARPS`), and in
this kernel, each thread block processes the calculation between one
query token and key tokens of a whole context.
- **Grid**: A grid is a collection of thread blocks and defines the
shape of the collection. In this kernel, the shape is
`(num_heads, num_seqs, max_num_partitions)`. Therefore, each thread
block only handles the calculation for one head, one sequence, and
one partition.
## Query
This section will introduce how query data is stored in memory and
fetched by each thread. As mentioned above, each thread group fetches
one query token data, while each thread itself only handles a part of
one query token data. Within each warp, every thread group will fetch
the same query token data, but will multiply it with different key
token data.
```cpp
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
```
<figure markdown="span">
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
</figure>
Each thread defines its own `q_ptr` which points to the assigned
query token data on global memory. For example, if `VEC_SIZE` is 4
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs.
<figure markdown="span">
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
</figure>
```cpp
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
```
Next, we need to read the global memory data pointed to by `q_ptr`
into shared memory as `q_vecs`. It is important to note that each
vecs is assigned to a different row. For example, if the
`THREAD_GROUP_SIZE` is 2, thread 0 will handle the 0th row vecs,
while thread 1 handles the 1st row vecs. By reading the query data in
this way, neighboring threads like thread 0 and thread 1 can read
neighbor memory, achieving the memory coalescing to improve
performance.
## Key
Similar to the "Query" section, this section introduces memory layout
and assignment for keys. While each thread group only handle one
query token one kernel run, it may handle multiple key tokens across
multiple iterations. Meanwhile, each warp will process multiple blocks
of key tokens in multiple iterations, ensuring that all context
tokens are processed by the entire thread group after the kernel run.
In this context, "handle" refers to performing the dot multiplication
between query data and key data.
```cpp
const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
+ kv_head_idx * kv_head_stride
+ physical_block_offset * x;
```
Unlike to `q_ptr`, `k_ptr` in each thread will point to different
key token at different iterations. As shown above, that `k_ptr`
points to key token data based on `k_cache` at assigned block,
assigned head and assigned token.
<figure markdown="span">
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
</figure>
The diagram above illustrates the memory layout for key data. It
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
8, `THREAD_GROUP_SIZE` is 2, and there are a total of 4 warps. Each
rectangle represents all the elements for one key token at one head,
which will be processed by one thread group. The left half shows the
total 16 blocks of key token data for warp 0, while the right half
represents the remaining key token data for other warps or
iterations. Inside each rectangle, there are a total 32 vecs (128
elements for one token) that will be processed by 2 threads (one
thread group) separately.
<figure markdown="span">
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
</figure>
```cpp
K_vec k_vecs[NUM_VECS_PER_THREAD]
```
Next, we need to read the key token data from `k_ptr` and store
them on register memory as `k_vecs`. We use register memory for
`k_vecs` because it will only be accessed by one thread once,
whereas `q_vecs` will be accessed by multiple threads multiple
times. Each `k_vecs` will contain multiple vectors for later
calculation. Each vec will be set at each inner iteration. The
assignment of vecs allows neighboring threads in a warp to read
neighboring memory together, which again promotes the memory
coalescing. For instance, thread 0 will read vec 0, while thread 1
will read vec 1. In the next inner loop, thread 0 will read vec 2,
while thread 1 will read vec 3, and so on.
You may still be a little confused about the overall flow. Don't
worry, please keep reading the next "QK" section. It will illustrate
the query and key calculation flow in a clearer and higher-level
manner.
## QK
As shown the pseudo code below, before the entire for loop block, we
fetch the query data for one token and store it in `q_vecs`. Then,
in the outer for loop, we iterate through different `k_ptrs` that
point to different tokens and prepare the `k_vecs` in the inner for
loop. Finally, we perform the dot multiplication between the
`q_vecs` and each `k_vecs`.
```cpp
q_vecs = ...
for ... {
k_ptr = ...
for ... {
k_vecs[i] = ...
}
...
float qk = scale * Qk_dot<scalar_t, THREAD_GROUP_SIZE>::dot(q_vecs[thread_group_offset], k_vecs);
}
```
As mentioned before, for each thread, it only fetches part of the
query and key token data at a time. However, there will be a cross
thread group reduction happen in the `Qk_dot<>::dot` . So `qk`
returned here is not just between part of the query and key token dot
multiplication, but actually a full result between entire query and
key token data.
For example, if the value of `HEAD_SIZE` is 128 and
`THREAD_GROUP_SIZE` is 2, each thread's `k_vecs` will contain
total 64 elements. However, the returned `qk` is actually the
result of dot multiplication between 128 query elements and 128 key
elements. If you want to learn more about the details of the dot
multiplication and reduction, you may refer to the implementation of
`Qk_dot<>::dot`. However, for the sake of simplicity, I will not
cover it in this document.
## Softmax
Next, we need to calculate the normalized softmax for all `qk`s,
as shown above, where each $x$ represents a `qk`. To do this,
we must obtain the reduced value of `qk_max`($m(x)$) and
the `exp_sum`($\ell(x)$) of all `qk`s. The reduction
should be performed across the entire thread block, encompassing
results between the query token and all context key tokens.
$$
\begin{gather*}
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
\end{gather*}
$$
### `qk_max` and `logits`
Just right after we get the `qk` result, we can set the temporary
`logits` result with `qk` (In the end, the `logits` should
store the normalized softmax result). Also we can compare and collect
the `qk_max` for all `qk`s that are calculated by current
thread group.
```cpp
if (thread_group_offset == 0) {
const bool mask = token_idx >= context_len;
logits[token_idx - start_token_idx] = mask ? 0.f : qk;
qk_max = mask ? qk_max : fmaxf(qk_max, qk);
}
```
Please note that the `logits` here is on shared memory, so each
thread group will set the fields for its own assigned context tokens.
Overall, the size of logits should be number of context tokens.
```cpp
for (int mask = WARP_SIZE / 2; mask >= THREAD_GROUP_SIZE; mask /= 2) {
qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
}
if (lane == 0) {
red_smem[warp_idx] = qk_max;
}
```
Then we need to get the reduced `qk_max` across each warp. The main
idea is to make threads in warp to communicate with each other and
get the final max `qk` .
```cpp
for (int mask = NUM_WARPS / 2; mask >= 1; mask /= 2) {
qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
}
qk_max = VLLM_SHFL_SYNC(qk_max, 0);
```
Finally, we can get the reduced `qk_max` from whole thread block by
compare the `qk_max` from all warps in this thread block. Then we
need to broadcast the final result to each thread.
### `exp_sum`
Similar to `qk_max`, we need to get the reduced sum value from the
entire thread block too.
```cpp
for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
float val = __expf(logits[i] - qk_max);
logits[i] = val;
exp_sum += val;
}
...
exp_sum = block_sum<NUM_WARPS>(&red_smem[NUM_WARPS], exp_sum);
```
Firstly, sum all exp values from each thread group, and meanwhile,
convert each entry of `logits` from `qk` to `exp(qk - qk_max)`.
Please note, the `qk_max` here is already the max `qk` across the
whole thread block. And then we can do reduction for `exp_sum`
across whole thread block just like the `qk_max`.
```cpp
const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f);
for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
logits[i] *= inv_sum;
}
```
Finally, with the reduced `qk_max` and `exp_sum`, we can obtain
the final normalized softmax result as `logits`. This `logits`
variable will be used for dot multiplication with the value data in
later steps. Now, it should store the normalized softmax result of
`qk` for all assigned context tokens.
## Value
<figure markdown="span">
![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" }
</figure>
<figure markdown="span">
![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
</figure>
<figure markdown="span">
![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" }
</figure>
Now we need to retrieve the value data and perform dot multiplication
with `logits`. Unlike query and key, there is no thread group
concept for value data. As shown in diagram, different from key token
memory layout, elements from the same column correspond to the same
value token. For one block of value data, there are `HEAD_SIZE` of
rows and `BLOCK_SIZE` of columns that are split into multiple
`v_vecs`.
Each thread always fetches `V_VEC_SIZE` elements from the same
`V_VEC_SIZE` of tokens at a time. As a result, a single thread
retrieves multiple `v_vec`s from different rows and the same
columns through multiple inner iterations. For each `v_vec`, it
needs to be dot multiplied with the corresponding `logits_vec`,
which is also `V_VEC_SIZE` elements from `logits`. Overall, with
multiple inner iterations, each warp will process one block of value
tokens. And with multiple outer iterations, the whole context value
tokens are processed
```cpp
float accs[NUM_ROWS_PER_THREAD];
for ... { // Iteration over different blocks.
logits_vec = ...
for ... { // Iteration over different rows.
v_vec = ...
...
accs[i] += dot(logits_vec, v_vec);
}
}
```
As shown in the above pseudo code, in the outer loop, similar to
`k_ptr`, `logits_vec` iterates over different blocks and reads
`V_VEC_SIZE` elements from `logits`. In the inner loop, each
thread reads `V_VEC_SIZE` elements from the same tokens as a
`v_vec` and performs dot multiplication. It is important to note
that in each inner iteration, the thread fetches different head
position elements for the same tokens. The dot result is then
accumulated in `accs`. Therefore, each entry of `accs` is mapped
to a head position assigned to the current thread.
For example, if `BLOCK_SIZE` is 16 and `V_VEC_SIZE` is 8, each
thread fetches 8 value elements for 8 tokens at a time. Each element
is from different tokens at the same head position. If `HEAD_SIZE`
is 128 and `WARP_SIZE` is 32, for each inner loop, a warp needs to
fetch `WARP_SIZE * V_VEC_SIZE = 256` elements. This means there are
a total of 128 * 16 / 256 = 8 inner iterations for a warp to handle
a whole block of value tokens. And each `accs` in each thread
contains 8 elements that accumulated at 8 different head positions.
For the thread 0, the `accs` variable will have 8 elements, which
are 0th, 32th … 224th elements of a value head that are accumulated
from all assigned 8 tokens.
## LV
Now, we need to perform reduction for `accs` within each warp. This
process allows each thread to accumulate the `accs` for the
assigned head positions of all tokens in one block.
```cpp
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
float acc = accs[i];
for (int mask = NUM_V_VECS_PER_ROW / 2; mask >= 1; mask /= 2) {
acc += VLLM_SHFL_XOR_SYNC(acc, mask);
}
accs[i] = acc;
}
```
Next, we perform reduction for `accs` across all warps, allowing
each thread to have the accumulation of `accs` for the assigned
head positions of all context tokens. Please note that each `accs`
in every thread only stores the accumulation for a portion of
elements of the entire head for all context tokens. However, overall,
all results for output have been calculated but are just stored in
different thread register memory.
```cpp
float* out_smem = reinterpret_cast<float*>(shared_mem);
for (int i = NUM_WARPS; i > 1; i /= 2) {
// Upper warps write to shared memory.
...
float* dst = &out_smem[(warp_idx - mid) * HEAD_SIZE];
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
...
dst[row_idx] = accs[i];
}
// Lower warps update the output.
const float* src = &out_smem[warp_idx * HEAD_SIZE];
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
...
accs[i] += src[row_idx];
}
// Write out the accs.
}
```
## Output
Now we can write all of calculated result from local register memory
to final output global memory.
```cpp
scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
+ head_idx * max_num_partitions * HEAD_SIZE
+ partition_idx * HEAD_SIZE;
```
First, we need to define the `out_ptr` variable, which points to
the start address of the assigned sequence and assigned head.
```cpp
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
if (row_idx < HEAD_SIZE && lane % NUM_V_VECS_PER_ROW == 0) {
from_float(*(out_ptr + row_idx), accs[i]);
}
}
```
Finally, we need to iterate over different assigned head positions
and write out the corresponding accumulated result based on the
`out_ptr`.
(mm-processing)= ---
title: Multi-Modal Data Processing
---
[](){ #mm-processing }
# Multi-Modal Data Processing To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefill) and [prefix caching](#automatic-prefix-caching), we use {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
## Prompt Update Detection ## Prompt Update Detection
...@@ -15,7 +16,7 @@ One of the main responsibilities of HF processor is to update the prompt with pl ...@@ -15,7 +16,7 @@ One of the main responsibilities of HF processor is to update the prompt with pl
The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs. The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptUpdate` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens. In vLLM, this information is specified using [PromptUpdate][vllm.multimodal.processing.PromptUpdate] in [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
## Tokenized Prompt Inputs ## Tokenized Prompt Inputs
...@@ -43,22 +44,22 @@ While HF processors support text + multi-modal inputs natively, this is not so f ...@@ -43,22 +44,22 @@ While HF processors support text + multi-modal inputs natively, this is not so f
Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other. Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.
(mm-dummy-text)= [](){ #mm-dummy-text }
### Dummy text ### Dummy text
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data. We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
(mm-automatic-prompt-updating)= [](){ #mm-automatic-prompt-updating }
### Automatic prompt updating ### Automatic prompt updating
We address the second issue by implementing model-agnostic code in We address the second issue by implementing model-agnostic code in
{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates` to automatically update the prompt with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. [_apply_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates] to automatically update the prompt with feature placeholder tokens based on the specification outputted by [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
### Summary ### Summary
With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`. With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in [_apply_hf_processor_main][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main].
## Processor Output Caching ## Processor Output Caching
...@@ -66,4 +67,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238) ...@@ -66,4 +67,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238)
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache. When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#mm-automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other. Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text][mm-dummy-text] to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating][mm-automatic-prompt-updating] afterwards to keep the output tokens and multi-modal data consistent with each other.
...@@ -2,14 +2,13 @@ ...@@ -2,14 +2,13 @@
## Debugging ## Debugging
Please see the [Troubleshooting](#troubleshooting-python-multiprocessing) Please see the [Troubleshooting][troubleshooting-python-multiprocessing]
page for information on known issues and how to solve them. page for information on known issues and how to solve them.
## Introduction ## Introduction
:::{important} !!! warning
The source code references are to the state of the code at the time of writing in December, 2024. The source code references are to the state of the code at the time of writing in December, 2024.
:::
The use of Python multiprocessing in vLLM is complicated by: The use of Python multiprocessing in vLLM is complicated by:
...@@ -124,7 +123,7 @@ what is happening. First, a log message from vLLM: ...@@ -124,7 +123,7 @@ what is happening. First, a log message from vLLM:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing https://docs.vllm.ai/en/latest/usage/debugging.html#python-multiprocessing
for more information. for more information.
``` ```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment