Unverified Commit a1fe24d9 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Migrate docs from Sphinx to MkDocs (#18145)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent d0bc2f81
(deployment-kubeai)=
# KubeAI
---
title: KubeAI
---
[](){ #deployment-kubeai }
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
......
(deployment-llamastack)=
# Llama Stack
---
title: Llama Stack
---
[](){ #deployment-llamastack }
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
......
(deployment-llmaz)=
# llmaz
---
title: llmaz
---
[](){ #deployment-llmaz }
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
......
(deployment-production-stack)=
# Production stack
---
title: Production stack
---
[](){ #deployment-production-stack }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
......@@ -114,7 +115,7 @@ To remove the deployment, run:
sudo helm uninstall vllm
```
------
---
### (Advanced) Configuring vLLM production stack
......
(deployment-k8s)=
# Using Kubernetes
---
title: Using Kubernetes
---
[](){ #deployment-k8s }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
......@@ -19,9 +20,8 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
## Deployment with CPUs
:::{note}
The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
:::
!!! note
The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
......
(nginxloadbalancer)=
# Using Nginx
---
title: Using Nginx
---
[](){ #nginxloadbalancer }
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents:
1. [Build Nginx Container](#nginxloadbalancer-nginx-build)
2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf)
3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container)
4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network)
5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container)
6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx)
7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
1. [Build Nginx Container][nginxloadbalancer-nginx-build]
2. [Create Simple Nginx Config file][nginxloadbalancer-nginx-conf]
3. [Build vLLM Container][nginxloadbalancer-nginx-vllm-container]
4. [Create Docker Network][nginxloadbalancer-nginx-docker-network]
5. [Launch vLLM Containers][nginxloadbalancer-nginx-launch-container]
6. [Launch Nginx][nginxloadbalancer-nginx-launch-nginx]
7. [Verify That vLLM Servers Are Ready][nginxloadbalancer-nginx-verify-nginx]
(nginxloadbalancer-nginx-build)=
[](){ #nginxloadbalancer-nginx-build }
## Build Nginx Container
......@@ -39,7 +40,7 @@ Build the container:
docker build . -f Dockerfile.nginx --tag nginx-lb
```
(nginxloadbalancer-nginx-conf)=
[](){ #nginxloadbalancer-nginx-conf }
## Create Simple Nginx Config file
......@@ -63,7 +64,7 @@ server {
}
```
(nginxloadbalancer-nginx-vllm-container)=
[](){ #nginxloadbalancer-nginx-vllm-container }
## Build vLLM Container
......@@ -79,7 +80,7 @@ cd $vllm_root
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
```
(nginxloadbalancer-nginx-docker-network)=
[](){ #nginxloadbalancer-nginx-docker-network }
## Create Docker Network
......@@ -87,7 +88,7 @@ docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_prox
docker network create vllm_nginx
```
(nginxloadbalancer-nginx-launch-container)=
[](){ #nginxloadbalancer-nginx-launch-container }
## Launch vLLM Containers
......@@ -105,11 +106,10 @@ docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
:::{note}
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
:::
!!! note
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
(nginxloadbalancer-nginx-launch-nginx)=
[](){ #nginxloadbalancer-nginx-launch-nginx }
## Launch Nginx
......@@ -117,7 +117,7 @@ If you are behind proxy, you can pass the proxy settings to the docker run comma
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
```
(nginxloadbalancer-nginx-verify-nginx)=
[](){ #nginxloadbalancer-nginx-verify-nginx }
## Verify That vLLM Servers Are Ready
......
(arch-overview)=
# Architecture Overview
---
title: Architecture Overview
---
[](){ #arch-overview }
This document provides an overview of the vLLM architecture.
:::{contents} Table of Contents
:depth: 2
:local: true
:::
[TOC]
## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them.
:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
:alt: Entrypoints Diagram
:::
![Entrypoints Diagram](../assets/design/arch_overview/entrypoints.excalidraw.png)
### LLM Class
......@@ -77,16 +73,14 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.
More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document.
## LLM Engine
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing.
:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
:alt: LLMEngine Diagram
:::
![LLMEngine Diagram](../assets/design/arch_overview/llm_engine.excalidraw.png)
### LLMEngine
......@@ -137,18 +131,16 @@ input tensors and capturing cudagraphs.
## Model
Every model runner object has one model object, which is the actual
`torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various
`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
configurations affect the class we ultimately get.
## Class Hierarchy
The following figure shows the class hierarchy of vLLM:
> :::{figure} /assets/design/hierarchy.png
> :align: center
> :alt: query
> :width: 100%
> :::
> <figure markdown="span">
> ![](../assets/design/hierarchy.png){ align="center" alt="query" width="100%" }
> </figure>
There are several important design choices behind this class hierarchy:
......@@ -178,44 +170,43 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a
vision-language model.
:::{note}
To support this change, all vLLM models' signatures have been updated to:
```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
```
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
!!! note
To support this change, all vLLM models' signatures have been updated to:
```python
class MyOldModel(nn.Module):
def __init__(
self,
config,
cache_config: Optional[CacheConfig] = None,
quant_config: Optional[QuantizationConfig] = None,
lora_config: Optional[LoRAConfig] = None,
prefix: str = "",
) -> None:
...
from vllm.config import VllmConfig
class MyNewModel(MyOldModel):
```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
config = vllm_config.model_config.hf_config
cache_config = vllm_config.cache_config
quant_config = vllm_config.quant_config
lora_config = vllm_config.lora_config
super().__init__(config, cache_config, quant_config, lora_config, prefix)
if __version__ >= "0.6.4":
MyModel = MyNewModel
else:
MyModel = MyOldModel
```
This way, the model can work with both old and new versions of vLLM.
:::
```
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
```python
class MyOldModel(nn.Module):
def __init__(
self,
config,
cache_config: Optional[CacheConfig] = None,
quant_config: Optional[QuantizationConfig] = None,
lora_config: Optional[LoRAConfig] = None,
prefix: str = "",
) -> None:
...
from vllm.config import VllmConfig
class MyNewModel(MyOldModel):
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
config = vllm_config.model_config.hf_config
cache_config = vllm_config.cache_config
quant_config = vllm_config.quant_config
lora_config = vllm_config.lora_config
super().__init__(config, cache_config, quant_config, lora_config, prefix)
if __version__ >= "0.6.4":
MyModel = MyNewModel
else:
MyModel = MyOldModel
```
This way, the model can work with both old and new versions of vLLM.
3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the
......
(design-automatic-prefix-caching)=
# Automatic Prefix Caching
---
title: Automatic Prefix Caching
---
[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
......
(huggingface-integration)=
# Integration with HuggingFace
---
title: Integration with HuggingFace
---
[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
......
(design-paged-attention)=
# vLLM Paged Attention
---
title: vLLM Paged Attention
---
[](){ #design-paged-attention }
- Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`).
......@@ -139,26 +140,22 @@
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
```
:::{figure} ../../assets/kernel/query.png
:align: center
:alt: query
:width: 70%
Query data of one token at one head
:::
<figure markdown="span">
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
<figcaption>
</figcaption>
</figure>
- Each thread defines its own `q_ptr` which points to the assigned
query token data on global memory. For example, if `VEC_SIZE` is 4
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs.
:::{figure} ../../assets/kernel/q_vecs.png
:align: center
:alt: q_vecs
:width: 70%
`q_vecs` for one thread group
:::
<figure markdown="span">
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
<figcaption>
</figcaption>
</figure>
```cpp
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
......@@ -195,13 +192,11 @@
points to key token data based on `k_cache` at assigned block,
assigned head and assigned token.
:::{figure} ../../assets/kernel/key.png
:align: center
:alt: key
:width: 70%
Key data of all context tokens at one head
:::
<figure markdown="span">
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
<figcaption>
</figcaption>
</figure>
- The diagram above illustrates the memory layout for key data. It
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
......@@ -214,13 +209,11 @@
elements for one token) that will be processed by 2 threads (one
thread group) separately.
:::{figure} ../../assets/kernel/k_vecs.png
:align: center
:alt: k_vecs
:width: 70%
`k_vecs` for one thread
:::
<figure markdown="span">
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
<figcaption>
</figcaption>
</figure>
```cpp
K_vec k_vecs[NUM_VECS_PER_THREAD]
......@@ -289,14 +282,12 @@
should be performed across the entire thread block, encompassing
results between the query token and all context key tokens.
:::{math}
:nowrap: true
$$
\begin{gather*}
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
\end{gather*}
:::
$$
### `qk_max` and `logits`
......@@ -379,29 +370,23 @@
## Value
:::{figure} ../../assets/kernel/value.png
:align: center
:alt: value
:width: 70%
Value data of all context tokens at one head
:::
:::{figure} ../../assets/kernel/logits_vec.png
:align: center
:alt: logits_vec
:width: 50%
`logits_vec` for one thread
:::
:::{figure} ../../assets/kernel/v_vec.png
:align: center
:alt: v_vec
:width: 70%
List of `v_vec` for one thread
:::
<figure markdown="span">
![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" }
<figcaption>
</figcaption>
</figure>
<figure markdown="span">
![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
<figcaption>
</figcaption>
</figure>
<figure markdown="span">
![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" }
<figcaption>
</figcaption>
</figure>
- Now we need to retrieve the value data and perform dot multiplication
with `logits`. Unlike query and key, there is no thread group
......
(mm-processing)=
---
title: Multi-Modal Data Processing
---
[](){ #mm-processing }
# Multi-Modal Data Processing
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefill) and [prefix caching](#automatic-prefix-caching), we use {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
## Prompt Update Detection
......@@ -15,7 +16,7 @@ One of the main responsibilities of HF processor is to update the prompt with pl
The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptUpdate` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
In vLLM, this information is specified using [PromptUpdate][vllm.multimodal.processing.PromptUpdate] in [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
## Tokenized Prompt Inputs
......@@ -43,22 +44,22 @@ While HF processors support text + multi-modal inputs natively, this is not so f
Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.
(mm-dummy-text)=
[](){ #mm-dummy-text }
### Dummy text
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
(mm-automatic-prompt-updating)=
[](){ #mm-automatic-prompt-updating }
### Automatic prompt updating
We address the second issue by implementing model-agnostic code in
{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates` to automatically update the prompt with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
[_apply_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates] to automatically update the prompt with feature placeholder tokens based on the specification outputted by [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
### Summary
With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in [_apply_hf_processor_main][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main].
## Processor Output Caching
......@@ -66,4 +67,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238)
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#mm-automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other.
Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text][mm-dummy-text] to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating][mm-automatic-prompt-updating] afterwards to keep the output tokens and multi-modal data consistent with each other.
......@@ -2,14 +2,13 @@
## Debugging
Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
Please see the [Troubleshooting][troubleshooting-python-multiprocessing]
page for information on known issues and how to solve them.
## Introduction
:::{important}
The source code references are to the state of the code at the time of writing in December, 2024.
:::
!!! warning
The source code references are to the state of the code at the time of writing in December, 2024.
The use of Python multiprocessing in vLLM is complicated by:
......
(plugin-system)=
# vLLM's Plugin System
---
title: vLLM's Plugin System
---
[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [](#arch-overview)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins
......
......@@ -57,7 +57,7 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
These are documented under [Inferencing and Serving -> Production Metrics](../../serving/metrics.md).
### Grafana Dashboard
......@@ -222,9 +222,7 @@ And the calculated intervals are:
Put another way:
:::{image} /assets/design/v1/metrics/intervals-1.png
:alt: Interval calculations - common case
:::
![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png)
We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However,
......@@ -239,17 +237,13 @@ When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals.
:::{image} /assets/design/v1/metrics/intervals-2.png
:alt: Interval calculations - preempted decode
:::
![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png)
When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals.
:::{image} /assets/design/v1/metrics/intervals-3.png
:alt: Interval calculations - preempted prefill
:::
![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png)
### Frontend Stats Collection
......@@ -467,7 +461,7 @@ In general:
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them.
See the [deprecation policy](project:../../contributing/deprecation_policy.md) for
See the [deprecation policy](../../contributing/deprecation_policy.md) for
the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total`
......
......@@ -122,9 +122,7 @@ There are two design points to highlight:
As a result, we will have the following components when the KV cache manager is initialized:
:::{image} /assets/design/v1/prefix_caching/overview.png
:alt: Component Overview
:::
![Component Overview](../../assets/design/v1/prefix_caching/overview.png)
* Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
......@@ -194,9 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund
When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
:::{image} /assets/design/v1/prefix_caching/free.png
:alt: Free Queue after Free a Request
:::
![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png)
### Eviction (LRU)
......@@ -212,36 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),
**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens.
:::{image} /assets/design/v1/prefix_caching/example-time-1.png
:alt: Example Time 1
:::
![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png)
**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
:::{image} /assets/design/v1/prefix_caching/example-time-3.png
:alt: Example Time 3
:::
![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png)
**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
:::{image} /assets/design/v1/prefix_caching/example-time-4.png
:alt: Example Time 4
:::
![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png)
**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
:::{image} /assets/design/v1/prefix_caching/example-time-5.png
:alt: Example Time 5
:::
![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png)
**Time 6: Request 1 is finished and free.**
:::{image} /assets/design/v1/prefix_caching/example-time-6.png
:alt: Example Time 6
:::
![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png)
**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
:::{image} /assets/design/v1/prefix_caching/example-time-7.png
:alt: Example Time 7
:::
![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png)
(automatic-prefix-caching)=
# Automatic Prefix Caching
---
title: Automatic Prefix Caching
---
[](){ #automatic-prefix-caching }
## Introduction
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
:::{note}
Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
:::
!!! note
Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
## Enabling APC in vLLM
......
---
title: Compatibility Matrix
---
[](){ #compatibility-matrix }
The tables below show mutually exclusive features and the support on some hardware.
The symbols used have the following meanings:
- ✅ = Full compatibility
- 🟠 = Partial compatibility
- ❌ = No compatibility
!!! note
Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
## Feature x Feature
<style>
td:not(:first-child) {
text-align: center !important;
}
td {
padding: 0.5rem !important;
white-space: nowrap;
}
th {
padding: 0.5rem !important;
min-width: 0 !important;
}
th:not(:first-child) {
writing-mode: vertical-lr;
transform: rotate(180deg)
}
</style>
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
| best-of | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | |
| beam-search | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
[](){ #feature-x-hardware }
## Feature x Hardware
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD |
|-----------------------------------------------------------|--------------------|----------|----------|-------|----------|--------------------|-------|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC][automatic-prefix-caching] | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8475) | ✅ |
| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8477) | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
(disagg-prefill)=
# Disaggregated Prefilling (experimental)
---
title: Disaggregated Prefilling (experimental)
---
[](){ #disagg-prefill }
This page introduces you the disaggregated prefilling feature in vLLM.
:::{note}
This feature is experimental and subject to change.
:::
!!! note
This feature is experimental and subject to change.
## Why disaggregated prefilling?
......@@ -15,9 +15,8 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
:::{note}
Disaggregated prefill DOES NOT improve throughput.
:::
!!! note
Disaggregated prefill DOES NOT improve throughput.
## Usage example
......@@ -39,21 +38,16 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
:::{note}
`insert` is non-blocking operation but `drop_select` is blocking operation.
:::
!!! note
`insert` is non-blocking operation but `drop_select` is blocking operation.
Here is a figure illustrating how the above 3 abstractions are organized:
:::{image} /assets/features/disagg_prefill/abstraction.jpg
:alt: Disaggregated prefilling abstractions
:::
![Disaggregated prefilling abstractions](../assets/features/disagg_prefill/abstraction.jpg)
The workflow of disaggregated prefilling is as follows:
:::{image} /assets/features/disagg_prefill/overview.jpg
:alt: Disaggregated prefilling workflow
:::
![Disaggregated prefilling workflow](../assets/features/disagg_prefill/overview.jpg)
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment