Commit 711aa9d5 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.10.0' into v0.10.0-dev

parents 751c492c 6d8d0a24
--- # NVIDIA Triton
title: NVIDIA Triton
---
[](){ #deployment-triton }
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
--- # KServe
title: KServe
---
[](){ #deployment-kserve }
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
......
--- # KubeAI
title: KubeAI
---
[](){ #deployment-kubeai }
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
......
# KubeRay
[KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clusters.
A Ray cluster can be declared in YAML, and the operator then handles pod scheduling, networking configuration, restarts, and blue-green deployments — all while preserving the familiar Kubernetes experience.
## Why KubeRay instead of manual scripts?
| Feature | Manual scripts | KubeRay |
|---------|-----------------------------------------------------------|---------|
| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` |
| Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size |
| Upgrades | Tear down & re-create manually | Blue/green deployment updates supported |
| Declarative config | Bash flags & environment variables | Git-ops-friendly YAML CRDs (RayCluster/RayService) |
Using KubeRay reduces the operational burden and simplifies integration of Ray + vLLM with existing Kubernetes workflows (CI/CD, secrets, storage classes, etc.).
## Learn more
* ["Serve a Large Language Model using Ray Serve LLM on Kubernetes"](https://docs.ray.io/en/master/cluster/kubernetes/examples/rayserve-llm-example.html) - An end-to-end example of how to serve a model using vLLM, KubeRay, and Ray Serve.
* [KubeRay documentation](https://docs.ray.io/en/latest/cluster/kubernetes/index.html)
--- # Llama Stack
title: Llama Stack
---
[](){ #deployment-llamastack }
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) . vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
......
--- # llmaz
title: llmaz
---
[](){ #deployment-llmaz }
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend. [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
......
--- # Production stack
title: Production stack
---
[](){ #deployment-production-stack }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
...@@ -44,7 +41,8 @@ vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38 ...@@ -44,7 +41,8 @@ vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
``` ```
**NOTE**: It may take some time for the containers to download the Docker images and LLM weights. !!! note
It may take some time for the containers to download the Docker images and LLM weights.
### Send a Query to the Stack ### Send a Query to the Stack
...@@ -60,7 +58,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai ...@@ -60,7 +58,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
curl -o- http://localhost:30080/models curl -o- http://localhost:30080/models
``` ```
??? Output ??? console "Output"
```json ```json
{ {
...@@ -89,7 +87,7 @@ curl -X POST http://localhost:30080/completions \ ...@@ -89,7 +87,7 @@ curl -X POST http://localhost:30080/completions \
}' }'
``` ```
??? Output ??? console "Output"
```json ```json
{ {
...@@ -121,7 +119,7 @@ sudo helm uninstall vllm ...@@ -121,7 +119,7 @@ sudo helm uninstall vllm
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
??? Yaml ??? code "Yaml"
```yaml ```yaml
servingEngineSpec: servingEngineSpec:
...@@ -152,6 +150,8 @@ In this YAML configuration: ...@@ -152,6 +150,8 @@ In this YAML configuration:
* **`requestGPU`**: Specifies the number of GPUs required. * **`requestGPU`**: Specifies the number of GPUs required.
* **`pvcStorage`**: Allocates persistent storage for the model. * **`pvcStorage`**: Allocates persistent storage for the model.
**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml). !!! note
If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details! !!! tip
vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!
--- # Using Kubernetes
title: Using Kubernetes
---
[](){ #deployment-k8s }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
...@@ -16,6 +13,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -16,6 +13,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
- [Helm](frameworks/helm.md) - [Helm](frameworks/helm.md)
- [InftyAI/llmaz](integrations/llmaz.md) - [InftyAI/llmaz](integrations/llmaz.md)
- [KServe](integrations/kserve.md) - [KServe](integrations/kserve.md)
- [KubeRay](integrations/kuberay.md)
- [kubernetes-sigs/lws](frameworks/lws.md) - [kubernetes-sigs/lws](frameworks/lws.md)
- [meta-llama/llama-stack](integrations/llamastack.md) - [meta-llama/llama-stack](integrations/llamastack.md)
- [substratusai/kubeai](integrations/kubeai.md) - [substratusai/kubeai](integrations/kubeai.md)
...@@ -29,7 +27,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following: ...@@ -29,7 +27,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model: First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
??? Config ??? console "Config"
```bash ```bash
cat <<EOF |kubectl apply -f - cat <<EOF |kubectl apply -f -
...@@ -57,7 +55,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa ...@@ -57,7 +55,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa
Next, start the vLLM server as a Kubernetes Deployment and Service: Next, start the vLLM server as a Kubernetes Deployment and Service:
??? Config ??? console "Config"
```bash ```bash
cat <<EOF |kubectl apply -f - cat <<EOF |kubectl apply -f -
......
--- # Using Nginx
title: Using Nginx
---
[](){ #nginxloadbalancer }
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
...@@ -36,7 +33,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb ...@@ -36,7 +33,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`. Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
??? Config ??? console "Config"
```console ```console
upstream backend { upstream backend {
...@@ -95,7 +92,7 @@ Notes: ...@@ -95,7 +92,7 @@ Notes:
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command. - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`. - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
??? Commands ??? console "Commands"
```console ```console
mkdir -p ~/.cache/huggingface/hub/ mkdir -p ~/.cache/huggingface/hub/
......
--- # Architecture Overview
title: Architecture Overview
---
[](){ #arch-overview }
This document provides an overview of the vLLM architecture. This document provides an overview of the vLLM architecture.
...@@ -22,7 +19,7 @@ server. ...@@ -22,7 +19,7 @@ server.
Here is a sample of `LLM` class usage: Here is a sample of `LLM` class usage:
??? Code ??? code
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
...@@ -74,7 +71,7 @@ python -m vllm.entrypoints.openai.api_server --model <model> ...@@ -74,7 +71,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the [OpenAI-Compatible Server][serving-openai-compatible-server] document. More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
## LLM Engine ## LLM Engine
...@@ -132,7 +129,7 @@ input tensors and capturing cudagraphs. ...@@ -132,7 +129,7 @@ input tensors and capturing cudagraphs.
## Model ## Model
Every model runner object has one model object, which is the actual Every model runner object has one model object, which is the actual
`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various `torch.nn.Module` instance. See [huggingface_integration](huggingface_integration.md) for how various
configurations affect the class we ultimately get. configurations affect the class we ultimately get.
## Class Hierarchy ## Class Hierarchy
...@@ -180,7 +177,7 @@ vision-language model. ...@@ -180,7 +177,7 @@ vision-language model.
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one: To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
??? Code ??? code
```python ```python
class MyOldModel(nn.Module): class MyOldModel(nn.Module):
......
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
......
--- # Integration with HuggingFace
title: Integration with HuggingFace
---
[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
......
--- # vLLM Paged Attention
title: vLLM Paged Attention
---
[](){ #design-paged-attention }
Currently, vLLM utilizes its own implementation of a multi-head query Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`). attention kernel (`csrc/attention/attention_kernels.cu`).
...@@ -448,7 +445,7 @@ elements of the entire head for all context tokens. However, overall, ...@@ -448,7 +445,7 @@ elements of the entire head for all context tokens. However, overall,
all results for output have been calculated but are just stored in all results for output have been calculated but are just stored in
different thread register memory. different thread register memory.
??? Code ??? code
```cpp ```cpp
float* out_smem = reinterpret_cast<float*>(shared_mem); float* out_smem = reinterpret_cast<float*>(shared_mem);
......
--- # Multi-Modal Data Processing
title: Multi-Modal Data Processing
---
[](){ #mm-processing }
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]: Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
......
--- # vLLM's Plugin System
title: vLLM's Plugin System
---
[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM ## How Plugins Work in vLLM
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work. Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins ## How vLLM Discovers Plugins
vLLM's plugin system uses the standard Python `entry_points` mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin: vLLM's plugin system uses the standard Python `entry_points` mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin:
??? Code ??? code
```python ```python
# inside `setup.py` file # inside `setup.py` file
......
...@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0. ...@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
## Objectives ## Objectives
- Achieve parity of metrics between v0 and v1. - Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments. - The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases. - Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
## Background ## Background
Metrics in vLLM can be categorized as follows: Metrics in vLLM can be categorized as follows:
1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. 1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking. 2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking.
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are. The mental model is that server-level metrics help explain the values of request-level metrics.
### v0 Metrics ### v0 Metrics
...@@ -61,24 +61,24 @@ These are documented under [Inferencing and Serving -> Production Metrics](../.. ...@@ -61,24 +61,24 @@ These are documented under [Inferencing and Serving -> Production Metrics](../..
### Grafana Dashboard ### Grafana Dashboard
vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. vLLM also provides [a reference example](../../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds - `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds.
- `vllm:prompt_tokens_total` - Prompt Tokens - `vllm:prompt_tokens_total` - Prompt tokens.
- `vllm:generation_tokens_total` - Generation Tokens - `vllm:generation_tokens_total` - Generation tokens.
- `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second. - `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds.
- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds. - `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state - `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states.
- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM. - `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
- `vllm:request_prompt_tokens` - Request prompt length - `vllm:request_prompt_tokens` - Request prompt length.
- `vllm:request_generation_tokens` - request generation length - `vllm:request_generation_tokens` - Request generation length.
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached - `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
- `vllm:request_queue_time_seconds` - Queue Time - `vllm:request_queue_time_seconds` - Queue time.
- `vllm:request_prefill_time_seconds` - Requests Prefill Time - `vllm:request_prefill_time_seconds` - Requests prefill time.
- `vllm:request_decode_time_seconds` - Requests Decode Time - `vllm:request_decode_time_seconds` - Requests decode time.
- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group - `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group.
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here. See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
...@@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process ...@@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process
### Built in Python/Process Metrics ### Built in Python/Process Metrics
The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used: The following metrics are supported by default by `prometheus_client`, but they are not exposed when multi-process mode is used:
- `python_gc_objects_collected_total` - `python_gc_objects_collected_total`
- `python_gc_objects_uncollectable_total` - `python_gc_objects_uncollectable_total`
...@@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core ...@@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core
process to minimize the time between each forward pass. process to minimize the time between each forward pass.
The overall idea of V1 EngineCore design is: The overall idea of V1 EngineCore design is:
- EngineCore is the inner loop. Performance is most critical here - EngineCore is the inner loop. Performance is most critical here
- AsyncLLM is the outer loop. This is overlapped with GPU execution - AsyncLLM is the outer loop. This is overlapped with GPU execution
(ideally), so this is where any "overheads" should be if (ideally), so this is where any "overheads" should be if
...@@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is ...@@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is
unaffected by system clock changes (e.g. from NTP). unaffected by system clock changes (e.g. from NTP).
It's also important to note that monotonic clocks differ between It's also important to note that monotonic clocks differ between
processes - each process has its own reference. point. So it is processes - each process has its own reference point. So it is
meaningless to compare monotonic timestamps from different processes. meaningless to compare monotonic timestamps from different processes.
Therefore, in order to calculate an interval, we must compare two Therefore, in order to calculate an interval, we must compare two
...@@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3. ...@@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
``` ```
Note - the choice of histogram buckets to be most useful to users !!! note
across a broad set of use cases is not straightforward and will The choice of histogram buckets to be most useful to users
require refinement over time. across a broad set of use cases is not straightforward and will
require refinement over time.
### Cache Config Info ### Cache Config Info
`prometheus_client` has support for [Info `prometheus_client` has support for
metrics](https://prometheus.github.io/client_python/instrumenting/info/) [Info metrics](https://prometheus.github.io/client_python/instrumenting/info/)
which are equivalent to a `Gauge` whose value is permanently set to 1, which are equivalent to a `Gauge` whose value is permanently set to 1,
but exposes interesting key/value pair information via labels. This is but exposes interesting key/value pair information via labels. This is
used for information about an instance that does not change - so it used for information about an instance that does not change - so it
...@@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric: ...@@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric:
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge # TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
``` ```
However, `prometheus_client` has [never supported Info metrics in However, `prometheus_client` has
multiprocessing [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
mode](https://github.com/prometheus/client_python/pull/300) - for for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
[unclear
reasons](gh-pr:7279#discussion_r1710417152). We
simply use a `Gauge` metric set to 1 and simply use a `Gauge` metric set to 1 and
`multiprocess_mode="mostrecent"` instead. `multiprocess_mode="mostrecent"` instead.
...@@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited. ...@@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited.
Note that `multiprocess_mode="livemostrecent"` is used - the most Note that `multiprocess_mode="livemostrecent"` is used - the most
recent metric is used, but only from currently running processes. recent metric is used, but only from currently running processes.
This was added in This was added in <gh-pr:9477> and there is
<gh-pr:9477> and there is [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
[at least one known If we revisit this design and deprecate the old metric, we should reduce
user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in the need for a significant deprecation period by making the change in
v0 also and asking this project to move to the new metric. v0 also and asking this project to move to the new metric.
...@@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is ...@@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is
an equivalent metric for them to use. an equivalent metric for them to use.
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
[deprecated](gh-pr:2764) (with a [deprecated](gh-pr:2764) (with a comment in the code),
comment in the code), [removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218).
[removed](gh-pr:12383), and then
[noticed by a
user](gh-issue:13218).
In general: In general:
1) We should be cautious about deprecating metrics, especially since 1. We should be cautious about deprecating metrics, especially since
it can be hard to predict the user impact. it can be hard to predict the user impact.
2) We should include a prominent deprecation notice in the help string 2. We should include a prominent deprecation notice in the help string
that is included in the `/metrics' output. that is included in the `/metrics' output.
3) We should list deprecated metrics in user-facing documentation and 3. We should list deprecated metrics in user-facing documentation and
release notes. release notes.
4) We should consider hiding deprecated metrics behind a CLI argument 4. We should consider hiding deprecated metrics behind a CLI argument
in order to give administrators [an escape in order to give administrators
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) [an escape hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them. for some time before deleting them.
See the [deprecation policy](../../contributing/deprecation_policy.md) for See the [deprecation policy](../../contributing/deprecation_policy.md) for
...@@ -474,7 +468,7 @@ removed. ...@@ -474,7 +468,7 @@ removed.
The `vllm:time_in_queue_requests` Histogram metric was added by The `vllm:time_in_queue_requests` Histogram metric was added by
<gh-pr:9659> and its calculation is: <gh-pr:9659> and its calculation is:
``` ```python
self.metrics.first_scheduled_time = now self.metrics.first_scheduled_time = now
self.metrics.time_in_queue = now - self.metrics.arrival_time self.metrics.time_in_queue = now - self.metrics.arrival_time
``` ```
...@@ -482,7 +476,7 @@ The `vllm:time_in_queue_requests` Histogram metric was added by ...@@ -482,7 +476,7 @@ The `vllm:time_in_queue_requests` Histogram metric was added by
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
us with: us with:
``` ```python
if seq_group.is_finished(): if seq_group.is_finished():
if (seq_group.metrics.first_scheduled_time is not None and if (seq_group.metrics.first_scheduled_time is not None and
seq_group.metrics.first_token_time is not None): seq_group.metrics.first_token_time is not None):
...@@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU ...@@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`. with `--swap-space` and `--preemption-mode`.
In v0, [vLLM has long supported beam In v0, [vLLM has long supported beam search](gh-issue:6226). The
search](gh-issue:6226). The
SequenceGroup encapsulated the idea of N Sequences which SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU sharing between requests, and copy-on-write to do branching. CPU
...@@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand ...@@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand
and the part of the prompt that was evicted can be recomputed. and the part of the prompt that was evicted can be recomputed.
SequenceGroup was removed in V1, although a replacement will be SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`). [Beam search was moved out of required for "parallel sampling" (`n>1`).
the core (in [Beam search was moved out of the core (in V0)](gh-issue:8306). There was a
V0)](gh-issue:8306). There was a
lot of complex code for a very uncommon feature. lot of complex code for a very uncommon feature.
In V1, with prefix caching being better (zero over head) and therefore In V1, with prefix caching being better (zero over head) and therefore
...@@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel ...@@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt. request multiple completions from the same prompt.
As part of adding parallel sampling support in <gh-pr:10980> we should As part of adding parallel sampling support in <gh-pr:10980>, we should
also add these metrics. also add these metrics.
- `vllm:request_params_n` (Histogram) - `vllm:request_params_n` (Histogram)
Observes the value of the 'n' parameter of every finished request. Observes the value of the 'n' parameter of every finished request.
- `vllm:request_max_num_generation_tokens` (Histogram) - `vllm:request_max_num_generation_tokens` (Histogram)
Observes the maximum output length of all sequences in every finished Observes the maximum output length of all sequences in every finished
sequence group. In the absence of parallel sampling, this is sequence group. In the absence of parallel sampling, this is
equivalent to `vllm:request_generation_tokens`. equivalent to `vllm:request_generation_tokens`.
### Speculative Decoding ### Speculative Decoding
...@@ -576,26 +568,23 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)" ...@@ -576,26 +568,23 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
seculative decoding to v1. Other techniques will follow. We should seculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context. revisit the v0 metrics in this context.
Note - we should probably expose acceptance rate as separate accepted !!! note
and draft counters, like we do for prefix caching hit rate. Efficiency We should probably expose acceptance rate as separate accepted
likely also needs similar treatment. and draft counters, like we do for prefix caching hit rate. Efficiency
likely also needs similar treatment.
### Autoscaling and Load-balancing ### Autoscaling and Load-balancing
A common use case for our metrics is to support automated scaling of A common use case for our metrics is to support automated scaling of
vLLM instances. vLLM instances.
For related discussion from the [Kubernetes Serving Working For related discussion from the
Group](https://github.com/kubernetes/community/tree/master/wg-serving), [Kubernetes Serving Working Group](https://github.com/kubernetes/community/tree/master/wg-serving),
see: see:
- [Standardizing Large Model Server Metrics in - [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) - [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Benchmarking LLM Workloads for Performance Evaluation and - [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
Autoscaling in
Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference
Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <gh-issue:5041> and <gh-pr:12726>. - <gh-issue:5041> and <gh-pr:12726>.
This is a non-trivial topic. Consider this comment from Rob: This is a non-trivial topic. Consider this comment from Rob:
...@@ -619,19 +608,16 @@ should judge an instance as approaching saturation: ...@@ -619,19 +608,16 @@ should judge an instance as approaching saturation:
Our approach to naming metrics probably deserves to be revisited: Our approach to naming metrics probably deserves to be revisited:
1. The use of colons in metric names seems contrary to ["colons are 1. The use of colons in metric names seems contrary to
reserved for user defined recording ["colons are reserved for user defined recording rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).
rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels)
2. Most of our metrics follow the convention of ending with units, but 2. Most of our metrics follow the convention of ending with units, but
not all do. not all do.
3. Some of our metric names end with `_total`: 3. Some of our metric names end with `_total`:
``` If there is a suffix of `_total` on the metric name, it will be removed. When
If there is a suffix of `_total` on the metric name, it will be removed. When exposing the time series for counter, a `_total` suffix will be added. This is
exposing the time series for counter, a `_total` suffix will be added. This is for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics requires the `_total` suffix.
requires the `_total` suffix.
```
### Adding More Metrics ### Adding More Metrics
...@@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics: ...@@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics:
- Proposals arising from specific use cases, like the Kubernetes - Proposals arising from specific use cases, like the Kubernetes
auto-scaling topic above auto-scaling topic above
- Proposals that might arise out of standardisation efforts like - Proposals that might arise out of standardisation efforts like
[OpenTelemetry Semantic Conventions for Gen [OpenTelemetry Semantic Conventions for Gen AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
We should be cautious in our approach to adding new metrics. While We should be cautious in our approach to adding new metrics. While
metrics are often relatively straightforward to add: metrics are often relatively straightforward to add:
...@@ -668,19 +653,14 @@ fall under the more general heading of "Observability". ...@@ -668,19 +653,14 @@ fall under the more general heading of "Observability".
v0 has support for OpenTelemetry tracing: v0 has support for OpenTelemetry tracing:
- Added by <gh-pr:4687> - Added by <gh-pr:4687>
- Configured with `--oltp-traces-endpoint` and - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
`--collect-detailed-traces` - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
- [OpenTelemetry blog - [User-facing docs](../../examples/online_serving/opentelemetry.md)
post](https://opentelemetry.io/blog/2024/llm-observability/) - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [User-facing - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
docs](https://docs.vllm.ai/en/latest/examples/opentelemetry.html)
- [Blog
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product
docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
OpenTelemetry has a [Gen AI Working OpenTelemetry has a
Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
Since metrics is a big enough topic on its own, we are going to tackle Since metrics is a big enough topic on its own, we are going to tackle
the topic of tracing in v1 separately. the topic of tracing in v1 separately.
...@@ -699,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled ...@@ -699,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled
and if `--collect-detailed-traces=all/model/worker` is used. The and if `--collect-detailed-traces=all/model/worker` is used. The
documentation for this option states: documentation for this option states:
> collect detailed traces for the specified "modules. This involves > collect detailed traces for the specified modules. This involves
> use of possibly costly and or blocking operations and hence might > use of possibly costly and or blocking operations and hence might
> have a performance impact. > have a performance impact.
......
...@@ -31,7 +31,7 @@ Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (cur ...@@ -31,7 +31,7 @@ Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (cur
## KV Cache Transfer Methods ## KV Cache Transfer Methods
There are three methods for KVcache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVcache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache. There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache.
Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT. Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
...@@ -39,13 +39,13 @@ Experimental results have shown that the performance of these methods, from high ...@@ -39,13 +39,13 @@ Experimental results have shown that the performance of these methods, from high
As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart.
Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVcache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVcache data itself. Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVCache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVCache data itself.
When a P instance and a D instance transmit KVcache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVcache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVcache transmission can be performed, without being restricted by rank or world size. When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size.
## NCCL Group Topology ## NCCL Group Topology
Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVcache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance.
![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36) ![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36)
...@@ -53,33 +53,17 @@ Each NCCL group occupies a certain amount of GPU memory buffer for communication ...@@ -53,33 +53,17 @@ Each NCCL group occupies a certain amount of GPU memory buffer for communication
## GPU Memory Buffer and Tensor Memory Pool ## GPU Memory Buffer and Tensor Memory Pool
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size. The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size.
If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVcache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVcache loss. Once KVcache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance. If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVCache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVCache loss. Once KVCache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance.
To address the above issues, I have designed and developed a local Tensor memory pool for storing KVcache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVcache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVcache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store.
# Install vLLM # Install vLLM
??? Commands ```shell
pip install "vllm>=0.9.2"
```shell ```
# Enter the home directory or your working directory.
cd /home
# Download the installation package, and I will update the commit-id in time. You can directly copy the command.
wget https://vllm-wheels.s3.us-west-2.amazonaws.com/9112b443a042d8d815880b8780633882ad32b183/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# Download the code repository.
git clone -b xpyd-v1 https://github.com/Abatom/vllm.git
cd vllm
# Set the installation package path.
export VLLM_PRECOMPILED_WHEEL_LOCATION=/home/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# installation
pip install -e . -v
```
# Run xPyD # Run xPyD
...@@ -90,7 +74,7 @@ To address the above issues, I have designed and developed a local Tensor memory ...@@ -90,7 +74,7 @@ To address the above issues, I have designed and developed a local Tensor memory
- You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict). - You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict).
- `PUT_ASYNC` offers the best performance and should be prioritized. - `PUT_ASYNC` offers the best performance and should be prioritized.
- The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`. - The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`.
- The `disagg_prefill_proxy_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances). - The `disagg_proxy_p2p_nccl_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances).
- The node running the proxy must have `quart` installed. - The node running the proxy must have `quart` installed.
- Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`. - Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`.
- In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**. - In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**.
...@@ -100,18 +84,18 @@ To address the above issues, I have designed and developed a local Tensor memory ...@@ -100,18 +84,18 @@ To address the above issues, I have designed and developed a local Tensor memory
### Proxy (e.g. 10.0.1.1) ### Proxy (e.g. 10.0.1.1)
```shell ```shell
cd {your vllm directory}/examples/online_serving/disagg_xpyd/ cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
python3 disagg_prefill_proxy_xpyd.py & python3 disagg_proxy_p2p_nccl_xpyd.py &
``` ```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20005 \ --port 20001 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -123,17 +107,17 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -123,17 +107,17 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) ### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20009 \ --port 20002 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -145,12 +129,12 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -145,12 +129,12 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) ### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
...@@ -167,17 +151,17 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -167,17 +151,17 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) ### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20008 \ --port 20004 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -189,7 +173,7 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -189,7 +173,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
``` ```
## Run 3P1D ## Run 3P1D
...@@ -197,18 +181,18 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -197,18 +181,18 @@ python3 disagg_prefill_proxy_xpyd.py &
### Proxy (e.g. 10.0.1.1) ### Proxy (e.g. 10.0.1.1)
```shell ```shell
cd {your vllm directory}/examples/online_serving/disagg_xpyd/ cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
python3 disagg_prefill_proxy_xpyd.py & python3 disagg_proxy_p2p_nccl_xpyd.py &
``` ```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20005 \ --port 20001 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -220,17 +204,17 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -220,17 +204,17 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
``` ```
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) ### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20009 \ --port 20002 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -242,12 +226,12 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -242,12 +226,12 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
``` ```
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) ### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
...@@ -264,17 +248,17 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -264,17 +248,17 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) ### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
??? Command ??? console "Command"
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20008 \ --port 20004 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
...@@ -286,7 +270,7 @@ python3 disagg_prefill_proxy_xpyd.py & ...@@ -286,7 +270,7 @@ python3 disagg_prefill_proxy_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
``` ```
# Single request # Single request
...@@ -304,7 +288,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \ ...@@ -304,7 +288,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
# Benchmark # Benchmark
??? Command ??? console "Command"
```shell ```shell
python3 benchmark_serving.py \ python3 benchmark_serving.py \
...@@ -334,24 +318,6 @@ pgrep python | xargs kill -9 && pkill -f python ...@@ -334,24 +318,6 @@ pgrep python | xargs kill -9 && pkill -f python
# Test data # Test data
## **Scenario 1**: 1K input & 1K output tokens, E2E P99 latency ~20s ## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s
- **1P5D (6×A800) vs vLLM (1×A800)**:
- Throughput ↑7.2% (1085 → 6979/6) ![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1)
- ITL (P99) ↓81.3% (120ms → 22.9ms)
- TTFT (P99) ↑26.8% (175ms → 222ms)
- TPOT: No change
- **1P6D (7×A800) vs vLLM (1×A800)**:
- Throughput ↑9.6% (1085 → 8329/7)
- ITL (P99) ↓81.0% (120ms → 22.7ms)
- TTFT (P99) ↑210% (175ms →543ms)
- TPOT: No change
## **Scenario 2**: 1K input & 200 output tokens, E2E P99 latency ~4s
- **1P1D (2×A800) vs vLLM (1×A800)**:
- Throughput ↑37.4% (537 → 1476/2)
- ITL (P99) ↓81.8% (127ms → 23.1ms)
- TTFT (P99) ↑41.8% (160ms → 227ms)
- TPOT: No change
![testdata](https://github.com/user-attachments/assets/f791bfc7-9f3d-4e5c-9171-a42f9f4da627)
...@@ -28,7 +28,7 @@ A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all ...@@ -28,7 +28,7 @@ A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all
In the very verbose logs, we can see: In the very verbose logs, we can see:
??? Logs ??? console "Logs"
```text ```text
DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339> DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339>
...@@ -110,7 +110,7 @@ Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At ...@@ -110,7 +110,7 @@ Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At
When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log: When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log:
??? Logs ??? console "Logs"
``` ```
AUTOTUNE mm(8x2048, 2048x3072) AUTOTUNE mm(8x2048, 2048x3072)
......
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
[](){ #automatic-prefix-caching }
## Introduction ## Introduction
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
!!! note !!! note
Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching]. Technical details on how vLLM implements APC can be found [here](../design/automatic_prefix_caching.md).
## Enabling APC in vLLM ## Enabling APC in vLLM
......
--- # Compatibility Matrix
title: Compatibility Matrix
---
[](){ #compatibility-matrix }
The tables below show mutually exclusive features and the support on some hardware. The tables below show mutually exclusive features and the support on some hardware.
...@@ -37,23 +34,22 @@ th:not(:first-child) { ...@@ -37,23 +34,22 @@ th:not(:first-child) {
} }
</style> </style>
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | | Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | | [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | |
| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | | <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | | multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | | <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | | best-of | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | |
| best-of | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | | | beam-search | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
[](){ #feature-x-hardware } [](){ #feature-x-hardware }
...@@ -62,10 +58,9 @@ th:not(:first-child) { ...@@ -62,10 +58,9 @@ th:not(:first-child) {
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC][automatic-prefix-caching] | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8475) | ✅ | ❌ | | [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ | | <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment