Unverified Commit dd6a3a02 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Convert docs to use colon fences (#12471)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent a7e3eba6
...@@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si ...@@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
``` ```
```{note} :::{note}
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`. If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
``` :::
(nginxloadbalancer-nginx-launch-nginx)= (nginxloadbalancer-nginx-launch-nginx)=
......
...@@ -4,19 +4,19 @@ ...@@ -4,19 +4,19 @@
This document provides an overview of the vLLM architecture. This document provides an overview of the vLLM architecture.
```{contents} Table of Contents :::{contents} Table of Contents
:depth: 2 :depth: 2
:local: true :local: true
``` :::
## Entrypoints ## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them. following diagram shows the relationship between them.
```{image} /assets/design/arch_overview/entrypoints.excalidraw.png :::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
:alt: Entrypoints Diagram :alt: Entrypoints Diagram
``` :::
### LLM Class ### LLM Class
...@@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o ...@@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing. the vLLM system, handling model inference and asynchronous request processing.
```{image} /assets/design/arch_overview/llm_engine.excalidraw.png :::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
:alt: LLMEngine Diagram :alt: LLMEngine Diagram
``` :::
### LLMEngine ### LLMEngine
...@@ -144,11 +144,11 @@ configurations affect the class we ultimately get. ...@@ -144,11 +144,11 @@ configurations affect the class we ultimately get.
The following figure shows the class hierarchy of vLLM: The following figure shows the class hierarchy of vLLM:
> ```{figure} /assets/design/hierarchy.png > :::{figure} /assets/design/hierarchy.png
> :align: center > :align: center
> :alt: query > :alt: query
> :width: 100% > :width: 100%
> ``` > :::
There are several important design choices behind this class hierarchy: There are several important design choices behind this class hierarchy:
...@@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we ...@@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a can easily create a vision model and a language model and compose them into a
vision-language model. vision-language model.
````{note} :::{note}
To support this change, all vLLM models' signatures have been updated to: To support this change, all vLLM models' signatures have been updated to:
```python ```python
...@@ -215,7 +215,7 @@ else: ...@@ -215,7 +215,7 @@ else:
``` ```
This way, the model can work with both old and new versions of vLLM. This way, the model can work with both old and new versions of vLLM.
```` :::
3\. **Sharding and Quantization at Initialization**: Certain features require 3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the changing the model weights. For example, tensor parallelism needs to shard the
......
...@@ -139,26 +139,26 @@ ...@@ -139,26 +139,26 @@
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE; const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
``` ```
```{figure} ../../assets/kernel/query.png :::{figure} ../../assets/kernel/query.png
:align: center :align: center
:alt: query :alt: query
:width: 70% :width: 70%
Query data of one token at one head Query data of one token at one head
``` :::
- Each thread defines its own `q_ptr` which points to the assigned - Each thread defines its own `q_ptr` which points to the assigned
query token data on global memory. For example, if `VEC_SIZE` is 4 query token data on global memory. For example, if `VEC_SIZE` is 4
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs. total of 128 elements divided into 128 / 4 = 32 vecs.
```{figure} ../../assets/kernel/q_vecs.png :::{figure} ../../assets/kernel/q_vecs.png
:align: center :align: center
:alt: q_vecs :alt: q_vecs
:width: 70% :width: 70%
`q_vecs` for one thread group `q_vecs` for one thread group
``` :::
```cpp ```cpp
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD]; __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
...@@ -195,13 +195,13 @@ ...@@ -195,13 +195,13 @@
points to key token data based on `k_cache` at assigned block, points to key token data based on `k_cache` at assigned block,
assigned head and assigned token. assigned head and assigned token.
```{figure} ../../assets/kernel/key.png :::{figure} ../../assets/kernel/key.png
:align: center :align: center
:alt: key :alt: key
:width: 70% :width: 70%
Key data of all context tokens at one head Key data of all context tokens at one head
``` :::
- The diagram above illustrates the memory layout for key data. It - The diagram above illustrates the memory layout for key data. It
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
...@@ -214,13 +214,13 @@ ...@@ -214,13 +214,13 @@
elements for one token) that will be processed by 2 threads (one elements for one token) that will be processed by 2 threads (one
thread group) separately. thread group) separately.
```{figure} ../../assets/kernel/k_vecs.png :::{figure} ../../assets/kernel/k_vecs.png
:align: center :align: center
:alt: k_vecs :alt: k_vecs
:width: 70% :width: 70%
`k_vecs` for one thread `k_vecs` for one thread
``` :::
```cpp ```cpp
K_vec k_vecs[NUM_VECS_PER_THREAD] K_vec k_vecs[NUM_VECS_PER_THREAD]
...@@ -289,14 +289,14 @@ ...@@ -289,14 +289,14 @@
should be performed across the entire thread block, encompassing should be performed across the entire thread block, encompassing
results between the query token and all context key tokens. results between the query token and all context key tokens.
```{math} :::{math}
:nowrap: true :nowrap: true
\begin{gather*} \begin{gather*}
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\ m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)} \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
\end{gather*} \end{gather*}
``` :::
### `qk_max` and `logits` ### `qk_max` and `logits`
...@@ -379,29 +379,29 @@ ...@@ -379,29 +379,29 @@
## Value ## Value
```{figure} ../../assets/kernel/value.png :::{figure} ../../assets/kernel/value.png
:align: center :align: center
:alt: value :alt: value
:width: 70% :width: 70%
Value data of all context tokens at one head Value data of all context tokens at one head
``` :::
```{figure} ../../assets/kernel/logits_vec.png :::{figure} ../../assets/kernel/logits_vec.png
:align: center :align: center
:alt: logits_vec :alt: logits_vec
:width: 50% :width: 50%
`logits_vec` for one thread `logits_vec` for one thread
``` :::
```{figure} ../../assets/kernel/v_vec.png :::{figure} ../../assets/kernel/v_vec.png
:align: center :align: center
:alt: v_vec :alt: v_vec
:width: 70% :width: 70%
List of `v_vec` for one thread List of `v_vec` for one thread
``` :::
- Now we need to retrieve the value data and perform dot multiplication - Now we need to retrieve the value data and perform dot multiplication
with `logits`. Unlike query and key, there is no thread group with `logits`. Unlike query and key, there is no thread group
......
...@@ -7,9 +7,9 @@ page for information on known issues and how to solve them. ...@@ -7,9 +7,9 @@ page for information on known issues and how to solve them.
## Introduction ## Introduction
```{important} :::{important}
The source code references are to the state of the code at the time of writing in December, 2024. The source code references are to the state of the code at the time of writing in December, 2024.
``` :::
The use of Python multiprocessing in vLLM is complicated by: The use of Python multiprocessing in vLLM is complicated by:
......
...@@ -6,9 +6,9 @@ ...@@ -6,9 +6,9 @@
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
```{note} :::{note}
Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching). Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
``` :::
## Enabling APC in vLLM ## Enabling APC in vLLM
......
...@@ -4,13 +4,13 @@ ...@@ -4,13 +4,13 @@
The tables below show mutually exclusive features and the support on some hardware. The tables below show mutually exclusive features and the support on some hardware.
```{note} :::{note}
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination. Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
``` :::
## Feature x Feature ## Feature x Feature
```{raw} html :::{raw} html
<style> <style>
/* Make smaller to try to improve readability */ /* Make smaller to try to improve readability */
td { td {
...@@ -23,448 +23,447 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar ...@@ -23,448 +23,447 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
font-size: 0.8rem; font-size: 0.8rem;
} }
</style> </style>
``` :::
```{list-table} :::{list-table}
:header-rows: 1 :header-rows: 1
:stub-columns: 1 :stub-columns: 1
:widths: auto :widths: auto
* - Feature - * Feature
- [CP](#chunked-prefill) * [CP](#chunked-prefill)
- [APC](#automatic-prefix-caching) * [APC](#automatic-prefix-caching)
- [LoRA](#lora-adapter) * [LoRA](#lora-adapter)
- <abbr title="Prompt Adapter">prmpt adptr</abbr> * <abbr title="Prompt Adapter">prmpt adptr</abbr>
- [SD](#spec_decode) * [SD](#spec_decode)
- CUDA graph * CUDA graph
- <abbr title="Pooling Models">pooling</abbr> * <abbr title="Pooling Models">pooling</abbr>
- <abbr title="Encoder-Decoder Models">enc-dec</abbr> * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- <abbr title="Logprobs">logP</abbr> * <abbr title="Logprobs">logP</abbr>
- <abbr title="Prompt Logprobs">prmpt logP</abbr> * <abbr title="Prompt Logprobs">prmpt logP</abbr>
- <abbr title="Async Output Processing">async output</abbr> * <abbr title="Async Output Processing">async output</abbr>
- multi-step * multi-step
- <abbr title="Multimodal Inputs">mm</abbr> * <abbr title="Multimodal Inputs">mm</abbr>
- best-of * best-of
- beam-search * beam-search
- <abbr title="Guided Decoding">guided dec</abbr> * <abbr title="Guided Decoding">guided dec</abbr>
* - [CP](#chunked-prefill) - * [CP](#chunked-prefill)
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - [APC](#automatic-prefix-caching) - * [APC](#automatic-prefix-caching)
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - [LoRA](#lora-adapter) - * [LoRA](#lora-adapter)
- [✗](gh-pr:9057) * [](gh-pr:9057)
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Prompt Adapter">prmpt adptr</abbr> - * <abbr title="Prompt Adapter">prmpt adptr</abbr>
- ✅ *
- ✅ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - [SD](#spec_decode) - * [SD](#spec_decode)
- ✅ *
- ✅ *
- ✗ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - CUDA graph - * CUDA graph
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Pooling Models">pooling</abbr> - * <abbr title="Pooling Models">pooling</abbr>
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr> - * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- ✗ *
- [✗](gh-issue:7366) * [](gh-issue:7366)
- ✗ *
- ✗ *
- [✗](gh-issue:7366) * [](gh-issue:7366)
- ✅ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Logprobs">logP</abbr> - * <abbr title="Logprobs">logP</abbr>
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✗ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Prompt Logprobs">prmpt logP</abbr> - * <abbr title="Prompt Logprobs">prmpt logP</abbr>
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- [✗](gh-pr:8199) * [](gh-pr:8199)
- ✅ *
- ✗ *
- ✅ *
- ✅ *
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Async Output Processing">async output</abbr> - * <abbr title="Async Output Processing">async output</abbr>
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✗ *
- ✅ *
- ✗ *
- ✗ *
- ✅ *
- ✅ *
- *
- *
- *
- *
- *
- *
* - multi-step - * multi-step
- ✗ *
- ✅ *
- ✗ *
- ✅ *
- ✗ *
- ✅ *
- ✗ *
- ✗ *
- ✅ *
- [✗](gh-issue:8198) * [](gh-issue:8198)
- ✅ *
- *
- *
- *
- *
- *
* - <abbr title="Multimodal Inputs">mm</abbr> - * <abbr title="Multimodal Inputs">mm</abbr>
- ✅ *
- [✗](gh-pr:8348) * [](gh-pr:8348)
- [✗](gh-pr:7199) * [](gh-pr:7199)
- ? * ?
- ? * ?
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- ? * ?
- *
- *
- *
- *
* - best-of - * best-of
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- [✗](gh-issue:6137) * [](gh-issue:6137)
- ✅ *
- ✗ *
- ✅ *
- ✅ *
- ✅ *
- ? * ?
- [✗](gh-issue:7968) * [](gh-issue:7968)
- ✅ *
- *
- *
- *
* - beam-search - * beam-search
- ✅ *
- ✅ *
- ✅ *
- ✅ *
- [✗](gh-issue:6137) * [](gh-issue:6137)
- ✅ *
- ✗ *
- ✅ *
- ✅ *
- ✅ *
- ? * ?
- [✗](gh-issue:7968>) * [](gh-issue:7968>)
- ? * ?
- ✅ *
- *
- *
* - <abbr title="Guided Decoding">guided dec</abbr> - * <abbr title="Guided Decoding">guided dec</abbr>
- ✅ *
- ✅ *
- ? * ?
- ? * ?
- [✗](gh-issue:11484) * [](gh-issue:11484)
- ✅ *
- ✗ *
- ? * ?
- ✅ *
- ✅ *
- ✅ *
- [✗](gh-issue:9893) * [](gh-issue:9893)
- ? * ?
- ✅ *
- ✅ *
- *
:::
```
(feature-x-hardware)= (feature-x-hardware)=
## Feature x Hardware ## Feature x Hardware
```{list-table} :::{list-table}
:header-rows: 1 :header-rows: 1
:stub-columns: 1 :stub-columns: 1
:widths: auto :widths: auto
* - Feature - * Feature
- Volta * Volta
- Turing * Turing
- Ampere * Ampere
- Ada * Ada
- Hopper * Hopper
- CPU * CPU
- AMD * AMD
* - [CP](#chunked-prefill) - * [CP](#chunked-prefill)
- [✗](gh-issue:2729) * [](gh-issue:2729)
- *
- *
- *
- *
- *
- *
* - [APC](#automatic-prefix-caching) - * [APC](#automatic-prefix-caching)
- [✗](gh-issue:3687) * [](gh-issue:3687)
- *
- *
- *
- *
- *
- *
* - [LoRA](#lora-adapter) - * [LoRA](#lora-adapter)
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Prompt Adapter">prmpt adptr</abbr> - * <abbr title="Prompt Adapter">prmpt adptr</abbr>
- *
- *
- *
- *
- *
- [✗](gh-issue:8475) * [](gh-issue:8475)
- *
* - [SD](#spec_decode) - * [SD](#spec_decode)
- *
- *
- *
- *
- *
- *
- *
* - CUDA graph - * CUDA graph
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Pooling Models">pooling</abbr> - * <abbr title="Pooling Models">pooling</abbr>
- *
- *
- *
- *
- *
- *
- ? * ?
* - <abbr title="Encoder-Decoder Models">enc-dec</abbr> - * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Multimodal Inputs">mm</abbr> - * <abbr title="Multimodal Inputs">mm</abbr>
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Logprobs">logP</abbr> - * <abbr title="Logprobs">logP</abbr>
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Prompt Logprobs">prmpt logP</abbr> - * <abbr title="Prompt Logprobs">prmpt logP</abbr>
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Async Output Processing">async output</abbr> - * <abbr title="Async Output Processing">async output</abbr>
- *
- *
- *
- *
- *
- *
- *
* - multi-step - * multi-step
- *
- *
- *
- *
- *
- [✗](gh-issue:8477) * [](gh-issue:8477)
- *
* - best-of - * best-of
- *
- *
- *
- *
- *
- *
- *
* - beam-search - * beam-search
- *
- *
- *
- *
- *
- *
- *
* - <abbr title="Guided Decoding">guided dec</abbr> - * <abbr title="Guided Decoding">guided dec</abbr>
- *
- *
- *
- *
- *
- *
- *
``` :::
...@@ -4,9 +4,9 @@ ...@@ -4,9 +4,9 @@
This page introduces you the disaggregated prefilling feature in vLLM. This page introduces you the disaggregated prefilling feature in vLLM.
```{note} :::{note}
This feature is experimental and subject to change. This feature is experimental and subject to change.
``` :::
## Why disaggregated prefilling? ## Why disaggregated prefilling?
...@@ -15,9 +15,9 @@ Two main reasons: ...@@ -15,9 +15,9 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT. - **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL. - **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
```{note} :::{note}
Disaggregated prefill DOES NOT improve throughput. Disaggregated prefill DOES NOT improve throughput.
``` :::
## Usage example ## Usage example
...@@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling: ...@@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer. - **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`. - **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
```{note} :::{note}
`insert` is non-blocking operation but `drop_select` is blocking operation. `insert` is non-blocking operation but `drop_select` is blocking operation.
``` :::
Here is a figure illustrating how the above 3 abstractions are organized: Here is a figure illustrating how the above 3 abstractions are organized:
```{image} /assets/features/disagg_prefill/abstraction.jpg :::{image} /assets/features/disagg_prefill/abstraction.jpg
:alt: Disaggregated prefilling abstractions :alt: Disaggregated prefilling abstractions
``` :::
The workflow of disaggregated prefilling is as follows: The workflow of disaggregated prefilling is as follows:
```{image} /assets/features/disagg_prefill/overview.jpg :::{image} /assets/features/disagg_prefill/overview.jpg
:alt: Disaggregated prefilling workflow :alt: Disaggregated prefilling workflow
``` :::
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer. The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
......
...@@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \ ...@@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
``` ```
```{note} :::{note}
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one. The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
``` :::
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`, The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
......
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
# AutoAWQ # AutoAWQ
```{warning} :::{warning}
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version. inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
``` :::
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
......
...@@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations, ...@@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`. - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values. - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
```{note} :::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin. FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
``` :::
## Quick Start with Online Dynamic Quantization ## Quick Start with Online Dynamic Quantization
...@@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8") ...@@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
result = model.generate("Hello, my name is") result = model.generate("Hello, my name is")
``` ```
```{warning} :::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
``` :::
## Installation ## Installation
...@@ -110,9 +110,9 @@ model.generate("Hello my name is") ...@@ -110,9 +110,9 @@ model.generate("Hello my name is")
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`): Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
```{note} :::{note}
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
``` :::
```console ```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
...@@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th ...@@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
## Deprecated Flow ## Deprecated Flow
```{note} :::{note}
The following information is preserved for reference and search purposes. The following information is preserved for reference and search purposes.
The quantization method described below is deprecated in favor of the `llmcompressor` method described above. The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
``` :::
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8). For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
......
...@@ -2,13 +2,13 @@ ...@@ -2,13 +2,13 @@
# GGUF # GGUF
```{warning} :::{warning}
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
``` :::
```{warning} :::{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model. Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
``` :::
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command: To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
...@@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen ...@@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
``` ```
```{warning} :::{warning}
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size. We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
``` :::
You can also use the GGUF model directly through the LLM entrypoint: You can also use the GGUF model directly through the LLM entrypoint:
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
```{toctree} :::{toctree}
:caption: Contents :caption: Contents
:maxdepth: 1 :maxdepth: 1
...@@ -15,4 +15,4 @@ gguf ...@@ -15,4 +15,4 @@ gguf
int8 int8
fp8 fp8
quantized_kvcache quantized_kvcache
``` :::
...@@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma ...@@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415). Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
```{note} :::{note}
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper). INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
``` :::
## Prerequisites ## Prerequisites
...@@ -119,9 +119,9 @@ $ lm_eval --model vllm \ ...@@ -119,9 +119,9 @@ $ lm_eval --model vllm \
--batch_size 'auto' --batch_size 'auto'
``` ```
```{note} :::{note}
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
``` :::
## Best Practices ## Best Practices
......
...@@ -4,128 +4,129 @@ ...@@ -4,128 +4,129 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
```{list-table} :::{list-table}
:header-rows: 1 :header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8 :widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation - * Implementation
- Volta * Volta
- Turing * Turing
- Ampere * Ampere
- Ada * Ada
- Hopper * Hopper
- AMD GPU * AMD GPU
- Intel GPU * Intel GPU
- x86 CPU * x86 CPU
- AWS Inferentia * AWS Inferentia
- Google TPU * Google TPU
* - AWQ - * AWQ
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
* - GPTQ - * GPTQ
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
* - Marlin (GPTQ/AWQ/FP8) - * Marlin (GPTQ/AWQ/FP8)
- ✗ *
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
* - INT8 (W8A8) - * INT8 (W8A8)
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✅︎ * ✅︎
- ✗ *
- ✗ *
* - FP8 (W8A8) - * FP8 (W8A8)
- ✗ *
- ✗ *
- ✗ *
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
* - AQLM - * AQLM
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
* - bitsandbytes - * bitsandbytes
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
* - DeepSpeedFP - * DeepSpeedFP
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
- ✗ *
* - GGUF - * GGUF
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✅︎ * ✅︎
- ✗ *
- ✗ *
- ✗ *
- ✗ *
```
:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware. - "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware. - "✗" indicates that the quantization method is not supported on the specified hardware.
```{note} :::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods. This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team. For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
``` :::
...@@ -2,15 +2,15 @@ ...@@ -2,15 +2,15 @@
# Speculative Decoding # Speculative Decoding
```{warning} :::{warning}
Please note that speculative decoding in vLLM is not yet optimized and does Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630> The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
``` :::
```{warning} :::{warning}
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism. Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
``` :::
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM. This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
......
...@@ -95,10 +95,10 @@ completion = client.chat.completions.create( ...@@ -95,10 +95,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content) print(completion.choices[0].message.content)
``` ```
```{tip} :::{tip}
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them. While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
This can improve the results notably in most cases. This can improve the results notably in most cases.
``` :::
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries. Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below: It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
......
...@@ -57,9 +57,9 @@ class Index: ...@@ -57,9 +57,9 @@ class Index:
def generate(self) -> str: def generate(self) -> str:
content = f"# {self.title}\n\n{self.description}\n\n" content = f"# {self.title}\n\n{self.description}\n\n"
content += "```{toctree}\n" content += ":::{toctree}\n"
content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n" content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
content += "\n".join(self.documents) + "\n```\n" content += "\n".join(self.documents) + "\n:::\n"
return content return content
......
...@@ -86,9 +86,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env . ...@@ -86,9 +86,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
``` ```
```{tip} :::{tip}
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
``` :::
## Extra information ## Extra information
...@@ -155,30 +155,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work. ...@@ -155,30 +155,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag. Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
```{list-table} vLLM execution modes :::{list-table} vLLM execution modes
:widths: 25 25 50 :widths: 25 25 50
:header-rows: 1 :header-rows: 1
* - `PT_HPU_LAZY_MODE` - * `PT_HPU_LAZY_MODE`
- `enforce_eager` * `enforce_eager`
- execution mode * execution mode
* - 0 - * 0
- 0 * 0
- torch.compile * torch.compile
* - 0 - * 0
- 1 * 1
- PyTorch eager mode * PyTorch eager mode
* - 1 - * 1
- 0 * 0
- HPU Graphs * HPU Graphs
* - 1 - * 1
- 1 * 1
- PyTorch lazy mode * PyTorch lazy mode
``` :::
```{warning} :::{warning}
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode. In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
``` :::
(gaudi-bucketing-mechanism)= (gaudi-bucketing-mechanism)=
...@@ -187,9 +187,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and ...@@ -187,9 +187,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution. Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`. In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
```{note} :::{note}
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase. Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
``` :::
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup: Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
...@@ -222,15 +222,15 @@ min = 128, step = 128, max = 512 ...@@ -222,15 +222,15 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket. In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
```{warning} :::{warning}
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario. If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
``` :::
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket. As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
```{note} :::{note}
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests. Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
``` :::
### Warmup ### Warmup
...@@ -252,9 +252,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size ...@@ -252,9 +252,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations. This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
```{tip} :::{tip}
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment. Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
``` :::
### HPU Graph capture ### HPU Graph capture
...@@ -269,9 +269,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil ...@@ -269,9 +269,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints. Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
```{note} :::{note}
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory. `gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
``` :::
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented: User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
\- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode \- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
...@@ -279,9 +279,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec ...@@ -279,9 +279,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy. When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
```{note} :::{note}
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below. `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
``` :::
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released): Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
...@@ -352,13 +352,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi ...@@ -352,13 +352,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism - `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
- `{phase}` is either `PROMPT` or `DECODE` * `{phase}` is either `PROMPT` or `DECODE`
- `{dim}` is either `BS`, `SEQ` or `BLOCK` * `{dim}` is either `BS`, `SEQ` or `BLOCK`
- `{param}` is either `MIN`, `STEP` or `MAX` * `{param}` is either `MIN`, `STEP` or `MAX`
- Default values: * Default values:
- Prompt: - Prompt:
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1` - batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
......
...@@ -2,374 +2,374 @@ ...@@ -2,374 +2,374 @@
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions: vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "# Installation" :start-after: "# Installation"
:end-before: "## Requirements" :end-before: "## Requirements"
```
::: :::
:::: ::::
:::::
## Requirements ## Requirements
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Configure a new environment" :end-before: "## Configure a new environment"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Configure a new environment" :end-before: "## Configure a new environment"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Configure a new environment" :end-before: "## Configure a new environment"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "## Requirements" :start-after: "## Requirements"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::: ::::
:::::
## Configure a new environment ## Configure a new environment
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "## Configure a new environment" :start-after: "## Configure a new environment"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "## Configure a new environment" :start-after: "## Configure a new environment"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "## Configure a new environment" :start-after: "## Configure a new environment"
:end-before: "## Set up using Python" :end-before: "## Set up using Python"
```
::: :::
:::{tab-item} OpenVINO ::::
:sync: openvino
```{include} ../python_env_setup.inc.md ::::{tab-item} OpenVINO
``` :sync: openvino
:::{include} ../python_env_setup.inc.md
::: :::
:::: ::::
:::::
## Set up using Python ## Set up using Python
### Pre-built wheels ### Pre-built wheels
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "### Pre-built wheels" :start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source" :end-before: "### Build wheel from source"
```
::: :::
:::: ::::
:::::
### Build wheel from source ### Build wheel from source
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "### Build wheel from source" :start-after: "### Build wheel from source"
:end-before: "## Set up using Docker" :end-before: "## Set up using Docker"
```
::: :::
:::: ::::
:::::
## Set up using Docker ## Set up using Docker
### Pre-built images ### Pre-built images
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "### Pre-built images" :start-after: "### Pre-built images"
:end-before: "### Build image from source" :end-before: "### Build image from source"
```
::: :::
:::: ::::
:::::
### Build image from source ### Build image from source
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Extra information" :end-before: "## Extra information"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Extra information" :end-before: "## Extra information"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Extra information" :end-before: "## Extra information"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "### Build image from source" :start-after: "### Build image from source"
:end-before: "## Extra information" :end-before: "## Extra information"
```
::: :::
:::: ::::
:::::
## Extra information ## Extra information
::::{tab-set} :::::{tab-set}
:sync-group: device :sync-group: device
:::{tab-item} TPU ::::{tab-item} TPU
:sync: tpu :sync: tpu
```{include} tpu.inc.md :::{include} tpu.inc.md
:start-after: "## Extra information" :start-after: "## Extra information"
```
::: :::
:::{tab-item} Intel Gaudi ::::
::::{tab-item} Intel Gaudi
:sync: hpu-gaudi :sync: hpu-gaudi
```{include} hpu-gaudi.inc.md :::{include} hpu-gaudi.inc.md
:start-after: "## Extra information" :start-after: "## Extra information"
```
::: :::
:::{tab-item} Neuron ::::
::::{tab-item} Neuron
:sync: neuron :sync: neuron
```{include} neuron.inc.md :::{include} neuron.inc.md
:start-after: "## Extra information" :start-after: "## Extra information"
```
::: :::
:::{tab-item} OpenVINO ::::
::::{tab-item} OpenVINO
:sync: openvino :sync: openvino
```{include} openvino.inc.md :::{include} openvino.inc.md
:start-after: "## Extra information" :start-after: "## Extra information"
```
::: :::
:::: ::::
:::::
...@@ -67,9 +67,9 @@ Currently, there are no pre-built Neuron wheels. ...@@ -67,9 +67,9 @@ Currently, there are no pre-built Neuron wheels.
### Build wheel from source ### Build wheel from source
```{note} :::{note}
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel. The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
``` :::
Following instructions are applicable to Neuron SDK 2.16 and beyond. Following instructions are applicable to Neuron SDK 2.16 and beyond.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment