Commit 4eabe123 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge remote-tracking branch 'mirror/releases/v0.9.0' into v0.9.0-ori

parents 45840cd2 58738772
(plugin-system)= ---
title: vLLM's Plugin System
# vLLM's Plugin System ---
[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM ## How Plugins Work in vLLM
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [](#arch-overview)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work. Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins ## How vLLM Discovers Plugins
...@@ -29,8 +30,10 @@ def register(): ...@@ -29,8 +30,10 @@ def register():
from vllm import ModelRegistry from vllm import ModelRegistry
if "MyLlava" not in ModelRegistry.get_supported_archs(): if "MyLlava" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model("MyLlava", ModelRegistry.register_model(
"vllm_add_dummy_model.my_llava:MyLlava") "MyLlava",
"vllm_add_dummy_model.my_llava:MyLlava",
)
``` ```
For more information on adding entry points to your package, please check the [official documentation](https://setuptools.pypa.io/en/latest/userguide/entry_point.html). For more information on adding entry points to your package, please check the [official documentation](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
......
...@@ -57,11 +57,11 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` ...@@ -57,11 +57,11 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
- `vllm:spec_decode_num_draft_tokens_total` (Counter) - `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter) - `vllm:spec_decode_num_emitted_tokens_total` (Counter)
These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md). These are documented under [Inferencing and Serving -> Production Metrics](../../usage/metrics.md).
### Grafana Dashboard ### Grafana Dashboard
vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
...@@ -222,9 +222,7 @@ And the calculated intervals are: ...@@ -222,9 +222,7 @@ And the calculated intervals are:
Put another way: Put another way:
:::{image} /assets/design/v1/metrics/intervals-1.png ![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png)
:alt: Interval calculations - common case
:::
We explored the possibility of having the frontend calculate these We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However, intervals using the timing of events visible by the frontend. However,
...@@ -239,17 +237,13 @@ When a preemption occurs during decode, since any already generated ...@@ -239,17 +237,13 @@ When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals. inter-token, decode, and inference intervals.
:::{image} /assets/design/v1/metrics/intervals-2.png ![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png)
:alt: Interval calculations - preempted decode
:::
When a preemption occurs during prefill (assuming such an event When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals. time-to-first-token and prefill intervals.
:::{image} /assets/design/v1/metrics/intervals-3.png ![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png)
:alt: Interval calculations - preempted prefill
:::
### Frontend Stats Collection ### Frontend Stats Collection
...@@ -467,7 +461,7 @@ In general: ...@@ -467,7 +461,7 @@ In general:
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them. for some time before deleting them.
See the [deprecation policy](project:../../contributing/deprecation_policy.md) for See the [deprecation policy](../../contributing/deprecation_policy.md) for
the project-wide deprecation policy. the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total` ### Unimplemented - `vllm:tokens_total`
...@@ -679,7 +673,7 @@ v0 has support for OpenTelemetry tracing: ...@@ -679,7 +673,7 @@ v0 has support for OpenTelemetry tracing:
- [OpenTelemetry blog - [OpenTelemetry blog
post](https://opentelemetry.io/blog/2024/llm-observability/) post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing - [User-facing
docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html) docs](https://docs.vllm.ai/en/latest/examples/opentelemetry.html)
- [Blog - [Blog
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product - [IBM product
......
...@@ -122,9 +122,7 @@ There are two design points to highlight: ...@@ -122,9 +122,7 @@ There are two design points to highlight:
As a result, we will have the following components when the KV cache manager is initialized: As a result, we will have the following components when the KV cache manager is initialized:
:::{image} /assets/design/v1/prefix_caching/overview.png ![Component Overview](../../assets/design/v1/prefix_caching/overview.png)
:alt: Component Overview
:::
* Block Pool: A list of KVCacheBlock. * Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations. * Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
...@@ -194,9 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund ...@@ -194,9 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund
When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first. When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
:::{image} /assets/design/v1/prefix_caching/free.png ![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png)
:alt: Free Queue after Free a Request
:::
### Eviction (LRU) ### Eviction (LRU)
...@@ -212,36 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens), ...@@ -212,36 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),
**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens. **Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens.
:::{image} /assets/design/v1/prefix_caching/example-time-1.png ![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png)
:alt: Example Time 1
:::
**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. **Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
:::{image} /assets/design/v1/prefix_caching/example-time-3.png ![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png)
:alt: Example Time 3
:::
**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. **Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
:::{image} /assets/design/v1/prefix_caching/example-time-4.png ![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png)
:alt: Example Time 4
:::
**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. **Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
:::{image} /assets/design/v1/prefix_caching/example-time-5.png ![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png)
:alt: Example Time 5
:::
**Time 6: Request 1 is finished and free.** **Time 6: Request 1 is finished and free.**
:::{image} /assets/design/v1/prefix_caching/example-time-6.png ![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png)
:alt: Example Time 6
:::
**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). **Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
:::{image} /assets/design/v1/prefix_caching/example-time-7.png ![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png)
:alt: Example Time 7
:::
...@@ -99,7 +99,9 @@ This time, Inductor compilation is completely bypassed, and we will load from di ...@@ -99,7 +99,9 @@ This time, Inductor compilation is completely bypassed, and we will load from di
The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example: The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example:
`vllm serve meta-llama/Llama-3.2-1B --compilation_config "{'compile_sizes': [1, 2, 4, 8]}"` ```
vllm serve meta-llama/Llama-3.2-1B --compilation_config '{"compile_sizes": [1, 2, 4, 8]}'
```
Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel. Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel.
...@@ -134,12 +136,14 @@ The cudagraphs are captured and managed by the compiler backend, and replayed wh ...@@ -134,12 +136,14 @@ The cudagraphs are captured and managed by the compiler backend, and replayed wh
By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`: By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`:
`vllm serve meta-llama/Llama-3.2-1B --compilation-config "{'cudagraph_capture_sizes': [1, 2, 4, 8]}"` ```
vllm serve meta-llama/Llama-3.2-1B --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}'
```
Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture. Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture.
### Full Cudagraph capture ### Full Cudagraph capture
It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config "{'full_cuda_graph': True}"` It is possible to include attention as part of the cudagraph if using an attention backend that is cudagraph compatible. This can improve performance in some cases such as decode speed for smaller models. Enable this using `--compilation-config '{"full_cuda_graph": true}'`.
Currently only FlashAttention 3 is compatible, and only when cascade attention is disabled. Currently only FlashAttention 3 is compatible, and only when cascade attention is disabled.
---
title: Automatic Prefix Caching
---
[](){ #automatic-prefix-caching }
## Introduction
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
!!! note
Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
## Enabling APC in vLLM
Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
<gh-file:examples/offline_inference/automatic_prefix_caching.py>
## Example workloads
We describe two example workloads, where APC can provide huge performance benefit:
- Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency.
- Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency.
## Limits
APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused).
---
title: Compatibility Matrix
---
[](){ #compatibility-matrix }
The tables below show mutually exclusive features and the support on some hardware.
The symbols used have the following meanings:
- ✅ = Full compatibility
- 🟠 = Partial compatibility
- ❌ = No compatibility
!!! note
Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
## Feature x Feature
<style>
td:not(:first-child) {
text-align: center !important;
}
td {
padding: 0.5rem !important;
white-space: nowrap;
}
th {
padding: 0.5rem !important;
min-width: 0 !important;
}
th:not(:first-child) {
writing-mode: vertical-lr;
transform: rotate(180deg)
}
</style>
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
| best-of | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | |
| beam-search | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
[](){ #feature-x-hardware }
## Feature x Hardware
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD |
|-----------------------------------------------------------|--------------------|----------|----------|-------|----------|--------------------|-------|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC][automatic-prefix-caching] | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8475) | ✅ |
| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8477) | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
(disagg-prefill)= ---
title: Disaggregated Prefilling (experimental)
# Disaggregated Prefilling (experimental) ---
[](){ #disagg-prefill }
This page introduces you the disaggregated prefilling feature in vLLM. This page introduces you the disaggregated prefilling feature in vLLM.
:::{note} !!! note
This feature is experimental and subject to change. This feature is experimental and subject to change.
:::
## Why disaggregated prefilling? ## Why disaggregated prefilling?
...@@ -15,9 +15,8 @@ Two main reasons: ...@@ -15,9 +15,8 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT. - **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL. - **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
:::{note} !!! note
Disaggregated prefill DOES NOT improve throughput. Disaggregated prefill DOES NOT improve throughput.
:::
## Usage example ## Usage example
...@@ -39,21 +38,16 @@ Key abstractions for disaggregated prefilling: ...@@ -39,21 +38,16 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer. - **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`. - **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
:::{note} !!! note
`insert` is non-blocking operation but `drop_select` is blocking operation. `insert` is non-blocking operation but `drop_select` is blocking operation.
:::
Here is a figure illustrating how the above 3 abstractions are organized: Here is a figure illustrating how the above 3 abstractions are organized:
:::{image} /assets/features/disagg_prefill/abstraction.jpg ![Disaggregated prefilling abstractions](../assets/features/disagg_prefill/abstraction.jpg)
:alt: Disaggregated prefilling abstractions
:::
The workflow of disaggregated prefilling is as follows: The workflow of disaggregated prefilling is as follows:
:::{image} /assets/features/disagg_prefill/overview.jpg ![Disaggregated prefilling workflow](../assets/features/disagg_prefill/overview.jpg)
:alt: Disaggregated prefilling workflow
:::
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer. The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
......
(lora-adapter)= ---
title: LoRA Adapters
# LoRA Adapters ---
[](){ #lora-adapter }
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`. LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with them locally with
...@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \ ...@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
``` ```
:::{note} !!! note
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one. The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
:::
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`, The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
......
(multimodal-inputs)= ---
title: Multimodal Inputs
---
[](){ #multimodal-inputs }
# Multimodal Inputs This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM. !!! note
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
:::{note} and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
:::
## Offline Inference ## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`: To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
- `prompt`: The prompt should follow the format that is documented on HuggingFace. - `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`. - `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
### Image Inputs ### Image Inputs
...@@ -211,16 +211,15 @@ for o in outputs: ...@@ -211,16 +211,15 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
:::{important} !!! warning
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`. For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>. If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument. If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>. For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision. For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
:::
### Image Inputs ### Image Inputs
...@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content) ...@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{tip} !!! tip
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request. and pass the file path as `url` in the API request.
:::
:::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
:::
:::{note} !!! tip
By default, the timeout for fetching images through HTTP URL is `5` seconds. There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
You can override this by setting the environment variable: In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
```console !!! note
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> By default, the timeout for fetching images through HTTP URL is `5` seconds.
``` You can override this by setting the environment variable:
::: ```console
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```
### Video Inputs ### Video Inputs
...@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result) ...@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{note} !!! note
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
``` ```
:::
### Audio Inputs ### Audio Inputs
...@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result) ...@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{note} !!! note
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
::: ```console
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
### Embedding Inputs ### Embedding Inputs
...@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create( ...@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create(
) )
``` ```
:::{note} !!! note
Only one message can contain `{"type": "image_embeds"}`. Only one message can contain `{"type": "image_embeds"}`.
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc. If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
:::
...@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM. ...@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM.
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary. The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
:::{note} !!! note
Prompt embeddings are currently only supported in the v0 engine. Prompt embeddings are currently only supported in the v0 engine.
:::
## Offline Inference ## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`: To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model. - `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
...@@ -20,59 +19,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPromp ...@@ -20,59 +19,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPromp
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples: You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
```python <gh-file:examples/offline_inference/prompt_embed_inference.py>
from vllm import LLM
import transformers
model_name = "meta-llama/Llama-3.2-1B-Instruct"
# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
llm = LLM(model=model_name, enable_prompt_embeds=True)
# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Single prompt inference
outputs = llm.generate({
"prompt_embeds": prompt_embeds,
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
chats = [
[{"role": "user", "content": "Please tell me about the capital of France."}],
[{"role": "user", "content": "When is the day longest during the year?"}],
[{"role": "user", "content": "Where is bigger, the moon or the sun?"}]
]
token_ids_list = [
tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt') for chat in chats
]
prompt_embeds_list = [embedding_layer(token_ids).squeeze(0) for token_ids in token_ids_list]
outputs = llm.generate(
[
{
"prompt_embeds": prompt_embeds,
} for prompt_embeds in prompt_embeds_list
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
## Online Serving ## Online Serving
...@@ -93,52 +40,4 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \ ...@@ -93,52 +40,4 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
```python <gh-file:examples/online_serving/prompt_embed_inference_with_openai_client.py>
from openai import OpenAI
import transformers
import torch
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model_name = "meta-llama/Llama-3.2-1B-Instruct"
# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Prompt embeddings
buffer = io.BytesIO()
torch.save(prompt_embeds, buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode('utf-8')
completion = client_with_prompt_embeds.completions.create(
model=model_name,
# NOTE: The OpenAI client does not allow `None` as an input to
# `prompt`. Use an empty string if you have no text prompts.
prompt="",
max_tokens=5,
temperature=0.0,
# NOTE: The OpenAI client allows passing in extra JSON body via the
# `extra_body` argument.
extra_body={"prompt_embeds": encoded_embeds}
)
print(completion.choices[0].text)
```
---
title: Quantization
---
[](){ #quantization-index }
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
Contents:
- [Supported_Hardware](supported_hardware.md)
- [Auto_Awq](auto_awq.md)
- [Bnb](bnb.md)
- [Bitblas](bitblas.md)
- [Gguf](gguf.md)
- [Gptqmodel](gptqmodel.md)
- [Int4](int4.md)
- [Int8](int8.md)
- [Fp8](fp8.md)
- [Modelopt](modelopt.md)
- [Quark](quark.md)
- [Quantized_Kvcache](quantized_kvcache.md)
- [Torchao](torchao.md)
(auto-awq)= ---
title: AutoAWQ
# AutoAWQ ---
[](){ #auto-awq }
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint. Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
...@@ -41,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"') ...@@ -41,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command: To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console ```console
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq python examples/offline_inference/llm_engine_example.py \
--model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq
``` ```
AWQ models are also supported directly through the LLM entrypoint: AWQ models are also supported directly through the LLM entrypoint:
......
(bitblas)= ---
title: BitBLAS
# BitBLAS ---
[](){ #bitblas }
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations. vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
:::{note} !!! note
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`). Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper. Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html). For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
:::
Below are the steps to utilize BitBLAS with vLLM. Below are the steps to utilize BitBLAS with vLLM.
...@@ -33,7 +33,12 @@ import torch ...@@ -33,7 +33,12 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint. # "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas" model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas") llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitblas"
)
``` ```
## Read gptq format checkpoint ## Read gptq format checkpoint
...@@ -44,5 +49,11 @@ import torch ...@@ -44,5 +49,11 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint. # "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1" model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024) llm = LLM(
model=model_id,
dtype=torch.float16,
trust_remote_code=True,
quantization="bitblas",
max_model_len=1024
)
``` ```
(bits-and-bytes)= ---
title: BitsAndBytes
# BitsAndBytes ---
[](){ #bits-and-bytes }
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
...@@ -14,7 +15,7 @@ pip install bitsandbytes>=0.45.3 ...@@ -14,7 +15,7 @@ pip install bitsandbytes>=0.45.3
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint. vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
You can find bitsandbytes quantized models on <https://huggingface.co/models?search=bitsandbytes>. You can find bitsandbytes quantized models on [Hugging Face](https://huggingface.co/models?search=bitsandbytes).
And usually, these repositories have a config.json file that includes a quantization_config section. And usually, these repositories have a config.json file that includes a quantization_config section.
## Read quantized checkpoint ## Read quantized checkpoint
...@@ -26,7 +27,11 @@ from vllm import LLM ...@@ -26,7 +27,11 @@ from vllm import LLM
import torch import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint. # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit" model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True) llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True
)
``` ```
## Inflight quantization: load as 4bit quantization ## Inflight quantization: load as 4bit quantization
...@@ -37,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify ...@@ -37,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify
from vllm import LLM from vllm import LLM
import torch import torch
model_id = "huggyllama/llama-7b" model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \ llm = LLM(
quantization="bitsandbytes") model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitsandbytes"
)
``` ```
## OpenAI Compatible Server ## OpenAI Compatible Server
......
(fp8)= ---
title: FP8 W8A8
# FP8 W8A8 ---
[](){ #fp8 }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
...@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations, ...@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`. - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values. - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
:::{note} !!! note
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin. FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
## Installation ## Installation
...@@ -94,9 +94,8 @@ print(result[0].outputs[0].text) ...@@ -94,9 +94,8 @@ print(result[0].outputs[0].text)
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`): Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
:::
```console ```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
...@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is") ...@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is")
print(result[0].outputs[0].text) print(result[0].outputs[0].text)
``` ```
:::{warning} !!! warning
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
:::
(gguf)= ---
title: GGUF
---
[](){ #gguf }
# GGUF !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
:::{warning} !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
:::
:::{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
:::
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command: To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console ```console
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
``` ```
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs: You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console ```console
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--tensor-parallel-size 2
``` ```
:::{warning} !!! warning
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size. We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
:::
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console ```console
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path # If you model is not supported by huggingface you can manually provide a huggingface compatible config path
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
``` ```
You can also use the GGUF model directly through the LLM entrypoint: You can also use the GGUF model directly through the LLM entrypoint:
......
(gptqmodel)= ---
title: GPTQModel
# GPTQModel ---
[](){ #gptqmodel }
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI. To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
...@@ -58,7 +59,8 @@ model.save(quant_path) ...@@ -58,7 +59,8 @@ model.save(quant_path)
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command: To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console ```console
python examples/offline_inference/llm_engine_example.py --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2 python examples/offline_inference/llm_engine_example.py \
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
``` ```
## Using GPTQModel with vLLM's Python API ## Using GPTQModel with vLLM's Python API
......
(int4)= ---
title: INT4 W4A16
# INT4 W4A16 ---
[](){ #int4 }
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS). vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c). Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
:::{note} !!! note
INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell). INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
:::
## Prerequisites ## Prerequisites
...@@ -121,9 +121,8 @@ $ lm_eval --model vllm \ ...@@ -121,9 +121,8 @@ $ lm_eval --model vllm \
--batch_size 'auto' --batch_size 'auto'
``` ```
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
:::
## Best Practices ## Best Practices
......
(int8)= ---
title: INT8 W8A8
# INT8 W8A8 ---
[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance. This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415). Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
:::{note} !!! note
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell). INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
:::
## Prerequisites ## Prerequisites
...@@ -125,9 +125,8 @@ $ lm_eval --model vllm \ ...@@ -125,9 +125,8 @@ $ lm_eval --model vllm \
--batch_size 'auto' --batch_size 'auto'
``` ```
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
:::
## Best Practices ## Best Practices
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment