v1_guide.md 10.6 KB
Newer Older
1
# vLLM V1
Jennifer Zhao's avatar
Jennifer Zhao committed
2

3
!!! announcement
4

5
    We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
6

7
    If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
Jennifer Zhao's avatar
Jennifer Zhao committed
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.

Building on V0’s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.

Specifically, V1 aims to:

- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.

We see significant performance improvements from upgrading to V1 core engine, in
particular for long context scenarios. Please see performance benchmark (To be
added).

For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
## Differences from V0

This section lists some differences in behavior between V0 and V1.

### Chunked Prefill

Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.

### CUDA Graphs

CUDA graph capture takes up more memory in V1 than in V0.

### Semantic Changes to Logprobs

#### Logprobs Calculation

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

You can adjust this behavior by setting the `--logprobs-mode` flag.
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
Raw means the values before applying any logit processors, like bad words.
Processed means the values after applying all processors, including temperature and top_k/top_p.

#### Prompt Logprobs with Prefix Caching

While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs.
For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

## Feature Support
65

66
For each item, its support in vLLM V1 falls into one of the following states:
67

68
69
70
- **🟢 Functional**: Fully operational with optimizations comparable to or better than V0.
- **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs.
- **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.
71

72
73
74
75
76
77
78
79
80
81
82
83
!!! note
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come,
First-Served (FCFS) and priority-based scheduling (where requests are processed
based on assigned priority, with FCFS as a tie-breaker), configurable via the
`--scheduling-policy` argument.

84
85
### Hardware

86
87
88
89
| Hardware         | Status                                        |
|------------------|-----------------------------------------------|
| **NVIDIA**       | <nobr>🟢</nobr>                               |
| **AMD**          | <nobr>🟢</nobr>                               |
90
| **INTEL GPU**    | <nobr>🟢</nobr>                               |
91
92
| **TPU**          | <nobr>🟢</nobr>                               |
| **CPU**          | <nobr>🟢</nobr>                               |
93
94
95
96
97
98
99

!!! note

    More hardware platforms may be supported via plugins, e.g.:

    - [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
    - [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
100
    - [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi)
101
102
103
104
105
106
    - [vllm-openvino](https://github.com/vllm-project/vllm-openvino)

    Please check their corresponding repositories for more details.

### Models

107
108
109
110
111
112
113
| Model Type                  | Status                                                                  |
|-----------------------------|-------------------------------------------------------------------------|
| **Decoder-only Models**     | <nobr>🟢</nobr>                                                         |
| **Encoder-Decoder Models**  | <nobr>🟢 (Whisper), 🔴 (Others) </nobr>                                |
| **Pooling Models**          | <nobr>🟢</nobr>                                                         |
| **Mamba Models**            | <nobr>🟢</nobr>                                                         |
| **Multimodal Models**       | <nobr>🟢</nobr>                                                         |
114

115
See below for the status of models that are not yet supported or have more features planned in V1.
116

117
#### Pooling Models
118

119
Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models.
Jennifer Zhao's avatar
Jennifer Zhao committed
120

121
We are working on enabling prefix caching and chunked prefill for more categories of pooling models.
122

123
124
#### Mamba Models

125
Models using selective state-space mechanisms instead of standard transformer attention are supported.
126
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`, `FalconMambaForCausalLM`) are supported.
127

128
Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
129
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).
130

131
132
133
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`, `Lfm2ForCausalLM`).

Please note that prefix caching is not yet supported for any of the above models.
134

135
#### Encoder-Decoder Models
136

137
138
139
140
141
142
Whisper is supported natively. Other encoder-decoder models are supported via the plugin system:

- **BART**: `BartForConditionalGeneration` is supported via the official [bart-plugin](https://github.com/vllm-project/bart-plugin).

For other encoder-decoder models (e.g., `MllamaForConditionalGeneration`), we recommend
following a similar pattern by implementing support through the [plugin system](../design/plugin_system.md).
143
144
145

### Features

146
147
| Feature                                     | Status                                                                            |
|---------------------------------------------|-----------------------------------------------------------------------------------|
148
149
150
| **Prefix Caching**                          | <nobr>🟢 Functional</nobr>                                                        |
| **Chunked Prefill**                         | <nobr>🟢 Functional</nobr>                                                        |
| **LoRA**                                    | <nobr>🟢 Functional</nobr>                                                        |
Jennifer Zhao's avatar
Jennifer Zhao committed
151
| **Logprobs Calculation**                    | <nobr>🟢 Functional</nobr>                                                        |
152
153
154
| **FP8 KV Cache**                            | <nobr>🟢 Functional</nobr>                                                        |
| **Spec Decode**                             | <nobr>🟢 Functional</nobr>                                                        |
| **Prompt Logprobs with Prefix Caching**     | <nobr>🟢 Functional</nobr>                                                        |
155
| **Structured Output Alternative Backends**  | <nobr>🟢 Functional</nobr>                                                        |
156
157
158
159
160
| **Concurrent Partial Prefills**             | <nobr>🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003)</nobr>  |
| **best_of**                                 | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361)</nobr>      |
| **Per-Request Logits Processors**           | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360)</nobr>        |
| **GPU <> CPU KV Cache Swapping**            | <nobr>🔴 Removed</nobr>                                                           |
| **Request-level Structured Output Backend** | <nobr>🔴 Removed</nobr>                                                           |
Jennifer Zhao's avatar
Jennifer Zhao committed
161

162
!!! note
Jennifer Zhao's avatar
Jennifer Zhao committed
163

164
165
166
167
168
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.
Jennifer Zhao's avatar
Jennifer Zhao committed
169

170
#### Removed Features
Jennifer Zhao's avatar
Jennifer Zhao committed
171

172
As part of the major architectural rework in vLLM V1, several legacy features have been removed.
Jennifer Zhao's avatar
Jennifer Zhao committed
173

174
##### Sampling features
Jennifer Zhao's avatar
Jennifer Zhao committed
175

176
- **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
Jennifer Zhao's avatar
Jennifer Zhao committed
177
178
- **Per-Request Logits Processors**: In V0, users could pass custom
  processing functions to adjust logits on a per-request basis. In vLLM V1, this
179
  feature has been removed. Instead, we now support **global logits processors**
180
  which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799).
Jennifer Zhao's avatar
Jennifer Zhao committed
181

182
##### KV Cache features
Jennifer Zhao's avatar
Jennifer Zhao committed
183
184
185
186

- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.

187
##### Structured Output features
Jennifer Zhao's avatar
Jennifer Zhao committed
188

189
- **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.