v1_guide.md 10.4 KB
Newer Older
1
# vLLM V1
Jennifer Zhao's avatar
Jennifer Zhao committed
2

3
!!! announcement
4

5
    We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
6

7
    If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
Jennifer Zhao's avatar
Jennifer Zhao committed
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.

Building on V0’s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.

Specifically, V1 aims to:

- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.

We see significant performance improvements from upgrading to V1 core engine, in
particular for long context scenarios. Please see performance benchmark (To be
added).

For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
## Differences from V0

This section lists some differences in behavior between V0 and V1.

### Chunked Prefill

Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.

### CUDA Graphs

CUDA graph capture takes up more memory in V1 than in V0.

### Semantic Changes to Logprobs

#### Logprobs Calculation

By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

You can adjust this behavior by setting the `--logprobs-mode` flag.
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
Raw means the values before applying any logit processors, like bad words.
Processed means the values after applying all processors, including temperature and top_k/top_p.

#### Prompt Logprobs with Prefix Caching

While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs.
For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

## Feature Support
65

66
For each item, its support in vLLM V1 falls into one of the following states:
67

68
69
70
- **🟢 Functional**: Fully operational with optimizations comparable to or better than V0.
- **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs.
- **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.
71

72
73
74
75
76
77
78
79
80
81
82
83
!!! note
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come,
First-Served (FCFS) and priority-based scheduling (where requests are processed
based on assigned priority, with FCFS as a tie-breaker), configurable via the
`--scheduling-policy` argument.

84
85
### Hardware

86
87
88
89
| Hardware         | Status                                        |
|------------------|-----------------------------------------------|
| **NVIDIA**       | <nobr>🟢</nobr>                               |
| **AMD**          | <nobr>🟢</nobr>                               |
90
| **INTEL GPU**    | <nobr>🟢</nobr>                               |
91
92
| **TPU**          | <nobr>🟢</nobr>                               |
| **CPU**          | <nobr>🟢</nobr>                               |
93
94
95
96
97
98
99

!!! note

    More hardware platforms may be supported via plugins, e.g.:

    - [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
    - [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
100
    - [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi)
101
102
103
104
105
106
    - [vllm-openvino](https://github.com/vllm-project/vllm-openvino)

    Please check their corresponding repositories for more details.

### Models

107
108
109
110
111
112
113
| Model Type                  | Status                                                                  |
|-----------------------------|-------------------------------------------------------------------------|
| **Decoder-only Models**     | <nobr>🟢</nobr>                                                         |
| **Encoder-Decoder Models**  | <nobr>🟢 (Whisper), 🔴 (Others) </nobr>                                |
| **Pooling Models**          | <nobr>🟢</nobr>                                                         |
| **Mamba Models**            | <nobr>🟢</nobr>                                                         |
| **Multimodal Models**       | <nobr>🟢</nobr>                                                         |
114

115
See below for the status of models that are not yet supported or have more features planned in V1.
116

117
#### Pooling Models
118

119
Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models.
Jennifer Zhao's avatar
Jennifer Zhao committed
120

121
We are working on enabling prefix caching and chunked prefill for more categories of pooling models.
122

123
124
#### Mamba Models

125
Models using selective state-space mechanisms instead of standard transformer attention are supported.
126
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported.
127

128
Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
129
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).
130

131
132
133
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`, `Lfm2ForCausalLM`).

Please note that prefix caching is not yet supported for any of the above models.
134

135
#### Encoder-Decoder Models
136

137
138
Whisper is supported. Other models requiring cross-attention between separate
encoder and decoder (e.g., `BartForConditionalGeneration`,
139
`MllamaForConditionalGeneration`) are no longer supported.
140
141
142

### Features

143
144
| Feature                                     | Status                                                                            |
|---------------------------------------------|-----------------------------------------------------------------------------------|
145
146
147
| **Prefix Caching**                          | <nobr>🟢 Functional</nobr>                                                        |
| **Chunked Prefill**                         | <nobr>🟢 Functional</nobr>                                                        |
| **LoRA**                                    | <nobr>🟢 Functional</nobr>                                                        |
Jennifer Zhao's avatar
Jennifer Zhao committed
148
| **Logprobs Calculation**                    | <nobr>🟢 Functional</nobr>                                                        |
149
150
151
| **FP8 KV Cache**                            | <nobr>🟢 Functional</nobr>                                                        |
| **Spec Decode**                             | <nobr>🟢 Functional</nobr>                                                        |
| **Prompt Logprobs with Prefix Caching**     | <nobr>🟢 Functional</nobr>                                                        |
152
| **Structured Output Alternative Backends**  | <nobr>🟢 Functional</nobr>                                                        |
153
154
155
156
157
| **Concurrent Partial Prefills**             | <nobr>🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003)</nobr>  |
| **best_of**                                 | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361)</nobr>      |
| **Per-Request Logits Processors**           | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360)</nobr>        |
| **GPU <> CPU KV Cache Swapping**            | <nobr>🔴 Removed</nobr>                                                           |
| **Request-level Structured Output Backend** | <nobr>🔴 Removed</nobr>                                                           |
Jennifer Zhao's avatar
Jennifer Zhao committed
158

159
!!! note
Jennifer Zhao's avatar
Jennifer Zhao committed
160

161
162
163
164
165
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.
Jennifer Zhao's avatar
Jennifer Zhao committed
166

167
#### Removed Features
Jennifer Zhao's avatar
Jennifer Zhao committed
168

169
As part of the major architectural rework in vLLM V1, several legacy features have been removed.
Jennifer Zhao's avatar
Jennifer Zhao committed
170

171
##### Sampling features
Jennifer Zhao's avatar
Jennifer Zhao committed
172

173
- **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
Jennifer Zhao's avatar
Jennifer Zhao committed
174
175
- **Per-Request Logits Processors**: In V0, users could pass custom
  processing functions to adjust logits on a per-request basis. In vLLM V1, this
176
  feature has been removed. Instead, we now support **global logits processors**
177
  which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799).
Jennifer Zhao's avatar
Jennifer Zhao committed
178

179
##### KV Cache features
Jennifer Zhao's avatar
Jennifer Zhao committed
180
181
182
183

- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.

184
##### Structured Output features
Jennifer Zhao's avatar
Jennifer Zhao committed
185

186
- **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.