v1_guide.md 10.7 KB
Newer Older
1
# vLLM V1
Jennifer Zhao's avatar
Jennifer Zhao committed
2

3
!!! announcement
4

5
    We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details.
6

Jennifer Zhao's avatar
Jennifer Zhao committed
7
8
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).

9
10
To disable V1, please set the environment variable as: `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason!

Jennifer Zhao's avatar
Jennifer Zhao committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
## Why vLLM V1?

vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.

Building on V0’s success, vLLM V1 retains the stable and proven components from V0
(such as the models, GPU kernels, and utilities). At the same time, it significantly
re-architects the core systems, covering the scheduler, KV cache manager, worker,
sampler, and API server, to provide a cohesive, maintainable framework that better
accommodates continued growth and innovation.

Specifically, V1 aims to:

- Provide a **simple, modular, and easy-to-hack codebase**.
- Ensure **high performance** with near-zero CPU overhead.
- **Combine key optimizations** into a unified architecture.
- Require **zero configs** by enabling features/optimizations by default.

We see significant performance improvements from upgrading to V1 core engine, in
particular for long context scenarios. Please see performance benchmark (To be
added).

For more details, check out the vLLM V1 blog post [vLLM V1: A Major
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).

This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

37
38
39
40
41
## Current Status

For each item, our progress towards V1 support falls into one of the following states:

- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
42
43
44
- **🟢 Functional**: Fully operational, with ongoing optimizations.
- **🚧 WIP**: Under active development.
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
45
46
47
- **🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
- **🔴 Deprecated**: Not planned for V1 unless there is strong demand.

48
49
50
51
52
53
54
55
56
57
58
59
!!! note
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come,
First-Served (FCFS) and priority-based scheduling (where requests are processed
based on assigned priority, with FCFS as a tie-breaker), configurable via the
`--scheduling-policy` argument.

60
61
### Hardware

62
63
64
65
| Hardware   | Status                                        |
|------------|-----------------------------------------------|
| **NVIDIA** | <nobr>🚀</nobr>                               |
| **AMD**    | <nobr>🟢</nobr>                               |
66
| **INTEL GPU**    | <nobr>🟢</nobr>                               |
67
68
| **TPU**    | <nobr>🟢</nobr>                               |
| **CPU**    | <nobr>🟢 (x86\_64/aarch64) 🟡 (MacOS) </nobr> |
69
70
71
72
73
74
75

!!! note

    More hardware platforms may be supported via plugins, e.g.:

    - [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
    - [vllm-spyre](https://github.com/vllm-project/vllm-spyre)
76
    - [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi)
77
78
79
80
81
82
    - [vllm-openvino](https://github.com/vllm-project/vllm-openvino)

    Please check their corresponding repositories for more details.

### Models

83
84
85
86
| Model Type                  | Status                                                                             |
|-----------------------------|------------------------------------------------------------------------------------|
| **Decoder-only Models**     | <nobr>🚀 Optimized</nobr>                                                          |
| **Encoder-Decoder Models**  | <nobr>🟠 Delayed</nobr>                                                            |
87
| **Embedding Models**        | <nobr>🟢 Functional</nobr>                                                         |
88
| **Mamba Models**            | <nobr>🟢 (Mamba-2), 🟢 (Mamba-1)</nobr>                                            |
89
90
91
92
93
94
| **Multimodal Models**       | <nobr>🟢 Functional</nobr>                                                         |

vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol.

!!! tip

95
    This corresponds to the V1 column in our [list of supported models](../models/supported_models.md).
96

97
See below for the status of models that are not yet supported or have more features planned in V1.
98
99
100

#### Embedding Models

101
The initial basic support is now functional.
Jennifer Zhao's avatar
Jennifer Zhao committed
102

103
104
Later, we will consider using [hidden states processor](gh-issue:12249),
which is based on [global logits processor](gh-pr:13360)
105
106
to enable simultaneous generation and embedding using the same engine instance in V1.

107
108
#### Mamba Models

109
Models using selective state-space mechanisms instead of standard transformer attention are supported.
110
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1.
111

112
113
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that
114
these models currently require disabling prefix caching and using the FlashInfer attention backend in V1.
115

116
117
118
119
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
Please note that these models currently require disabling prefix caching, enforcing eager mode, and using the FlashInfer
attention backend in V1.

120
#### Encoder-Decoder Models
121

122
123
Models requiring cross-attention between separate encoder and decoder (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`)
are not yet supported.
124
125
126

### Features

127
128
| Feature                                     | Status                                                                            |
|---------------------------------------------|-----------------------------------------------------------------------------------|
129
130
| **Prefix Caching**                          | <nobr>🚀 Optimized</nobr>                                                         |
| **Chunked Prefill**                         | <nobr>🚀 Optimized</nobr>                                                         |
131
| **LoRA**                                    | <nobr>🚀 Optimized</nobr>                                                         |
Jennifer Zhao's avatar
Jennifer Zhao committed
132
| **Logprobs Calculation**                    | <nobr>🟢 Functional</nobr>                                                        |
133
| **FP8 KV Cache**                            | <nobr>🟢 Functional on Hopper devices (<gh-pr:15191>)</nobr>|
134
| **Spec Decode**                             | <nobr>🚀 Optimized</nobr>                                                         |
135
| **Prompt Logprobs with Prefix Caching**     | <nobr>🟡 Planned ([RFC #13414](gh-issue:13414))</nobr>|
136
| **Structured Output Alternative Backends**  | <nobr>🟢 Functional</nobr>                                                        |
Jennifer Zhao's avatar
Jennifer Zhao committed
137
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr>                                                        |
138
139
| **best_of**                                 | <nobr>🔴 Deprecated ([RFC #13361](gh-issue:13361))</nobr>|
| **Per-Request Logits Processors**           | <nobr>🔴 Deprecated ([RFC #13360](gh-pr:13360))</nobr> |
Jennifer Zhao's avatar
Jennifer Zhao committed
140
141
| **GPU <> CPU KV Cache Swapping**            | <nobr>🔴 Deprecated</nobr>                                                        |

142
!!! note
Jennifer Zhao's avatar
Jennifer Zhao committed
143

144
145
146
147
148
    vLLM V1’s unified scheduler treats both prompt and output tokens the same
    way by using a simple dictionary (e.g., `{request_id: num_tokens}`) to dynamically
    allocate a fixed token budget per request, enabling features like chunked prefills,
    prefix caching, and speculative decoding without a strict separation between prefill
    and decode phases.
Jennifer Zhao's avatar
Jennifer Zhao committed
149

150
#### Semantic Changes to Logprobs
Jennifer Zhao's avatar
Jennifer Zhao committed
151
152
153
154

vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
differences compared to V0:

155
##### Logprobs Calculation
Jennifer Zhao's avatar
Jennifer Zhao committed
156

157
By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
Jennifer Zhao's avatar
Jennifer Zhao committed
158
159
160
161
before applying any logits post-processing such as temperature scaling or penalty
adjustments). As a result, the returned logprobs do not reflect the final adjusted
probabilities used during sampling.

162
163
164
165
You can adjust this behavior by setting the `--logprobs-mode` flag.
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
Raw means the values before applying any logit processors, like bad words.
Processed means the values after applying all processors, including temperature and top_k/top_p.
Jennifer Zhao's avatar
Jennifer Zhao committed
166

167
##### Prompt Logprobs with Prefix Caching
Jennifer Zhao's avatar
Jennifer Zhao committed
168

169
Logprobs are not cached. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.
Jennifer Zhao's avatar
Jennifer Zhao committed
170
171
172
173
174

#### Deprecated Features

As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.

175
##### Sampling features
Jennifer Zhao's avatar
Jennifer Zhao committed
176

177
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
Jennifer Zhao's avatar
Jennifer Zhao committed
178
179
180
- **Per-Request Logits Processors**: In V0, users could pass custom
  processing functions to adjust logits on a per-request basis. In vLLM V1, this
  feature has been deprecated. Instead, the design is moving toward supporting **global logits
181
  processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
Jennifer Zhao's avatar
Jennifer Zhao committed
182

183
##### KV Cache features
Jennifer Zhao's avatar
Jennifer Zhao committed
184
185
186
187

- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
to handle request preemptions.

188
##### Structured Output features
Jennifer Zhao's avatar
Jennifer Zhao committed
189

190
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.