sleep_mode.md 1.38 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Sleep Mode

vLLM-Omni’s **Sleep Mode** allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches (for autoregressive models)—**without stopping the server or unloading the Docker container**.

This feature is inherited from [vLLM’s Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html), which provides zero-reload model switching for multi-model serving.  

It is especially useful in **RLHF**, **training**, or **cost-saving scenarios**, where GPU resources must be freed between inference workloads.

---

## Omni Model

Omni model inherit the feature from vLLM' Sleep Mode

This means:

- Support both Level 1 and Level 2 sleep, allow to release and reset both model weights and KV Cache

## Diffusion Model Extension

We added Sleep Mode support for **diffusion models**, which previously lacked this functionality.  
In diffusion pipelines, this currently only offloads **model weight memory**, as these models typically do not use KV caches.

This means:

- Diffusion models can now enter Level 1 sleep.
- Pipeline states (e.g., noise schedulers, buffers) remain intact after waking.
- Useful for releasing VRAM between image generation or training cycles.

---

## Enable sleep mode
To enable sleep mode, set the `enable_sleep_mode` in `engine_args` to `True`


Example:
```python
omni = Omni(model=...,enable_sleep_mode=True)
```