vLLM-Omni’s **Sleep Mode** allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches (for autoregressive models)—**without stopping the server or unloading the Docker container**.
This feature is inherited from [vLLM’s Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html), which provides zero-reload model switching for multi-model serving.
It is especially useful in **RLHF**, **training**, or **cost-saving scenarios**, where GPU resources must be freed between inference workloads.
---
## Omni Model
Omni model inherit the feature from vLLM' Sleep Mode
This means:
- Support both Level 1 and Level 2 sleep, allow to release and reset both model weights and KV Cache
## Diffusion Model Extension
We added Sleep Mode support for **diffusion models**, which previously lacked this functionality.
In diffusion pipelines, this currently only offloads **model weight memory**, as these models typically do not use KV caches.
This means:
- Diffusion models can now enter Level 1 sleep.
- Pipeline states (e.g., noise schedulers, buffers) remain intact after waking.
- Useful for releasing VRAM between image generation or training cycles.
---
## Enable sleep mode
To enable sleep mode, set the `enable_sleep_mode` in `engine_args` to `True`
vLLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models.
## Requirements
- OS: Linux
- Python: 3.12
!!! note
vLLM-Omni is currently not natively supported on Windows.
vLLM-Omni depends vLLM. So please follow instructions below mainly for vLLM.
!!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
Therefore, it is recommended to install vLLM and vLLM-Omni with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [build-from-source-vllm](https://docs.vllm.ai/en/stable/getting_started/installation/gpu/#build-wheel-from-source) for more details.
# --8<-- [start:pre-built-wheels]
#### Installation of vLLM
Note: Pre-built wheels are currently only available for vLLM-Omni 0.11.0rc1, 0.12.0rc1, 0.14.0rc1, 0.14.0. For the latest version, please [build from source](https://docs.vllm.ai/projects/vllm-omni/en/latest/getting_started/installation/gpu/#build-wheel-from-source).
vLLM-Omni is built based on vLLM. Please install it with command below.
```bash
uv pip install vllm==0.14.0 --torch-backend=auto
```
#### Installation of vLLM-Omni
```bash
uv pip install vllm-omni
```
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
#### Installation of vLLM
If you do not need to modify source code of vLLM, you can directly install the stable 0.14.0 release version of the library
```bash
uv pip install vllm==0.14.0 --torch-backend=auto
```
The release 0.14.0 of vLLM is based on PyTorch 2.9.0 which requires CUDA 12.9 environment.
#### Installation of vLLM-Omni
Since vllm-omni is rapidly evolving, it's recommended to install it from source
Set up environment variables to get pre-built wheels. If there are internet problems, just download the whl file manually. And set `VLLM_PRECOMPILED_WHEEL_LOCATION` as your local absolute path of whl file.
vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as [vllm/vllm-omni](https://hub.docker.com/r/vllm/vllm-omni/tags). The version of vLLM-Omni indicates which release of vLLM it is based on.
Here's an example deployment command that has been verified on 2 x H100's:
You can use this docker image to serve models the same way you would with in vLLM! To do so, make sure you overwrite the default entrypoint (`vllm serve --omni`) which works only for models supported in the vLLM-Omni project.
vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as [vllm/vllm-omni-rocm](https://hub.docker.com/r/vllm/vllm-omni-rocm/tags). The version of vLLM-Omni indicates which release of vLLM it is based on.
#### Launch vLLM-Omni Server
Here's an example deployment command that has been verified on 2 x MI300's:
For detailed hardware and software requirements, please refer to the [vllm-ascend installation documentation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
# --8<-- [end:requirements]
# --8<-- [start:installation]
The recommended way to use vLLM-Omni on NPU is through the vllm-ascend pre-built Docker images:
```bash
# Update DEVICE according to your NPUs (/dev/davinci[0-7])
The default workdir is `/workspace`, with vLLM, vLLM-Ascend and vLLM-Omni code placed in `/vllm-workspace` installed in development mode.
For other installation methods (pip installation, building from source, custom Docker builds), please refer to the [vllm-ascend installation guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands: