vllm-omni_0.15.0.rc1+fix1 first commit

c1cacde6 · weishb · 35607782 · c1cacde6 · c1cacde6 · c1cacde6
Commit c1cacde6 authored Mar 25, 2026 by weishb
20 changed files
--- a/docs/features/sleep_mode.md
+++ b/docs/features/sleep_mode.md
+# Sleep Mode
+
+vLLM-Omni’s **Sleep Mode** allows you to temporarily release most GPU memory used by a model—such as model weights and key-value (KV) caches (for autoregressive models)—**without stopping the server or unloading the Docker container**.
+
+This feature is inherited from [vLLM’s Sleep Mode](https://blog.vllm.ai/2025/10/26/sleep-mode.html), which provides zero-reload model switching for multi-model serving.  
+
+It is especially useful in **RLHF**, **training**, or **cost-saving scenarios**, where GPU resources must be freed between inference workloads.
+
+---
+
+## Omni Model
+
+Omni model inherit the feature from vLLM' Sleep Mode
+
+This means:
+
+- Support both Level 1 and Level 2 sleep, allow to release and reset both model weights and KV Cache
+
+## Diffusion Model Extension
+
+We added Sleep Mode support for **diffusion models**, which previously lacked this functionality.  
+In diffusion pipelines, this currently only offloads **model weight memory**, as these models typically do not use KV caches.
+
+This means:
+
+- Diffusion models can now enter Level 1 sleep.
+- Pipeline states (e.g., noise schedulers, buffers) remain intact after waking.
+- Useful for releasing VRAM between image generation or training cycles.
+
+---
+
+## Enable sleep mode
+To enable sleep mode, set the `enable_sleep_mode` in `engine_args` to `True`
+
+
+Example:
+```python
+omni = Omni(model=...,enable_sleep_mode=True)
+```
--- a/docs/getting_started/installation/.nav.yml
+++ b/docs/getting_started/installation/.nav.yml
+nav:
+  - README.md
+  - gpu.md
+  - npu.md
--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
+# Installation
+
+vLLM-Omni supports the following hardware platforms:
+
+- [GPU](gpu.md)
+    - [NVIDIA CUDA](gpu.md)
+    - [AMD ROCm](gpu.md)
+- [NPU](npu.md)
--- a/docs/getting_started/installation/gpu.md
+++ b/docs/getting_started/installation/gpu.md
+# GPU
+
+vLLM-Omni is a Python library that supports the following GPU variants. The library itself mainly contains python implementations for framework and models.
+
+## Requirements
+
+- OS: Linux
+- Python: 3.12
+
+!!! note
+    vLLM-Omni is currently not natively supported on Windows.
+
+=== "NVIDIA CUDA"
+
+    --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
+
+=== "AMD ROCm"
+
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
+
+## Set up using Python
+
+### Create a new Python environment
+
+--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
+
+### Pre-built wheels
+
+=== "NVIDIA CUDA"
+
+    --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
+
+
+=== "AMD ROCm"
+
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
+
+[](){ #build-from-source }
+
+### Build wheel from source
+
+=== "NVIDIA CUDA"
+
+    --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
+
+=== "AMD ROCm"
+
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
+
+## Set up using Docker
+
+### Pre-built images
+
+=== "NVIDIA CUDA"
+
+    --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
+
+=== "AMD ROCm"
+
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
+
+### Build your own docker image
+
+=== "AMD ROCm"
+
+    --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-docker"
--- a/docs/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/getting_started/installation/gpu/cuda.inc.md
+# --8<-- [start:requirements]
+
+- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+vLLM-Omni depends vLLM. So please follow instructions below mainly for vLLM.
+
+!!! note
+    PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
+
+In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
+
+Therefore, it is recommended to install vLLM and vLLM-Omni with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [build-from-source-vllm](https://docs.vllm.ai/en/stable/getting_started/installation/gpu/#build-wheel-from-source) for more details.
+
+# --8<-- [start:pre-built-wheels]
+
+#### Installation of vLLM
+Note: Pre-built wheels are currently only available for vLLM-Omni 0.11.0rc1, 0.12.0rc1, 0.14.0rc1, 0.14.0. For the latest version, please [build from source](https://docs.vllm.ai/projects/vllm-omni/en/latest/getting_started/installation/gpu/#build-wheel-from-source).
+
+
+vLLM-Omni is built based on vLLM. Please install it with command below.
+```bash
+uv pip install vllm==0.14.0 --torch-backend=auto
+```
+
+#### Installation of vLLM-Omni
+
+```bash
+uv pip install vllm-omni
+```
+
+# --8<-- [end:pre-built-wheels]
+
+# --8<-- [start:build-wheel-from-source]
+
+#### Installation of vLLM
+If you do not need to modify source code of vLLM, you can directly install the stable 0.14.0 release version of the library
+
+```bash
+uv pip install vllm==0.14.0 --torch-backend=auto
+```
+
+The release 0.14.0 of vLLM is based on PyTorch 2.9.0 which requires CUDA 12.9 environment.
+
+#### Installation of vLLM-Omni
+Since vllm-omni is rapidly evolving, it's recommended to install it from source
+```bash
+git clone https://github.com/vllm-project/vllm-omni.git
+cd vllm-omni
+uv pip install -e .
+```
+
+<details><summary>(Optional) Installation of vLLM from source</summary>
+If you want to check, modify or debug with source code of vLLM, install the library from source with the following instructions:
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout v0.14.0
+```
+Set up environment variables to get pre-built wheels. If there are internet problems, just download the whl file manually. And set `VLLM_PRECOMPILED_WHEEL_LOCATION` as your local absolute path of whl file.
+```bash
+export VLLM_PRECOMPILED_WHEEL_LOCATION=https://github.com/vllm-project/vllm/releases/download/v0.14.0/vllm-0.14.0-cp38-abi3-manylinux_2_31_x86_64.whl
+```
+Install vllm with command below (If you have no existing PyTorch).
+```bash
+uv pip install --editable .
+```
+Install vllm with command below (If you already have PyTorch).
+```bash
+python use_existing_torch.py
+uv pip install -r requirements/build.txt
+uv pip install --no-build-isolation --editable .
+```
+</details>
+
+# --8<-- [end:build-wheel-from-source]
+
+# --8<-- [start:build-wheel-from-source-in-docker]
+
+# --8<-- [end:build-wheel-from-source-in-docker]
+
+# --8<-- [start:pre-built-images]
+
+vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as [vllm/vllm-omni](https://hub.docker.com/r/vllm/vllm-omni/tags). The version of vLLM-Omni indicates which release of vLLM it is based on.
+
+Here's an example deployment command that has been verified on 2 x H100's:
+```bash
+docker run --runtime nvidia --gpus 2 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HF_TOKEN=$HF_TOKEN" \
+    -p 8091:8091 \
+    --ipc=host \
+    vllm/vllm-omni:v0.14.0 \
+    --model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091
+```
+
+!!! tip
+    You can use this docker image to serve models the same way you would with in vLLM! To do so, make sure you overwrite the default entrypoint (`vllm serve --omni`) which works only for models supported in the vLLM-Omni project.
+
+# --8<-- [end:pre-built-images]
--- a/docs/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
+# --8<-- [start:requirements]
+
+- GPU: Validated on gfx942 (It should be supported on the AMD GPUs that are supported by vLLM.)
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+vLLM-Omni current recommends the steps in under setup through Docker Images.
+
+# --8<-- [start:pre-built-wheels]
+
+# --8<-- [end:pre-built-wheels]
+
+# --8<-- [start:build-wheel-from-source]
+
+# --8<-- [end:build-wheel-from-source]
+
+# --8<-- [start:build-docker]
+
+#### Build docker image
+
+```bash
+DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-omni-rocm .
+```
+
+#### Launch the docker image
+
+##### Launch with OpenAI API Server
+
+```
+docker run --rm \
+--group-add=video \
+--ipc=host \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--device /dev/kfd \
+--device /dev/dri \
+-v ~/.cache/huggingface:/root/.cache/huggingface \
+--env "HF_TOKEN=$HF_TOKEN" \
+-p 8091:8091 \
+--ipc=host \
+vllm-omni-rocm \
+--model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091
+```
+
+##### Launch with interactive session for development
+
+```
+docker run --rm -it \
+--network=host \
+--group-add=video \
+--ipc=host \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--device /dev/kfd \
+--device /dev/dri \
+-v <path/to/model>:/app/model \
+-v ~/.cache/huggingface:/root/.cache/huggingface \
+--entrypoint bash \
+vllm-omni-rocm
+```
+
+# --8<-- [end:build-docker]
+
+# --8<-- [start:pre-built-images]
+
+vLLM-Omni offers an official docker image for deployment. These images are built on top of vLLM docker images and available on Docker Hub as [vllm/vllm-omni-rocm](https://hub.docker.com/r/vllm/vllm-omni-rocm/tags). The version of vLLM-Omni indicates which release of vLLM it is based on.
+
+#### Launch vLLM-Omni Server
+Here's an example deployment command that has been verified on 2 x MI300's:
+```bash
+docker run --rm \
+  --group-add=video \
+  --ipc=host \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  --device /dev/kfd \
+  --device /dev/dri \
+  -v <path/to/model>:/app/model \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  --env "HF_TOKEN=$HF_TOKEN" \
+  -p 8091:8091 \
+  vllm/vllm-omni-rocm:v0.14.0 \
+  --model Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
+```
+
+#### Launch an interactive terminal with prebuilt docker image.
+If you want to run in dev environment you can launch the docker image as follows:
+```bash
+docker run --rm -it \
+  --network=host \
+  --group-add=video \
+  --ipc=host \
+  --cap-add=SYS_PTRACE \
+  --security-opt seccomp=unconfined \
+  --device /dev/kfd \
+  --device /dev/dri \
+  -v <path/to/model>:/app/model \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  --env "HF_TOKEN=$HF_TOKEN" \
+  --entrypoint bash \
+  vllm/vllm-omni-rocm:v0.14.0
+```
+
+# --8<-- [end:pre-built-images]
--- a/docs/getting_started/installation/npu.md
+++ b/docs/getting_started/installation/npu.md
+# NPU
+
+vLLM-Omni supports NPU through the vLLM Ascend Plugin (vllm-ascend). This is a community maintained hardware plugin for running vLLM on NPU.
+
+## Requirements
+
+- OS: Linux
+- Python: 3.12
+
+!!! note
+    vLLM-Omni is currently not natively supported on Windows.
+
+=== "NPU"
+
+    --8<-- "docs/getting_started/installation/npu/npu.inc.md:requirements"
+
+## Installation
+
+### Recommended
+
+=== "NPU"
+
+    --8<-- "docs/getting_started/installation/npu/npu.inc.md:installation"
--- a/docs/getting_started/installation/npu/npu.inc.md
+++ b/docs/getting_started/installation/npu/npu.inc.md
+# --8<-- [start:requirements]
+
+For detailed hardware and software requirements, please refer to the [vllm-ascend installation documentation](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
+
+# --8<-- [end:requirements]
+# --8<-- [start:installation]
+
+The recommended way to use vLLM-Omni on NPU is through the vllm-ascend pre-built Docker images:
+
+```bash
+# Update DEVICE according to your NPUs (/dev/davinci[0-7])
+export DEVICE0=/dev/davinci0
+export DEVICE1=/dev/davinci1
+# Update the vllm-ascend image
+# Atlas A2:
+# export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0
+# Atlas A3:
+# export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0-a3
+export IMAGE=quay.io/ascend/vllm-ascend:v0.14.0
+docker run --rm \
+    --name vllm-omni-npu \
+    --shm-size=1g \
+    --device $DEVICE0 \
+    --device $DEVICE1 \
+    --device /dev/davinci_manager \
+    --device /dev/devmm_svm \
+    --device /dev/hisi_hdc \
+    -v /usr/local/dcmi:/usr/local/dcmi \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+    -v /etc/ascend_install.info:/etc/ascend_install.info \
+    -v /root/.cache:/root/.cache \
+    -p 8000:8000 \
+    -it $IMAGE bash
+
+# Install the missing dependency of mooncake in the origin image.
+apt update
+apt install libjemalloc2
+echo "export LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libjemalloc.so.2:$LD_PRELOAD" >> ~/.bashrc
+source ~/.bashrc
+
+# Inside the container, install vLLM-Omni from source
+cd /vllm-workspace
+git clone -b v0.14.0 https://github.com/vllm-project/vllm-omni.git
+cd vllm-omni
+pip install -v -e .
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+
+# (Optional) Disable mooncake for stable capability
+mv /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake \
+   /usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake.disabled
+```
+
+The default workdir is `/workspace`, with vLLM, vLLM-Ascend and vLLM-Omni code placed in `/vllm-workspace` installed in development mode.
+
+For other installation methods (pip installation, building from source, custom Docker builds), please refer to the [vllm-ascend installation guide](https://docs.vllm.ai/projects/ascend/en/latest/installation.html).
+
+# --8<-- [end:installation]
--- a/docs/getting_started/installation/python_env_setup.inc.md
+++ b/docs/getting_started/installation/python_env_setup.inc.md
+It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
+
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+```
--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
+# Quickstart
+
+This guide will help you quickly get started with vLLM-Omni to perform:
+
+- Offline batched inference
+- Online serving using OpenAI-compatible server
+
+## Prerequisites
+
+- OS: Linux
+- Python: 3.12
+
+## Installation
+
+For installation on GPU from source:
+
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+
+# On CUDA
+uv pip install vllm==0.14.0 --torch-backend=auto
+
+# On ROCm
+uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
+
+git clone https://github.com/vllm-project/vllm-omni.git
+cd vllm-omni
+uv pip install -e .
+```
+
+For additional installation methods — please see the [installation guide](installation/README.md).
+
+## Offline Inference
+
+Text-to-image generation quickstart with vLLM-Omni:
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+if __name__ == "__main__":
+    omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
+    prompt = "a cup of coffee on the table"
+    outputs = omni.generate(prompt)
+    images = outputs[0].request_output[0].images
+    images[0].save("coffee.png")
+```
+
+You can pass a list of prompts and wait for them to process altogether, shown below.
+
+!!! info
+
+    However, it is not currently recommended to do so
+    because not all models support batch inference,
+    and batch requesting mostly does not provide significant performance improvement (despite the impression that it does).
+    This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.
+
+```python
+from vllm_omni.entrypoints.omni import Omni
+
+if __name__ == "__main__":
+    omni = Omni(
+        model="Tongyi-MAI/Z-Image-Turbo",
+        # stage_configs_path="./stage-config.yaml",  # See below
+    )
+    prompts = [
+        "a cup of coffee on a table",
+        "a toy dinosaur on a sandy beach",
+        "a fox waking up in bed and yawning",
+    ]
+    omni_outputs = omni.generate(prompts)
+    for i_prompt, prompt_output in enumerate(omni_outputs):
+        this_request_output = prompt_output.request_output[0]
+        this_images = this_request_output.images
+        for i_image, image in enumerate(this_images):
+            image.save(f"p{i_prompt}-img{i_image}.jpg")
+            print("saved to", f"p{i_prompt}-img{i_image}.jpg")
+            # saved to p0-img0.jpg
+            # saved to p1-img0.jpg
+            # saved to p2-img0.jpg
+```
+
+!!! info
+
+    For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
+    list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
+    batched inputs, you can [modify this configuration](../configuration/stage_configs.md) to let the model accept a longer batch of prompts.
+
+For more usages, please refer to [offline inference](../user_guide/examples/offline_inference/qwen2_5_omni.md)
+
+## Online Serving with OpenAI-Completions API
+
+Text-to-image generation quickstart with vLLM-Omni:
+
+```bash
+vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091
+```
+
+```bash
+curl -s http://localhost:8091/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "a cup of coffee on the table"}
+    ],
+    "extra_body": {
+      "height": 1024,
+      "width": 1024,
+      "num_inference_steps": 50,
+      "guidance_scale": 4.0,
+      "seed": 42
+    }
+  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2 | base64 -d > coffee.png
+```
+
+For more details, please refer to [online serving](../user_guide/examples/online_serving/text_to_image.md).
--- a/docs/mkdocs/hooks/generate_api_readme.py
+++ b/docs/mkdocs/hooks/generate_api_readme.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM-Omni project
+"""
+Hook to automatically generate docs/api/README.md from the codebase.
+
+This script scans the vllm_omni module for public classes and functions,
+categorizes them, and generates a summary README file.
+"""
+
+import ast
+import logging
+from pathlib import Path
+
+logger = logging.getLogger("mkdocs")
+
+ROOT_DIR = Path(__file__).parent.parent.parent.parent
+API_README_PATH = ROOT_DIR / "docs" / "api" / "README.md"
+
+# Category mappings: module prefix -> category name and description
+CATEGORIES = {
+    "entrypoints": {
+        "name": "Entry Points",
+        "description": "Main entry points for vLLM-Omni inference and serving.",
+    },
+    "inputs": {
+        "name": "Inputs",
+        "description": "Input data structures for multi-modal inputs.",
+    },
+    "engine": {
+        "name": "Engine",
+        "description": "Engine classes for offline and online inference.",
+    },
+    "core": {
+        "name": "Core",
+        "description": "Core scheduling and caching components.",
+    },
+    # "model_executor": {
+    #     "name": "Model Executor",
+    #     "description": "Model execution components.",
+    # },
+    "config": {
+        "name": "Configuration",
+        "description": "Configuration classes.",
+    },
+    "worker": {
+        "name": "Workers",
+        "description": "Worker classes and model runners for distributed inference.",
+    },
+}
+
+
+class APIVisitor(ast.NodeVisitor):
+    """AST visitor to extract public classes and module-level functions."""
+
+    def __init__(self, module_path: str):
+        self.module_path = module_path
+        self.classes: list[str] = []
+        self.functions: list[str] = []
+        self._class_stack: list[str] = []  # Track nested class definitions
+
+    def visit_ClassDef(self, node: ast.ClassDef) -> None:
+        """Visit class definitions."""
+        if not node.name.startswith("_"):
+            self.classes.append(f"{self.module_path}.{node.name}")
+        # Track that we're entering a class
+        self._class_stack.append(node.name)
+        self.generic_visit(node)
+        # Remove from stack when done visiting
+        self._class_stack.pop()
+
+    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
+        """Visit function definitions - only collect module-level functions."""
+        # Only collect if we're not inside a class (stack is empty)
+        if not self._class_stack and not node.name.startswith("_"):
+            self.functions.append(f"{self.module_path}.{node.name}")
+        self.generic_visit(node)
+
+    def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:
+        """Visit async function definitions - only collect module-level functions."""
+        # Only collect if we're not inside a class (stack is empty)
+        if not self._class_stack and not node.name.startswith("_"):
+            self.functions.append(f"{self.module_path}.{node.name}")
+        self.generic_visit(node)
+
+
+def parse_file_for_symbols(file_path: Path, module_path: str) -> tuple[list[str], list[str]]:
+    """
+    Parse a Python file and extract public classes and functions.
+
+    Returns:
+        Tuple of (classes, functions)
+    """
+    try:
+        # If this is __init__.py, use parent module path
+        if file_path.name == "__init__.py":
+            # Remove .__init__ from module path
+            if module_path.endswith(".__init__"):
+                module_path = module_path[:-9]
+
+        with open(file_path, encoding="utf-8") as f:
+            content = f.read()
+
+        tree = ast.parse(content, filename=str(file_path))
+        visitor = APIVisitor(module_path)
+        visitor.visit(tree)
+
+        return visitor.classes, visitor.functions
+    except Exception as e:
+        logger.debug(f"Could not parse {file_path}: {e}")
+        return [], []
+
+
+def scan_package(package_name: str = "vllm_omni") -> dict[str, list[str]]:
+    """
+    Scan the vllm_omni package and categorize public symbols.
+
+    Returns:
+        Dict mapping category names to lists of symbol full names
+    """
+    categorized: dict[str, list[str]] = {cat["name"]: [] for cat in CATEGORIES.values()}
+
+    try:
+        # Find the package directory
+        package_path = ROOT_DIR / package_name
+        if not package_path.exists():
+            logger.warning(f"Package path not found: {package_path}")
+            return categorized
+
+        # Walk through all Python files
+        for py_file in package_path.rglob("*.py"):
+            # Skip __init__.py and private modules
+            if py_file.name.startswith("_") and py_file.name != "__init__.py":
+                continue
+
+            # Get module path
+            relative_path = py_file.relative_to(ROOT_DIR)
+            module_path = str(relative_path.with_suffix("")).replace("/", ".").replace("\\", ".")
+
+            # Skip excluded modules (avoid importing vllm during docs build)
+            excluded_prefixes = [
+                "vllm_omni.diffusion.models.qwen_image",
+                "vllm_omni.entrypoints.async_diffusion",
+                "vllm_omni.entrypoints.openai",
+            ]
+            if any(module_path.startswith(prefix) for prefix in excluded_prefixes):
+                continue
+
+            # Handle __init__.py - use parent module path
+            if py_file.name == "__init__.py":
+                # Remove .__init__ from module path
+                if module_path.endswith(".__init__"):
+                    module_path = module_path[:-9]
+
+            # Determine category from module path
+            category = None
+            for prefix, cat_info in CATEGORIES.items():
+                if prefix in module_path:
+                    category = cat_info["name"]
+                    break
+
+            if not category:
+                continue
+
+            # Parse file for symbols
+            classes, functions = parse_file_for_symbols(py_file, module_path)
+
+            # Filter out internal implementation classes
+            # Skip classes that look like internal components (DiT layers, etc.)
+            internal_patterns = [
+                "Block",
+                "Layer",
+                "Net",
+                "Embedding",
+                "Norm",
+                "Activation",
+                "Solver",
+                "Pooling",
+                "Attention",
+                "MLP",
+                "DecoderLayer",
+                "InputEmbedding",
+                "TimestepEmbedding",
+                "CodecEmbedding",
+                "DownSample",
+                "UpSample",
+                "Res2Net",
+                "SqueezeExcitation",
+                "TimeDelay",
+                "TorchActivation",
+                "SnakeBeta",
+                "SinusPosition",
+                "RungeKutta",
+                "AMPBlock",
+                "AdaLayerNorm",
+            ]
+
+            # Add classes (filter out internal ones)
+            for class_name in classes:
+                class_short_name = class_name.split(".")[-1]
+                # Skip if it matches internal patterns (unless it's a main model class)
+                if any(pattern in class_short_name for pattern in internal_patterns):
+                    # But include main model classes
+                    if not any(
+                        main_class in class_short_name
+                        for main_class in [
+                            "ForConditionalGeneration",
+                            "Model",
+                            "Registry",
+                            "Worker",
+                            "Runner",
+                            "Scheduler",
+                            "Manager",
+                            "Processor",
+                            "Config",
+                        ]
+                    ):
+                        continue
+                categorized[category].append(class_name)
+
+            # Add important functions (parse, preprocess, etc.)
+            for func_name in functions:
+                # Include functions that match certain patterns
+                if any(keyword in func_name.lower() for keyword in ["parse", "preprocess"]):
+                    categorized[category].append(func_name)
+
+        # Sort symbols within each category
+        for category in categorized:
+            categorized[category].sort()
+
+    except Exception as e:
+        logger.error(f"Error scanning package: {e}", exc_info=True)
+
+    return categorized
+
+
+def generate_readme(categorized: dict[str, list[str]]) -> str:
+    """Generate the API README markdown content."""
+    lines = ["# Summary", ""]
+
+    # Generate sections for each category
+    for prefix, cat_info in CATEGORIES.items():
+        category_name = cat_info["name"]
+        description = cat_info["description"]
+        symbols = categorized.get(category_name, [])
+
+        if not symbols:
+            continue
+
+        lines.append(f"## {category_name}")
+        lines.append("")
+        lines.append(description)
+        lines.append("")
+
+        for symbol in symbols:
+            lines.append(f"- [{symbol}][]")
+
+        lines.append("")
+
+    return "\n".join(lines)
+
+
+def on_startup(command, dirty: bool):
+    """MkDocs hook entry point."""
+    logger.info("Generating API README documentation")
+
+    # Scan the package
+    categorized = scan_package()
+
+    # Generate README content
+    content = generate_readme(categorized)
+
+    # Write to file
+    API_README_PATH.parent.mkdir(parents=True, exist_ok=True)
+    with open(API_README_PATH, "w", encoding="utf-8") as f:
+        f.write(content)
+
+    logger.info(f"API README generated: {API_README_PATH.relative_to(ROOT_DIR)}")
--- a/docs/mkdocs/hooks/generate_examples.py
+++ b/docs/mkdocs/hooks/generate_examples.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+import itertools
+import logging
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Literal
+
+import regex as re
+import yaml
+
+logger = logging.getLogger("mkdocs")
+
+ROOT_DIR = Path(__file__).parent.parent.parent.parent
+ROOT_DIR_RELATIVE = "../../../../.."
+EXAMPLE_DIR = ROOT_DIR / "examples"
+EXAMPLE_DOC_DIR = ROOT_DIR / "docs/user_guide/examples"
+NAV_FILE = ROOT_DIR / "docs/.nav.yml"
+
+
+def fix_case(text: str) -> str:
+    subs = {
+        "api": "API",
+        "cli": "CLI",
+        "cpu": "CPU",
+        "llm": "LLM",
+        "mae": "MAE",
+        "tpu": "TPU",
+        "gguf": "GGUF",
+        "lora": "LoRA",
+        "rlhf": "RLHF",
+        "vllm": "vLLM",
+        "openai": "OpenAI",
+        "lmcache": "LMCache",
+        "multilora": "MultiLoRA",
+        "mlpspeculator": "MLPSpeculator",
+        r"fp\d+": lambda x: x.group(0).upper(),  # e.g. fp16, fp32
+        r"int\d+": lambda x: x.group(0).upper(),  # e.g. int8, int16
+    }
+    for pattern, repl in subs.items():
+        text = re.sub(rf"\b{pattern}\b", repl, text, flags=re.IGNORECASE)
+    return text
+
+
+@dataclass
+class Example:
+    """
+    Example class for generating documentation content from a given path.
+
+    Attributes:
+        path (Path): The path to the main directory or file.
+        category (str): The category of the document.
+        main_file (Path): The main file in the directory.
+        other_files (list[Path]): list of other files in the directory.
+        title (str): The title of the document.
+
+    Methods:
+        __post_init__(): Initializes the main_file, other_files, and title attributes.
+        determine_main_file() -> Path: Determines the main file in the given path.
+        determine_other_files() -> list[Path]: Determines other files in the directory excluding the main file.
+        determine_title() -> str: Determines the title of the document.
+        generate() -> str: Generates the documentation content.
+    """  # noqa: E501
+
+    path: Path
+    category: str = None
+    main_file: Path = field(init=False)
+    other_files: list[Path] = field(init=False)
+    title: str = field(init=False)
+
+    def __post_init__(self):
+        self.main_file = self.determine_main_file()
+        self.other_files = self.determine_other_files()
+        self.title = self.determine_title()
+
+    @property
+    def is_code(self) -> bool:
+        return self.main_file.suffix != ".md"
+
+    def determine_main_file(self) -> Path:
+        """
+        Determines the main file in the given path.
+        If the path is a file, it returns the path itself. Otherwise, it searches
+        for Markdown files (*.md) in the directory and returns the first one found.
+        Returns:
+            Path: The main file path, either the original path if it's a file or the first
+            Markdown file found in the directory.
+        Raises:
+            IndexError: If no Markdown files are found in the directory.
+        """  # noqa: E501
+        return self.path if self.path.is_file() else list(self.path.glob("*.md")).pop()
+
+    def determine_other_files(self) -> list[Path]:
+        """
+        Determine other files in the directory excluding the main file.
+
+        This method checks if the given path is a file. If it is, it returns an empty list.
+        Otherwise, it recursively searches through the directory and returns a list of all
+        files that are not the main file.
+
+        Returns:
+            list[Path]: A list of Path objects representing the other files in the directory.
+        """  # noqa: E501
+        if self.path.is_file():
+            return []
+        # Binary file extensions to exclude
+        binary_extensions = {
+            ".wav",
+            ".mp3",
+            ".mp4",
+            ".avi",
+            ".mov",
+            ".mkv",  # Audio/Video
+            ".png",
+            ".jpg",
+            ".jpeg",
+            ".gif",
+            ".bmp",
+            ".ico",
+            ".svg",  # Images
+            ".pdf",
+            ".zip",
+            ".tar",
+            ".gz",
+            ".bz2",
+            ".xz",  # Archives/Documents
+            ".exe",
+            ".so",
+            ".dll",
+            ".dylib",  # Binaries
+            ".bin",
+            ".dat",
+            ".db",
+            ".sqlite",  # Data files
+        }
+
+        def is_other_file(file: Path) -> bool:
+            return file.is_file() and file != self.main_file and file.suffix.lower() not in binary_extensions
+
+        return [file for file in self.path.rglob("*") if is_other_file(file)]
+
+    def determine_title(self) -> str:
+        if not self.is_code:
+            # Specify encoding for building on Windows
+            with open(self.main_file, encoding="utf-8") as f:
+                first_line = f.readline().strip()
+            match = re.match(r"^#\s+(?P<title>.+)$", first_line)
+            if match:
+                return match.group("title")
+        return fix_case(self.path.stem.replace("_", " ").title())
+
+    def fix_relative_links(self, content: str) -> str:
+        """
+        Fix relative links in markdown content by converting them to gh-file
+        format.
+
+        Args:
+            content (str): The markdown content to process
+
+        Returns:
+            str: Content with relative links converted to gh-file format
+        """
+        # Regex to match markdown links [text](relative_path)
+        # This matches links that don't start with http, https, ftp, or #
+        link_pattern = r"\[([^\]]*)\]\((?!(?:https?|ftp)://|#)([^)]+)\)"
+
+        def replace_link(match):
+            link_text = match.group(1)
+            relative_path = match.group(2)
+
+            # Make relative to repo root
+            gh_file = (self.main_file.parent / relative_path).resolve()
+            gh_file = gh_file.relative_to(ROOT_DIR)
+
+            # Make GitHub URL
+            url = "https://github.com/vllm-project/vllm-omni/"
+            url += "tree/main" if self.path.is_dir() else "blob/main"
+            gh_url = f"{url}/{gh_file}"
+
+            return f"[{link_text}]({gh_url})"
+
+        return re.sub(link_pattern, replace_link, content)
+
+    def generate(self) -> str:
+        content = f"# {self.title}\n\n"
+        url = "https://github.com/vllm-project/vllm-omni/"
+        url += "tree/main" if self.path.is_dir() else "blob/main"
+        content += f"Source <{url}/{self.path.relative_to(ROOT_DIR)}>.\n\n"
+
+        # Use long code fence to avoid issues with
+        # included files containing code fences too
+        code_fence = "``````"
+
+        if self.is_code:
+            main_file_rel = self.main_file.relative_to(ROOT_DIR)
+            content += f'{code_fence}{self.main_file.suffix[1:]}\n--8<-- "{main_file_rel}"\n{code_fence}\n'
+        else:
+            with open(self.main_file, encoding="utf-8") as f:
+                # Skip the title from md snippets as it's been included above
+                main_content = f.readlines()[1:]
+            content += self.fix_relative_links("".join(main_content))
+        content += "\n"
+
+        if not self.other_files:
+            return content
+
+        content += "## Example materials\n\n"
+        for file in sorted(self.other_files):
+            content += f'??? abstract "{file.relative_to(self.path)}"\n'
+            if file.suffix != ".md":
+                content += f"    {code_fence}{file.suffix[1:]}\n"
+            content += f'    --8<-- "{file.relative_to(ROOT_DIR)}"\n'
+            if file.suffix != ".md":
+                content += f"    {code_fence}\n"
+
+        return content
+
+
+def update_nav_file(examples: list[Example]):
+    """
+    Update the .nav.yml file to include all generated examples.
+    This function completely regenerates the examples section based on the actual
+    folder structure, ensuring consistency between the examples folder and nav file.
+
+    Args:
+        examples: List of Example objects that have been generated
+    """
+    if not NAV_FILE.exists():
+        logger.warning("Navigation file not found: %s", NAV_FILE)
+        return
+
+    # Read the current nav file
+    with open(NAV_FILE, encoding="utf-8") as f:
+        nav_data = yaml.safe_load(f) or {}
+
+    nav_list = nav_data.get("nav", [])
+
+    # Find the "User Guide" section
+    user_guide_idx = None
+    examples_idx = None
+    for i, item in enumerate(nav_list):
+        if isinstance(item, dict) and "User Guide" in item:
+            user_guide_idx = i
+            user_guide_content = item["User Guide"]
+            # Find the "Examples" subsection
+            for j, subitem in enumerate(user_guide_content):
+                if isinstance(subitem, dict) and "Examples" in subitem:
+                    examples_idx = j
+                    break
+            break
+
+    if user_guide_idx is None or examples_idx is None:
+        logger.warning("Could not find 'User Guide' -> 'Examples' section in nav file")
+        return
+
+    # Get existing Examples section to preserve non-example items (like README.md)
+    existing_examples_content = nav_list[user_guide_idx]["User Guide"][examples_idx]["Examples"]
+
+    # Preserve string items (like "examples/README.md") that are not example categories
+    preserved_items = [
+        item
+        for item in existing_examples_content
+        if isinstance(item, str) and not item.startswith("user_guide/examples/")
+    ]
+
+    # Group examples by category
+    examples_by_category = {}
+    for example in examples:
+        category = example.category
+        if category not in examples_by_category:
+            examples_by_category[category] = []
+        examples_by_category[category].append(example)
+
+    # Build the new Examples section - start with preserved items
+    examples_section = preserved_items.copy()
+
+    # Add examples grouped by category, sorted by category name
+    for category in sorted(examples_by_category.keys()):
+        category_examples = sorted(examples_by_category[category], key=lambda e: e.path.stem)
+        category_items = []
+        for example in category_examples:
+            doc_path = EXAMPLE_DOC_DIR / example.category / f"{example.path.stem}.md"
+            rel_path = doc_path.relative_to(ROOT_DIR / "docs")
+            category_items.append({example.title: str(rel_path)})
+
+        if category_items:
+            # Format category name (e.g., "offline_inference" -> "Offline Inference")
+            category_title = fix_case(category.replace("_", " ").title())
+            examples_section.append({category_title: category_items})
+
+    # Update the nav structure
+    nav_list[user_guide_idx]["User Guide"][examples_idx]["Examples"] = examples_section
+
+    # Write back to file
+    nav_data["nav"] = nav_list
+    with open(NAV_FILE, "w", encoding="utf-8") as f:
+        yaml.dump(nav_data, f, default_flow_style=False, sort_keys=False, allow_unicode=True)
+    logger.info("Updated navigation file: %s", NAV_FILE.relative_to(ROOT_DIR))
+
+
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+    logger.info("Generating example documentation")
+    logger.debug("Root directory: %s", ROOT_DIR.resolve())
+    logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
+    logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
+
+    # Create the EXAMPLE_DOC_DIR if it doesn't exist
+    if not EXAMPLE_DOC_DIR.exists():
+        EXAMPLE_DOC_DIR.mkdir(parents=True)
+
+    categories = sorted(p for p in EXAMPLE_DIR.iterdir() if p.is_dir())
+
+    examples = []
+    glob_patterns = ["*.py", "*.md", "*.sh"]
+    # Find categorised examples
+    for category in categories:
+        globs = [category.glob(pattern) for pattern in glob_patterns]
+        for path in itertools.chain(*globs):
+            examples.append(Example(path, category.stem))
+        # Find examples in subdirectories
+        for path in category.glob("*/*.md"):
+            examples.append(Example(path.parent, category.stem))
+
+    # Generate the example documentation
+    for example in sorted(examples, key=lambda e: e.path.stem):
+        example_name = f"{example.path.stem}.md"
+        doc_path = EXAMPLE_DOC_DIR / example.category / example_name
+        if not doc_path.parent.exists():
+            doc_path.parent.mkdir(parents=True)
+        # Specify encoding for building on Windows
+        with open(doc_path, "w+", encoding="utf-8") as f:
+            f.write(example.generate())
+        logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
+
+    # Update the navigation file
+    update_nav_file(examples)
--- a/docs/mkdocs/hooks/url_schemes.py
+++ b/docs/mkdocs/hooks/url_schemes.py
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""
+This is basically a port of MyST parser’s external URL resolution mechanism
+(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
+to work with MkDocs.
+
+It allows Markdown authors to use GitHub shorthand links like:
+
+  - [Text](gh-issue:123)
+  - <gh-pr:456>
+  - [File](gh-file:path/to/file.py#L10)
+
+These are automatically rewritten into fully qualified GitHub URLs pointing to
+issues, pull requests, files, directories, or projects in the
+`vllm-project/vllm-omni` repository.
+
+The goal is to simplify cross-referencing common GitHub resources
+in project docs.
+"""
+
+import regex as re
+from mkdocs.config.defaults import MkDocsConfig
+from mkdocs.structure.files import Files
+from mkdocs.structure.pages import Page
+
+
+def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig, files: Files) -> str:
+    """
+    Custom MkDocs plugin hook to rewrite special GitHub reference links
+    in Markdown.
+
+    This function scans the given Markdown content for specially formatted
+    GitHub shorthand links, such as:
+      - `[Link text](gh-issue:123)`
+      - `<gh-pr:456>`
+
+    And rewrites them into fully-qualified GitHub URLs with GitHub icons:
+      - `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
+      - `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
+
+    Supported shorthand types:
+      - `gh-issue`
+      - `gh-pr`
+      - `gh-project`
+      - `gh-dir`
+      - `gh-file`
+
+    Args:
+        markdown (str): The raw Markdown content of the page.
+        page (Page): The MkDocs page object being processed.
+        config (MkDocsConfig): The MkDocs site configuration.
+        files (Files): The collection of files in the MkDocs build.
+
+    Returns:
+        str: The updated Markdown content with GitHub shorthand links replaced.
+    """
+    gh_icon = ":octicons-mark-github-16:"
+    gh_url = "https://github.com"
+    repo_url = f"{gh_url}/vllm-project/vllm-omni"
+    org_url = f"{gh_url}/orgs/vllm-project"
+
+    # Mapping of shorthand types to their corresponding GitHub base URLs
+    urls = {
+        "issue": f"{repo_url}/issues",
+        "pr": f"{repo_url}/pull",
+        "project": f"{org_url}/projects",
+        "dir": f"{repo_url}/tree/main",
+        "file": f"{repo_url}/blob/main",
+    }
+
+    # Default title prefixes for auto links
+    titles = {
+        "issue": "Issue #",
+        "pr": "Pull Request #",
+        "project": "Project #",
+        "dir": "",
+        "file": "",
+    }
+
+    # Regular expression to match GitHub shorthand links
+    scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
+    inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
+    auto_link = re.compile(f"<{scheme}>")
+
+    def replace_inline_link(match: re.Match) -> str:
+        """
+        Replaces a matched inline-style GitHub shorthand link
+        with a full Markdown link.
+
+        Example:
+            [My issue](gh-issue:123) → [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
+        """
+        url = f"{urls[match.group('type')]}/{match.group('path')}"
+        if fragment := match.group("fragment"):
+            url += f"#{fragment}"
+
+        return f"[{gh_icon} {match.group('title')}]({url})"
+
+    def replace_auto_link(match: re.Match) -> str:
+        """
+        Replaces a matched autolink-style GitHub shorthand
+        with a full Markdown link.
+
+        Example:
+            <gh-pr:456> → [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
+        """
+        type = match.group("type")
+        path = match.group("path")
+        title = f"{titles[type]}{path}"
+        url = f"{urls[type]}/{path}"
+        if fragment := match.group("fragment"):
+            url += f"#{fragment}"
+
+        return f"[{gh_icon} {title}]({url})"
+
+    # Replace both inline and autolinks
+    markdown = inline_link.sub(replace_inline_link, markdown)
+    markdown = auto_link.sub(replace_auto_link, markdown)
+
+    return markdown
--- a/docs/mkdocs/javascript/edit_and_feedback.js
+++ b/docs/mkdocs/javascript/edit_and_feedback.js
+/**
+ * edit_and_feedback.js
+ *
+ * Enhances MkDocs Material docs pages by:
+ *
+ * 1. Adding a "Question? Give us feedback" link
+ *    below the "Edit" button.
+ *
+ *    - The link opens a GitHub issue with a template,
+ *      auto-filled with the current page URL and path.
+ *
+ * 2. Ensuring the edit button opens in a new tab
+ *    with target="_blank" and rel="noopener".
+ */
+document.addEventListener("DOMContentLoaded", function () {
+    const url = window.location.href;
+    const page = document.body.dataset.mdUrl || location.pathname;
+
+    const feedbackLink = document.createElement("a");
+    feedbackLink.href = `https://github.com/vllm-project/vllm-omni/issues/new?template=100-documentation.yml&title=${encodeURIComponent(
+      `[Docs] Feedback for \`${page}\``
+    )}&body=${encodeURIComponent(`📄 **Reference:**\n${url}\n\n📝 **Feedback:**\n_Your response_`)}`;
+    feedbackLink.target = "_blank";
+    feedbackLink.rel = "noopener";
+    feedbackLink.title = "Provide feedback";
+    feedbackLink.className = "md-content__button";
+    feedbackLink.innerHTML = `
+    <svg
+      xmlns="http://www.w3.org/2000/svg"
+      height="24px"
+      viewBox="0 -960 960 960"
+      width="24px"
+      fill="currentColor"
+    >
+      <path d="M280-280h280v-80H280v80Zm0-160h400v-80H280v80Zm0-160h400v-80H280v80Zm-80 480q-33 0-56.5-23.5T120-200v-560q0-33 23.5-56.5T200-840h560q33 0 56.5 23.5T840-760v560q0 33-23.5 56.5T760-120H200Zm0-80h560v-560H200v560Zm0-560v560-560Z"/>
+    </svg>
+  `;
+
+    const editButton = document.querySelector('.md-content__button[href*="edit"]');
+
+    if (editButton && editButton.parentNode) {
+      editButton.insertAdjacentElement("beforebegin", feedbackLink);
+
+      editButton.setAttribute("target", "_blank");
+      editButton.setAttribute("rel", "noopener");
+    }
+  });
--- a/docs/mkdocs/javascript/mathjax.js
+++ b/docs/mkdocs/javascript/mathjax.js
+// Enables MathJax rendering
+window.MathJax = {
+    tex: {
+      inlineMath: [["\\(", "\\)"]],
+      displayMath: [["\\[", "\\]"]],
+      processEscapes: true,
+      processEnvironments: true
+    },
+    options: {
+      ignoreHtmlClass: ".*|",
+      processHtmlClass: "arithmatex"
+    }
+  };
+
+  document$.subscribe(() => {
+    MathJax.startup.output.clearCache()
+    MathJax.typesetClear()
+    MathJax.texReset()
+    MathJax.typesetPromise()
+  })
--- a/docs/mkdocs/javascript/mermaid.js
+++ b/docs/mkdocs/javascript/mermaid.js
+// Initialize Mermaid for diagram rendering
+mermaid.initialize({
+  startOnLoad: false,
+  theme: 'default',
+  securityLevel: 'loose',
+  flowchart: {
+    useMaxWidth: true,
+    htmlLabels: true
+  }
+});
+
+// Render Mermaid diagrams when page content is ready
+document$.subscribe(() => {
+  const mermaidElements = document.querySelectorAll('.mermaid');
+  if (mermaidElements.length > 0) {
+    mermaid.run({
+      querySelector: '.mermaid',
+      nodes: mermaidElements
+    });
+  }
+});
--- a/docs/mkdocs/javascript/slack_and_forum.js
+++ b/docs/mkdocs/javascript/slack_and_forum.js
+/**
+ * slack_and_forum.js
+ *
+ * Adds a custom Slack and Forum button to the MkDocs Material header.
+ *
+ */
+
+window.addEventListener('DOMContentLoaded', () => {
+    const headerInner = document.querySelector('.md-header__inner');
+
+    if (headerInner) {
+      const slackButton = document.createElement('button');
+      slackButton.className = 'slack-button';
+      slackButton.title = 'Join us on Slack';
+      slackButton.style.border = 'none';
+      slackButton.style.background = 'transparent';
+      slackButton.style.cursor = 'pointer';
+
+      slackButton.innerHTML = `
+        <img src="https://a.slack-edge.com/80588/marketing/img/icons/icon_slack_hash_colored.png"
+             style="height: 1.1rem;"
+             alt="Slack">
+      `;
+
+      slackButton.addEventListener('click', () => {
+        window.open('https://slack.vllm.ai', '_blank', 'noopener');
+      });
+
+      const forumButton = document.createElement('button');
+      forumButton.className = 'forum-button';
+      forumButton.title = 'Join the Forum';
+      forumButton.style.border = 'none';
+      forumButton.style.background = 'transparent';
+      forumButton.style.cursor = 'pointer';
+
+      forumButton.innerHTML = `
+        <svg
+          xmlns="http://www.w3.org/2000/svg"
+          viewBox="0 -960 960 960"
+          fill="currentColor"
+        >
+          <path d="M817.85-198.15 698.46-317.54H320q-24.48 0-41.47-16.99T261.54-376v-11.69h424.61q25.39 0 43.47-18.08 18.07-18.08 18.07-43.46v-268.92h11.69q24.48 0 41.47 16.99 17 16.99 17 41.47v461.54ZM179.08-434.69l66.84-66.85h363.31q10.77 0 17.69-6.92 6.93-6.92 6.93-17.69v-246.77q0-10.77-6.93-17.7-6.92-6.92-17.69-6.92H203.69q-10.77 0-17.69 6.92-6.92 6.93-6.92 17.7v338.23Zm-36.93 89.46v-427.69q0-25.39 18.08-43.46 18.08-18.08 43.46-18.08h405.54q25.39 0 43.46 18.08 18.08 18.07 18.08 43.46v246.77q0 25.38-18.08 43.46-18.07 18.07-43.46 18.07H261.54L142.15-345.23Zm36.93-180.92V-797.54v271.39Z"/>
+        </svg>
+      `;
+
+      forumButton.addEventListener('click', () => {
+        window.open('https://discuss.vllm.ai/', '_blank', 'noopener');
+      });
+
+      const githubSource = document.querySelector('.md-header__source');
+      if (githubSource) {
+        githubSource.parentNode.insertBefore(slackButton, githubSource.nextSibling);
+        githubSource.parentNode.insertBefore(forumButton, slackButton.nextSibling);
+      }
+    }
+  });
--- a/docs/mkdocs/overrides/main.html
+++ b/docs/mkdocs/overrides/main.html
+{% extends "base.html" %}
--- a/docs/mkdocs/overrides/partials/toc-item.html
+++ b/docs/mkdocs/overrides/partials/toc-item.html
+<!-- Enables the use of toc_depth in document frontmatter https://github.com/squidfunk/mkdocs-material/issues/4827#issuecomment-1869812019 -->
+<li class="md-nav__item">
+    <a href="{{ toc_item.url }}" class="md-nav__link">
+      <span class="md-ellipsis">
+        {{ toc_item.title }}
+      </span>
+    </a>
+
+    <!-- Table of contents list -->
+    {% if toc_item.children %}
+      <nav class="md-nav" aria-label="{{ toc_item.title | striptags }}">
+        <ul class="md-nav__list">
+          {% for toc_item in toc_item.children %}
+          {% if not page.meta.toc_depth or toc_item.level <= page.meta.toc_depth %}
+            {% include "partials/toc-item.html" %}
+          {% endif %}
+          {% endfor %}
+        </ul>
+      </nav>
+    {% endif %}
+  </li>
--- a/docs/mkdocs/stylesheets/extra.css
+++ b/docs/mkdocs/stylesheets/extra.css
+/* Warning for latest docs */
+.md-banner {
+    background-color: var(--md-warning-bg-color);
+    color: var(--md-warning-fg-color);
+}
+
+/* https://christianoliff.com/blog/styling-external-links-with-an-icon-in-css/ */
+a:not(:has(svg)):not(.md-icon):not(.autorefs-external) {
+    align-items: center;
+
+    &[href^="//"]::after,
+    &[href^="http://"]::after,
+    &[href^="https://"]::after {
+        content: "";
+        width: 12px;
+        height: 12px;
+        margin-left: 4px;
+        background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' stroke='gray' viewBox='0 0 16 16'%3E%3Cpath fill-rule='evenodd' d='M8.636 3.5a.5.5 0 0 0-.5-.5H1.5A1.5 1.5 0 0 0 0 4.5v10A1.5 1.5 0 0 0 1.5 16h10a1.5 1.5 0 0 0 1.5-1.5V7.864a.5.5 0 0 0-1 0V14.5a.5.5 0 0 1-.5.5h-10a.5.5 0 0 1-.5-.5v-10a.5.5 0 0 1 .5-.5h6.636a.5.5 0 0 0 .5-.5z'/%3E%3Cpath fill-rule='evenodd' d='M16 .5a.5.5 0 0 0-.5-.5h-5a.5.5 0 0 0 0 1h3.793L6.146 9.146a.5.5 0 1 0 .708.708L15 1.707V5.5a.5.5 0 0 0 1 0v-5z'/%3E%3C/svg%3E");
+        background-position: center;
+        background-repeat: no-repeat;
+        background-size: contain;
+        display: inline-block;
+    }
+}
+
+a[href*="localhost"]::after,
+a[href*="127.0.0.1"]::after,
+
+/* Hide external link icons for all links */
+a[href^="//"]::after,
+a[href^="http://"]::after,
+a[href^="https://"]::after {
+    display: none !important;
+}
+
+/* Light mode: darker section titles */
+body[data-md-color-scheme="default"] .md-nav__item--section > label.md-nav__link .md-ellipsis {
+  color: rgba(0, 0, 0, 0.7) !important;
+  font-weight: 700;
+}
+
+/* Dark mode: lighter gray section titles */
+body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .md-ellipsis {
+  color: rgba(255, 255, 255, 0.75) !important;
+  font-weight: 700;
+}
+
+/* Custom admonitions */
+:root {
+  --md-admonition-icon--announcement: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M3.25 9a.75.75 0 0 1 .75.75c0 2.142.456 3.828.733 4.653a.122.122 0 0 0 .05.064.212.212 0 0 0 .117.033h1.31c.085 0 .18-.042.258-.152a.45.45 0 0 0 .075-.366A16.743 16.743 0 0 1 6 9.75a.75.75 0 0 1 1.5 0c0 1.588.25 2.926.494 3.85.293 1.113-.504 2.4-1.783 2.4H4.9c-.686 0-1.35-.41-1.589-1.12A16.4 16.4 0 0 1 2.5 9.75.75.75 0 0 1 3.25 9Z"></path><path d="M0 6a4 4 0 0 1 4-4h2.75a.75.75 0 0 1 .75.75v6.5a.75.75 0 0 1-.75.75H4a4 4 0 0 1-4-4Zm4-2.5a2.5 2.5 0 1 0 0 5h2v-5Z"></path><path d="M15.59.082A.75.75 0 0 1 16 .75v10.5a.75.75 0 0 1-1.189.608l-.002-.001h.001l-.014-.01a5.775 5.775 0 0 0-.422-.25 10.63 10.63 0 0 0-1.469-.64C11.576 10.484 9.536 10 6.75 10a.75.75 0 0 1 0-1.5c2.964 0 5.174.516 6.658 1.043.423.151.787.302 1.092.443V2.014c-.305.14-.669.292-1.092.443C11.924 2.984 9.713 3.5 6.75 3.5a.75.75 0 0 1 0-1.5c2.786 0 4.826-.484 6.155-.957.665-.236 1.154-.47 1.47-.64.144-.077.284-.161.421-.25l.014-.01a.75.75 0 0 1 .78-.061Z"></path></svg>');
+  --md-admonition-icon--important: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M4.47.22A.749.749 0 0 1 5 0h6c.199 0 .389.079.53.22l4.25 4.25c.141.14.22.331.22.53v6a.749.749 0 0 1-.22.53l-4.25 4.25A.749.749 0 0 1 11 16H5a.749.749 0 0 1-.53-.22L.22 11.53A.749.749 0 0 1 0 11V5c0-.199.079-.389.22-.53Zm.84 1.28L1.5 5.31v5.38l3.81 3.81h5.38l3.81-3.81V5.31L10.69 1.5ZM8 4a.75.75 0 0 1 .75.75v3.5a.75.75 0 0 1-1.5 0v-3.5A.75.75 0 0 1 8 4Zm0 8a1 1 0 1 1 0-2 1 1 0 0 1 0 2Z"></path></svg>');
+  --md-admonition-icon--code: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="m11.28 3.22 4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L13.94 8l-3.72-3.72a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215m-6.56 0a.75.75 0 0 1 1.042.018.75.75 0 0 1 .018 1.042L2.06 8l3.72 3.72a.749.749 0 0 1-.326 1.275.75.75 0 0 1-.734-.215L.47 8.53a.75.75 0 0 1 0-1.06Z"/></svg>');
+  --md-admonition-icon--console: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M0 2.75C0 1.784.784 1 1.75 1h12.5c.966 0 1.75.784 1.75 1.75v10.5A1.75 1.75 0 0 1 14.25 15H1.75A1.75 1.75 0 0 1 0 13.25Zm1.75-.25a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h12.5a.25.25 0 0 0 .25-.25V2.75a.25.25 0 0 0-.25-.25ZM7.25 8a.75.75 0 0 1-.22.53l-2.25 2.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L5.44 8 3.72 6.28a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215l2.25 2.25c.141.14.22.331.22.53m1.5 1.5h3a.75.75 0 0 1 0 1.5h-3a.75.75 0 0 1 0-1.5"/></svg>');
+}
+
+.md-typeset .admonition.announcement,
+.md-typeset details.announcement {
+  border-color: rgb(255, 110, 66);
+}
+.md-typeset .admonition.important,
+.md-typeset details.important {
+  border-color: rgb(239, 85, 82);
+}
+.md-typeset .admonition.code,
+.md-typeset details.code {
+  border-color: #64dd17
+}
+.md-typeset .admonition.console,
+.md-typeset details.console {
+  border-color: #64dd17
+}
+
+.md-typeset .announcement > .admonition-title,
+.md-typeset .announcement > summary {
+  background-color: rgb(255, 110, 66, 0.1);
+}
+.md-typeset .important > .admonition-title,
+.md-typeset .important > summary {
+  background-color: rgb(239, 85, 82, 0.1);
+}
+.md-typeset .code > .admonition-title,
+.md-typeset .code > summary {
+  background-color: #64dd171a;
+}
+.md-typeset .console > .admonition-title,
+.md-typeset .console > summary {
+  background-color: #64dd171a;
+}
+
+.md-typeset .announcement > .admonition-title::before,
+.md-typeset .announcement > summary::before {
+  background-color: rgb(239, 85, 82);
+  -webkit-mask-image: var(--md-admonition-icon--announcement);
+          mask-image: var(--md-admonition-icon--announcement);
+}
+.md-typeset .important > .admonition-title::before,
+.md-typeset .important > summary::before {
+  background-color: rgb(239, 85, 82);
+  -webkit-mask-image: var(--md-admonition-icon--important);
+          mask-image: var(--md-admonition-icon--important);
+}
+.md-typeset .code > .admonition-title::before,
+.md-typeset .code > summary::before {
+  background-color: #64dd17;
+  -webkit-mask-image: var(--md-admonition-icon--code);
+          mask-image: var(--md-admonition-icon--code);
+}
+.md-typeset .console > .admonition-title::before,
+.md-typeset .console > summary::before {
+  background-color: #64dd17;
+  -webkit-mask-image: var(--md-admonition-icon--console);
+          mask-image: var(--md-admonition-icon--console);
+}
+
+/* Make label fully visible on hover */
+.md-content__button[href*="edit"]:hover::after {
+  opacity: 1;
+}
+
+/* Hide edit button on generated docs/examples pages */
+@media (min-width: 960px) {
+  .md-content__button[href*="docs/examples/"] {
+    display: none !important;
+  }
+}
+
+.md-content__button-wrapper {
+  position: absolute;
+  top: 0.6rem;
+  right: 0.8rem;
+  display: flex;
+  flex-direction: row;
+  align-items: center;
+  gap: 0.4rem;
+  z-index: 1;
+}
+
+.md-content__button-wrapper a {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  height: 24px;
+  width: 24px;
+  color: var(--md-default-fg-color);
+  text-decoration: none;
+}
+
+.md-content__button-wrapper a:hover {
+  color: var(--md-accent-fg-color);
+}
+
+/* Slack and Forum css */
+.slack-button,
+.forum-button {
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  margin-left: 0.4rem;
+  height: 24px;
+}
+
+.slack-button img {
+  height: 18px;
+  filter: none !important;
+}
+
+.slack-button:hover,
+.forum-button:hover {
+  opacity: 0.7;
+}
+
+.forum-button svg {
+  height: 28px;
+  opacity: 0.9;
+  transform: translateY(2px);
+}
+
+/* For logo css */
+[data-md-color-scheme="default"] .logo-dark {
+  display: none;
+}
+
+[data-md-color-scheme="slate"] .logo-light {
+  display: none;
+}
+
+/* Outline for content tabs */
+.md-typeset .tabbed-set {
+  border: 0.075rem solid var(--md-default-fg-color);
+  border-radius: 0.2rem;
+}
+
+.md-typeset .tabbed-content {
+  padding: 0 0.6em;
+}
+
+/* Hide link icon in header logo */
+.md-header__button.md-logo :is(img, svg) {
+  pointer-events: none;
+}
+
+.md-header__button.md-logo::after {
+  display: none !important;
+}
+
+/* Hide link icons in content tabs (tabbed content) */
+.md-typeset .tabbed-set > label::after,
+.md-typeset .tabbed-set > input:checked + label::after {
+  display: none !important;
+}
+
+/* Hide link icons in navigation tabs */
+.md-nav__link[href]::after,
+.md-nav__item--nested > .md-nav__link::after {
+  display: none !important;
+}
+
+/* Hide link icons in top navigation tabs */
+.md-tabs__link::after {
+  display: none !important;
+}