docs(multimodality): Refactor MM docs into high level features and VLM/Diffusion sections (#6831)

Signed-off-by: Ryan McCormick <mccormick.codes@gmail.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>

docs(multimodality): Refactor MM docs into high level features and VLM/Diffusion sections (#6831)
Signed-off-by: Ryan McCormick <mccormick.codes@gmail.com> Co-authored-by: akshatha-k <33278067+akshatha-k@users.noreply.github.com>
f0bcabe0 · Ryan McCormick · GitHub · 76c96c5d · f0bcabe0 · f0bcabe0
Unverified Commit f0bcabe0 authored Mar 06, 2026 by Ryan McCormick Committed by GitHub Mar 06, 2026
6 changed files
--- a/docs/features/multimodal/README.md
+++ b/docs/features/multimodal/README.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: Multimodality Support
+title: Multimodal Model Serving
 subtitle: Deploy multimodal models with image, video, and audio support in Dynamo
 ---

-Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
+Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

-> [!IMPORTANT]
-> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
-> See the relevant documentation for each backend for the necessary flags.
->
-> This prevents unintended processing of multimodal data from untrusted sources.
+<Warning>
+**Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.
+</Warning>

-## Backend Documentation
-## Support Matrix
-
-### Backend Capabilities
-
-| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio |
-|-------|------|-------|------|-----|-------|-------|-------|
-| **[vLLM](multimodal-vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 |
-| **[TRT-LLM](multimodal-trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ |
-| **[SGLang](multimodal-sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
-
-\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668))
+## Key Features
+```mermaid
+---
+title: Sample flow for an aggregated VLM serving scenario
+---
+flowchart TD
+    A[Request] --> B{KV cache hit?}
+    B -->|Yes| C[Use KV]
+    B -->|No| D{Embedding cache hit?}
+    D -->|Yes| E[Load embedding]
+    D -->|No| F[Run encoder]
+    F --> G[save to cache]
+    G --> H["PREFILL (image tokens + text tokens → KV cache)"]
+    E --> H
+    C --> I[DECODE]
+    H --> I
+    I --> J[Response]
+
+Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
+| Feature | Description |
+|---------|-------------|
+| **[Embedding Cache](embedding-cache.md)** | CPU-side LRU cache that skips re-encoding repeated images |
+| **[Encoder Disaggregation](encoder-disaggregation.md)** | Separate vision encoder worker for independent scaling |
+| **[Multimodal KV Routing](multimodal-kv-routing.md)** | MM-aware KV cache routing for optimal worker selection |

-**Pattern Key:**
+## Support Matrix

- **EPD** - All-in-one worker (Simple Aggregated)
- **E/PD** - Separate encode, combined prefill+decode
- **E/P/D** - All stages separate
- **EP/D** - Combined encode+prefill, separate decode
+| Stack | Image | Video | Audio |
+|-------|-------|-------|-------|
+| **[vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)** | ✅ | 🧪  | 🧪 |
+| **[TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)** | ✅ | ❌ | ❌ |
+| **[SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)** | ✅ | ❌ | ❌ |

-**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported
+**Status:** ✅ Supported | 🧪 Experimental | ❌ Not supported

 ### Input Format Support

@@ -43,150 +54,19 @@ Dynamo supports multimodal inference across multiple LLM backends, enabling mode
 | Data URL (Base64) | ❌ | ❌ | ✅ |
 | Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ |

-## Architecture Patterns
-
-Dynamo supports several deployment patterns for multimodal inference based on two dimensions:
-
-1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**?
-   - *Inline*: Simpler setup, encoding happens in the prefill worker
-   - *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling
-
-2. **Prefill/Decode**: Are prefill and decode in the same worker or separate?
-   - *Aggregated*: Single worker handles both prefill and decode
-   - *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them
-
-These combine into four deployment patterns:
-
-### EPD - Simple Aggregated
-
-All processing happens within a single worker - the simplest setup.
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Worker (Python)
-    ↓ image load + encode + prefill + decode
-Response
-```
-
-| Component | Purpose |
-|-----------|---------|
-| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing |
-| Worker | Complete inference pipeline (encode + prefill + decode) |
-
-**When to use:** Quick setup, smaller models, development/testing.
-
-### E/PD - Encode Separate
-
-Encoding happens in a separate worker; prefill and decode share the same engine.
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python)
-    ↓ tokenizes, extracts media URL
-Encode Worker (Python)
-    ↓ downloads media, generates embeddings, NIXL transfer
-PD Worker (Python)
-    ↓ receives embeddings via NIXL, prefill + decode
-Response
-```
-
-| Component | Purpose |
-|-----------|---------|
-| Frontend (Rust) | HTTP entry point |
-| Processor (Python) | Tokenization, extracts media URLs |
-| Encode Worker | Media encoding, embeddings generation |
-| PD Worker | Prefill + Decode with embeddings |
-
-**When to use:** Offload vision encoding to separate GPU, scale encode workers independently.
-
-### E/P/D - Full Disaggregation
-
-Full disaggregation with separate workers for encoding, prefill, and decode.
-There are two variants of this workflow:
- Prefill-first, used by vLLM
- Decode-first, used by SGLang
-
-Prefill-first:
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python)
-    ↓ tokenizes, extracts media URL
-Encode Worker (Python)
-    ↓ downloads media, generates embeddings, NIXL transfer
-Prefill Worker (Python)
-    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
-Decode Worker (Python)
-    ↓ decode only, token generation
-Response
-```
-
-OR
-
-Decode-first:
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python)
-    ↓ tokenizes, extracts media URL
-Encode Worker (Python)
-    ↓ downloads media, generates embeddings, NIXL transfer
-Decode Worker (Python)
-    ↓ Bootstraps prefill worker
-Prefill Worker (Python)
-    ↓ receives embeddings via NIXL, prefill only, KV cache transfer
-Decode Worker (Python)
-    ↓ decode only, token generation
-Response
-```
-
-| Component | Purpose |
-|-----------|---------|
-| Frontend (Rust) | HTTP entry point |
-| Processor (Python) | Tokenization, extracts media URLs |
-| Encode Worker | Media encoding, embeddings generation |
-| Prefill Worker | Prefill only, transfers KV cache |
-| Decode Worker | Decode only, token generation |
-
-**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase.
-
-### EP/D - Traditional Disaggregated
-
-Encoding is combined with prefill, with decode separate.
-
-```text
-HTTP Frontend (Rust)
-    ↓
-Processor (Python)
-    ↓ tokenizes, extracts media URL
-Encode+Prefill Worker (Python)
-    ↓ downloads media, encodes inline, prefill, KV cache transfer
-Decode Worker (Python)
-    ↓ decode only, token generation
-Response
-```
-
-| Component | Purpose |
-|-----------|---------|
-| Frontend (Rust) | HTTP entry point |
-| Processor (Python) | Tokenization, extracts media URLs (vLLM only) |
-| Encode+Prefill Worker | Combined encoding and prefill |
-| Decode Worker | Decode only, token generation |
-
-> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker.
-> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored.
-
-**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment.
-
 ## Example Workflows

-You can find example workflows and reference implementations for deploying multimodal models in:
+Reference implementations for deploying multimodal models:

 - [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch)
 - [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch)
 - [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch)
- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
+- [Experimental multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio)
+
+## Backend Documentation
+
+Detailed deployment guides, configuration, and examples for each backend:
+
+- **[vLLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md)**
+- **[TensorRT-LLM Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md)**
+- **[SGLang Multimodal](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)**
--- a/docs/features/multimodal/diffusion.md
+++ b/docs/features/multimodal/diffusion.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Diffusion
+subtitle: Deploy diffusion models for text-to-image, text-to-video, and more in Dynamo
+---
+
+## Overview
+
+Dynamo supports serving diffusion models across multiple backends, enabling generation of images and video from text prompts. Backends expose diffusion capabilities through the same Dynamo pipeline infrastructure used for LLM inference, including frontend routing, scaling, and observability.
+
+## Support Matrix
+
+| Modality | vLLM-Omni | SGLang | TRT-LLM |
+|----------|-----------|--------|---------|
+| Text-to-Text | ✅ | ✅ | ❌ |
+| Text-to-Image | ✅ | ✅ | ❌ |
+| Text-to-Video | ✅ | ✅ | ✅ |
+| Image-to-Video | ❌ | ❌ | ❌ |
+
+**Status:** ✅ Supported | ❌ Not supported
+
+> [!NOTE]
+> Image-to-video support is planned and coming soon across all backends.
+
+## Backend Documentation
+
+For deployment guides, configuration, and examples for each backend:
+
+- **[vLLM-Omni](../../backends/vllm/vllm-omni.md)**
+- **[SGLang Diffusion](../../backends/sglang/sglang-diffusion.md)**
+- **[TRT-LLM Diffusion](../../backends/trtllm/trtllm-video-diffusion.md)**
--- a/docs/features/multimodal/embedding-cache.md
+++ b/docs/features/multimodal/embedding-cache.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Embedding Cache
+subtitle: Cache vision encoder embeddings to skip re-encoding repeated images
+---
+
+## Overview
+
+The embedding cache is a CPU-side LRU cache that stores vision encoder outputs. When the same image appears in multiple requests, the cached embedding is reused instead of running the vision encoder again. This reduces GPU load on the encoder and lowers latency for repeated images.
+> Note: This feature can also be referred to as **encoder cache**. Embedding cache is separate from KV cache, which reuses attention key/value state after prefill to skip prefill and go straight to decode. For KV cache reuse and routing, see [Multimodal KV Routing](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-kv-routing.md).
+## When to Use
+
+Use the embedding cache when your workload includes repeated images across requests. Common scenarios:
+
+- Product catalog queries where users ask about the same product images
+- Document processing pipelines that reference shared diagrams or figures
+- Chat sessions where the same image is discussed across multiple turns, like an architecture diagram in a code-gen use case.
+
+If your workload consists entirely of unique images, the cache provides no benefit.
+
+## Support Matrix
+
+| Backend | Aggregated | Disaggregated (E/PD) | Notes |
+|---------|------------|----------------------|-------|
+| **vLLM** | ✅* | ✅ | Aggregated uses vLLM-native `ec_both`; disaggregated uses Dynamo `EmbeddingCacheManager` |
+| **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker |
+| **SGLang** | ❌ | ❌ | Not supported yet |
+
+*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
+
+## How It Works
+
+The prefill worker owns the CPU-side LRU cache. On a hit, the encode worker is skipped entirely. On a miss, the encode worker produces the embedding, transfers it via NIXL, and the prefill worker saves it to the cache.
+
+```mermaid
+flowchart LR
+    req[Request] --> check{CPU cache hit?}
+    check -. hit .-> use[Use cached embedding]
+    check -- miss --> E[Encode Worker]
+    E -- embeddings via NIXL --> save[Save to cache]
+    save --> engine[Inference Engine]
+    use --> engine
+```
+
+**Launch (vLLM):**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
+```
+
+**Launch (TRT-LLM):**
+
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10
+```
+
+## Configuration
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--multimodal-embedding-cache-capacity-gb` | CPU-side LRU cache size in GB | 0 (disabled) |
+
+Set the capacity based on your expected working set of unique images. A larger cache holds more embeddings but consumes more host memory.
+
+See the backend-specific documentation ([vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md#embedding-cache), [TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md#embedding-cache)) for more details.
--- a/docs/features/multimodal/encoder-disaggregation.md
+++ b/docs/features/multimodal/encoder-disaggregation.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Encoder Disaggregation
+subtitle: Separate vision encoding into a dedicated worker for independent scaling
+---
+
+## Overview
+
+Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA).
+
+This enables:
+
+- Independent scaling of encode workers based on vision workload
+- Reduced GPU memory pressure on prefill/decode workers
+- Better GPU utilization by matching worker counts to actual bottlenecks
+
+## When to Use
+
+Use encoder disaggregation when:
+
+- Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers
+- You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference)
+- Your deployment handles high volumes of multimodal requests and encoding throughput is limiting
+
+For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up.
+
+## Support Matrix
+
+| Backend | E/PD | E/P/D | Notes |
+|---------|------|-------|-------|
+| **vLLM** | ✅ | ✅ | NIXL transfer for embeddings; NIXL KV cache transfer for P/D |
+| **TRT-LLM** | ❌ | ✅ | Supports image URLs (via `MultimodalEncoder`) and pre-computed embeddings (via NIXL) |
+| **SGLang** | ✅ | ✅ | NIXL for embeddings; bootstrap mechanism for P/D KV transfer |
+
+## Deployment Patterns
+
+**E/PD** — Separate encoder, combined prefill+decode:
+
+```text
+Frontend → Processor → Encode Worker → PD Worker → Response
+                           (NIXL)
+```
+
+The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker.
+
+**E/P/D** — All stages separate:
+
+```text
+Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response
+                           (NIXL)          (KV transfer)
+```
+
+Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker.
+
+## Launching
+
+### vLLM
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+
+# E/PD
+bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
+
+# E/P/D
+bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
+```
+
+### TRT-LLM
+
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm
+
+# E/PD
+bash launch/disagg_e_pd.sh
+
+# E/P/D
+./launch/epd_multimodal_image_and_embeddings.sh
+```
+
+### SGLang
+
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+
+# E/PD
+./launch/multimodal_epd.sh
+
+# E/P/D
+./launch/multimodal_disagg.sh
+```
+
+See the backend-specific documentation ([vLLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-vllm.md), [TRT-LLM](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-trtllm.md), [SGLang](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/multimodal-sglang.md)) for full configuration details and component flags.
--- a/docs/features/multimodal/multimodal-kv-routing.md
+++ b/docs/features/multimodal/multimodal-kv-routing.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Multimodal KV Routing
+subtitle: Route multimodal requests to workers with the best KV cache overlap
+---
+
+## Overview
+
+Multimodal KV routing extends Dynamo's KV-aware router to account for image content when computing cache overlap scores. A dedicated MM router worker sits between the frontend and backend workers. It downloads images, computes a hash of each image (`mm_hash`), and includes this hash in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.
+
+Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.
+> Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see [Embedding Cache](https://github.com/ai-dynamo/dynamo/blob/main/docs/features/multimodal/embedding-cache.md).
+## When to Use
+
+Use multimodal KV routing when:
+
+- You have multiple backend workers serving multimodal requests
+- Your workload includes repeated images across requests (e.g., the same product photo, shared reference images)
+- You want to maximize KV cache hit rates for multimodal content
+
+Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image's KV blocks.
+
+## Support Matrix
+
+| Backend | Supported | Notes |
+|---------|-----------|-------|
+| **vLLM** | ✅* | Requires vLLM with KV events `extra_keys` support ([PR #33304](https://github.com/vllm-project/vllm/pull/33304)) |
+| **TRT-LLM** | ✅ | Requires `--publish-events-and-metrics` on TRT-LLM workers |
+| **SGLang** | ❌ | Not supported yet |
+
+*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
+
+## How It Works
+
+```text
+Frontend (round-robin) → MM Router Worker → Backend Workers
+                              │
+                              ├─ Download image
+                              ├─ Compute mm_hash
+                              ├─ Build per-block MM metadata
+                              └─ KvRouter selects best worker
+```
+
+1. The frontend routes to the MM router worker via round-robin
+2. The MM router downloads each image and computes an `mm_hash`
+3. Per-block routing metadata (`block_mm_infos`) is built, tagging blocks that contain image tokens
+4. The KV router evaluates overlap across all backend workers, accounting for image-bearing blocks
+5. The request is forwarded to the worker with the highest overlap
+
+On repeated requests with the same image, the selected worker shows higher cached block counts, reducing prefill latency.
+
+## Launching
+
+### vLLM
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm/mm_router_worker
+MODEL=Qwen/Qwen3-VL-2B-Instruct ./launch.sh
+```
+
+### TRT-LLM
+
+```bash
+cd $DYNAMO_HOME/examples/backends/trtllm/mm_router_worker
+./launch.sh
+```
+
+See the [vLLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/mm_router_worker/README.md) and [TRT-LLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/mm_router_worker/README.md) for full setup instructions and configuration options.
+
+## Known Limitations
+
+- Currently supports Qwen-family multimodal processors (Qwen2-VL, Qwen2.5-VL, Qwen3-VL) for per-image visual token counting
+- Images are downloaded twice: once in the MM router (for hash computation) and once in the backend worker (for processing)
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -84,15 +84,26 @@ navigation:
        path: components/kvbm/kvbm-guide.md
      - page: Dynamo Benchmarking
        path: benchmarks/benchmarking.md
-      - section: Multimodality Support
-        path: features/multimodal/README.md
+      - section: Multimodal Model Serving
        contents:
-          - page: vLLM Multimodal
-            path: features/multimodal/multimodal-vllm.md
-          - page: TensorRT-LLM Multimodal
-            path: features/multimodal/multimodal-trtllm.md
-          - page: SGLang Multimodal
-            path: features/multimodal/multimodal-sglang.md
+          - section: Vision Language Models (VLMs)
+            path: features/multimodal/README.md
+            contents:
+              - page: Embedding Cache
+                path: features/multimodal/embedding-cache.md
+              - page: Encoder Disaggregation
+                path: features/multimodal/encoder-disaggregation.md
+              - page: Multimodal KV Routing
+                path: features/multimodal/multimodal-kv-routing.md
+          - section: Diffusion (Experimental)
+            path: features/multimodal/diffusion.md
+            contents:
+              - page: vLLM-Omni
+                path: backends/vllm/vllm-omni.md
+              - page: SGLang Diffusion
+                path: backends/sglang/sglang-diffusion.md
+              - page: TRT-LLM Diffusion
+                path: backends/trtllm/trtllm-video-diffusion.md
      - page: Tool Calling
        path: agents/tool-calling.md
      - page: LoRA Adapters