feat: sglang to 0.5.9 + updated docs (#6518)

Co-authored-by: baihuitian <baihuitian.bht@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

feat: sglang to 0.5.9 + updated docs (#6518)
Co-authored-by: baihuitian <baihuitian.bht@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
6642e23e · ishandhanani · GitHub · 1df620b4 · 1df620b4 · 6642e23e
Unverified Commit 6642e23e authored Feb 24, 2026 by ishandhanani Committed by GitHub Feb 24, 2026
20 changed files
--- a/docs/pages/backends/sglang/profiling.md
+++ b/docs/pages/backends/sglang/profiling.md
---
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-title: Profiling
---
-# Profiling SGLang Workers in Dynamo
-> [!NOTE]
-> **See also**: [Profiler Component Overview](../../components/profiler/README.md) for SLA-driven profiling and deployment optimization.
-Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
-These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.
-## Quick Start
-1. **Start profiling:**
-```bash
-curl -X POST http://localhost:9090/engine/start_profile \
-  -H "Content-Type: application/json" \
-  -d '{"output_dir": "/tmp/profiler_output"}'
-```
-2. **Run some inference requests to generate profiling data**
-3. **Stop profiling:**
-```bash
-curl -X POST http://localhost:9090/engine/stop_profile
-```
-4. **View the traces:**
-The profiler outputs Chrome trace files in the specified `output_dir`. You can view them using:
- Chrome's `chrome://tracing`
- [Perfetto UI](https://ui.perfetto.dev/)
- TensorBoard with the PyTorch Profiler plugin
-## Test Script
-A test script is provided at [`examples/backends/sglang/test_sglang_profile.py`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/test_sglang_profile.py) that demonstrates the full profiling workflow:
-```bash
-python examples/backends/sglang/test_sglang_profile.py
-```
--- a/docs/pages/backends/sglang/sglang-diffusion.md
+++ b/docs/pages/backends/sglang/sglang-diffusion.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Diffusion
+---
+# Diffusion Models
+Dynamo SGLang supports three types of diffusion-based generation: **LLM diffusion** (text generation via iterative refinement), **image diffusion** (text-to-image), and **video generation** (text-to-video). Each uses a different worker flag and handler, but all integrate with SGLang's `DiffGenerator`.
+## Overview
+| Type | Worker Flag | API Endpoint |
+|------|------------|--------------|
+| LLM Diffusion | `--dllm-algorithm <algo>` | `/v1/chat/completions`, `/v1/completions` |
+| Image Diffusion | `--image-diffusion-worker` | `/v1/images/generations` |
+| Video Generation | `--video-generation-worker` | `/v1/videos` |
+<Note>
+If you see a CuDNN version mismatch error on startup (`cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0`), set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang requires for Conv3d operations.
+</Note>
+## LLM Diffusion
+Diffusion Language Models generate text through iterative refinement rather than autoregressive token-by-token generation. The model starts with masked tokens and progressively replaces them with predictions, refining low-confidence tokens each step.
+LLM diffusion is auto-detected: when `--dllm-algorithm` is set, the worker automatically uses `DiffusionWorkerHandler` without needing a separate flag. For more details on diffusion algorithms, see the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md).
+### Launch
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/diffusion_llada.sh
+```
+See the [launch script](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/diffusion_llada.sh) for configuration options.
+### Test
+```bash
+curl -X POST http://localhost:8001/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "inclusionAI/LLaDA2.0-mini-preview",
+    "messages": [{"role": "user", "content": "Hello! How are you?"}],
+    "temperature": 0.7,
+    "max_tokens": 512
+  }'
+```
+## Image Diffusion
+Image diffusion workers generate images from text prompts using SGLang's `DiffGenerator`. Generated images are returned as either URLs (when using `--media-output-fs-url` for storage) or base64 data, in an OpenAI-compatible response format.
+### Launch
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/image_diffusion.sh
+```
+Supports local storage (`--fs-url file:///tmp/images`) and S3 (`--fs-url s3://bucket`). Pass `--http-url` to set the base URL for serving stored images. See the launch script for all configuration options.
+### Test
+```bash
+curl http://localhost:8000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "black-forest-labs/FLUX.1-dev",
+    "prompt": "A sunset over the ocean",
+    "size": "1024x1024",
+    "response_format": "url",
+    "nvext": {
+      "num_inference_steps": 15
+    }
+  }'
+```
+## Video Generation
+Video generation workers produce videos from text or image prompts using SGLang's `DiffGenerator` with frame-to-video encoding. Supports text-to-video (T2V) and image-to-video (I2V) workflows.
+### Launch
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/text-to-video-diffusion.sh
+```
+Use `--wan-size 1b` (default, 1 GPU) or `--wan-size 14b` (2 GPUs). See the launch script for all configuration options.
+### Test
+```bash
+curl http://localhost:8000/v1/videos \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "A curious raccoon exploring a garden",
+    "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
+    "seconds": 2,
+    "size": "832x480",
+    "response_format": "url",
+    "nvext": {
+      "fps": 8,
+      "num_frames": 17,
+      "num_inference_steps": 50
+    }
+  }'
+```
+## See Also
+- **[Examples](sglang-examples.md)**: Launch scripts for all deployment patterns
+- **[Reference Guide](sglang-reference-guide.md)**: Worker types and argument reference
+- **[SGLang Diffusion LMs (upstream)](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md)**: SGLang diffusion documentation
--- a/docs/pages/backends/sglang/sglang-examples.md
+++ b/docs/pages/backends/sglang/sglang-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Examples
+---
+# SGLang Examples
+For quick start instructions, see the [SGLang README](README.md). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.
+## Table of Contents
+- [Infrastructure Setup](#infrastructure-setup)
+- [LLM Serving](#llm-serving)
+- [Embedding Models](#embedding-models)
+- [Vision Models](#vision-models)
+- [Diffusion Models](#diffusion-models)
+- [Kubernetes Deployment](#kubernetes-deployment)
+- [Testing](#testing)
+## Infrastructure Setup
+For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+<Note>
+- **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
+- **NATS** is only needed when using KV routing with events (`--kv-events-config`). Use `--no-router-kv-events` on the frontend for prediction-based routing without NATS.
+- **On Kubernetes**, neither is required when using the Dynamo operator (`DYN_DISCOVERY_BACKEND=kubernetes`).
+</Note>
+<Tip>
+Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. For AI agents working with Dynamo, you can run the launch script in the background and use the `curl` commands to test the deployment.
+</Tip>
+## LLM Serving
+### Aggregated Serving
+The simplest deployment pattern: a single worker handles both prefill and decode.
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/agg.sh
+```
+### Aggregated Serving with KV Routing
+Two workers behind a [KV-aware router](../../components/router/README.md) that maximizes cache reuse:
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/agg_router.sh
+```
+This launches the frontend with `--router-mode kv` and two workers with ZMQ-based KV event publishing.
+### Disaggregated Serving
+Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/disagg.sh
+```
+For details on how SGLang disaggregation works with Dynamo, including the bootstrap mechanism and RDMA transfer flow, see [SGLang Disaggregation](sglang-disaggregation.md).
+### Disaggregated Serving with KV-Aware Prefill Routing
+Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/disagg_router.sh
+```
+The frontend uses `--router-mode kv` and automatically detects prefill workers to activate an internal prefill router. Each worker publishes KV events over ZMQ on unique ports.
+## Multimodal Serving
+### Aggregated Multimodal
+Serve multimodal models using SGLang's built-in multimodal support:
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/agg_vision.sh
+```
+<Accordion title="Verify the deployment">
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-VL-8B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {"type": "text", "text": "Describe the image."},
+          {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test2017/000000155781.jpg"}}
+        ]
+      }
+    ],
+    "max_tokens": 50,
+    "stream": false
+  }' | jq
+```
+</Accordion>
+### Multimodal with Disaggregated Components
+For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated [SGLang Multimodal](../../features/multimodal/multimodal-sglang.md) documentation.
+| Pattern | Script | Description |
+|---------|--------|-------------|
+| E/PD | `./launch/multimodal_epd.sh` | Separate vision encoder + combined PD worker |
+| E/P/D | `./launch/multimodal_disagg.sh` | Separate encoder, prefill, and decode workers |
+## Diffusion Models
+### Diffusion LM
+Run diffusion language models like [LLaDA2.0](https://github.com/inclusionAI/LLaDA2.0):
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/diffusion_llada.sh
+```
+### Image Diffusion
+Generate images from text prompts using [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev) or other diffusion models:
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/image_diffusion.sh
+```
+Options: `--model-path`, `--fs-url` (local or S3), `--http-url`.
+### Video Generation
+Generate videos from text prompts using [Wan2.1](https://huggingface.co/Wan-AI) models:
+```bash
+cd $DYNAMO_HOME/examples/backends/sglang
+./launch/text-to-video-diffusion.sh
+```
+Options: `--wan-size 1b|14b`, `--num-frames`, `--height`, `--width`, `--num-inference-steps`.
+For full details on all diffusion worker types (LLM, image, video), see [Diffusion](sglang-diffusion.md).
+### Kubernetes Deployment
+For complete K8s deployment examples, see:
+- [SGLang K8s deployment guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy)
+- [SGLang aggregated router K8s example](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/deploy/agg_router.yaml)
+- [Kubernetes Deployment Guide](../../kubernetes/README.md)
+## Troubleshooting
+### CuDNN Version Check Fails
+```
+RuntimeError: cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0
+```
+Set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang's Conv3d models require. Affects vision and diffusion models.
+### Model Registration Fails with `config.json` Error
+```
+unable to extract config.json from directory ...
+```
+This happens with diffusers models (FLUX.1-dev, Wan2.1, etc.) that use `model_index.json` instead of `config.json`. Ensure you are using the correct worker flag (`--image-diffusion-worker` or `--video-generation-worker`) rather than the standard LLM worker mode. These flags use a registration path that does not require `config.json`.
+### GPU OOM on Startup
+If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:
+```bash
+nvidia-smi  # look for lingering sgl_diffusion::scheduler or python processes
+kill -9 <PID>
+```
+### Disaggregated Workers Cannot Connect
+Ensure both prefill and decode workers can reach each other over TCP. The bootstrap mechanism uses `--disaggregation-bootstrap-port` (default: 12345). For multi-node setups, ensure the port is reachable across hosts and set `--host 0.0.0.0`.
+## See Also
+- **[SGLang README](README.md)**: Quick start and feature overview
+- **[Reference Guide](sglang-reference-guide.md)**: Architecture, configuration, and operational details
+- **[SGLang Multimodal](../../features/multimodal/multimodal-sglang.md)**: Vision model deployment patterns
+- **[SGLang HiCache](../../integrations/sglang-hicache.md)**: Hierarchical cache integration
+- **[Benchmarking](../../benchmarks/benchmarking.md)**: Performance benchmarking tools
+- **[Tuning Disaggregated Performance](../../performance/tuning.md)**: P/D tuning guide
--- a/docs/pages/backends/sglang/prometheus.md
+++ b/docs/pages/backends/sglang/prometheus.md
--- a/docs/pages/backends/sglang/sglang-reference-guide.md
+++ b/docs/pages/backends/sglang/sglang-reference-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Reference Guide
+subtitle: Architecture, configuration, and operational details for the SGLang backend
+---
+# Reference Guide
+## Overview
+The SGLang backend in Dynamo uses a modular architecture where `main.py` dispatches to specialized initialization modules based on the worker type. Each worker type has its own init module, request handler, health check, and registration logic.
+Dynamo SGLang uses SGLang's native argument parser -- all SGLang engine arguments (e.g., `--model-path`, `--tp`, `--trust-remote-code`) are passed through directly. Dynamo adds its own arguments for worker mode selection, tokenizer control, and disaggregation configuration.
+### Worker Types
+| Worker Type | Description |
+|------------|-------------|
+| **Decode** *(default)* | Standard LLM inference (aggregated or disaggregated decode) |
+| **Prefill** | Disaggregated prefill phase (`--disaggregation-mode prefill`) |
+| **Embedding** | Text embedding models (`--embedding-worker`) |
+| **Multimodal Processor** | HTTP entry point for multimodal, OpenAI-to-SGLang conversion (`--multimodal-processor`) |
+| **Multimodal Encode** | Vision encoder and embeddings generation (`--multimodal-encode-worker`) |
+| **Multimodal Worker** | LLM inference with multimodal data (`--multimodal-worker`) |
+| **Multimodal Prefill** | Prefill phase for multimodal disaggregation (`--multimodal-worker --disaggregation-mode prefill`) |
+| **Image Diffusion** | Image generation via DiffGenerator (`--image-diffusion-worker`) |
+| **Video Generation** | Text/image-to-video via DiffGenerator (`--video-generation-worker`) |
+| **LLM Diffusion** | Diffusion language models like LLaDA (`--dllm-algorithm <algo>`) |
+## Argument Reference
+### Dynamo-Specific Arguments
+These arguments are added by Dynamo on top of SGLang's native arguments.
+| Argument | Env Var | Default | Description |
+|----------|---------|---------|-------------|
+| `--endpoint` | `DYN_ENDPOINT` | Auto-generated | Dynamo endpoint in `dyn://namespace.component.endpoint` format |
+| `--use-sglang-tokenizer` | `DYN_SGL_USE_TOKENIZER` | `false` | Use SGLang's tokenizer instead of Dynamo's |
+| `--dyn-tool-call-parser` | `DYN_TOOL_CALL_PARSER` | `None` | [Tool call](../../agents/tool-calling.md) parser (overrides SGLang's `--tool-call-parser`) |
+| `--dyn-reasoning-parser` | `DYN_REASONING_PARSER` | `None` | Reasoning parser for chain-of-thought models |
+| `--custom-jinja-template` | `DYN_CUSTOM_JINJA_TEMPLATE` | `None` | Custom chat template path (incompatible with `--use-sglang-tokenizer`) |
+| `--embedding-worker` | `DYN_SGL_EMBEDDING_WORKER` | `false` | Run as embedding worker (also sets SGLang's `--is-embedding`) |
+| `--multimodal-processor` | `DYN_SGL_MULTIMODAL_PROCESSOR` | `false` | Run as [multimodal](../../features/multimodal/multimodal-sglang.md) processor |
+| `--multimodal-encode-worker` | `DYN_SGL_MULTIMODAL_ENCODE_WORKER` | `false` | Run as multimodal encode worker |
+| `--multimodal-worker` | `DYN_SGL_MULTIMODAL_WORKER` | `false` | Run as multimodal LLM worker |
+| `--image-diffusion-worker` | `DYN_SGL_IMAGE_DIFFUSION_WORKER` | `false` | Run as [image diffusion](sglang-diffusion.md#image-diffusion) worker |
+| `--video-generation-worker` | `DYN_SGL_VIDEO_GENERATION_WORKER` | `false` | Run as [video generation](sglang-diffusion.md#video-generation) worker |
+| `--disagg-config` | `DYN_SGL_DISAGG_CONFIG` | `None` | Path to YAML disaggregation config file |
+| `--disagg-config-key` | `DYN_SGL_DISAGG_CONFIG_KEY` | `None` | Key to select from disaggregation config (e.g., `prefill`, `decode`) |
+<Note>
+`--disagg-config` and `--disagg-config-key` must be provided together. The selected section is written to a temp YAML file and passed to SGLang's `--config` flag.
+</Note>
+## Tokenizer Behavior
+By default, Dynamo handles tokenization and detokenization through its Rust-based frontend, passing `input_ids` to SGLang. This enables all frontend endpoints (`v1/chat/completions`, `v1/completions`, `v1/embeddings`).
+With `--use-sglang-tokenizer`, SGLang handles tokenization internally and Dynamo passes raw prompts. This restricts the frontend to `v1/chat/completions` only.
+<Warning>
+`--custom-jinja-template` and `--use-sglang-tokenizer` are mutually exclusive. Custom templates require Dynamo's preprocessor.
+</Warning>
+## Request Cancellation
+When a client disconnects, Dynamo automatically cancels the in-flight request across all workers, freeing compute resources. A background cancellation monitor detects disconnection and aborts the SGLang request.
+| Mode | Prefill | Decode |
+|------|---------|--------|
+| **Aggregated** | ✅ | ✅ |
+| **Disaggregated** | ⚠️ | ✅ |
+<Warning>Cancellation during remote prefill in disaggregated mode is not currently supported.</Warning>
+For details on the cancellation architecture, see [Request Cancellation](../../fault-tolerance/request-cancellation.md).
+## Graceful Shutdown
+SGLang workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received:
+1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it
+2. **Grace period**: In-flight requests are allowed to complete
+3. **Deferred handlers**: SGLang's internal signal handlers (captured during startup via monkey-patching `loop.add_signal_handler`) are invoked after the graceful period
+This ensures zero dropped requests during rolling updates or scale-down events.
+For more details, see [Graceful Shutdown](../../fault-tolerance/graceful-shutdown.md).
+## Health Checks
+Each worker type has a specialized health check payload that validates the full inference pipeline:
+| Worker Type | Health Check Strategy |
+|------------|----------------------|
+| Decode / Aggregated | Short generation request (`max_new_tokens=1`) |
+| Prefill | Wrapped prefill-specific request structure |
+| Image Diffusion | Minimal image generation request |
+| Video Generation | Minimal video generation request |
+| Embedding | Standard embedding request |
+Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. See [Health Checks](../../observability/health-checks.md) for the broader health check architecture.
+## Metrics and KV Events
+### Prometheus Metrics
+Enable metrics with `--enable-metrics` on the worker. Set `DYN_SYSTEM_PORT` to expose the `/metrics` endpoint:
+```bash
+DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --enable-metrics
+```
+Both SGLang engine metrics (`sglang:*` prefix) and Dynamo runtime metrics (`dynamo_*` prefix) are served from the same endpoint.
+For metric details, see [SGLang Prometheus Metrics](sglang-prometheus.md). For visualization setup, see [Prometheus + Grafana](../../observability/prometheus-grafana.md).
+### KV Events
+When configured with `--kv-events-config`, workers publish KV cache events (block creation/deletion) for the [KV-aware router](../../components/router/README.md). Events are published via ZMQ from SGLang's scheduler and relayed through Dynamo's event plane.
+For DP attention mode (`--enable-dp-attention`), the publisher handles multiple DP ranks per node, each with its own KV event stream.
+## Engine Routes
+SGLang workers expose operational endpoints via Dynamo's system server:
+| Route | Description |
+|-------|-------------|
+| `/engine/start_profile` | Start PyTorch profiling |
+| `/engine/stop_profile` | Stop profiling and save traces |
+| `/engine/release_memory_occupation` | Release GPU memory for maintenance |
+| `/engine/resume_memory_occupation` | Resume GPU memory after release |
+| `/engine/update_weights_from_distributor` | Update model weights from distributor |
+| `/engine/update_weights_from_disk` | Update model weights from disk |
+| `/engine/update_weight_version` | Update weight version metadata |
+## See Also
+- **[Examples](sglang-examples.md)**: All deployment patterns
+- **[Disaggregation](sglang-disaggregation.md)**: P/D architecture and KV transfer
+- **[Diffusion](sglang-diffusion.md)**: LLM, image, and video diffusion models
+- **[Router Guide](../../components/router/router-guide.md)**: KV-aware routing configuration
--- a/docs/pages/observability/metrics.md
+++ b/docs/pages/observability/metrics.md
@@ -86,7 +86,7 @@ Dynamo exposes several categories of metrics:
 - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
 - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
 - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
+- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-prometheus.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
 ## Runtime Hierarchy

--- a/docs/pages/reference/support-matrix.md
+++ b/docs/pages/reference/support-matrix.md
@@ -16,7 +16,7 @@ The following table shows the backend framework versions included with each Dyna
 | **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** |
 | :--- | :--- | :--- | :--- | :--- |
-| **main (ToT)** | `0.15.1` | `0.5.8` | `1.3.0rc3` | `0.9.0` |
+| **main (ToT)** | `0.15.1` | `0.5.9` | `1.3.0rc3` | `0.9.0` |
 | **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
 | **v0.9.1** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc3` | `0.9.0` |
 | **v0.9.0** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |

--- a/docs/versions/dev.yml
+++ b/docs/versions/dev.yml
@@ -137,8 +137,19 @@ navigation:
        contents:
          - page: vLLM
            path: ../pages/backends/vllm/README.md
-          - page: SGLang
+          - section: SGLang
            path: ../pages/backends/sglang/README.md
+            contents:
+              - page: Reference Guide
+                path: ../pages/backends/sglang/sglang-reference-guide.md
+              - page: Examples
+                path: ../pages/backends/sglang/sglang-examples.md
+              - page: Disaggregation
+                path: ../pages/backends/sglang/sglang-disaggregation.md
+              - page: Diffusion
+                path: ../pages/backends/sglang/sglang-diffusion.md
+              - page: Prometheus
+                path: ../pages/backends/sglang/sglang-prometheus.md
          - page: TensorRT-LLM
            path: ../pages/backends/trtllm/README.md
      - section: Frontend
@@ -280,20 +291,6 @@ navigation:
            path: ../pages/backends/vllm/prompt-embeddings.md
          - page: vLLM-Omni
            path: ../pages/backends/vllm/vllm-omni.md
-      - section: SGLang Details
-        contents:
-          - page: Expert Distribution (EPLB)
-            path: ../pages/backends/sglang/expert-distribution-eplb.md
-          - page: GPT-OSS
-            path: ../pages/backends/sglang/gpt-oss.md
-          - page: Diffusion LM
-            path: ../pages/backends/sglang/diffusion-lm.md
-          - page: Profiling
-            path: ../pages/backends/sglang/profiling.md
-          - page: Disaggregation
-            path: ../pages/backends/sglang/sglang-disaggregation.md
-          - page: Prometheus
-            path: ../pages/backends/sglang/prometheus.md
      - section: TensorRT-LLM Details
        contents:
          - page: Multinode Examples

--- a/examples/backends/sglang/launch/agg.sh
+++ b/examples/backends/sglang/launch/agg.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Aggregated serving: single worker handles both prefill and decode.
+# GPUs: 1
 # Setup cleanup trap
 cleanup() {
@@ -54,6 +57,26 @@ if [ "$ENABLE_OTEL" = true ]; then
    TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
 fi
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Aggregated LLM Worker"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 OTEL_SERVICE_NAME=dynamo-frontend \

--- a/examples/backends/sglang/launch/agg_embed.sh
+++ b/examples/backends/sglang/launch/agg_embed.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Aggregated embedding model serving.
+# GPUs: 1
 # Setup cleanup trap
 cleanup() {
@@ -31,6 +34,26 @@ while [[ $# -gt 0 ]]; do
    esac
 done
+MODEL="Qwen/Qwen3-Embedding-4B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Embedding Worker"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/embeddings \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"input\": \"Hello world\""
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python3 -m dynamo.frontend &

--- a/examples/backends/sglang/launch/agg_router.sh
+++ b/examples/backends/sglang/launch/agg_router.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Two aggregated workers behind a KV-aware router.
+# GPUs: 2
 # Setup cleanup trap
 cleanup() {
@@ -51,6 +54,27 @@ if [ "$ENABLE_OTEL" = true ]; then
    TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
 fi
+MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Aggregated Router (2 workers)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 FRONTEND_ARGS=(--router-mode kv)

--- a/examples/backends/sglang/launch/agg_vision.sh
+++ b/examples/backends/sglang/launch/agg_vision.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Aggregated multimodal (vision + LLM) serving.
+# GPUs: 1
 # Setup cleanup trap
 cleanup() {
@@ -60,6 +63,29 @@ if [ "$ENABLE_OTEL" = true ]; then
    TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
 fi
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Aggregated Vision Worker"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": ["
+echo "        {\"type\": \"text\", \"text\": \"Describe the image.\"},"
+echo "        {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
+echo "      ]}],"
+echo "      \"max_tokens\": 50"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 OTEL_SERVICE_NAME=dynamo-frontend \

--- a/examples/backends/sglang/launch/diffusion_llada.sh
+++ b/examples/backends/sglang/launch/diffusion_llada.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Diffusion language model (LLaDA2.0). Text generation via iterative refinement.
+# GPUs: 1
 set -e
@@ -38,6 +41,19 @@ echo "TP Size: $TP_SIZE"
 echo "Diffusion Algorithm: ${DLLM_ALGORITHM:-LowConfidence}"
 echo "Algorithm Config: ${DLLM_ALGORITHM_CONFIG:-default}"
 echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL_PATH}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello! How are you?\"}],"
+echo "      \"temperature\": 0.7,"
+echo "      \"max_tokens\": 512"
+echo "    }'"
+echo ""
+echo "=========================================="
 # Launch frontend (OpenAI-compatible API server)
 echo "Starting Dynamo Frontend on port $HTTP_PORT..."

--- a/examples/backends/sglang/launch/disagg.sh
+++ b/examples/backends/sglang/launch/disagg.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Disaggregated serving: prefill on GPU 0, decode on GPU 1.
+# GPUs: 2
 # Setup cleanup trap
 cleanup() {
@@ -45,6 +48,27 @@ if [ "$ENABLE_OTEL" = true ]; then
    TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
 fi
+MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Disaggregated Workers (P/D)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 OTEL_SERVICE_NAME=dynamo-frontend \

--- a/examples/backends/sglang/launch/disagg_router.sh
+++ b/examples/backends/sglang/launch/disagg_router.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Disaggregated serving with KV-aware routing: 2 prefill + 2 decode workers.
+# GPUs: 4
 # Setup cleanup trap
 cleanup() {
@@ -46,6 +49,27 @@ if [ "$ENABLE_OTEL" = true ]; then
    TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
 fi
+MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Disaggregated Router (2P + 2D)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
 # Start frontend with KV routing
 # The frontend will automatically detect prefill workers and activate an internal prefill router
 # No standalone prefill router needed - the frontend handles prefill routing internally

--- a/examples/backends/sglang/launch/disagg_same_gpu.sh
+++ b/examples/backends/sglang/launch/disagg_same_gpu.sh
@@ -2,6 +2,9 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
+# Disaggregated serving on a single GPU (prefill + decode share memory).
+# GPUs: 1 (requires 16+ GB VRAM)
+#
 # Usage: ./disagg_same_gpu.sh [GPU_MEM_FRACTION]
 #   GPU_MEM_FRACTION: Fraction of GPU memory to use per worker (default: 0.45)
 #   Example: ./disagg_same_gpu.sh 0.45
@@ -36,6 +39,28 @@ cleanup() {
 trap cleanup EXIT INT TERM
+MODEL="Qwen/Qwen3-0.6B"
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Disaggregated (same GPU)"
+echo "=========================================="
+echo "Model:       $MODEL"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "GPU Mem:     ${GPU_MEM_FRACTION} per worker"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
+echo "      \"max_tokens\": 32"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress with KV router mode for disaggregated setup
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python3 -m dynamo.frontend --router-mode kv &

--- a/examples/backends/sglang/launch/image_diffusion.sh
+++ b/examples/backends/sglang/launch/image_diffusion.sh
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Image diffusion worker (text-to-image). Default model: FLUX.1-dev (~38 GB VRAM).
+# GPUs: 1
+set -e
+# Setup cleanup trap
+cleanup() {
+    echo "Cleaning up background processes..."
+    kill $FRONTEND_PID 2>/dev/null || true
+    wait $FRONTEND_PID 2>/dev/null || true
+    echo "Cleanup complete."
+}
+trap cleanup EXIT INT TERM
+# Defaults
+MODEL_PATH="black-forest-labs/FLUX.1-dev"
+FS_URL="file:///tmp/dynamo_media"
+HTTP_URL=""
+HTTP_PORT="${HTTP_PORT:-8000}"
+# Parse command line arguments
+EXTRA_ARGS=()
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --model-path)
+            MODEL_PATH="$2"
+            shift 2
+            ;;
+        --fs-url)
+            FS_URL="$2"
+            shift 2
+            ;;
+        --http-url)
+            HTTP_URL="$2"
+            shift 2
+            ;;
+        --http-port)
+            HTTP_PORT="$2"
+            shift 2
+            ;;
+        -h|--help)
+            echo "Usage: $0 [OPTIONS]"
+            echo ""
+            echo "Launch a Dynamo image diffusion worker."
+            echo ""
+            echo "Options:"
+            echo "  --model-path <path>          Model path (default: black-forest-labs/FLUX.1-dev)"
+            echo "  --fs-url <url>               Filesystem URL for image storage (default: file:///tmp/dynamo_media)"
+            echo "  --http-url <url>             Base URL for serving images over HTTP (optional)"
+            echo "  --http-port <port>           Frontend HTTP port (default: 8000)"
+            echo "  -h, --help                   Show this help message"
+            echo ""
+            echo "Additional flags are forwarded to dynamo.sglang."
+            echo ""
+            echo "Examples:"
+            echo "  # Local file storage"
+            echo "  $0 --model-path black-forest-labs/FLUX.1-dev --fs-url file:///tmp/images"
+            echo ""
+            echo "  # S3 storage (set FSSPEC_S3_KEY, FSSPEC_S3_SECRET, optionally FSSPEC_S3_ENDPOINT_URL)"
+            echo "  $0 --fs-url s3://my-bucket/images"
+            exit 0
+            ;;
+        *)
+            EXTRA_ARGS+=("$1")
+            shift
+            ;;
+    esac
+done
+echo "=========================================="
+echo "Launching Image Diffusion Worker"
+echo "=========================================="
+echo "Model:       $MODEL_PATH"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "FS URL:      $FS_URL"
+[ -n "$HTTP_URL" ] && echo "HTTP URL:    $HTTP_URL"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/images/generations \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"prompt\": \"A curious raccoon exploring a garden\","
+echo "      \"model\": \"${MODEL_PATH}\","
+echo "      \"size\": \"1024x1024\","
+echo "      \"response_format\": \"url\","
+echo "      \"nvext\": {"
+echo "        \"num_inference_steps\": 15"
+echo "      }"
+echo "    }'"
+echo ""
+echo "=========================================="
+# Build optional HTTP URL arg
+HTTP_URL_ARGS=()
+if [ -n "$HTTP_URL" ]; then
+    HTTP_URL_ARGS=(--media-output-http-url "$HTTP_URL")
+fi
+# Launch frontend
+echo "Starting Dynamo Frontend on port $HTTP_PORT..."
+python3 -m dynamo.frontend \
+    --http-port "$HTTP_PORT" &
+FRONTEND_PID=$!
+sleep 2
+# Launch image diffusion worker
+echo "Starting Image Diffusion Worker..."
+python3 -m dynamo.sglang \
+    --model-path "$MODEL_PATH" \
+    --served-model-name "$MODEL_PATH" \
+    --image-diffusion-worker \
+    --media-output-fs-url "$FS_URL" \
+    "${HTTP_URL_ARGS[@]}" \
+    --trust-remote-code \
+    --skip-tokenizer-init \
+    --enable-metrics \
+    "${EXTRA_ARGS[@]}"
--- a/examples/backends/sglang/launch/multimodal_disagg.sh
+++ b/examples/backends/sglang/launch/multimodal_disagg.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Multimodal E/P/D: encoder (GPU 0), prefill (GPU 1), decode (GPU 2).
+# GPUs: 3
 # Setup cleanup trap
 cleanup() {
@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then
    SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME"
 fi
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Multimodal E/P/D Workers"
+echo "=========================================="
+echo "Model:       $MODEL_NAME"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL_NAME}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": ["
+echo "        {\"type\": \"text\", \"text\": \"Describe the image.\"},"
+echo "        {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
+echo "      ]}],"
+echo "      \"max_tokens\": 50"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python3 -m dynamo.frontend &

--- a/examples/backends/sglang/launch/multimodal_epd.sh
+++ b/examples/backends/sglang/launch/multimodal_epd.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Multimodal E/PD: separate vision encoder (GPU 0) + combined PD worker (GPU 1).
+# GPUs: 2
 # Setup cleanup trap
 cleanup() {
@@ -59,6 +62,29 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then
    SERVED_MODEL_ARG="--served-model-name $SERVED_MODEL_NAME"
 fi
+HTTP_PORT="${DYN_HTTP_PORT:-8000}"
+echo "=========================================="
+echo "Launching Multimodal E/PD Workers"
+echo "=========================================="
+echo "Model:       $MODEL_NAME"
+echo "Frontend:    http://localhost:$HTTP_PORT"
+echo "=========================================="
+echo ""
+echo "Example test command:"
+echo ""
+echo "  curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
+echo "    -H 'Content-Type: application/json' \\"
+echo "    -d '{"
+echo "      \"model\": \"${MODEL_NAME}\","
+echo "      \"messages\": [{\"role\": \"user\", \"content\": ["
+echo "        {\"type\": \"text\", \"text\": \"Describe the image.\"},"
+echo "        {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/test2017/000000155781.jpg\"}}"
+echo "      ]}],"
+echo "      \"max_tokens\": 50"
+echo "    }'"
+echo ""
+echo "=========================================="
 # run ingress
 # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000)
 python3 -m dynamo.frontend &

--- a/examples/backends/sglang/launch/text-to-video-diffusion.sh
+++ b/examples/backends/sglang/launch/text-to-video-diffusion.sh
 #!/bin/bash
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
+#
+# Text-to-video generation with Wan2.1 models.
+# GPUs: 1 (--wan-size 1b) or 2 (--wan-size 14b)
 set -e