docs: add FastVideo example and guide with light sidebar reorg (#7283)

Signed-off-by: Dan Gil <dagil@nvidia.com>

docs: add FastVideo example and guide with light sidebar reorg (#7283)
Signed-off-by: Dan Gil <dagil@nvidia.com>
2adf8a2d · dagil-nvidia · GitHub · 52b460e4 · 2adf8a2d · 2adf8a2d
Unverified Commit 2adf8a2d authored Mar 12, 2026 by dagil-nvidia Committed by GitHub Mar 12, 2026
18 changed files
--- a/docs/features/agentic_workloads.md
+++ b/docs/features/agentic_workloads.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: Agentic Workloads
+title: Agents
 subtitle: Workload-aware inference with agentic hints for routing, scheduling, and KV cache Management
 ---


--- a/docs/features/multimodal/diffusion.md
+++ b/docs/features/multimodal/diffusion.md
@@ -30,3 +30,4 @@ For deployment guides, configuration, and examples for each backend:
 - **[vLLM-Omni](../../backends/vllm/vllm-omni.md)**
 - **[SGLang Diffusion](../../backends/sglang/sglang-diffusion.md)**
 - **[TRT-LLM Diffusion](../../backends/trtllm/trtllm-video-diffusion.md)**
+- **[FastVideo (custom worker)](fastvideo.md)**
--- a/docs/features/diffusion/fastvideo.md
+++ b/docs/features/diffusion/fastvideo.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+sidebar-title: FastVideo
+---
+
+# FastVideo
+
+This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) text-to-video generation on Dynamo using a custom worker (`worker.py`) exposed through the `/v1/videos` endpoint.
+
+> [!NOTE]
+> Dynamo also supports diffusion through built-in backends: [SGLang Diffusion](../../backends/sglang/sglang-diffusion.md) (LLM diffusion, image, video), [vLLM-Omni](../../backends/vllm/vllm-omni.md) (text-to-image, text-to-video), and [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md). See the [Diffusion Overview](README.md) for the full support matrix.
+
+## Overview
+
+- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
+- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
+- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput.
+- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
+- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
+
+> [!IMPORTANT]
+> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
+
+## Docker Image Build
+
+The local Docker workflow builds a runtime image from the [`Dockerfile`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers/Dockerfile):
+
+- Base image: `nvidia/cuda:13.1.1-devel-ubuntu24.04`
+- Installs [FastVideo](https://github.com/hao-ai-lab/FastVideo) from GitHub
+- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
+- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
+
+> [!WARNING]
+> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
+
+## Warmup Time
+
+On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
+
+> [!TIP]
+> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
+
+## Local Deployment
+
+### Prerequisites
+
+**For Docker Compose:**
+
+- Docker Engine 26.0+
+- Docker Compose v2
+- NVIDIA Container Toolkit
+
+**For host-local script:**
+
+- Python environment with Dynamo + FastVideo dependencies installed
+- CUDA-compatible GPU runtime available on host
+
+### Option 1: Docker Compose
+
+```bash
+cd <dynamo-root>/examples/diffusers/local
+
+# Start 4 workers on GPUs 0..3
+COMPOSE_PROFILES=4 docker compose up --build
+```
+
+The Compose file builds from the Dockerfile and exposes the API on `http://localhost:8000`. See the [Docker Image Build](#docker-image-build) section for build time expectations.
+
+### Option 2: Host-Local Script
+
+```bash
+cd <dynamo-root>/examples/diffusers/local
+./run_local.sh
+```
+
+Environment variables:
+
+| Variable | Default | Description |
+|---|---|---|
+| `PYTHON_BIN` | `python3` | Python interpreter |
+| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
+| `NUM_GPUS` | `1` | Number of GPUs |
+| `HTTP_PORT` | `8000` | Frontend HTTP port |
+| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) |
+| `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` |
+
+Example:
+
+```bash
+MODEL=FastVideo/LTX2-Distilled-Diffusers \
+NUM_GPUS=1 \
+HTTP_PORT=8000 \
+WORKER_EXTRA_ARGS="--disable-optimizations" \
+./run_local.sh
+```
+
+> [!NOTE]
+> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`.
+
+The script writes logs to:
+
+- `.runtime/logs/worker.log`
+- `.runtime/logs/frontend.log`
+
+## Kubernetes Deployment
+
+### Files
+
+| File | Description |
+|---|---|
+| `agg.yaml` | Base aggregated deployment (Frontend + `FastVideoWorker`) |
+| `agg_user_workload.yaml` | Same deployment with `user-workload` tolerations and `imagePullSecrets` |
+| `huggingface-cache-pvc.yaml` | Shared HF cache PVC for model weights |
+| `dynamo-platform-values-user-workload.yaml` | Optional Helm values for clusters with tainted `user-workload` nodes |
+
+### Prerequisites
+
+1. Dynamo Kubernetes Platform installed
+2. GPU-enabled Kubernetes cluster
+3. FastVideo runtime image pushed to your registry
+4. Optional HF token secret (for gated models)
+
+Create a Hugging Face token secret if needed:
+
+```bash
+export NAMESPACE=<your-namespace>
+export HF_TOKEN=<your-hf-token>
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+### Deploy
+
+```bash
+cd <dynamo-root>/examples/diffusers/deploy
+export NAMESPACE=<your-namespace>
+
+kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
+kubectl apply -f agg.yaml -n ${NAMESPACE}
+```
+
+For clusters with tainted `user-workload` nodes and private registry pulls:
+
+1. Set your pull secret name and image in `agg_user_workload.yaml`.
+2. Apply:
+
+```bash
+kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
+kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE}
+```
+
+### Update Image Quickly
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \
+  ${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated
+
+kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE}
+```
+
+### Verify and Access
+
+```bash
+kubectl get dgd -n ${NAMESPACE}
+kubectl get pods -n ${NAMESPACE}
+kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker
+```
+
+```bash
+kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000
+```
+
+## Test Request
+
+> [!NOTE]
+> If this is the first request after startup, expect it to take longer while warmup completes. See [Warmup Time](#warmup-time) for details.
+
+Send a request and decode the response:
+
+```bash
+curl -s -X POST http://localhost:8000/v1/videos \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "FastVideo/LTX2-Distilled-Diffusers",
+    "prompt": "A cinematic drone shot over a snowy mountain range at sunrise",
+    "size": "1920x1088",
+    "seconds": 5,
+    "nvext": {
+      "fps": 24,
+      "num_frames": 121,
+      "num_inference_steps": 5,
+      "guidance_scale": 1.0,
+      "seed": 10
+    }
+  }' > response.json
+
+# Linux
+jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4
+
+# macOS
+jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
+```
+
+## Worker Configuration Reference
+
+### CLI Flags
+
+| Flag | Default | Description |
+|---|---|---|
+| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
+| `--num-gpus` | `1` | Number of GPUs for distributed inference |
+| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA |
+
+### Request Parameters (`nvext`)
+
+| Field | Default | Description |
+|---|---|---|
+| `fps` | `24` | Frames per second |
+| `num_frames` | `121` | Total frames; overrides `fps * seconds` when set |
+| `num_inference_steps` | `5` | Diffusion inference steps |
+| `guidance_scale` | `1.0` | Classifier-free guidance scale |
+| `seed` | `10` | RNG seed for reproducibility |
+| `negative_prompt` | — | Text to avoid in generation |
+
+### Environment Variables
+
+| Variable | Default | Description |
+|---|---|---|
+| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
+| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
+| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) |
+| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
+| `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging |
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---|---|---|
+| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile |
+| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
+| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
+| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` |
+
+## Source Code
+
+The example source lives at [`examples/diffusers/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers) in the Dynamo repository.
+
+## See Also
+
+- [vLLM-Omni Text-to-Video](../../backends/vllm/vllm-omni.md#text-to-video) — vLLM-Omni video generation via `/v1/videos`
+- [vLLM-Omni Text-to-Image](../../backends/vllm/vllm-omni.md#text-to-image) — vLLM-Omni image generation
+- [SGLang Video Generation](../../backends/sglang/sglang-diffusion.md#video-generation) — SGLang video generation worker
+- [SGLang Image Diffusion](../../backends/sglang/sglang-diffusion.md#image-diffusion) — SGLang image diffusion worker
+- [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md#quick-start) — TensorRT-LLM video diffusion quick start
+- [Diffusion Overview](README.md) — Full backend support matrix
--- a/docs/index.yml
+++ b/docs/index.yml
@@ -89,31 +89,34 @@ navigation:
        path: components/kvbm/kvbm-guide.md
      - page: Dynamo Benchmarking
        path: benchmarks/benchmarking.md
-      - section: Multimodal Model Serving
+      - section: Multimodal
+        path: features/multimodal/README.md
        contents:
-          - section: Vision Language Models (VLMs)
-            path: features/multimodal/README.md
-            contents:
-              - page: Embedding Cache
-                path: features/multimodal/embedding-cache.md
-              - page: Encoder Disaggregation
-                path: features/multimodal/encoder-disaggregation.md
-              - page: Multimodal KV Routing
-                path: features/multimodal/multimodal-kv-routing.md
-          - section: Diffusion (Experimental)
-            path: features/multimodal/diffusion.md
-            contents:
-              - page: vLLM-Omni
-                path: backends/vllm/vllm-omni.md
-              - page: SGLang Diffusion
-                path: backends/sglang/sglang-diffusion.md
-              - page: TRT-LLM Diffusion
-                path: backends/trtllm/trtllm-video-diffusion.md
+          - page: Embedding Cache
+            path: features/multimodal/embedding-cache.md
+          - page: Encoder Disaggregation
+            path: features/multimodal/encoder-disaggregation.md
+          - page: Multimodal KV Routing
+            path: features/multimodal/multimodal-kv-routing.md
+      - section: Diffusion (Preview)
+        slug: diffusion
+        path: features/diffusion/README.md
+        contents:
+          - page: FastVideo
+            slug: fastvideo
+            path: features/diffusion/fastvideo.md
+          - page: vLLM-Omni
+            path: backends/vllm/vllm-omni.md
+          - page: SGLang Diffusion
+            path: backends/sglang/sglang-diffusion.md
+          - page: TRT-LLM Diffusion
+            path: backends/trtllm/trtllm-video-diffusion.md
      - page: Tool Calling
        path: agents/tool-calling.md
      - page: LoRA Adapters
        path: features/lora/README.md
-      - section: Agentic Workloads
+      - section: Agents
+        slug: agents
        path: features/agentic_workloads.md
        contents:
          - page: SGLang for Agentic Workloads

--- a/examples/diffusers/.dockerignore
+++ b/examples/diffusers/.dockerignore
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Local outputs
+outputs
+outputs_video
+
+# Python caches
+__pycache__
+*.pyc
+.git
+local/.runtime
--- a/examples/diffusers/Dockerfile
+++ b/examples/diffusers/Dockerfile
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Shared runtime image for Dynamo frontend and FastVideo workers.
+FROM nvidia/cuda:13.1.1-devel-ubuntu24.04
+
+RUN apt-get update \
+ && apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
+ && apt-get clean
+
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+ENV UV_LINK_MODE=copy
+
+RUN uv venv /opt/dynamo/venv --python 3.12 \
+ && . /opt/dynamo/venv/bin/activate \
+ && uv pip install pip setuptools packaging ninja psutil uvloop \
+ && uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130 \
+ && uv pip install flashinfer-python
+
+ENV VIRTUAL_ENV=/opt/dynamo/venv
+ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
+
+# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS.
+# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores.
+RUN git clone https://github.com/RandNMR73/flash-attention \
+ && cd flash-attention \
+ && git switch fa4-compile \
+ && TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \
+ && TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \
+ && rm -rf ../flash-attention
+
+# Install Dynamo with /v1/videos support.
+RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \
+ && uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
+
+# Install FastVideo directly from the public upstream repository.
+# Checkout with --recurse-submodules to get the required submodules as well.
+RUN . /opt/dynamo/venv/bin/activate \
+ && uv pip install setuptools_scm scikit-build-core cmake ninja \
+ && git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
+ && TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo
+
+ENV FASTVIDEO_VIDEO_CODEC=libx264
+ENV FASTVIDEO_X264_PRESET=ultrafast
+
+WORKDIR /opt/app
+COPY . /opt/app/
--- a/examples/diffusers/README.md
+++ b/examples/diffusers/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# FastVideo Video Diffusion Example
+
+Full documentation can be found:
+
+- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo) (Recommended)
+- [FastVideo - GitHub](../../docs/features/diffusion/fastvideo.md)
--- a/examples/diffusers/deploy/README.md
+++ b/examples/diffusers/deploy/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# FastVideo Kubernetes Deployment
+
+Full documentation can be found:
+
+- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo#kubernetes-deployment) (Recommended)
+- [FastVideo - GitHub](../../../docs/features/diffusion/fastvideo.md#kubernetes-deployment)
--- a/examples/diffusers/deploy/agg.yaml
+++ b/examples/diffusers/deploy/agg.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: fastvideo-agg
+spec:
+  pvcs:
+    - name: huggingface-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_DISCOVERY_BACKEND
+          value: kubernetes
+      extraPodSpec:
+        mainContainer:
+          image: my-registry/fastvideo-runtime:my-tag
+          imagePullPolicy: IfNotPresent
+          workingDir: /opt/app
+          command:
+            - python
+            - -m
+            - dynamo.frontend
+          args:
+            - --http-port
+            - "8000"
+
+    FastVideoWorker:
+      componentType: worker
+      replicas: 1
+      sharedMemory:
+        size: 8Gi
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"
+      envs:
+        - name: DYN_DISCOVERY_BACKEND
+          value: kubernetes
+        - name: LD_LIBRARY_PATH
+          value: ""
+        - name: TORCHINDUCTOR_CACHE_DIR
+          value: /cache/torchinductor
+        - name: TRITON_CACHE_DIR
+          value: /cache/triton
+        - name: HF_HOME
+          value: /root/.cache/huggingface
+      volumeMounts:
+        - name: huggingface-cache
+          mountPoint: /root/.cache/huggingface
+      extraPodSpec:
+        mainContainer:
+          image: my-registry/fastvideo-runtime:my-tag
+          imagePullPolicy: IfNotPresent
+          workingDir: /opt/app
+          command:
+            - python
+            - worker.py
--- a/examples/diffusers/deploy/agg_user_workload.yaml
+++ b/examples/diffusers/deploy/agg_user_workload.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: fastvideo-agg
+spec:
+  pvcs:
+    - name: huggingface-cache
+      create: false
+  services:
+    Frontend:
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_DISCOVERY_BACKEND
+          value: kubernetes
+      extraPodSpec:
+        imagePullSecrets:
+          - name: my-image-pull-secret
+        tolerations:
+          - key: dedicated
+            operator: Equal
+            value: user-workload
+            effect: NoSchedule
+          - key: dedicated
+            operator: Equal
+            value: user-workload
+            effect: NoExecute
+        mainContainer:
+          image: my-registry/fastvideo-runtime:my-tag
+          imagePullPolicy: IfNotPresent
+          workingDir: /opt/app
+          command:
+            - python
+            - -m
+            - dynamo.frontend
+          args:
+            - --http-port
+            - "8000"
+
+    FastVideoWorker:
+      componentType: worker
+      replicas: 1
+      sharedMemory:
+        size: 8Gi
+      resources:
+        limits:
+          gpu: "1"
+        requests:
+          gpu: "1"
+      envs:
+        - name: DYN_DISCOVERY_BACKEND
+          value: kubernetes
+        - name: LD_LIBRARY_PATH
+          value: ""
+        - name: TORCHINDUCTOR_CACHE_DIR
+          value: /cache/torchinductor
+        - name: TRITON_CACHE_DIR
+          value: /cache/triton
+        - name: HF_HOME
+          value: /root/.cache/huggingface
+      volumeMounts:
+        - name: huggingface-cache
+          mountPoint: /root/.cache/huggingface
+      extraPodSpec:
+        imagePullSecrets:
+          - name: my-image-pull-secret
+        tolerations:
+          - key: dedicated
+            operator: Equal
+            value: user-workload
+            effect: NoSchedule
+          - key: dedicated
+            operator: Equal
+            value: user-workload
+            effect: NoExecute
+        mainContainer:
+          image: my-registry/fastvideo-runtime:my-tag
+          imagePullPolicy: IfNotPresent
+          workingDir: /opt/app
+          command:
+            - python
+            - worker.py
--- a/examples/diffusers/deploy/dynamo-platform-values-user-workload.yaml
+++ b/examples/diffusers/deploy/dynamo-platform-values-user-workload.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+dynamo-operator:
+  namespaceRestriction:
+    enabled: true
+  controllerManager:
+    tolerations:
+      - { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
+      - { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
+
+etcd:
+  persistence:
+    storageClass: ebs
+  tolerations:
+    - { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
+    - { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
+
+nats:
+  config:
+    jetstream:
+      fileStore:
+        pvc:
+          storageClassName: ebs
+  podTemplate:
+    merge:
+      spec:
+        tolerations:
+          - { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
+          - { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
--- a/examples/diffusers/deploy/huggingface-cache-pvc.yaml
+++ b/examples/diffusers/deploy/huggingface-cache-pvc.yaml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: huggingface-cache
+spec:
+  accessModes:
+    - ReadWriteOnce
+  storageClassName: ebs
+  resources:
+    requests:
+      storage: 200Gi
--- a/examples/diffusers/local/.gitignore
+++ b/examples/diffusers/local/.gitignore
+.runtime/
+response.json
+output.mp4
--- a/examples/diffusers/local/README.md
+++ b/examples/diffusers/local/README.md
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# FastVideo Local Run
+
+Full documentation can be found:
+
+- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo#local-deployment) (Recommended)
+- [FastVideo - GitHub](../../../docs/features/diffusion/fastvideo.md#local-deployment)
--- a/examples/diffusers/local/docker-compose.yml
+++ b/examples/diffusers/local/docker-compose.yml
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Worker count is controlled by COMPOSE_PROFILES (1-8).
+# COMPOSE_PROFILES=4 starts backend-0..backend-3 on GPUs 0..3.
+
+x-backend-base: &backend-base
+  build:
+    context: ..
+    dockerfile: Dockerfile
+  image: dynamo-fastvideo-diffusers:latest
+  restart: on-failure
+  command: python worker.py
+  environment:
+    - DYN_DISCOVERY_BACKEND=file
+    - DYN_FILE_KV=/tmp/dynamo-discovery
+    - LD_LIBRARY_PATH=
+    - TORCHINDUCTOR_CACHE_DIR=/cache/torchinductor
+    - TRITON_CACHE_DIR=/cache/triton
+  volumes:
+    - dynamo-discovery:/tmp/dynamo-discovery
+    - huggingface-cache:/root/.cache/huggingface
+  ipc: host
+  shm_size: 8g
+  ulimits:
+    memlock: -1
+    stack: 67108864
+  depends_on:
+    - frontend
+
+services:
+  frontend:
+    build:
+      context: ..
+      dockerfile: Dockerfile
+    image: dynamo-fastvideo-diffusers:latest
+    restart: on-failure
+    command: >
+      python -m dynamo.frontend
+        --http-port 8000
+        --discovery-backend file
+    environment:
+      - DYN_FILE_KV=/tmp/dynamo-discovery
+    volumes:
+      - dynamo-discovery:/tmp/dynamo-discovery
+    ports:
+      - "8000:8000"
+
+  backend-0:
+    <<: *backend-base
+    profiles: ["1", "2", "3", "4", "5", "6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["0"], capabilities: [gpu] }
+
+  backend-1:
+    <<: *backend-base
+    profiles: ["2", "3", "4", "5", "6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["1"], capabilities: [gpu] }
+
+  backend-2:
+    <<: *backend-base
+    profiles: ["3", "4", "5", "6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["2"], capabilities: [gpu] }
+
+  backend-3:
+    <<: *backend-base
+    profiles: ["4", "5", "6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["3"], capabilities: [gpu] }
+
+  backend-4:
+    <<: *backend-base
+    profiles: ["5", "6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["4"], capabilities: [gpu] }
+
+  backend-5:
+    <<: *backend-base
+    profiles: ["6", "7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["5"], capabilities: [gpu] }
+
+  backend-6:
+    <<: *backend-base
+    profiles: ["7", "8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["6"], capabilities: [gpu] }
+
+  backend-7:
+    <<: *backend-base
+    profiles: ["8"]
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - { driver: nvidia, device_ids: ["7"], capabilities: [gpu] }
+
+volumes:
+  dynamo-discovery:
+  huggingface-cache:
--- a/examples/diffusers/local/run_local.sh
+++ b/examples/diffusers/local/run_local.sh
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+EXAMPLE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
+
+: "${PYTHON_BIN:=python3}"
+: "${MODEL:=FastVideo/LTX2-Distilled-Diffusers}"
+: "${NUM_GPUS:=1}"
+: "${HTTP_PORT:=8000}"
+: "${DISCOVERY_DIR:=${SCRIPT_DIR}/.runtime/discovery}"
+: "${LOG_DIR:=${SCRIPT_DIR}/.runtime/logs}"
+: "${WORKER_EXTRA_ARGS:=}"
+: "${FRONTEND_EXTRA_ARGS:=}"
+
+if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
+  echo "error: ${PYTHON_BIN} not found"
+  exit 1
+fi
+
+mkdir -p "${DISCOVERY_DIR}" "${LOG_DIR}"
+
+export DYN_DISCOVERY_BACKEND=file
+export DYN_FILE_KV="${DYN_FILE_KV:-${DISCOVERY_DIR}}"
+
+cd "${EXAMPLE_DIR}"
+
+worker_cmd=("${PYTHON_BIN}" worker.py --model "${MODEL}" --num-gpus "${NUM_GPUS}")
+if [[ -n "${WORKER_EXTRA_ARGS}" ]]; then
+  # shellcheck disable=SC2206
+  worker_extra=( ${WORKER_EXTRA_ARGS} )
+  worker_cmd+=("${worker_extra[@]}")
+fi
+
+frontend_cmd=("${PYTHON_BIN}" -m dynamo.frontend --http-port "${HTTP_PORT}" --discovery-backend file)
+if [[ -n "${FRONTEND_EXTRA_ARGS}" ]]; then
+  # shellcheck disable=SC2206
+  frontend_extra=( ${FRONTEND_EXTRA_ARGS} )
+  frontend_cmd+=("${frontend_extra[@]}")
+fi
+
+cleanup() {
+  echo
+  echo "Stopping local processes..."
+  kill "${frontend_pid:-}" "${worker_pid:-}" 2>/dev/null || true
+  wait "${frontend_pid:-}" "${worker_pid:-}" 2>/dev/null || true
+}
+trap cleanup EXIT INT TERM
+
+echo "Starting worker: ${worker_cmd[*]}"
+"${worker_cmd[@]}" >"${LOG_DIR}/worker.log" 2>&1 &
+worker_pid=$!
+
+echo "Starting frontend: ${frontend_cmd[*]}"
+"${frontend_cmd[@]}" >"${LOG_DIR}/frontend.log" 2>&1 &
+frontend_pid=$!
+
+echo ""
+echo "Worker log:   ${LOG_DIR}/worker.log"
+echo "Frontend log: ${LOG_DIR}/frontend.log"
+echo ""
+echo "API endpoint: http://localhost:${HTTP_PORT}/v1/videos"
+echo ""
+echo "Example request:"
+echo "curl -s -X POST http://localhost:${HTTP_PORT}/v1/videos -H 'Content-Type: application/json' -d '{\"model\":\"${MODEL}\",\"prompt\":\"A cinematic drone shot over snowy mountains at sunrise\",\"size\":\"1920x1088\",\"seconds\":5,\"nvext\":{\"fps\":24,\"num_frames\":121,\"num_inference_steps\":5,\"guidance_scale\":1.0,\"seed\":10}}' > response.json"
+echo ""
+
+wait -n "${worker_pid}" "${frontend_pid}"
--- a/examples/diffusers/worker.py
+++ b/examples/diffusers/worker.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""
+FastVideo Worker for Dynamo (non-streaming)
+
+Registers a VideoGenerator as a Dynamo backend endpoint compatible with the
+/v1/videos frontend endpoint.  The endpoint generates a full video
+clip from the request parameters and returns it as a single response containing
+the complete MP4 file base64-encoded in data[0].b64_json.
+
+Generation parameters (size, fps, num_frames, etc.) are taken from the
+request body's nvext field, so the same worker instance can serve requests
+with different resolutions and quality settings without restarting.
+
+One request at a time (asyncio.Lock — VideoGenerator is not re-entrant).
+
+Usage:
+  python worker.py [--model MODEL] [--num-gpus N] [--disable-optimizations]
+
+Options:
+  --model          HuggingFace model path
+                   (default: FastVideo/LTX2-Distilled-Diffusers)
+  --num-gpus       Number of GPUs (default: 1)
+
+Request format (sent to /v1/videos):
+  prompt:   text description of the desired video
+  model:    HuggingFace model path (must match what the worker registered)
+  size:     "WxH" string, e.g. "1920x1088" (default: "1920x1088")
+  seconds:  clip duration when nvext.num_frames is not set (default: 5)
+  nvext:
+    fps:                frames per second (default: 24)
+    num_frames:         total frames; overrides fps * seconds when set (default: 121)
+    num_inference_steps diffusion steps (default: 5)
+    guidance_scale:     CFG scale (default: 1.0)
+    seed:               RNG seed (default: 10)
+    negative_prompt:    text to avoid (optional)
+"""
+
+import argparse
+import asyncio
+import base64
+import logging
+import os
+import tempfile
+import time
+import uuid
+
+import uvloop
+from fastvideo import VideoGenerator
+from fastvideo.configs.pipelines.base import PipelineConfig
+from fastvideo.layers.quantization.fp4_config import FP4Config
+from pydantic import BaseModel, Field
+
+from dynamo.llm import ModelInput, ModelType, register_llm  # type: ignore[attr-defined]
+from dynamo.runtime import DistributedRuntime, dynamo_endpoint
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers"
+
+# ── Request / Response models ─────────────────────────────────────────────────
+
+
+def _get_worker_namespace() -> str:
+    """
+    Resolve Dynamo namespace for endpoint registration.
+
+    Kubernetes operator injects DYN_NAMESPACE (and optionally a rollout suffix).
+    Compose/local runs keep using the historical "dynamo" default.
+    """
+    namespace = os.environ.get("DYN_NAMESPACE", "dynamo")
+    suffix = os.environ.get("DYN_NAMESPACE_WORKER_SUFFIX")
+    if suffix:
+        namespace = f"{namespace}-{suffix}"
+    return namespace
+
+
+class NvExtVideoCreateRequest(BaseModel):
+    fps: int = Field(default=24, description="Frames per second")
+    num_frames: int | None = Field(
+        default=121, description="Total frames; overrides fps * seconds"
+    )
+    num_inference_steps: int = Field(default=5, description="Diffusion inference steps")
+    guidance_scale: float = Field(
+        default=1.0, description="Classifier-free guidance scale"
+    )
+    seed: int | None = Field(default=10, description="RNG seed for reproducibility")
+    negative_prompt: str | None = Field(
+        default=None, description="Text to avoid in generation"
+    )
+
+
+class VideoCreateRequest(BaseModel):
+    prompt: str = Field(description="Text description of the desired video")
+    model: str = Field(description="HuggingFace model path")
+    size: str = Field(default="1920x1088", description="Frame dimensions as 'WxH'")
+    seconds: int = Field(
+        default=5, description="Clip duration; used when nvext.num_frames is unset"
+    )
+    user: str | None = Field(default=None)
+    nvext: NvExtVideoCreateRequest = Field(default_factory=NvExtVideoCreateRequest)
+
+
+class VideoData(BaseModel):
+    b64_json: str | None = Field(default=None, description="Base64-encoded MP4 video")
+    mime_type: str = Field(default="video/mp4")
+
+
+class VideoCreateResponse(BaseModel):
+    id: str
+    object: str = "video"
+    created: int
+    model: str
+    status: str = "complete"
+    data: list[VideoData]
+
+
+# ── Backend ───────────────────────────────────────────────────────────────────
+
+
+def _coerce_optional_float(value: object) -> float | None:
+    """Best-effort conversion for optional numeric metrics from backend results."""
+    if value is None:
+        return None
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return None
+
+
+class FastVideoBackend:
+    def __init__(self, args: argparse.Namespace) -> None:
+        self.model_name: str = args.model
+        self.num_gpus: int = args.num_gpus
+        self.disable_optimizations: bool = args.disable_optimizations
+
+        # One request at a time — VideoGenerator is not re-entrant
+        self._generate_lock = asyncio.Lock()
+        self.generator: VideoGenerator | None = None
+
+        attn_backend = "TORCH_SDPA" if self.disable_optimizations else "FLASH_ATTN"
+        os.environ["FASTVIDEO_ATTENTION_BACKEND"] = attn_backend
+        os.environ["FASTVIDEO_STAGE_LOGGING"] = "1"
+        os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0"
+
+    async def initialize_model(self) -> None:
+        logger.info("Loading VideoGenerator model=%s", self.model_name)
+        loop = asyncio.get_running_loop()
+
+        def _load():
+            pipeline_config = PipelineConfig.from_pretrained(self.model_name)
+            if not self.disable_optimizations:
+                logger.info(
+                    "Using FP4 quantization for VideoGenerator model=%s",
+                    self.model_name,
+                )
+                pipeline_config.dit_config.quant_config = FP4Config()
+
+            return VideoGenerator.from_pretrained(
+                self.model_name,
+                num_gpus=self.num_gpus,
+                ltx2_refine_enabled=True,
+                ltx2_refine_lora_path="",  # disable refine lora for distilled model
+                ltx2_refine_num_inference_steps=2,
+                ltx2_refine_guidance_scale=1.0,
+                ltx2_refine_add_noise=True,
+                pipeline_config=pipeline_config,
+                enable_torch_compile=not self.disable_optimizations,
+                enable_torch_compile_text_encoder=not self.disable_optimizations,
+                torch_compile_kwargs={
+                    "backend": "inductor",
+                    "fullgraph": True,
+                    "mode": "max-autotune-no-cudagraphs",
+                },
+                dit_cpu_offload=False,
+                vae_cpu_offload=False,
+                text_encoder_cpu_offload=False,
+                ltx2_vae_tiling=False,
+            )
+
+        self.generator = await loop.run_in_executor(None, _load)
+        logger.info("VideoGenerator ready")
+
+    # ── Helpers ───────────────────────────────────────────────────────────────
+
+    def _generate_mp4(
+        self,
+        prompt: str,
+        video_id: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        fps: int,
+        num_inference_steps: int,
+        guidance_scale: float,
+        seed: int | None,
+        negative_prompt: str | None,
+    ) -> bytes:
+        """Generate a video clip and return it as MP4 bytes."""
+        assert self.generator is not None
+
+        with tempfile.TemporaryDirectory() as tmpdir:
+            output_path = os.path.join(tmpdir, "output.mp4")
+            kwargs: dict = dict(
+                save_video=True,
+                return_frames=False,
+                output_path=output_path,
+                height=height,
+                width=width,
+                num_frames=num_frames,
+                fps=fps,
+                num_inference_steps=num_inference_steps,
+                guidance_scale=guidance_scale,
+            )
+            if seed is not None:
+                kwargs["seed"] = seed
+            if negative_prompt is not None:
+                kwargs["negative_prompt"] = negative_prompt
+
+            result = self.generator.generate_video(prompt=prompt, **kwargs)
+            result_dict = result if isinstance(result, dict) else {}
+            generation_time = _coerce_optional_float(result_dict.get("generation_time"))
+            e2e_latency = _coerce_optional_float(result_dict.get("e2e_latency"))
+            logger.info("[%s] MP4 written to %s", video_id, output_path)
+            if generation_time is not None:
+                logger.info(
+                    "[%s] Generation time: %.2f seconds", video_id, generation_time
+                )
+            else:
+                logger.info("[%s] Generation time: unavailable", video_id)
+
+            if e2e_latency is not None:
+                logger.info("[%s] E2E latency: %.2f seconds", video_id, e2e_latency)
+            else:
+                logger.info("[%s] E2E latency: unavailable", video_id)
+
+            time_start = time.perf_counter()
+            with open(output_path, "rb") as f:
+                data = f.read()
+            time_end = time.perf_counter()
+            logger.info(
+                "[%s] File read time: %.2f seconds", video_id, time_end - time_start
+            )
+
+            return data
+
+    # ── Dynamo endpoint ───────────────────────────────────────────────────────
+
+    @dynamo_endpoint(VideoCreateRequest, VideoCreateResponse)
+    async def create_video(self, request: VideoCreateRequest):
+        """
+        Non-streaming endpoint.
+
+        Generates one video clip using the parameters from the request's nvext
+        field, then yields a single VideoCreateResponse with data[0].b64_json
+        containing the complete MP4 file encoded in base64.
+        """
+        if self.generator is None:
+            raise RuntimeError("Generator is not initialized")
+
+        nvext = request.nvext
+        try:
+            width_str, height_str = request.size.lower().split("x", 1)
+            width, height = int(width_str), int(height_str)
+        except (ValueError, TypeError) as exc:
+            raise ValueError(
+                f"Invalid size format '{request.size}', expected 'WxH'"
+            ) from exc
+
+        if width <= 0 or height <= 0:
+            raise ValueError(
+                f"Invalid size '{request.size}', width and height must be positive"
+            )
+
+        num_frames = (
+            nvext.num_frames
+            if nvext.num_frames is not None
+            else nvext.fps * request.seconds
+        )
+        if num_frames <= 0:
+            raise ValueError("num_frames must be positive")
+
+        fps = nvext.fps
+        if fps <= 0:
+            raise ValueError("fps must be positive")
+
+        video_id = f"video_{uuid.uuid4().hex}"
+        created_ts = int(time.time())
+
+        logger.info(
+            "[%s] create_video: prompt='%s...' size=%s frames=%d steps=%d",
+            video_id,
+            request.prompt[:60],
+            request.size,
+            num_frames,
+            nvext.num_inference_steps,
+        )
+        logger.info(
+            "[%s] Waiting for generate lock (locked=%s)",
+            video_id,
+            self._generate_lock.locked(),
+        )
+        async with self._generate_lock:
+            t = time.perf_counter()
+            logger.info(
+                "[%s] Generating video (%dx%d, %d frames, %d steps) ...",
+                video_id,
+                width,
+                height,
+                num_frames,
+                nvext.num_inference_steps,
+            )
+            try:
+                mp4_bytes = await asyncio.to_thread(
+                    self._generate_mp4,
+                    prompt=request.prompt,
+                    video_id=video_id,
+                    width=width,
+                    height=height,
+                    num_frames=num_frames,
+                    fps=fps,
+                    num_inference_steps=nvext.num_inference_steps,
+                    guidance_scale=nvext.guidance_scale,
+                    seed=nvext.seed,
+                    negative_prompt=nvext.negative_prompt,
+                )
+            except Exception as exc:
+                logger.exception("[%s] Generation failed", video_id)
+                raise RuntimeError(
+                    f"Video generation failed for request {video_id}"
+                ) from exc
+
+            elapsed = time.perf_counter() - t
+            logger.info(
+                "[%s] Generation done in %.1fs — encoding %.2f MB MP4",
+                video_id,
+                elapsed,
+                len(mp4_bytes) / 1_048_576,
+            )
+
+            yield VideoCreateResponse(
+                id=video_id,
+                created=created_ts,
+                model=request.model,
+                data=[VideoData(b64_json=base64.b64encode(mp4_bytes).decode())],
+            ).model_dump()
+        logger.info("[%s] Generation request finished", video_id)
+
+
+# ── Dynamo wiring ─────────────────────────────────────────────────────────────
+
+
+async def _register_model(endpoint, model_name: str) -> None:
+    try:
+        await register_llm(
+            ModelInput.Text,  # type: ignore[attr-defined]
+            ModelType.Videos,
+            endpoint,
+            model_name,
+            model_name,
+        )
+        logger.info("Successfully registered model: %s", model_name)
+    except Exception as e:
+        logger.error("Failed to register model: %s", e, exc_info=True)
+        raise RuntimeError("Model registration failed") from e
+
+
+async def backend_worker(runtime: DistributedRuntime, args: argparse.Namespace) -> None:
+    namespace_name = _get_worker_namespace()
+    component_name = "backend"
+    endpoint_name = "generate"
+
+    endpoint = runtime.endpoint(f"{namespace_name}.{component_name}.{endpoint_name}")
+    logger.info(
+        "Serving endpoint %s/%s/%s", namespace_name, component_name, endpoint_name
+    )
+
+    backend = FastVideoBackend(args)
+    await backend.initialize_model()
+
+    await asyncio.gather(
+        endpoint.serve_endpoint(backend.create_video),  # type: ignore[arg-type]
+        _register_model(endpoint, backend.model_name),
+    )
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="FastVideo Worker for Dynamo (non-streaming)"
+    )
+    parser.add_argument(
+        "--model",
+        default=DEFAULT_MODEL,
+        help=f"HuggingFace model path (default: {DEFAULT_MODEL})",
+    )
+    parser.add_argument(
+        "--num-gpus",
+        type=int,
+        default=1,
+        dest="num_gpus",
+        help="Number of GPUs (default: 1)",
+    )
+    parser.add_argument(
+        "--disable-optimizations",
+        action="store_true",
+        dest="disable_optimizations",
+        help="Disable FP4 quantization, torch.compile, and use TORCH_SDPA attention",
+    )
+    return parser.parse_args()
+
+
+async def main(args: argparse.Namespace) -> None:
+    loop = asyncio.get_running_loop()
+    # Use Kubernetes discovery in-cluster and file discovery for local compose by default.
+    discovery_backend = os.environ.get("DYN_DISCOVERY_BACKEND")
+    if not discovery_backend:
+        discovery_backend = (
+            "kubernetes" if os.environ.get("KUBERNETES_SERVICE_HOST") else "file"
+        )
+    logger.info("Using discovery backend: %s", discovery_backend)
+    logger.info("Resolved worker namespace: %s", _get_worker_namespace())
+    runtime = DistributedRuntime(loop, discovery_backend, "tcp", False)
+    await backend_worker(runtime, args)
+
+
+if __name__ == "__main__":
+    _args = _parse_args()
+    logging.basicConfig(
+        level=(
+            logging.DEBUG
+            if os.environ.get("FASTVIDEO_LOG_LEVEL") == "DEBUG"
+            else logging.INFO
+        ),
+        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+        force=True,
+    )
+    uvloop.install()
+    asyncio.run(main(_args))
--- a/fern/docs.yml
+++ b/fern/docs.yml
@@ -42,6 +42,20 @@ redirects:
    destination: "/dynamo/resources/release-artifacts"
  - source: "/dynamo/getting-started/examples"
    destination: "/dynamo/resources/examples"
+  - source: "/dynamo/dev/user-guides/multimodal-model-serving/diffusion-experimental/:slug*"
+    destination: "/dynamo/dev/user-guides/diffusion/:slug*"
+  - source: "/dynamo/dev/user-guides/multimodal-model-serving/diffusion-experimental"
+    destination: "/dynamo/dev/user-guides/diffusion"
+  - source: "/dynamo/dev/user-guides/multimodal-model-serving/vision-language-models-vlms/:slug*"
+    destination: "/dynamo/dev/user-guides/multimodal/:slug*"
+  - source: "/dynamo/dev/user-guides/multimodal-model-serving/vision-language-models-vlms"
+    destination: "/dynamo/dev/user-guides/multimodal"
+  - source: "/dynamo/dev/user-guides/diffusion/diffusion-guide"
+    destination: "/dynamo/dev/user-guides/diffusion/fastvideo"
+  - source: "/dynamo/dev/user-guides/agentic-workloads/:slug*"
+    destination: "/dynamo/dev/user-guides/agents/:slug*"
+  - source: "/dynamo/dev/user-guides/agentic-workloads"
+    destination: "/dynamo/dev/user-guides/agents"

 # GitHub repository link in navbar
 navbar-links: