"...git@developer.sourcefind.cn:2222/OpenDAS/vllm_cscc.git" did not exist on "0f2f24c8b205b5bf2dadacf1f95f1ad9f7de73e0"
Unverified Commit 2adf8a2d authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: add FastVideo example and guide with light sidebar reorg (#7283)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 52b460e4
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Agentic Workloads
title: Agents
subtitle: Workload-aware inference with agentic hints for routing, scheduling, and KV cache Management
---
......
......@@ -30,3 +30,4 @@ For deployment guides, configuration, and examples for each backend:
- **[vLLM-Omni](../../backends/vllm/vllm-omni.md)**
- **[SGLang Diffusion](../../backends/sglang/sglang-diffusion.md)**
- **[TRT-LLM Diffusion](../../backends/trtllm/trtllm-video-diffusion.md)**
- **[FastVideo (custom worker)](fastvideo.md)**
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
sidebar-title: FastVideo
---
# FastVideo
This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) text-to-video generation on Dynamo using a custom worker (`worker.py`) exposed through the `/v1/videos` endpoint.
> [!NOTE]
> Dynamo also supports diffusion through built-in backends: [SGLang Diffusion](../../backends/sglang/sglang-diffusion.md) (LLM diffusion, image, video), [vLLM-Omni](../../backends/vllm/vllm-omni.md) (text-to-image, text-to-video), and [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md). See the [Diffusion Overview](README.md) for the full support matrix.
## Overview
- **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
- **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput.
- **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
- **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
> [!IMPORTANT]
> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
## Docker Image Build
The local Docker workflow builds a runtime image from the [`Dockerfile`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers/Dockerfile):
- Base image: `nvidia/cuda:13.1.1-devel-ubuntu24.04`
- Installs [FastVideo](https://github.com/hao-ai-lab/FastVideo) from GitHub
- Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
- Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
> [!WARNING]
> The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
## Warmup Time
On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
> [!TIP]
> When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
## Local Deployment
### Prerequisites
**For Docker Compose:**
- Docker Engine 26.0+
- Docker Compose v2
- NVIDIA Container Toolkit
**For host-local script:**
- Python environment with Dynamo + FastVideo dependencies installed
- CUDA-compatible GPU runtime available on host
### Option 1: Docker Compose
```bash
cd <dynamo-root>/examples/diffusers/local
# Start 4 workers on GPUs 0..3
COMPOSE_PROFILES=4 docker compose up --build
```
The Compose file builds from the Dockerfile and exposes the API on `http://localhost:8000`. See the [Docker Image Build](#docker-image-build) section for build time expectations.
### Option 2: Host-Local Script
```bash
cd <dynamo-root>/examples/diffusers/local
./run_local.sh
```
Environment variables:
| Variable | Default | Description |
|---|---|---|
| `PYTHON_BIN` | `python3` | Python interpreter |
| `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `NUM_GPUS` | `1` | Number of GPUs |
| `HTTP_PORT` | `8000` | Frontend HTTP port |
| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) |
| `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` |
Example:
```bash
MODEL=FastVideo/LTX2-Distilled-Diffusers \
NUM_GPUS=1 \
HTTP_PORT=8000 \
WORKER_EXTRA_ARGS="--disable-optimizations" \
./run_local.sh
```
> [!NOTE]
> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`.
The script writes logs to:
- `.runtime/logs/worker.log`
- `.runtime/logs/frontend.log`
## Kubernetes Deployment
### Files
| File | Description |
|---|---|
| `agg.yaml` | Base aggregated deployment (Frontend + `FastVideoWorker`) |
| `agg_user_workload.yaml` | Same deployment with `user-workload` tolerations and `imagePullSecrets` |
| `huggingface-cache-pvc.yaml` | Shared HF cache PVC for model weights |
| `dynamo-platform-values-user-workload.yaml` | Optional Helm values for clusters with tainted `user-workload` nodes |
### Prerequisites
1. Dynamo Kubernetes Platform installed
2. GPU-enabled Kubernetes cluster
3. FastVideo runtime image pushed to your registry
4. Optional HF token secret (for gated models)
Create a Hugging Face token secret if needed:
```bash
export NAMESPACE=<your-namespace>
export HF_TOKEN=<your-hf-token>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
### Deploy
```bash
cd <dynamo-root>/examples/diffusers/deploy
export NAMESPACE=<your-namespace>
kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
kubectl apply -f agg.yaml -n ${NAMESPACE}
```
For clusters with tainted `user-workload` nodes and private registry pulls:
1. Set your pull secret name and image in `agg_user_workload.yaml`.
2. Apply:
```bash
kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE}
kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE}
```
### Update Image Quickly
```bash
export DEPLOYMENT_FILE=agg.yaml
export FASTVIDEO_IMAGE=<my-registry/fastvideo-runtime:my-tag>
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \
${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated
kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE}
```
### Verify and Access
```bash
kubectl get dgd -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE}
kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker
```
```bash
kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000
```
## Test Request
> [!NOTE]
> If this is the first request after startup, expect it to take longer while warmup completes. See [Warmup Time](#warmup-time) for details.
Send a request and decode the response:
```bash
curl -s -X POST http://localhost:8000/v1/videos \
-H 'Content-Type: application/json' \
-d '{
"model": "FastVideo/LTX2-Distilled-Diffusers",
"prompt": "A cinematic drone shot over a snowy mountain range at sunrise",
"size": "1920x1088",
"seconds": 5,
"nvext": {
"fps": 24,
"num_frames": 121,
"num_inference_steps": 5,
"guidance_scale": 1.0,
"seed": 10
}
}' > response.json
# Linux
jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4
# macOS
jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
```
## Worker Configuration Reference
### CLI Flags
| Flag | Default | Description |
|---|---|---|
| `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
| `--num-gpus` | `1` | Number of GPUs for distributed inference |
| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA |
### Request Parameters (`nvext`)
| Field | Default | Description |
|---|---|---|
| `fps` | `24` | Frames per second |
| `num_frames` | `121` | Total frames; overrides `fps * seconds` when set |
| `num_inference_steps` | `5` | Diffusion inference steps |
| `guidance_scale` | `1.0` | Classifier-free guidance scale |
| `seed` | `10` | RNG seed for reproducibility |
| `negative_prompt` | — | Text to avoid in generation |
### Environment Variables
| Variable | Default | Description |
|---|---|---|
| `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
| `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) |
| `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
| `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging |
## Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile |
| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
| ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` |
## Source Code
The example source lives at [`examples/diffusers/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/diffusers) in the Dynamo repository.
## See Also
- [vLLM-Omni Text-to-Video](../../backends/vllm/vllm-omni.md#text-to-video) — vLLM-Omni video generation via `/v1/videos`
- [vLLM-Omni Text-to-Image](../../backends/vllm/vllm-omni.md#text-to-image) — vLLM-Omni image generation
- [SGLang Video Generation](../../backends/sglang/sglang-diffusion.md#video-generation) — SGLang video generation worker
- [SGLang Image Diffusion](../../backends/sglang/sglang-diffusion.md#image-diffusion) — SGLang image diffusion worker
- [TRT-LLM Video Diffusion](../../backends/trtllm/trtllm-video-diffusion.md#quick-start) — TensorRT-LLM video diffusion quick start
- [Diffusion Overview](README.md) — Full backend support matrix
......@@ -89,31 +89,34 @@ navigation:
path: components/kvbm/kvbm-guide.md
- page: Dynamo Benchmarking
path: benchmarks/benchmarking.md
- section: Multimodal Model Serving
- section: Multimodal
path: features/multimodal/README.md
contents:
- section: Vision Language Models (VLMs)
path: features/multimodal/README.md
contents:
- page: Embedding Cache
path: features/multimodal/embedding-cache.md
- page: Encoder Disaggregation
path: features/multimodal/encoder-disaggregation.md
- page: Multimodal KV Routing
path: features/multimodal/multimodal-kv-routing.md
- section: Diffusion (Experimental)
path: features/multimodal/diffusion.md
contents:
- page: vLLM-Omni
path: backends/vllm/vllm-omni.md
- page: SGLang Diffusion
path: backends/sglang/sglang-diffusion.md
- page: TRT-LLM Diffusion
path: backends/trtllm/trtllm-video-diffusion.md
- page: Embedding Cache
path: features/multimodal/embedding-cache.md
- page: Encoder Disaggregation
path: features/multimodal/encoder-disaggregation.md
- page: Multimodal KV Routing
path: features/multimodal/multimodal-kv-routing.md
- section: Diffusion (Preview)
slug: diffusion
path: features/diffusion/README.md
contents:
- page: FastVideo
slug: fastvideo
path: features/diffusion/fastvideo.md
- page: vLLM-Omni
path: backends/vllm/vllm-omni.md
- page: SGLang Diffusion
path: backends/sglang/sglang-diffusion.md
- page: TRT-LLM Diffusion
path: backends/trtllm/trtllm-video-diffusion.md
- page: Tool Calling
path: agents/tool-calling.md
- page: LoRA Adapters
path: features/lora/README.md
- section: Agentic Workloads
- section: Agents
slug: agents
path: features/agentic_workloads.md
contents:
- page: SGLang for Agentic Workloads
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Local outputs
outputs
outputs_video
# Python caches
__pycache__
*.pyc
.git
local/.runtime
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Shared runtime image for Dynamo frontend and FastVideo workers.
FROM nvidia/cuda:13.1.1-devel-ubuntu24.04
RUN apt-get update \
&& apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
&& apt-get clean
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
ENV UV_LINK_MODE=copy
RUN uv venv /opt/dynamo/venv --python 3.12 \
&& . /opt/dynamo/venv/bin/activate \
&& uv pip install pip setuptools packaging ninja psutil uvloop \
&& uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130 \
&& uv pip install flashinfer-python
ENV VIRTUAL_ENV=/opt/dynamo/venv
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS.
# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores.
RUN git clone https://github.com/RandNMR73/flash-attention \
&& cd flash-attention \
&& git switch fa4-compile \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \
&& rm -rf ../flash-attention
# Install Dynamo with /v1/videos support.
RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \
&& uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
# Install FastVideo directly from the public upstream repository.
# Checkout with --recurse-submodules to get the required submodules as well.
RUN . /opt/dynamo/venv/bin/activate \
&& uv pip install setuptools_scm scikit-build-core cmake ninja \
&& git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
&& TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo
ENV FASTVIDEO_VIDEO_CODEC=libx264
ENV FASTVIDEO_X264_PRESET=ultrafast
WORKDIR /opt/app
COPY . /opt/app/
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# FastVideo Video Diffusion Example
Full documentation can be found:
- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo) (Recommended)
- [FastVideo - GitHub](../../docs/features/diffusion/fastvideo.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# FastVideo Kubernetes Deployment
Full documentation can be found:
- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo#kubernetes-deployment) (Recommended)
- [FastVideo - GitHub](../../../docs/features/diffusion/fastvideo.md#kubernetes-deployment)
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: fastvideo-agg
spec:
pvcs:
- name: huggingface-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
envs:
- name: DYN_DISCOVERY_BACKEND
value: kubernetes
extraPodSpec:
mainContainer:
image: my-registry/fastvideo-runtime:my-tag
imagePullPolicy: IfNotPresent
workingDir: /opt/app
command:
- python
- -m
- dynamo.frontend
args:
- --http-port
- "8000"
FastVideoWorker:
componentType: worker
replicas: 1
sharedMemory:
size: 8Gi
resources:
limits:
gpu: "1"
requests:
gpu: "1"
envs:
- name: DYN_DISCOVERY_BACKEND
value: kubernetes
- name: LD_LIBRARY_PATH
value: ""
- name: TORCHINDUCTOR_CACHE_DIR
value: /cache/torchinductor
- name: TRITON_CACHE_DIR
value: /cache/triton
- name: HF_HOME
value: /root/.cache/huggingface
volumeMounts:
- name: huggingface-cache
mountPoint: /root/.cache/huggingface
extraPodSpec:
mainContainer:
image: my-registry/fastvideo-runtime:my-tag
imagePullPolicy: IfNotPresent
workingDir: /opt/app
command:
- python
- worker.py
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: fastvideo-agg
spec:
pvcs:
- name: huggingface-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
envs:
- name: DYN_DISCOVERY_BACKEND
value: kubernetes
extraPodSpec:
imagePullSecrets:
- name: my-image-pull-secret
tolerations:
- key: dedicated
operator: Equal
value: user-workload
effect: NoSchedule
- key: dedicated
operator: Equal
value: user-workload
effect: NoExecute
mainContainer:
image: my-registry/fastvideo-runtime:my-tag
imagePullPolicy: IfNotPresent
workingDir: /opt/app
command:
- python
- -m
- dynamo.frontend
args:
- --http-port
- "8000"
FastVideoWorker:
componentType: worker
replicas: 1
sharedMemory:
size: 8Gi
resources:
limits:
gpu: "1"
requests:
gpu: "1"
envs:
- name: DYN_DISCOVERY_BACKEND
value: kubernetes
- name: LD_LIBRARY_PATH
value: ""
- name: TORCHINDUCTOR_CACHE_DIR
value: /cache/torchinductor
- name: TRITON_CACHE_DIR
value: /cache/triton
- name: HF_HOME
value: /root/.cache/huggingface
volumeMounts:
- name: huggingface-cache
mountPoint: /root/.cache/huggingface
extraPodSpec:
imagePullSecrets:
- name: my-image-pull-secret
tolerations:
- key: dedicated
operator: Equal
value: user-workload
effect: NoSchedule
- key: dedicated
operator: Equal
value: user-workload
effect: NoExecute
mainContainer:
image: my-registry/fastvideo-runtime:my-tag
imagePullPolicy: IfNotPresent
workingDir: /opt/app
command:
- python
- worker.py
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
dynamo-operator:
namespaceRestriction:
enabled: true
controllerManager:
tolerations:
- { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
- { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
etcd:
persistence:
storageClass: ebs
tolerations:
- { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
- { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
nats:
config:
jetstream:
fileStore:
pvc:
storageClassName: ebs
podTemplate:
merge:
spec:
tolerations:
- { key: dedicated, operator: Equal, value: user-workload, effect: NoSchedule }
- { key: dedicated, operator: Equal, value: user-workload, effect: NoExecute }
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: huggingface-cache
spec:
accessModes:
- ReadWriteOnce
storageClassName: ebs
resources:
requests:
storage: 200Gi
.runtime/
response.json
output.mp4
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# FastVideo Local Run
Full documentation can be found:
- [FastVideo - Dynamo Docs](https://docs.nvidia.com/dynamo/dev/user-guides/diffusion/fastvideo#local-deployment) (Recommended)
- [FastVideo - GitHub](../../../docs/features/diffusion/fastvideo.md#local-deployment)
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# Worker count is controlled by COMPOSE_PROFILES (1-8).
# COMPOSE_PROFILES=4 starts backend-0..backend-3 on GPUs 0..3.
x-backend-base: &backend-base
build:
context: ..
dockerfile: Dockerfile
image: dynamo-fastvideo-diffusers:latest
restart: on-failure
command: python worker.py
environment:
- DYN_DISCOVERY_BACKEND=file
- DYN_FILE_KV=/tmp/dynamo-discovery
- LD_LIBRARY_PATH=
- TORCHINDUCTOR_CACHE_DIR=/cache/torchinductor
- TRITON_CACHE_DIR=/cache/triton
volumes:
- dynamo-discovery:/tmp/dynamo-discovery
- huggingface-cache:/root/.cache/huggingface
ipc: host
shm_size: 8g
ulimits:
memlock: -1
stack: 67108864
depends_on:
- frontend
services:
frontend:
build:
context: ..
dockerfile: Dockerfile
image: dynamo-fastvideo-diffusers:latest
restart: on-failure
command: >
python -m dynamo.frontend
--http-port 8000
--discovery-backend file
environment:
- DYN_FILE_KV=/tmp/dynamo-discovery
volumes:
- dynamo-discovery:/tmp/dynamo-discovery
ports:
- "8000:8000"
backend-0:
<<: *backend-base
profiles: ["1", "2", "3", "4", "5", "6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["0"], capabilities: [gpu] }
backend-1:
<<: *backend-base
profiles: ["2", "3", "4", "5", "6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["1"], capabilities: [gpu] }
backend-2:
<<: *backend-base
profiles: ["3", "4", "5", "6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["2"], capabilities: [gpu] }
backend-3:
<<: *backend-base
profiles: ["4", "5", "6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["3"], capabilities: [gpu] }
backend-4:
<<: *backend-base
profiles: ["5", "6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["4"], capabilities: [gpu] }
backend-5:
<<: *backend-base
profiles: ["6", "7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["5"], capabilities: [gpu] }
backend-6:
<<: *backend-base
profiles: ["7", "8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["6"], capabilities: [gpu] }
backend-7:
<<: *backend-base
profiles: ["8"]
deploy:
resources:
reservations:
devices:
- { driver: nvidia, device_ids: ["7"], capabilities: [gpu] }
volumes:
dynamo-discovery:
huggingface-cache:
#!/usr/bin/env bash
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
EXAMPLE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
: "${PYTHON_BIN:=python3}"
: "${MODEL:=FastVideo/LTX2-Distilled-Diffusers}"
: "${NUM_GPUS:=1}"
: "${HTTP_PORT:=8000}"
: "${DISCOVERY_DIR:=${SCRIPT_DIR}/.runtime/discovery}"
: "${LOG_DIR:=${SCRIPT_DIR}/.runtime/logs}"
: "${WORKER_EXTRA_ARGS:=}"
: "${FRONTEND_EXTRA_ARGS:=}"
if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
echo "error: ${PYTHON_BIN} not found"
exit 1
fi
mkdir -p "${DISCOVERY_DIR}" "${LOG_DIR}"
export DYN_DISCOVERY_BACKEND=file
export DYN_FILE_KV="${DYN_FILE_KV:-${DISCOVERY_DIR}}"
cd "${EXAMPLE_DIR}"
worker_cmd=("${PYTHON_BIN}" worker.py --model "${MODEL}" --num-gpus "${NUM_GPUS}")
if [[ -n "${WORKER_EXTRA_ARGS}" ]]; then
# shellcheck disable=SC2206
worker_extra=( ${WORKER_EXTRA_ARGS} )
worker_cmd+=("${worker_extra[@]}")
fi
frontend_cmd=("${PYTHON_BIN}" -m dynamo.frontend --http-port "${HTTP_PORT}" --discovery-backend file)
if [[ -n "${FRONTEND_EXTRA_ARGS}" ]]; then
# shellcheck disable=SC2206
frontend_extra=( ${FRONTEND_EXTRA_ARGS} )
frontend_cmd+=("${frontend_extra[@]}")
fi
cleanup() {
echo
echo "Stopping local processes..."
kill "${frontend_pid:-}" "${worker_pid:-}" 2>/dev/null || true
wait "${frontend_pid:-}" "${worker_pid:-}" 2>/dev/null || true
}
trap cleanup EXIT INT TERM
echo "Starting worker: ${worker_cmd[*]}"
"${worker_cmd[@]}" >"${LOG_DIR}/worker.log" 2>&1 &
worker_pid=$!
echo "Starting frontend: ${frontend_cmd[*]}"
"${frontend_cmd[@]}" >"${LOG_DIR}/frontend.log" 2>&1 &
frontend_pid=$!
echo ""
echo "Worker log: ${LOG_DIR}/worker.log"
echo "Frontend log: ${LOG_DIR}/frontend.log"
echo ""
echo "API endpoint: http://localhost:${HTTP_PORT}/v1/videos"
echo ""
echo "Example request:"
echo "curl -s -X POST http://localhost:${HTTP_PORT}/v1/videos -H 'Content-Type: application/json' -d '{\"model\":\"${MODEL}\",\"prompt\":\"A cinematic drone shot over snowy mountains at sunrise\",\"size\":\"1920x1088\",\"seconds\":5,\"nvext\":{\"fps\":24,\"num_frames\":121,\"num_inference_steps\":5,\"guidance_scale\":1.0,\"seed\":10}}' > response.json"
echo ""
wait -n "${worker_pid}" "${frontend_pid}"
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""
FastVideo Worker for Dynamo (non-streaming)
Registers a VideoGenerator as a Dynamo backend endpoint compatible with the
/v1/videos frontend endpoint. The endpoint generates a full video
clip from the request parameters and returns it as a single response containing
the complete MP4 file base64-encoded in data[0].b64_json.
Generation parameters (size, fps, num_frames, etc.) are taken from the
request body's nvext field, so the same worker instance can serve requests
with different resolutions and quality settings without restarting.
One request at a time (asyncio.Lock — VideoGenerator is not re-entrant).
Usage:
python worker.py [--model MODEL] [--num-gpus N] [--disable-optimizations]
Options:
--model HuggingFace model path
(default: FastVideo/LTX2-Distilled-Diffusers)
--num-gpus Number of GPUs (default: 1)
Request format (sent to /v1/videos):
prompt: text description of the desired video
model: HuggingFace model path (must match what the worker registered)
size: "WxH" string, e.g. "1920x1088" (default: "1920x1088")
seconds: clip duration when nvext.num_frames is not set (default: 5)
nvext:
fps: frames per second (default: 24)
num_frames: total frames; overrides fps * seconds when set (default: 121)
num_inference_steps diffusion steps (default: 5)
guidance_scale: CFG scale (default: 1.0)
seed: RNG seed (default: 10)
negative_prompt: text to avoid (optional)
"""
import argparse
import asyncio
import base64
import logging
import os
import tempfile
import time
import uuid
import uvloop
from fastvideo import VideoGenerator
from fastvideo.configs.pipelines.base import PipelineConfig
from fastvideo.layers.quantization.fp4_config import FP4Config
from pydantic import BaseModel, Field
from dynamo.llm import ModelInput, ModelType, register_llm # type: ignore[attr-defined]
from dynamo.runtime import DistributedRuntime, dynamo_endpoint
logger = logging.getLogger(__name__)
DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers"
# ── Request / Response models ─────────────────────────────────────────────────
def _get_worker_namespace() -> str:
"""
Resolve Dynamo namespace for endpoint registration.
Kubernetes operator injects DYN_NAMESPACE (and optionally a rollout suffix).
Compose/local runs keep using the historical "dynamo" default.
"""
namespace = os.environ.get("DYN_NAMESPACE", "dynamo")
suffix = os.environ.get("DYN_NAMESPACE_WORKER_SUFFIX")
if suffix:
namespace = f"{namespace}-{suffix}"
return namespace
class NvExtVideoCreateRequest(BaseModel):
fps: int = Field(default=24, description="Frames per second")
num_frames: int | None = Field(
default=121, description="Total frames; overrides fps * seconds"
)
num_inference_steps: int = Field(default=5, description="Diffusion inference steps")
guidance_scale: float = Field(
default=1.0, description="Classifier-free guidance scale"
)
seed: int | None = Field(default=10, description="RNG seed for reproducibility")
negative_prompt: str | None = Field(
default=None, description="Text to avoid in generation"
)
class VideoCreateRequest(BaseModel):
prompt: str = Field(description="Text description of the desired video")
model: str = Field(description="HuggingFace model path")
size: str = Field(default="1920x1088", description="Frame dimensions as 'WxH'")
seconds: int = Field(
default=5, description="Clip duration; used when nvext.num_frames is unset"
)
user: str | None = Field(default=None)
nvext: NvExtVideoCreateRequest = Field(default_factory=NvExtVideoCreateRequest)
class VideoData(BaseModel):
b64_json: str | None = Field(default=None, description="Base64-encoded MP4 video")
mime_type: str = Field(default="video/mp4")
class VideoCreateResponse(BaseModel):
id: str
object: str = "video"
created: int
model: str
status: str = "complete"
data: list[VideoData]
# ── Backend ───────────────────────────────────────────────────────────────────
def _coerce_optional_float(value: object) -> float | None:
"""Best-effort conversion for optional numeric metrics from backend results."""
if value is None:
return None
try:
return float(value)
except (TypeError, ValueError):
return None
class FastVideoBackend:
def __init__(self, args: argparse.Namespace) -> None:
self.model_name: str = args.model
self.num_gpus: int = args.num_gpus
self.disable_optimizations: bool = args.disable_optimizations
# One request at a time — VideoGenerator is not re-entrant
self._generate_lock = asyncio.Lock()
self.generator: VideoGenerator | None = None
attn_backend = "TORCH_SDPA" if self.disable_optimizations else "FLASH_ATTN"
os.environ["FASTVIDEO_ATTENTION_BACKEND"] = attn_backend
os.environ["FASTVIDEO_STAGE_LOGGING"] = "1"
os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0"
async def initialize_model(self) -> None:
logger.info("Loading VideoGenerator model=%s", self.model_name)
loop = asyncio.get_running_loop()
def _load():
pipeline_config = PipelineConfig.from_pretrained(self.model_name)
if not self.disable_optimizations:
logger.info(
"Using FP4 quantization for VideoGenerator model=%s",
self.model_name,
)
pipeline_config.dit_config.quant_config = FP4Config()
return VideoGenerator.from_pretrained(
self.model_name,
num_gpus=self.num_gpus,
ltx2_refine_enabled=True,
ltx2_refine_lora_path="", # disable refine lora for distilled model
ltx2_refine_num_inference_steps=2,
ltx2_refine_guidance_scale=1.0,
ltx2_refine_add_noise=True,
pipeline_config=pipeline_config,
enable_torch_compile=not self.disable_optimizations,
enable_torch_compile_text_encoder=not self.disable_optimizations,
torch_compile_kwargs={
"backend": "inductor",
"fullgraph": True,
"mode": "max-autotune-no-cudagraphs",
},
dit_cpu_offload=False,
vae_cpu_offload=False,
text_encoder_cpu_offload=False,
ltx2_vae_tiling=False,
)
self.generator = await loop.run_in_executor(None, _load)
logger.info("VideoGenerator ready")
# ── Helpers ───────────────────────────────────────────────────────────────
def _generate_mp4(
self,
prompt: str,
video_id: str,
width: int,
height: int,
num_frames: int,
fps: int,
num_inference_steps: int,
guidance_scale: float,
seed: int | None,
negative_prompt: str | None,
) -> bytes:
"""Generate a video clip and return it as MP4 bytes."""
assert self.generator is not None
with tempfile.TemporaryDirectory() as tmpdir:
output_path = os.path.join(tmpdir, "output.mp4")
kwargs: dict = dict(
save_video=True,
return_frames=False,
output_path=output_path,
height=height,
width=width,
num_frames=num_frames,
fps=fps,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
)
if seed is not None:
kwargs["seed"] = seed
if negative_prompt is not None:
kwargs["negative_prompt"] = negative_prompt
result = self.generator.generate_video(prompt=prompt, **kwargs)
result_dict = result if isinstance(result, dict) else {}
generation_time = _coerce_optional_float(result_dict.get("generation_time"))
e2e_latency = _coerce_optional_float(result_dict.get("e2e_latency"))
logger.info("[%s] MP4 written to %s", video_id, output_path)
if generation_time is not None:
logger.info(
"[%s] Generation time: %.2f seconds", video_id, generation_time
)
else:
logger.info("[%s] Generation time: unavailable", video_id)
if e2e_latency is not None:
logger.info("[%s] E2E latency: %.2f seconds", video_id, e2e_latency)
else:
logger.info("[%s] E2E latency: unavailable", video_id)
time_start = time.perf_counter()
with open(output_path, "rb") as f:
data = f.read()
time_end = time.perf_counter()
logger.info(
"[%s] File read time: %.2f seconds", video_id, time_end - time_start
)
return data
# ── Dynamo endpoint ───────────────────────────────────────────────────────
@dynamo_endpoint(VideoCreateRequest, VideoCreateResponse)
async def create_video(self, request: VideoCreateRequest):
"""
Non-streaming endpoint.
Generates one video clip using the parameters from the request's nvext
field, then yields a single VideoCreateResponse with data[0].b64_json
containing the complete MP4 file encoded in base64.
"""
if self.generator is None:
raise RuntimeError("Generator is not initialized")
nvext = request.nvext
try:
width_str, height_str = request.size.lower().split("x", 1)
width, height = int(width_str), int(height_str)
except (ValueError, TypeError) as exc:
raise ValueError(
f"Invalid size format '{request.size}', expected 'WxH'"
) from exc
if width <= 0 or height <= 0:
raise ValueError(
f"Invalid size '{request.size}', width and height must be positive"
)
num_frames = (
nvext.num_frames
if nvext.num_frames is not None
else nvext.fps * request.seconds
)
if num_frames <= 0:
raise ValueError("num_frames must be positive")
fps = nvext.fps
if fps <= 0:
raise ValueError("fps must be positive")
video_id = f"video_{uuid.uuid4().hex}"
created_ts = int(time.time())
logger.info(
"[%s] create_video: prompt='%s...' size=%s frames=%d steps=%d",
video_id,
request.prompt[:60],
request.size,
num_frames,
nvext.num_inference_steps,
)
logger.info(
"[%s] Waiting for generate lock (locked=%s)",
video_id,
self._generate_lock.locked(),
)
async with self._generate_lock:
t = time.perf_counter()
logger.info(
"[%s] Generating video (%dx%d, %d frames, %d steps) ...",
video_id,
width,
height,
num_frames,
nvext.num_inference_steps,
)
try:
mp4_bytes = await asyncio.to_thread(
self._generate_mp4,
prompt=request.prompt,
video_id=video_id,
width=width,
height=height,
num_frames=num_frames,
fps=fps,
num_inference_steps=nvext.num_inference_steps,
guidance_scale=nvext.guidance_scale,
seed=nvext.seed,
negative_prompt=nvext.negative_prompt,
)
except Exception as exc:
logger.exception("[%s] Generation failed", video_id)
raise RuntimeError(
f"Video generation failed for request {video_id}"
) from exc
elapsed = time.perf_counter() - t
logger.info(
"[%s] Generation done in %.1fs — encoding %.2f MB MP4",
video_id,
elapsed,
len(mp4_bytes) / 1_048_576,
)
yield VideoCreateResponse(
id=video_id,
created=created_ts,
model=request.model,
data=[VideoData(b64_json=base64.b64encode(mp4_bytes).decode())],
).model_dump()
logger.info("[%s] Generation request finished", video_id)
# ── Dynamo wiring ─────────────────────────────────────────────────────────────
async def _register_model(endpoint, model_name: str) -> None:
try:
await register_llm(
ModelInput.Text, # type: ignore[attr-defined]
ModelType.Videos,
endpoint,
model_name,
model_name,
)
logger.info("Successfully registered model: %s", model_name)
except Exception as e:
logger.error("Failed to register model: %s", e, exc_info=True)
raise RuntimeError("Model registration failed") from e
async def backend_worker(runtime: DistributedRuntime, args: argparse.Namespace) -> None:
namespace_name = _get_worker_namespace()
component_name = "backend"
endpoint_name = "generate"
endpoint = runtime.endpoint(f"{namespace_name}.{component_name}.{endpoint_name}")
logger.info(
"Serving endpoint %s/%s/%s", namespace_name, component_name, endpoint_name
)
backend = FastVideoBackend(args)
await backend.initialize_model()
await asyncio.gather(
endpoint.serve_endpoint(backend.create_video), # type: ignore[arg-type]
_register_model(endpoint, backend.model_name),
)
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="FastVideo Worker for Dynamo (non-streaming)"
)
parser.add_argument(
"--model",
default=DEFAULT_MODEL,
help=f"HuggingFace model path (default: {DEFAULT_MODEL})",
)
parser.add_argument(
"--num-gpus",
type=int,
default=1,
dest="num_gpus",
help="Number of GPUs (default: 1)",
)
parser.add_argument(
"--disable-optimizations",
action="store_true",
dest="disable_optimizations",
help="Disable FP4 quantization, torch.compile, and use TORCH_SDPA attention",
)
return parser.parse_args()
async def main(args: argparse.Namespace) -> None:
loop = asyncio.get_running_loop()
# Use Kubernetes discovery in-cluster and file discovery for local compose by default.
discovery_backend = os.environ.get("DYN_DISCOVERY_BACKEND")
if not discovery_backend:
discovery_backend = (
"kubernetes" if os.environ.get("KUBERNETES_SERVICE_HOST") else "file"
)
logger.info("Using discovery backend: %s", discovery_backend)
logger.info("Resolved worker namespace: %s", _get_worker_namespace())
runtime = DistributedRuntime(loop, discovery_backend, "tcp", False)
await backend_worker(runtime, args)
if __name__ == "__main__":
_args = _parse_args()
logging.basicConfig(
level=(
logging.DEBUG
if os.environ.get("FASTVIDEO_LOG_LEVEL") == "DEBUG"
else logging.INFO
),
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
force=True,
)
uvloop.install()
asyncio.run(main(_args))
......@@ -42,6 +42,20 @@ redirects:
destination: "/dynamo/resources/release-artifacts"
- source: "/dynamo/getting-started/examples"
destination: "/dynamo/resources/examples"
- source: "/dynamo/dev/user-guides/multimodal-model-serving/diffusion-experimental/:slug*"
destination: "/dynamo/dev/user-guides/diffusion/:slug*"
- source: "/dynamo/dev/user-guides/multimodal-model-serving/diffusion-experimental"
destination: "/dynamo/dev/user-guides/diffusion"
- source: "/dynamo/dev/user-guides/multimodal-model-serving/vision-language-models-vlms/:slug*"
destination: "/dynamo/dev/user-guides/multimodal/:slug*"
- source: "/dynamo/dev/user-guides/multimodal-model-serving/vision-language-models-vlms"
destination: "/dynamo/dev/user-guides/multimodal"
- source: "/dynamo/dev/user-guides/diffusion/diffusion-guide"
destination: "/dynamo/dev/user-guides/diffusion/fastvideo"
- source: "/dynamo/dev/user-guides/agentic-workloads/:slug*"
destination: "/dynamo/dev/user-guides/agents/:slug*"
- source: "/dynamo/dev/user-guides/agentic-workloads"
destination: "/dynamo/dev/user-guides/agents"
# GitHub repository link in navbar
navbar-links:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment