fix: Support FastVideo example on current public version (#7431)

1042c552 · John Pohl · GitHub · 708858a1 · 1042c552 · 1042c552
Unverified Commit 1042c552 authored Mar 17, 2026 by John Pohl Committed by GitHub Mar 17, 2026
4 changed files
--- a/docs/features/diffusion/fastvideo.md
+++ b/docs/features/diffusion/fastvideo.md
@@ -15,12 +15,12 @@ This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo)
 - **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5.
 - **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture.
- **Optimized inference:** FP4 quantization and `torch.compile` are enabled by default for maximum throughput.
+- **Optimized inference:** FP4 quantization and `torch.compile` are available via `--enable-optimizations`; attention backend selection is controlled separately via `--attention-backend`.
 - **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming).
 - **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers.
 > [!IMPORTANT]
-> This example is optimized for **NVIDIA B200/B300** GPUs (CUDA arch 10.0) with FP4 quantization and flash-attention. It can run on other GPUs (H100, A100, etc.) by passing `--disable-optimizations` to `worker.py`, which disables FP4 quantization, `torch.compile`, and switches the attention backend from FLASH_ATTN to TORCH_SDPA. Expect lower performance but broader compatibility.
+> `worker.py` defaults to `--attention-backend TORCH_SDPA` for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with `--enable-optimizations` and, if desired, opt into flash-attention explicitly with `--attention-backend FLASH_ATTN`.
 ## Docker Image Build
@@ -31,12 +31,35 @@ The local Docker workflow builds a runtime image from the [`Dockerfile`](https:/
 - Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support)
 - Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source
+The Dockerfile exposes `TORCH_CUDA_ARCH_LIST` as a build argument (default: `10.0 10.0a` for Blackwell). Pass `--build-arg` to target a different architecture:
+```bash
+# Blackwell (default)
+docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="10.0 10.0a"
+# Hopper
+docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a"
+```
+`MAX_JOBS` (default: `4`) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory:
+```bash
+docker build examples/diffusers/ --build-arg MAX_JOBS=2
+```
+When using Docker Compose, set these as environment variables before running `docker compose up --build`:
+```bash
+# Hopper on a memory-constrained builder
+TORCH_CUDA_ARCH_LIST="9.0 9.0a" MAX_JOBS=2 COMPOSE_PROFILES=4 docker compose up --build
+```
 > [!WARNING]
 > The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores.
 ## Warmup Time
-On first start, workers download model weights and run compile/warmup steps. Expect roughly **10–20 minutes** before the first request is ready (hardware-dependent). After the first successful response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
+On first start, workers download model weights. When `--enable-optimizations` is enabled, compile/warmup steps can push the first ready time to roughly **10–20 minutes** (hardware-dependent). After the first successful optimized response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward.
 > [!TIP]
 > When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts.
@@ -82,7 +105,7 @@ Environment variables:
 | `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
 | `NUM_GPUS` | `1` | Number of GPUs |
 | `HTTP_PORT` | `8000` | Frontend HTTP port |
-| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (e.g., `--disable-optimizations`) |
+| `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (for example, `--enable-optimizations --attention-backend FLASH_ATTN`) |
 | `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` |
 Example:
@@ -91,12 +114,12 @@ Example:
 MODEL=FastVideo/LTX2-Distilled-Diffusers \
 NUM_GPUS=1 \
 HTTP_PORT=8000 \
-WORKER_EXTRA_ARGS="--disable-optimizations" \
+WORKER_EXTRA_ARGS="--enable-optimizations --attention-backend FLASH_ATTN" \
 ./run_local.sh
 ```
 > [!NOTE]
-> `--disable-optimizations` is a `worker.py` flag (not a `dynamo.frontend` flag), so pass it through `WORKER_EXTRA_ARGS`.
+> `--enable-optimizations` and `--attention-backend` are `worker.py` flags, not `dynamo.frontend` flags, so pass them through `WORKER_EXTRA_ARGS` when you want a non-default worker configuration.
 The script writes logs to:
@@ -214,7 +237,8 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
 |---|---|---|
 | `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path |
 | `--num-gpus` | `1` | Number of GPUs for distributed inference |
-| `--disable-optimizations` | off | Disables FP4 quantization, `torch.compile`, and switches attention from FLASH_ATTN to TORCH_SDPA |
+| `--enable-optimizations` | off | Enables FP4 quantization and `torch.compile` |
+| `--attention-backend` | `TORCH_SDPA` | Sets `FASTVIDEO_ATTENTION_BACKEND`; choices: `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, `SAGE_SLA_ATTN` |
 ### Request Parameters (`nvext`)
@@ -233,7 +257,7 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
 |---|---|---|
 | `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding |
 | `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset |
-| `FASTVIDEO_ATTENTION_BACKEND` | `FLASH_ATTN` | Attention backend (`FLASH_ATTN` or `TORCH_SDPA`) |
+| `FASTVIDEO_ATTENTION_BACKEND` | `TORCH_SDPA` | Attention backend; `worker.py` sets this from `--attention-backend` and validates `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, and `SAGE_SLA_ATTN` |
 | `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs |
 | `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging |
@@ -241,10 +265,12 @@ jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4
 | Symptom | Cause | Fix |
 |---|---|---|
-| OOM during Docker build | `flash-attention` compilation uses too much RAM | Lower `MAX_JOBS` in the Dockerfile |
+| OOM during Docker build | `flash-attention` compilation uses too much RAM | Pass `--build-arg MAX_JOBS=2` (or lower) at build time |
-| 10–20 min wait on first start | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
+| `no kernel image available for this GPU` or CUDA arch error at runtime | Image was built for a different GPU architecture | Rebuild with the correct `TORCH_CUDA_ARCH_LIST` (e.g. `9.0 9.0a` for Hopper) |
+| 10–20 min wait on first start with optimizations enabled | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached |
 | ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward |
-| Poor performance on non-B200/B300 GPUs | FP4 and flash-attention optimizations require CUDA arch 10.0 | Pass `--disable-optimizations` to `worker.py` |
+| Lower throughput than expected on B200/B300 | FP4/compile and flash-attention are configured separately | Pass `--enable-optimizations` and, if desired, `--attention-backend FLASH_ATTN` |
+| Startup or import failure after enabling optimizations or changing the attention backend | FP4 and some attention backends depend on specific hardware/software support | Re-run `worker.py` without `--enable-optimizations`, or use `--attention-backend TORCH_SDPA` |
 ## Source Code

--- a/examples/diffusers/Dockerfile
+++ b/examples/diffusers/Dockerfile
@@ -8,7 +8,7 @@ RUN apt-get update \
 && apt-get install -yq libucx0 python3-dev python3-pip python3-venv git protobuf-compiler curl ffmpeg libclang-dev \
 && apt-get clean
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+COPY --from=ghcr.io/astral-sh/uv:0.10.11 /uv /uvx /bin/
 ENV UV_LINK_MODE=copy
 RUN uv venv /opt/dynamo/venv --python 3.12 \
@@ -20,25 +20,36 @@ RUN uv venv /opt/dynamo/venv --python 3.12 \
 ENV VIRTUAL_ENV=/opt/dynamo/venv
 ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"
-# flash-attn compilation is memory-intensive. If the build OOMs, lower MAX_JOBS.
+# Override at build time to target a different GPU architecture, e.g.:
-# The flash-attn install notes call this out for machines with <96GB RAM and many CPU cores.
+#   docker build --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a" ...
-RUN git clone https://github.com/RandNMR73/flash-attention \
+ARG TORCH_CUDA_ARCH_LIST="10.0 10.0a"
+# Lower MAX_JOBS if the build OOMs (machines with <96GB RAM and many CPU cores).
+#   docker build --build-arg MAX_JOBS=2 ...
+ARG MAX_JOBS=4
+# flash-attention ignores TORCH_CUDA_ARCH_LIST and uses its own FLASH_ATTN_CUDA_ARCHS variable.
+# Translate from PyTorch format ("10.0 10.0a", space-separated with dots) to flash-attention
+# format ("100;100a", semicolon-separated without dots).
+RUN export FLASH_ATTN_CUDA_ARCHS=$(echo "${TORCH_CUDA_ARCH_LIST}" | sed 's/ /;/g; s/\.//g') \
+ && echo "Building flash-attention for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST} FLASH_ATTN_CUDA_ARCHS=${FLASH_ATTN_CUDA_ARCHS} MAX_JOBS=${MAX_JOBS}" \
+ && git clone https://github.com/RandNMR73/flash-attention \
 && cd flash-attention \
 && git switch fa4-compile \
- && TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install . --no-build-isolation \
+ && uv pip install . --no-build-isolation \
- && TORCH_CUDA_ARCH_LIST="10.0 10.0a" MAX_JOBS=4 uv pip install ./flash_attn/cute \
+ && uv pip install ./flash_attn/cute \
 && rm -rf ../flash-attention
 # Install Dynamo with /v1/videos support.
-RUN uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0#subdirectory=lib/bindings/python' \
+RUN uv pip install ai-dynamo==1.0.0
- && uv pip install 'git+https://github.com/ai-dynamo/dynamo@release/1.0.0'
 # Install FastVideo directly from the public upstream repository.
 # Checkout with --recurse-submodules to get the required submodules as well.
-RUN . /opt/dynamo/venv/bin/activate \
+RUN echo "Building FastVideo for TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}" \
+ && . /opt/dynamo/venv/bin/activate \
 && uv pip install setuptools_scm scikit-build-core cmake ninja \
 && git clone --recurse-submodules https://github.com/hao-ai-lab/FastVideo.git /tmp/FastVideo \
- && TORCH_CUDA_ARCH_LIST="10.0 10.0a" uv pip install --no-build-isolation /tmp/FastVideo
+ && uv pip install --no-build-isolation /tmp/FastVideo
 ENV FASTVIDEO_VIDEO_CODEC=libx264
 ENV FASTVIDEO_X264_PRESET=ultrafast

--- a/examples/diffusers/local/docker-compose.yml
+++ b/examples/diffusers/local/docker-compose.yml
@@ -8,6 +8,9 @@ x-backend-base: &backend-base
  build:
    context: ..
    dockerfile: Dockerfile
+    args:
+      TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
+      MAX_JOBS: ${MAX_JOBS:-4}
  image: dynamo-fastvideo-diffusers:latest
  restart: on-failure
  command: python worker.py
@@ -33,6 +36,9 @@ services:
    build:
      context: ..
      dockerfile: Dockerfile
+      args:
+        TORCH_CUDA_ARCH_LIST: ${TORCH_CUDA_ARCH_LIST:-10.0 10.0a}
+        MAX_JOBS: ${MAX_JOBS:-4}
    image: dynamo-fastvideo-diffusers:latest
    restart: on-failure
    command: >

--- a/examples/diffusers/worker.py
+++ b/examples/diffusers/worker.py
@@ -16,12 +16,17 @@ with different resolutions and quality settings without restarting.
 One request at a time (asyncio.Lock — VideoGenerator is not re-entrant).
 Usage:
-  python worker.py [--model MODEL] [--num-gpus N] [--disable-optimizations]
+  python worker.py [--model MODEL] [--num-gpus N] [--enable-optimizations]
+                   [--attention-backend ATTENTION_BACKEND]
 Options:
  --model          HuggingFace model path
                   (default: FastVideo/LTX2-Distilled-Diffusers)
  --num-gpus       Number of GPUs (default: 1)
+  --enable-optimizations
+                   Enable FP4 quantization (if available) and torch.compile
+  --attention-backend
+                   Attention backend (default: TORCH_SDPA)
 Request format (sent to /v1/videos):
  prompt:   text description of the desired video
@@ -46,10 +51,11 @@ import tempfile
 import time
 import uuid
+import torch
 import uvloop
 from fastvideo import VideoGenerator
 from fastvideo.configs.pipelines.base import PipelineConfig
-from fastvideo.layers.quantization.fp4_config import FP4Config
+from fastvideo.platforms.interface import AttentionBackendEnum
 from pydantic import BaseModel, Field
 from dynamo.llm import ModelInput, ModelType, register_llm  # type: ignore[attr-defined]
@@ -58,6 +64,14 @@ from dynamo.runtime import DistributedRuntime, dynamo_endpoint
 logger = logging.getLogger(__name__)
 DEFAULT_MODEL = "FastVideo/LTX2-Distilled-Diffusers"
+DEFAULT_ATTENTION_BACKEND = "TORCH_SDPA"
+# FastVideo exposes NO_ATTENTION in the enum, but it is not a selectable
+# inference backend for this worker's FASTVIDEO_ATTENTION_BACKEND override.
+ATTENTION_BACKEND_CHOICES = tuple(
+    backend_name
+    for backend_name in AttentionBackendEnum.__members__
+    if backend_name != "NO_ATTENTION"
+)
 # ── Request / Response models ─────────────────────────────────────────────────
@@ -133,14 +147,14 @@ class FastVideoBackend:
    def __init__(self, args: argparse.Namespace) -> None:
        self.model_name: str = args.model
        self.num_gpus: int = args.num_gpus
-        self.disable_optimizations: bool = args.disable_optimizations
+        self.enable_optimizations: bool = args.enable_optimizations
+        self.attention_backend: str = args.attention_backend
        # One request at a time — VideoGenerator is not re-entrant
        self._generate_lock = asyncio.Lock()
        self.generator: VideoGenerator | None = None
-        attn_backend = "TORCH_SDPA" if self.disable_optimizations else "FLASH_ATTN"
+        os.environ["FASTVIDEO_ATTENTION_BACKEND"] = self.attention_backend
-        os.environ["FASTVIDEO_ATTENTION_BACKEND"] = attn_backend
        os.environ["FASTVIDEO_STAGE_LOGGING"] = "1"
        os.environ["FASTVIDEO_ENABLE_RMSNORM_FP4_PREQUANT"] = "0"
@@ -150,33 +164,56 @@ class FastVideoBackend:
        def _load():
            pipeline_config = PipelineConfig.from_pretrained(self.model_name)
-            if not self.disable_optimizations:
+            optimization_kwargs = {}
-                logger.info(
+            if self.enable_optimizations:
-                    "Using FP4 quantization for VideoGenerator model=%s",
+                major, minor = torch.cuda.get_device_capability()
-                    self.model_name,
+                if major < 10:
-                )
+                    logger.warning(
-                pipeline_config.dit_config.quant_config = FP4Config()
+                        "FP4 quantization is only supported on NVIDIA Blackwell GPUs (compute capability 10.0+). Detected compute capability: %d.%d. Continuing without FP4 optimizations.",
+                        major,
+                        minor,
+                    )
+                else:
+                    logger.info(
+                        "Using FP4 quantization for VideoGenerator model=%s",
+                        self.model_name,
+                    )
+                    try:
+                        from fastvideo.layers.quantization.fp4_config import FP4Config
+                    except ImportError as exc:
+                        raise RuntimeError(
+                            "FastVideo optimizations require "
+                            "fastvideo.layers.quantization.fp4_config, but this "
+                            "FastVideo build does not provide it. Re-run "
+                            "worker.py without --enable-optimizations or install a "
+                            "FastVideo version that includes fp4_config."
+                        ) from exc
+                    pipeline_config.dit_config.quant_config = FP4Config()
+                optimization_kwargs = {
+                    "ltx2_refine_enabled": True,
+                    "ltx2_refine_lora_path": "",  # disable refine lora for distilled model
+                    "ltx2_refine_num_inference_steps": 2,
+                    "ltx2_refine_guidance_scale": 1.0,
+                    "ltx2_refine_add_noise": True,
+                    "enable_torch_compile": True,
+                    "enable_torch_compile_text_encoder": True,
+                    "torch_compile_kwargs": {
+                        "backend": "inductor",
+                        "fullgraph": True,
+                        "mode": "max-autotune-no-cudagraphs",
+                    },
+                    "dit_cpu_offload": False,
+                    "vae_cpu_offload": False,
+                    "text_encoder_cpu_offload": False,
+                    "ltx2_vae_tiling": False,
+                }
            return VideoGenerator.from_pretrained(
                self.model_name,
                num_gpus=self.num_gpus,
-                ltx2_refine_enabled=True,
-                ltx2_refine_lora_path="",  # disable refine lora for distilled model
-                ltx2_refine_num_inference_steps=2,
-                ltx2_refine_guidance_scale=1.0,
-                ltx2_refine_add_noise=True,
                pipeline_config=pipeline_config,
-                enable_torch_compile=not self.disable_optimizations,
+                **optimization_kwargs,
-                enable_torch_compile_text_encoder=not self.disable_optimizations,
-                torch_compile_kwargs={
-                    "backend": "inductor",
-                    "fullgraph": True,
-                    "mode": "max-autotune-no-cudagraphs",
-                },
-                dit_cpu_offload=False,
-                vae_cpu_offload=False,
-                text_encoder_cpu_offload=False,
-                ltx2_vae_tiling=False,
            )
        self.generator = await loop.run_in_executor(None, _load)
@@ -402,10 +439,21 @@ def _parse_args() -> argparse.Namespace:
        help="Number of GPUs (default: 1)",
    )
    parser.add_argument(
-        "--disable-optimizations",
+        "--enable-optimizations",
        action="store_true",
-        dest="disable_optimizations",
+        dest="enable_optimizations",
-        help="Disable FP4 quantization, torch.compile, and use TORCH_SDPA attention",
+        help="Enable FP4 quantization (if available) and torch.compile",
+    )
+    parser.add_argument(
+        "--attention-backend",
+        choices=ATTENTION_BACKEND_CHOICES,
+        default=DEFAULT_ATTENTION_BACKEND,
+        dest="attention_backend",
+        help=(
+            "Attention backend to set via FASTVIDEO_ATTENTION_BACKEND "
+            f"(choices: {', '.join(ATTENTION_BACKEND_CHOICES)}; "
+            f"default: {DEFAULT_ATTENTION_BACKEND})"
+        ),
    )
    return parser.parse_args()